Around 2014, I was idly browsing YouTube when I noticed something new in the resolution menu: 4K. TV makers like Sony, LG, and Samsung were already pushing 4K sets hard, but I always thought, “What’s the point when there’s no real 4K content yet?” Now, with YouTube finally offering it, I had my excuse to cave in to all that advertising.
I clicked 4K, watched the spinner for a moment… and then my browser froze. I shut down other tabs, closed every background app, and tried again. A couple of frames played before everything ground to a halt and the tab crashed. Not one to give up easily, I downloaded “Big Buck Bunny” in 4K from the Blender Foundation and popped open VLC.
A ten-minute clip, a gigabyte worth of pixels—buffering. One or two seconds of playback, then another pause to catch up. And this was on my trusty Dell desktop with an Intel i5. More than capable of handling spreadsheets, web development, and just about everything I threw at it… except 4K video.
Fast forward a few years to a Raspberry Pi 5, a tiny $60 computer drawing under 10 watts. It breezes through 4K with zero stutter. What gives? Moore’s Law can’t account for that kind of leap. This little board doesn’t have 16x more transistors than my old desktop CPU.
The real secret sauce here is hardware acceleration. Purpose-built video decode blocks, specialized instruction sets, and system-on-chip designs that offload and parallelize the heavy lifting.
Myth of Raw Power
Growing up, one of the key metrics I used to follow on Intel hardware was the Raw Power, the clock speed. Denoted as GHz in the spec sheets. For each new CPU that comes out, we look at the maximum frequency to create a mental model of how much better it is than the previous generation. But this is a trap.
Take my old desktop: an Intel i5 with 4 physical cores and above 3GHz clock speed. On paper, it got more raw power than a tiny Raspberry Pi 5 or a smartphone. But when it comes to something like playing 4K video, or even running a complex JavaScript benchmark, those so-called underpowered devices run circles around it.
1. General-Purpose vs Specialized Performance
Old-school CPUs like the i5 are general-purpose workhorses. They’re designed to do everything reasonably well: compile code, render web pages, run Excel spreadsheets, maybe even game a little. But they’re not optimized to do any one thing incredibly well.
In contrast, devices like the Raspberry Pi or an iPhone are built with a very specific use case in mind. The Pi is loaded with a system-on-a-chip (SoC) that includes a dedicated video decoder, meaning it can play 4K video without even waking up the CPU. It’s not doing more work. It’s doing different work, with the right hardware.
2. More Cores ≠ Faster Speed
Here’s another classic misunderstanding: more cores doesn’t mean faster performance—at least, not for everything. Many tasks, like video playback or executing a single-threaded JavaScript function, don’t scale well across cores. So while your desktop might have more cores, if a task only uses one, what matters most is how fast and efficient that one core is.
This is where Apple’s single-threaded performance on iPhones blows people’s minds. Their custom ARM cores are so optimized for this kind of task, they beat out many desktop CPUs in benchmarks designed to test just one core at a time.
3. Power Budget Makes a Huge Difference
And then the elephant in the room. Power. A desktop CPU might run at 65W or more, generating a lot of heat that has to be managed by fans, heatsinks, and airflow. Devices like the Pi or iPhone operate under a 5–10W power budget. They’re designed to be efficient first, not just fast.
This means all their performance gains had to come from smart design—smarter instructions, lower-latency memory, and better workload distribution—not just from throwing more silicon and watts at the problem.
The Rise of Dedicated Hardware
If general-purpose CPUs were the Swiss Army knives of the early 2000s, today’s computing is more like a workshop full of specialized tools.
Let’s go back to our 4K video example. My old desktop tried to play Big Buck Bunny using its CPU and software decoding. It’s like trying to chop a tree down with a butter knife. Sure, the CPU can decode video, but that doesn’t mean it should.
Modern devices, like the Raspberry Pi or an iPhone, don’t even ask the CPU to decode 4K video. That job is handed off to a specialized chip baked right into the processor package: a hardware video decoder. These are tiny silicon blocks optimized to process video streams with near-zero CPU usage. Think of them as little workers that only do one thing, but they do it really, really well.
Let’s say you’re watching a 4K video on a Raspberry Pi 5:
- Instead of streaming every frame through the CPU, the system routes the video data straight to the H.265 (HEVC) decoder.
- This decoder handles decompression, frame reconstruction, color adjustment, and even timing—all in hardware.
- Meanwhile, the CPU sits back and sips power, ready for other tasks.
This separation of duties is what makes modern low-power devices so capable. And it's not just video. Dedicated hardware shows up all over the place:
- Neural engines on iPhones for running machine learning models without touching the CPU or GPU.
- Audio DSPs that handle noise cancellation and voice detection on smart speakers and earbuds.
- Network offloading that manages Wi-Fi and Bluetooth with minimal system overhead.
Dedicated hardware means better performance at lower power. You’re not just saving energy, you’re making room. If your CPU isn’t burning cycles decoding video or processing sensor input, it can focus on higher-level tasks, like running apps or managing the operating system.
It also makes hardware cheaper and cooler. A Raspberry Pi can run 4K video with no fan. Your desktop might need a mini turbine just to run silently under load.
SIMD: Doing More With Every Tick
If dedicated hardware is about adding new tools to your toolbox, SIMD is about using the tools you already have way more efficiently.
SIMD stands for Single Instruction, Multiple Data. The idea is pretty simple: instead of telling your CPU to add two numbers together, why not add four pairs of numbers in one go?
Your CPU already knows how to do basic math—add, subtract, multiply, move data around. But with SIMD, it gets a vector unit, a special part of the chip that can handle wide chunks of data in parallel. For example, a regular instruction might add two 32-bit numbers. A SIMD instruction (like Intel’s AVX or ARM’s NEON) might add eight 32-bit numbers at once.
This improves handling media-heavy workloads like video decoding, audio processing, image manipulation, or even machine learning inference. Instead of processing each pixel or sound wave individually, SIMD handles chunks of them in parallel, dramatically speeding things up.
Why don't we use SIMD everywhere?
Here’s the fun twist: even though desktop CPUs like Intel’s Core i7 have powerful SIMD capabilities (SSE, AVX, etc.), they’re often underused. Why?
- Software has to be explicitly written or compiled to use SIMD.
- Many apps are built for cross-platform compatibility and don’t always take full advantage of SIMD instructions.
- On the flip side, ARM chips in phones and Raspberry Pis often use SIMD aggressively because they have to squeeze every bit of efficiency out of their lower-power hardware.
Apple’s A-series chips are a perfect example. Their NEON SIMD engine is highly tuned and heavily used throughout iOS—for video processing, AR, photo filters, even basic animations. That’s part of why an iPhone can feel so fast and responsive while sipping battery.
SIMD vs Dedicated Hardware
SIMD is still “software-controlled.” You get a speed boost, but you’re still running those instructions on a general-purpose CPU. With dedicated hardware (like a video decoder), the chip just takes over the task entirely.
But when there’s no specialized block—or when you’re doing a task that changes constantly—SIMD is the next best thing. It’s general enough to apply to lots of problems, but powerful enough to make them run 2x, 4x, even 8x faster.
Minding the Temperature
Smartphones, Raspberry Pis, even some laptops, can do amazing things without sounding like a jet engine. How is that possible? Your computer, no matter how powerful, is ruled by a simple constraint: its power budget. How much energy it can use, and how much heat it can afford to generate.
- My old desktop had a 400W power supply, and the CPU alone could draw 65–95W under load.
- A Raspberry Pi 5? It maxes out around 10W.
- An iPhone? Somewhere in the 5W or less range when it’s doing heavy lifting.
The low power use all comes back to the innovations we’ve talked about:
1. Dedicated Hardware = Less Work
When a chip has built-in hardware blocks for video decoding, image processing, or AI, the main CPU doesn't have to lift a finger. Less work means less heat. Think of it like having helpers do chores for you while you relax.
2. SIMD = Efficient Work
Even when the CPU is doing the work, SIMD helps it get more done per clock cycle. Instead of running in circles, it sprints smartly, processing multiple things at once. More work, fewer cycles, less power.
3. Smaller Chips, Shorter Paths
Modern SoCs (System on Chips) like the Pi 5 or Apple’s A-series are incredibly compact. Shorter distances between components mean lower latency and less energy lost to signaling. They also benefit from newer manufacturing nodes (like 7nm, 5nm), which pack more transistors into smaller spaces that use less power.
4. Smarter Thermal Management
Mobile chips are designed with temperature in mind from the start. They aggressively manage clocks and voltage, scaling performance up just enough for the task and then backing off immediately. This keeps them cool without sacrificing responsiveness.
5. Software Knows the Rules
Modern operating systems are smarter about power too. They know when to shift work from the CPU to a low-power DSP, when to dim the screen, when to suspend apps in the background. All this adds up to a system that performs better by doing less.
Does Moore’s Law Matter?
Jim Keller, legendary architect of AMD’s K7/K8/Zen cores, Apple’s A-series chips, Tesla’s Autopilot SoC, and Intel’s silicon roadmap, has a refreshingly broad view of Moore’s Law. In his 2020 Lex Fridman podcast, he argued that Moore’s Law isn’t just about transistor counts, but about expectation:
“People think Moore’s Law is one thing—transistors get smaller—but under the sheets there’s literally thousands of innovations, each with its own diminishing-return curve. The result? What we experience as an exponential trend. If one curve plateaus, another kicks in, and the overall story keeps going.”
That anticipation of more capacity every two to three years profoundly shapes how architects design chips. As Keller put it:
“As you’re writing a new software stack, the hardware underneath keeps getting faster. That expectation influences every decision—if Moore’s Law will continue, you optimize one way; if it’s slowing, you take a different path.”
In other words, Moore’s Law is as much a planning tool as a physical phenomenon. It forces teams to manage growing complexity and to prepare architectures that can swallow twice as many transistors (and features) the next cycle around.
Moore’s Law endures not because it’s a law of physics, but because the industry treats it like one. The belief in regular capacity increases drives architects to design for tomorrow’s silicon today. And even if the classic two-year rhythm has slipped, that very expectation continues to spark the innovations, both big and small, that keep our devices steadily more capable.
In my search to understand why the old Dell couldn't play a simple ten-minute, 1GB 4K file, and the Pi 5 coasting through it, I learned a few things:
- Raw specs lie. GHz ratings, core counts, and gigabytes of RAM don’t tell the whole story. Real-world performance depends on how hardware and software work together.
- Dedicated hardware wins. Purpose-built video decoders, neural engines, DSPs, and network offload blocks handle specialized tasks at a fraction of the power. Freeing the CPU to focus on the rest.
- SIMD scales smartly. Wide-vector instructions (NEON, AVX, etc.) let CPUs process many data elements in parallel, squeezing out massive speedups for media and signal processing.
- Efficiency is everything. Modern SoCs use smaller process nodes, aggressive thermal–power management, and software aware of hardware quirks to stay cool, quiet, and efficient.
- Moore’s Law drives design. Even as transistor-density gains slow, the expectation of regular improvements forces architects to innovate, layering new pipelines, packaging techniques, and materials to keep the curve climbing.
In the end, it’s not about having the biggest, hottest, most power-hungry machine. It’s about having the right machine—one that combines specialized blocks, clever instruction-level tricks, and holistic power management. That’s why my little Raspberry Pi can laugh in the face of 4K video while my old desktop coughs and splutters.
And that, dear reader, is the essence of modern computing: smarter design trumps brute force every time.

 
	 
			 
			 
			
Comments
There are no comments added yet.
Let's hear your thoughts