The cleanest way to understand hardware is to keep asking the same question at every scale: what is expensive here, and what structure reduces that expense? At the transistor level, the expense is unreliable analog behavior. At the gate level, it is ambiguity. At the processor level, it is time waiting on data. At the system level, it is moving information and burning power. Architecture is the shape left behind after engineers optimize around those costs.
1. Start with a switch
A transistor is not magic. It is a controllable device that can allow or block current flow. In modern digital chips, we mostly use transistors not as pretty analog amplifiers, but as very tiny electrical switches. If a voltage on one terminal crosses a threshold, the transistor turns "on" enough to conduct. If not, it stays "off" enough to block.
Real transistors are analog, noisy, temperature-sensitive devices. Digital design works because we stop
pretending the world is continuous and instead reserve wide voltage regions for only two meanings:
logical 0 and logical 1. The trick is not that nature is digital. The trick is
that we build robust abstractions on top of analog physics.
From transistors to logic gates
Once transistors can pull a node up or down, we can arrange them into gates such as NOT,
AND, OR, NAND, and NOR. These gates are small physical
circuits that implement truth tables. A gate receives bits, and because of how transistors are connected,
it settles to the correct output bit.
NOT 1 = 0
NOT 0 = 1
AND(1,1) = 1
AND(1,0) = 0
OR(1,0) = 1
XOR(1,0) = 1
NAND and NOR are especially important because they are functionally complete: in principle, you can build any Boolean function out of enough of them. That means an enormous chip can still be understood as a very large composition of tiny yes-no decisions.
Visual: the abstraction ladder
This is the core progression of the whole article: each layer solves a problem the previous layer could not solve alone.
Visual: logic is a controlled funnel
Bits are not nature’s preferred representation. They are a deliberate simplification that makes large systems buildable.
2. Gates alone are not enough: you also need memory and time
If all you had were pure combinational gates, a chip would be a complicated calculator: inputs go in, outputs come out, and the circuit has no memory of what happened before. Real computation requires state. A machine must remember values, count steps, track whether a branch was taken, hold an instruction, or cache a recent result.
A flip-flop stores one bit. Put many together and you get a register. Put registers plus control logic together and you get a register file, a queue, or a pipeline stage. Put those next to arithmetic blocks and suddenly you have the skeleton of a processor.
How arithmetic appears
Addition is a great example of bottom-up design. A half adder handles one bit. A full adder handles one bit plus an incoming carry. Chain full adders together and you can add 32-bit or 64-bit numbers. Multiply, divide, compare, shift, and fuse operations on top of that and you get an arithmetic logic unit, or ALU.
So the answer to "how do chips work?" is partly "they compute Boolean functions," but the more useful answer is "they constantly move, transform, and remember bits according to a timed choreography."
3. Why real chips are not just giant piles of logic
At small scales, logic seems like the star. At large scales, wires, memory, and power take over. In advanced chips, moving a bit can cost as much as or more than computing on it. The hardest part of modern design is often not the math block itself, but feeding it data quickly enough and cheaply enough.
ALUs, tensor units, schedulers
registers, SRAM, caches
decode, branch, dispatch
links, buses, NoCs, memory PHYs
This is why floorplanning matters. A chip is a physical object. Units that communicate heavily are kept near each other. High-bandwidth memories are placed carefully. Clock trees are balanced. Hot spots are managed. Routing congestion is real. If two designs have the same abstract algorithm but one requires long, energy-hungry wires and the other keeps data local, the second can win by a lot.
The hierarchy that shapes everything
- Registers are tiny and fast, but scarce.
- SRAM caches are larger and slower, but still close to compute.
- Off-chip DRAM is huge, but much slower and far more energy-expensive per access.
- Interconnects decide whether many blocks cooperate efficiently or sit idle waiting.
Once you see hardware this way, architectural diversity stops looking arbitrary. Different machines are different compromises in the fight among parallelism, latency, bandwidth, flexibility, and energy.
Visual: a chip floorplan is a traffic map
Past a certain scale, wires and memory placement shape performance at least as much as the compute units themselves.
Visual: why specialization keeps happening
We did not stop at bigger monolithic CPUs just because we got bored. Power density and data movement forced new forms.
Dark silicon, thermals, and why "just make it bigger" stops working
A crucial modern constraint is that transistor counts can keep rising even when usable power density does not. This creates the dark silicon problem: not every transistor on a large chip can be driven hard at the same time without violating thermal or power-delivery limits. That is one reason the industry stopped getting free wins from simply increasing clock speed and building ever more aggressive general-purpose cores.
Specialization is therefore not only about elegance or performance. It is also about staying within a power envelope. A well-targeted accelerator can deliver more useful work per joule than lighting up a large block of general-purpose machinery to do the same task awkwardly.
Packaging is now part of architecture
This is also where modern packaging enters the story. Chiplets, silicon interposers, stacked memory, and high-bandwidth memory are not side details. They are architectural tools for reducing the penalty of distance. If a monolithic die becomes too large, too yield-sensitive, or too bandwidth-starved, designers increasingly split systems into better-shaped pieces and reconnect them with faster, denser links.
In that sense, floorplanning has expanded into package planning. The old question was "where should blocks go on this die?" The new question is often "which blocks deserve their own die, their own memory stack, or their own lane on the package substrate?"
4. Why CPUs, GPUs, ASICs, accelerators, TPUs, and FPGAs look so different
CPU: optimized for low-latency, unpredictable work
A CPU is the generalist athlete of computing. It is built for tasks where the next instruction depends on the previous result, where memory accesses may be irregular, and where branches may go all over the place. That makes CPUs control-heavy. They spend a lot of transistor budget on branch prediction, out-of-order execution, speculation, caches, rename logic, and sophisticated scheduling.
Why CPUs look this way
Because single-thread latency matters. CPUs try very hard to keep a few instruction streams moving even when programs are messy.
What CPUs sacrifice
Area and energy efficiency per arithmetic operation. A lot of die space goes to making irregular work run fast.
GPU: optimized for throughput on many similar operations
A GPU starts from a different assumption: instead of a few unpredictable threads, assume you have thousands of operations that are similar enough to run together. Graphics made this natural first, because many pixels and vertices go through nearly the same math. AI later loved GPUs for the same reason: matrix operations expose huge amounts of data parallelism.
So GPUs devote much more area to arithmetic lanes and much less to per-thread control sophistication. Rather than making one thread blisteringly fast, they keep many threads in flight and hide memory latency by switching among them. The shape of a GPU says, "I expect abundant parallel work; give me throughput."
ASIC: optimized for one problem family extremely well
An application-specific integrated circuit hardwires a narrower set of assumptions. If you know the workload in advance, you can remove huge amounts of general-purpose overhead. Maybe you know the dataflow, precision, buffer sizes, operation mix, or communication pattern. Then the chip can become drastically more efficient than a CPU or GPU for that target.
The catch is obvious: flexibility drops. An ASIC wins when the workload is important enough, stable enough, and high-volume enough to justify a custom physical implementation.
Accelerator: a broad category built around a hotspot
"Accelerator" is the umbrella term. It usually means a block or device designed to offload a specific hot region of computation from the CPU. Video codecs, cryptography engines, NPUs, packet processors, ray tracing blocks, and AI inference engines are all accelerators. The common pattern is simple: find something that is frequent, expensive, and structured enough to specialize.
TPU: a matrix-oriented ASIC shaped by AI dataflow
A TPU is a specific kind of AI accelerator whose identity comes from a very strong bet: a large share of useful machine learning can be expressed as dense linear algebra plus simple surrounding operations. If that bet holds, you want big matrix-multiply fabrics, carefully staged on-chip memory, deterministic dataflow, and enough bandwidth to keep tensor units busy.
In other words, a TPU-like design says the center of gravity is not speculative control flow. It is moving tensors through multiply-accumulate arrays with ruthless efficiency.
FPGA: optimized for post-fabrication reconfigurability
An FPGA looks different because its purpose is different. Instead of fixing the circuit permanently at manufacturing time, it contains configurable logic blocks, programmable interconnect, and embedded resources like DSP slices and RAMs. You can rewire the hardware after fabrication.
That makes FPGAs incredibly valuable when you need custom pipelines, unusual interfaces, low-volume special hardware, or rapid iteration without paying for a new chip tape-out. But configurability costs area, speed, and power efficiency compared with a custom ASIC. Reprogrammability is not free; it is paid for in overhead.
| Architecture | Best at | Main design bet | Main cost |
|---|---|---|---|
| CPU | Irregular, branchy, latency-sensitive code | Smarter control can rescue messy workloads | Lower efficiency per operation |
| GPU | Massively parallel throughput | Many similar threads can hide latency | Less ideal for serial or highly irregular work |
| ASIC | One domain at extreme efficiency | Known workload justifies fixed hardware | Poor flexibility and high NRE |
| Accelerator | Offloading a hotspot | Specialize only where it matters | Integration complexity and limited scope |
| TPU | Tensor-heavy ML compute | Dense linear algebra dominates | Narrower operating envelope |
| FPGA | Custom data paths with flexibility | Hardware should be rewritable | Efficiency overhead versus ASIC |
The deeper pattern: what changes is where the chip spends transistors
This is the unifying answer. Every architecture is made from the same underlying ingredients: transistors, wires, memory, and timing. What changes is the allocation. CPUs spend more on control. GPUs spend more on replicated arithmetic and thread context. TPUs spend more on matrix fabrics and local buffering. FPGAs spend more on programmability. ASICs spend more on the exact path their target workload needs and less on everything else.
Visual: where the transistor budget goes
These bars are conceptual rather than numeric, but they capture the central pattern: every architecture is a different spending plan for the same physical currency.
5. Why AI hardware keeps converging on dataflow and locality
AI workloads make one lesson painfully clear: arithmetic density alone is not enough. If weights, activations, and KV caches keep bouncing out to expensive memory, the compute units starve. That is why so much AI hardware design is really about memory hierarchy, systolic dataflow, on-chip SRAM capacity, reduced precision formats, and interconnect topology.
This is also why the line between "GPU" and "AI accelerator" keeps blurring. Once AI becomes central, GPUs acquire tensor cores and larger shared memories, while accelerators borrow lessons from GPU scheduling, packaging, and software stacks. The market categories differ, but the physics keeps forcing everyone toward the same core questions: where is the data, how often can it be reused, and how expensive is the next byte?
Visual: what AI hardware is trying to do all day
This is why locality, systolic flow, tiling, quantization, and SRAM capacity dominate so many AI hardware conversations.
6. Why the human brain looks so different from all of them
The brain is not a digital chip in the ordinary engineering sense. Neurons are slow, noisy, adaptive, electrochemical devices. Spikes are sparse and event-driven. Learning changes the structure over time. Memory and computation are deeply entangled rather than cleanly separated into ALUs here and DRAM over there.
Yet the brain obeys the same meta-rule as every architecture above: its shape reflects the costs it had to optimize against. Biology faced radically different constraints than semiconductor engineering.
Silicon priorities
Precise timing, repeatability, high-speed switching, exact arithmetic, deterministic manufacture, dense short-range wiring.
Biology priorities
Self-repair, development from local rules, extreme energy frugality, robustness to noise, adaptation, and learning in a changing world.
A modern transistor can switch in picoseconds to nanoseconds. A neuron fires on millisecond timescales. By raw switching speed, silicon wins easily. But the brain is not trying to maximize clock frequency. It gains power from massive parallelism, local learning, event sparsity, and a representation scheme that tolerates noise gracefully.
Put more sharply: silicon mostly scales temporally. It tries to make each device switch very fast, coordinate those switches precisely, and squeeze many sequential steps into a short time window. Biology scales much more spatially. It relies on enormous numbers of relatively slow units and an even more enormous number of synaptic connections, accepting low-frequency operation in exchange for colossal parallel, adaptive structure.
Separate compute, memory, and control enough to reason about them cleanly. Fight noise. Synchronize with clocks.
Co-locate storage and adaptation at synapses. Exploit analog dynamics. Fire only when useful. Let learning reshape the circuit.
They are optimizing different objective functions under different materials, fabrication methods, and failure models.
That is why neuromorphic hardware exists as a research direction at all. Engineers keep noticing that some aspects of intelligence might benefit from more event-driven, local-memory, lower-precision, brain-inspired structures than conventional von Neumann machines offer.
Visual: silicon scales in time
Visual: the brain scales in space
7. The big synthesis
Logic gates explain how digital hardware can represent and transform information. But they do not by themselves explain why modern compute devices look so different. The reason is that architecture is the art of spending a finite transistor, wire, memory, area, and power budget on the bottlenecks that matter most for a given workload.
- If your world is branchy and unpredictable, build something CPU-like.
- If your world contains oceans of parallel arithmetic, build something GPU-like.
- If your workload is stable and valuable enough, build an ASIC.
- If your hotspot is narrow but important, add an accelerator.
- If dense matrix math dominates, shape the machine into TPU-like tensor dataflow.
- If you need hardware that can change after deployment, choose FPGA-style reconfigurability.
- If you need lifelong adaptation under brutal energy constraints, nature points toward something more brain-like.
So the shortest honest answer to the title question is this: chips work by arranging switches into logic, logic into stateful machines, and stateful machines into data-moving systems. And all the strange-looking compute architectures around us are really just different answers to the same design problem: what should be fixed, what should be flexible, what should be local, and what should be parallel?
8. A compact mental model to keep
| Level | Question to ask | What matters most |
|---|---|---|
| Transistor | Can I control current reliably? | Noise margins, switching behavior |
| Gate | Can I express a Boolean rule? | Truth tables, composition |
| Sequential block | Can I remember and update state? | Clocks, latches, registers |
| Core | Can I keep useful work in flight? | Scheduling, control, locality |
| Chip | Can I feed compute fast enough? | Memory hierarchy, interconnect, thermals |
| Architecture | What bottleneck am I built around? | Workload structure and energy budget |
If you remember only one thing, remember this: hardware categories are not arbitrary brand labels. They are physical opinions about which kind of work deserves the most silicon.