Writings

Technical essays, explained with enough depth to be useful.

Long-form writing across AI infrastructure, memory systems, local agent runtimes, and accelerator architecture. This section is where project ideas and patent-adjacent concepts get room to breathe as essays rather than just landing-page summaries.

Search essays

Current essays

198 essays · updated
RL InferenceGB300Close-to-Metal

· 25 min read

Close-to-Metal RL Inference at GB300 Scale

A deep systems-level reference on close-to-metal RL inference at GB300 scale: persistent decode workers, hugepage-backed KV arenas, GPU/NIC command rings, NUMA locality, TLB/IOMMU reduction, cache coherency, and async reward handoff — with full C code.

Advanced PackagingHybrid BondingGPU Fabrics

· 18 min read

GPU-to-GPU Hybrid Bonding: From Multi-GPU Communication to Multi-Die GPU Fabrics

A systems-level analysis of GPU-to-GPU hybrid bonding, advanced packaging, chiplet GPUs, AI accelerator fabrics, and the future of multi-die GPU architecture — why the next scaling curve depends less on raw FLOPs and more on how cheaply silicon can talk to silicon.

KV CacheAttention KernelsAgentic Inference

· 18 min read

Intent Attention Kernel: Intent-Aware KV Execution for Agentic Long-Context Inference

A systems-level essay on why long-context attention should not pretend every KV block is equally useful — semantic block metadata, IntentQuant-KV precision policies, and what runtime-intent-aware KV execution looks like in the long-context regime.

MLCCComponentsMarkets

· 25 min read

MLCC: The Invisible Backbone of Modern Electronics

A complete ground-up guide to the world's most essential passive component — from ceramic dielectric physics and manufacturing to a $22B market shaped by AI servers, electric vehicles, and geopolitical supply shocks.

HBMThermal EngineeringPackaging

· 13 min read

SK Hynix's iHBM: Inside the Technology Redefining AI Memory Cooling

A systems-level analysis of SK Hynix iHBM, HBM thermal bottlenecks, MR-MUF packaging, HBM5 power density, and why memory cooling is becoming strategic.

NVIDIAAI FactoryControl Plane

· 10 min read

NVIDIA Vera and the Control Plane of the AI Factory

A systems-level analysis of NVIDIA Vera, Rubin, and why the AI factory is becoming an orchestration and memory-movement problem, not just a GPU story.

SemiconductorsPackagingTau Scaling

· 14 min read

Huawei's Tau Scaling Law and the Beginning of Distance-Centric Computing

A systems-level analysis of Huawei's Tau Scaling Law, LogicFolding, distance-centric computing, verification complexity, and why test and packaging infrastructure may matter more.

AI FundamentalsLLM TokensContext Window

· 8 min read

What Is a Token? The Atom of AI Language

A clear explainer of what tokens are, how tokenization works, why context windows and pricing depend on them, and why they shape model behavior.

Disaggregated MemoryDRAM RackKV Cache

· 12 min read

Disaggregated Memory for Long-Context AI Systems

A vendor-neutral reference architecture for standalone DRAM racks serving long-context AI systems, from fabric topology and KV-cache offload to failure modes and deployment criteria.

KV ASICChip ArchitectureDMA · CXL · HBM

· 18 min read

KV ASIC Part 3: The Chip

A deep technical architecture essay opening up the proposed KV infrastructure ASIC: floorplan, DMA engines, metadata SRAM, compression, prefetching, timing paths, and cluster topology.

KV CacheInference InfrastructureHBM · CXL

· 15 min read

Why KV Cache Needs Dedicated Infrastructure

A systems architecture essay on why KV cache has become first-class infrastructure for long-context inference, and why generic memory systems are not enough.

GQA · MQAKV CacheLLM Inference

· 11 min read

GQA and MQA: Attention, KV Cache, and Faster LLM Inference

A ground-up guide to Multi-Query and Grouped-Query Attention, why they shrink KV cache memory, and how they make long-context LLM inference cheaper and faster.

CXL MemoryKV CacheAI Inference

· 16 min read

CXL and the Search for a New Memory Tier in AI Inference

A systems essay on where CXL fits in the AI memory hierarchy, from GPU HBM and warm KV cache tiers to CPU DRAM, NVMe, and long-context serving pressure.

KV CacheASIC DesignCXL · HBM

· 20 min read

Beyond CXL: The Case for a KV-Aware Memory ASIC

A systems architecture essay on why generic CXL memory is the wrong abstraction for transformer inference, and how semantic KV-aware memory control could reduce HBM pressure.

RetimersPCIe · CXLAI Infrastructure

· 20 min read

Retimers: The Hidden Signal Infrastructure of the AI Chip Era

A systems-level guide to why retimers matter in AI servers, from PCIe 5/6 and CXL signal integrity to vendor positioning, active cables, optics, and the bear case.

Part 2Inference MemoryNUMA · PCIe · GPU

· 22 min read

Part 2: How Memory Timings Are Actually Configured in Modern Systems

Part 2 extends the memory-timings deep dive into inference performance: Linux runtime tuning, hugepages, NUMA binding, PCIe topology, GPU placement, and verification commands.

Linux MemoryDRAM TimingsNUMA · CXL · HBM

· 18 min read

How Memory Timings Are Actually Configured in Modern Systems

A practical deep dive into how modern systems configure memory timings across BIOS, firmware training, Linux runtime tuning, NUMA placement, CXL tiering, and HBM locality.

GPU KernelsAI AgentsCompiler Systems

· 10 min read

The Rise of AI Kernel Engineers

Autonomous agents are starting to design, benchmark, debug, and optimize the low-level CUDA and Triton kernels that determine real AI system throughput.

SemiconductorsComputer ArchitectureAI Hardware

· 19 min read

How Chips Actually Work: From Logic Gates to GPUs, ASICs, TPUs, FPGAs, Accelerators, and the Brain

A ground-up guide to how transistors become logic, logic becomes stateful machines, and architecture diverges into CPUs, GPUs, ASICs, TPUs, FPGAs, AI accelerators, and brain-like systems.

AI AcceleratorsMemory FabricInference

· 8 min read

Zhenwu M890: Why Alibaba Bet on Memory Fabrics, Not Just FLOPS

A systems architecture analysis of Alibaba Zhenwu M890, explaining why AI infrastructure is shifting from tensor FLOPS to memory orchestration, HBM, KV cache, and rack-scale fabrics.

Linux KernelBoard Bring-UpAccelerators

· 21 min read

First Silicon Is Not the Finish Line: Linux Board Bring-Up for x86, Arm, and Accelerators

A systems engineering guide to Linux board bring-up for x86, Arm, and accelerators, covering firmware, ACPI, Device Tree, drivers, DMA, IOMMU, runtimes, and debugging.

HBMDDR5AI Memory Wall

· 12 min read

HBM, DDR, SODIMM and the AI Memory Wall

A systems-level deep dive into HBM, DDR5, SODIMM, LPDDR, CXL, KV-cache pressure, packaging IP, and the future AI memory wall.

InferenceMemory WallKV Cache

· 9 min read

The Quadrillion-Token Era Has Arrived

A systems deep dive into the quadrillion-token era, KV-cache memory pressure, prefill/decode disaggregation, token warehouses, and the AI memory wall.

Disaggregated MemoryCXLAI Infrastructure

· 16 min read

Disaggregated Memory — The Architecture Reshaping AI Infrastructure

A technical deep dive into disaggregated memory for AI infrastructure, covering CXL, NVLink, memory pooling, vendor ecosystems, latency, and data-center design.

CPOSilicon PhotonicsAI Networking

· 16 min read

Why Co-Packaged Optics Is Crucial in Semiconductors

A technical essay on why co-packaged optics matters for AI semiconductors, bandwidth density, switch power, optical I/O, and data-center fabric scaling.

Scaling LawsPretrainingLLM Research

· 17 min read

Pretraining Scaling: The Engine Behind Modern AI

A deep technical explainer on scaling laws, Chinchilla compute-optimal training, data limits, inference economics, and why modern AI progress was predictable enough to plan around.

KV CacheMemory SystemsPatent Concept

· 25 min read

GhostKV: Attention Without Storing Full KV

A technical concept essay on query-time bounded elimination, semantic witnesses, and reconstruction-first KV memory for long-context transformer inference.

MemoryOrchestrationInference

· 22 min read

The Memory Wall Is Moving

A deep technical essay on why AI inference is shifting from a compute bottleneck to a memory-bandwidth and orchestration problem.

PatentSRAMTransformer Inference

· 16 min read

The SRAM Insight: How I Built a Patent Around AI's Oldest Memory Habit

A patent-backed essay on attention-sink-aware SRAM placement, explaining how sink-token behavior can reroute the hottest KV state into faster on-chip memory.

SRAMHeterogeneous InferenceDecode Bottleneck

· 17 min read

SRAM as the Specialist: Heterogeneous Inference and the Decode Bottleneck

A deep technical essay on why SRAM-heavy accelerators fit decode-side FFN and MoE paths, and how heterogeneous inference reframes the AI chip memory debate.

MFUBF16 · FP8 · FP4Training Systems

· 18 min read

MFU, BF16, FP8 and AI Numeric Formats: The Complete Guide

A complete guide to model flop utilization, low-precision formats, tensor-core throughput, and how to read AI training efficiency claims without getting fooled by the denominator.

PagedAttention KV Cache Memory Systems

· 19 min read

The Ideal PagedAttention Stack: Hardware + Software for Long-Context Inference

A systems essay on KV paging, FlashAttention-style tiling, SRAM-sized decompression windows, cold-memory prefetch, and the hardware long-context inference really wants.

FlashDecodeLong-Context InferenceKernel Analysis

· 17 min read

Four Things Nobody Has Written About FlashDecode

An original technical analysis of FlashDecode covering kernel crossover math, reduction inversion, CXL asymmetry, and runtime switching for long-context inference.

KV CacheTurboQuantTransformer Memory

· 19 min read

KV Cache, Transformer Memory, and Why TurboQuant Matters

A deep technical essay on KV cache growth, tiled attention, outliers, and why long-context inference is increasingly a memory systems problem rather than just a compute problem.

AI CompilerMemory SchedulingDeterministic Inference

· 22 min read

The Compiler Becomes the Memory Scheduler

Why runtime dynamism is becoming too expensive for AI inference, and how compilers are evolving into full memory traffic planners across DMA, tensor placement, and deterministic execution windows.

KV CacheAttention MemoryInference Systems

· 18 min read

The Next AI Bottleneck Isn't FLOPs. It's Attention Memory.

A systems-level deep dive into KV cache growth, attention memory, and why the next AI architecture war is rapidly becoming a memory war.

CXLCoherenceAI Infrastructure

· 20 min read

Coherent Fabrics: The Memory Highways Behind Agentic AI

A deep technical essay on coherent fabrics, CXL, NVLink, cache coherence, and why shared-state coordination is becoming the hardware bottleneck for agentic AI systems.

AI InfrastructureMemory SystemsCXL

· 18 min read

The Memory Wall Is the New Compute Wall

A deep technical essay on how CXL, coherent memory, SmartNICs, GPUDirect Storage, unified memory, and KV-cache routing all converge on the same core AI infrastructure problem: moving state efficiently.

HBM · DDR5Memory PlacementInference Serving

· 14 min read

Host RAM vs HBM for Inference: What Really Lives Where

Device-local KV, host spill, pinned buffers, DMA paths, and the physical constraints that determine what your serving stack should put where — and why getting it wrong costs 10× on tail latency.

AI Infrastructure Memory Systems CXL

· 18 min read

The Memory Wall — How AI Infrastructure Is Being Rewired From the Ground Up

A technical deep-dive into how CXL, SmartNICs, GPUDirect Storage, and unified memory are solving the memory movement bottleneck in modern AI systems.

KV Cache Memory Systems Inference

· 18 min read

KV Cache Is a Memory System

Attention indexing, VRAM, HBM, huge pages, TLB pressure, and pinned memory explained from first principles for modern inference stacks.

India Patent Inference Architecture Memory Systems

· 12 min read

Why AI Inference Needs a Weight Delivery Architecture

A public-facing note on Indian patent application no. 202641059412 and the case for treating model-weight delivery as its own inference architecture problem.

CPU Inference Microarchitecture

· 16 min read

Why LLM Prefill and Decode Suffer from Cache Misses, Branch Stalls, and Pipeline Bubbles on CPUs

A systems-level explainer on how cache misses, branch stalls, and pipeline bubbles shape CPU inference performance in transformer prefill and decode.

KernelMemoryControl Plane

· 21 min read

/dev/mem_hint: A Kernel Control Plane for AI Memory Systems

A deep technical primer on /dev/mem_hint, the kernel-mediated hint path linking AI runtimes, PMU classifiers, MSR/MMIO/CXL conduits, and memory PHY policy engines.

PatentMemoryAI Systems

· 14 min read

Teaching Computers to Remember Smarter

A patent essay on software-defined, workload-aware adaptive memory signaling for AI systems, letting runtimes tune memory timing, power, and latency by phase.

HBMMemoryResidency

· 8 min read

HBM Is Not a Cache: What High-Bandwidth Memory Actually Does in AI Systems

A technical essay on why HBM is not a magical cache, but a scarce local working tier whose value depends on residency policy, bandwidth budgeting, and orchestration.

MoENetworkingTopology

· 7 min read

MoE Is a Networking Problem Wearing a Model Costume

A systems essay on why sparse expert routing becomes a fabric, transport, and queueing problem long before it becomes a pure model-efficiency win.

PowerSchedulingRuntime

· 6 min read

Power Is Becoming a Scheduling Constraint

A technical essay on why rack power, thermal headroom, and electrical state are becoming live runtime inputs for admission, placement, and useful-throughput control.

InferenceQueueingTail Latency

· 13 min read

Inference Is a Queueing System: Why AI Serving Lives or Dies by Arrival Curves, Batching, and Tail Latency

A systems essay on why serving quality is shaped by queue depth, prefill/decode asymmetry, batching policy, and memory-driven service-time variance.

RuntimeSchedulerControl Plane

· 11 min read

The Scheduler Is the Product: Why AI Infrastructure Moats Are Becoming Policy Engines

A technical essay on why the deepest AI infrastructure advantage is moving into placement, admission, routing, and state-management policy.

MemoryLocalityMiss Penalty

· 11 min read

The Real Cost of a Miss: Unifying Cache Misses, KV Misses, Fabric Misses, and Remote Fetch in AI Systems

A memory-systems essay on how misses really behave in AI infrastructure, from HBM and KV locality to fabric fetch and byte-movement economics.

MemoryRambusCXL

· 12 min read

Rambus: The Hidden Bottleneck Bet in the AI Memory Supercycle

A deep technical essay on Rambus, from MRDIMM and interface chips to CXL, security IP, and why AI infrastructure is becoming memory-interface bound.

LLMsDeepSeekOpen Weights

· 38 min read

DeepSeek V4: The Model That Refuses to Stop Shocking the World

A technical essay on DeepSeek V4, from hybrid attention and million-token context to pricing, Huawei inference infrastructure, and the geopolitics of open weights.

SemiconductorsIntelFoundry

· 36 min read

Intel 18A & 14A: The Nodes That Could Rewrite Chip History

A technical essay on Intel 18A and 14A, from RibbonFET and PowerVia to High-NA EUV, foundry economics, packaging, and the race with TSMC.

ARMCPUsData Center

· 29 min read

ARM Data Center CPU Report: Comprehensive Technical Analysis

A technical deep dive into Graviton4, Axion, Cobalt 100, AmpereOne, Grace, and Qualcomm across architecture, benchmarks, software, and platform tradeoffs.

MemoryLPDDR5XAI Servers

· 7 min read

Rambus SOCAMM2: The Modular LPDDR Memory Layer for AI Servers

A systems-level primer on SOCAMM2, LPDDR5X server modules, Rambus support silicon, and where modular memory fits in the AI server hierarchy.

CPOLaser ArraysAI Networking

· 20 min read

High-Power Laser Arrays for Co-Packaged Optics: The Hidden Power Plant of AI Interconnects

A systems primer on high-power laser arrays, external laser sources, silicon photonics, packaging, and why CPO optics are becoming central to AI data-center networks.

SemiconductorsPackagingSubstrates

· 22 min read

The Substrate Layer: What Sits Between the Chip and Everything Else

A deep dive into semiconductor substrates, from ABF and BT to ceramic, glass, fan-out, and the advanced packaging stack underneath modern AI chips.

StorageLinuxGPU Data Path

· 18 min read

The Brains Behind AI Storage, v2: Smart Controllers, Linux Drivers, and the Full Path from read() to GPU

A deep technical primer on smart storage controllers, Linux drivers, NVMe queues, DMA, IOMMU, and the full AI server path from read() to GPU HBM.

StorageNVMeData Movement

· 15 min read

AI Storage Primer: From NAND Physics to GPUDirect and the Memory Wall

A technical primer on AI storage, from NAND and SSD controllers to GPUDirect Storage, HDD economics, HBF, and the memory wall in modern AI systems.

AI CPUsCPU ArchitectureAI Infrastructure

· 17 min read

Vera, Venice, AGI, Clearwater: The Coming Wave of AI CPUs

A technical essay on the next wave of AI CPUs: Intel Clearwater Forest, AMD Venice, Arm AGI, NVIDIA Vera, memory coherency, power, and host orchestration.

CPUAI InfrastructureRuntime Systems

· 18 min read

The CPU Is Back: Why AI Broke the GPU-Only Illusion

A technical essay on why CPUs are becoming central again in AI infrastructure: memory movement, GPU orchestration, agentic workloads, power, and runtime control.

AI PowerAcceleratorsPower Delivery

· 18 min read

The AI Power Delivery Stack: How Modern Accelerators Are Really Powered, and What Each Vendor Actually Does

A technical guide to the AI power delivery stack from grid to chip, mapping Vicor, MPS, TI, ADI, Renesas, Infineon, Delta, Flex, Murata, Bel, Navitas, and Wolfspeed.

AI Power800V DCSemiconductors

· 22 min read

The 800V DC Era Is Here — and It's Rewriting Who Builds AI Power

A technical essay on why NVIDIA and AI data centers are moving toward 800V DC power, and how SiC and GaN semiconductors reshape the infrastructure stack.

AI ChipsLogic BridgeOn-Package NUMA

· 16 min read

The Next-Generation AI Chip: Inside the Logic Bridge, Realistic Floorplans, and the Rise of On-Package NUMA

A technical essay on next-generation AI chip floorplans, logic bridges, HBM expansion, on-package NUMA, SRAM locality, and where optical interconnects fit.

AI ChipsHBMChiplets

· 18 min read

The Next-Generation AI Chip: From Flat HBM to an On-Package NUMA Fabric

A technical architecture essay on how AI accelerators may evolve from flat HBM layouts into topology-aware on-package NUMA fabrics with chiplets, SRAM layers, and optical links.

NVLinkScale-Up FabricAI Networking

· 18 min read

NVLink Switch Is Not NVLink: The Scale-Up Fabric Architecture Nobody Fully Explains

A systems-level explanation of NVLink Switch, NVL72 scale-up fabrics, GPU coherence boundaries, and where scale-up ends and scale-out begins.

InfiniBand800G EthernetScale-Out

· 20 min read

800G Ethernet vs. InfiniBand: The AI Scale-Out Decision Nobody Documents Honestly

A technical comparison of 800G Ethernet and InfiniBand for AI scale-out, covering congestion control, collectives, latency, cost, and operational tradeoffs.

CXLMemory SystemsProtocols

· 19 min read

CXL Is Three Protocols in a Trenchcoat: What .io, .mem, and .cache Actually Do

A ground-up explanation of CXL.io, CXL.mem, and CXL.cache, and why CXL is really a family of protocols for coherent memory expansion.

NCCLCollectivesGPU Clusters

· 21 min read

NCCL Internals: The Collective Communications Layer Nobody Reads — But Everyone Depends On

A deep technical walkthrough of NCCL internals, collective communication, rings, trees, channels, topology awareness, and why AI clusters depend on it.

AI HardwareTest EconomicsSemiconductors

· 16 min read

The Cost of Testing AI Chips Is Exploding

Why wafer probe, burn-in, ATE, high-speed I/O validation, thermal control, and advanced packaging are turning AI chip test into a first-order cost constraint.

AI HardwareSemiconductor TestValidation

· 18 min read

The Hidden Backbone of AI Hardware: How Chips Are Tested Before They Power the World

A practical map of wafer probe, burn-in, ATE, high-speed I/O validation, handlers, and system-level test for modern AI chips.

AI CompilersMLIRTriton

· 20 min read

The Compiler Is the New Kernel: Why MLIR/Triton/XLA Are the Most Underrated Layer in AI Infrastructure

Why MLIR, Triton, and XLA are becoming the control layer between AI hardware, kernel generation, memory policy, and inference efficiency.

AI NetworkingTopologyInference

· 22 min read

Fat-Tree Is Lying to You: Network Topology as a First-Class Inference Constraint

Why fat-tree assumptions break under MoE routing, prefill-decode disaggregation, KV transfer, and topology-aware inference scheduling.

Inference EconomicsTokensAI Business

· 24 min read

The Inference Unit Economics Time Bomb: Why $/Token Will Collapse (and What Survives)

A first-principles look at token pricing, the physical memory-and-cooling cost floor, and which AI infrastructure moats survive.

CPOSilicon PhotonicsAI Networking

· 35 min read

Co-Packaged Optics (CPO): The End of Pluggable Transceivers — A Ground-Up Guide

A ground-up technical guide to co-packaged optics, pluggable transceiver limits, silicon photonics, and AI data-center fabric power.

SemiconductorsEquipmentManufacturing

· 12 min read

The Machines That Build the Machines Pt. 3 — Beyond Lithography

A technical essay on the semiconductor manufacturing stack beyond lithography: deposition, etch, CMP, implant, metrology, clean, and subsystems.

ASMLEUVSemiconductors

· 12 min read

Inside ASML: The Product Line That Prints the Future (and Why EUV Took 30 Years)

A technical deep dive into ASML lithography systems, from DUV and immersion to Low-NA EUV, High-NA EUV, and why manufacturable EUV took three decades.

SemiconductorsEquipmentTSMC

· 28 min read

The Machines That Build the Machines: Inside Lam, AMAT, ASML, KLA and TSMC's Node Race

A technical deep dive into the equipment stack behind advanced nodes: how Lam, Applied Materials, ASML, KLA, and TSMC fit together from etch and deposition to EUV and yield.

InfrastructureGPUsCPUs

· 6 min read

GPU-to-CPU Deployment Ratios in Modern Server Deployments

GPU:CPU planning is becoming workload-specific again. For long-context, retrieval-heavy, and orchestration-heavy inference, CPU capacity, memory bandwidth,...

QuantumSimulationChemistry

· 10 min read

Quantum Series #5: Quantum Simulation — Why Chemistry and Pharma Actually Care

From a 70-year algorithm to medical AI: how quantum computing flips the resource game for molecules, materials, and drugs

TrainingMemoryCheckpointing

· 12 min read

Gradient Checkpointing Is a Memory Problem: Training Efficiency From First Principles

The inference memory essays in this series talk about KV cache, weight residency, and HBM pressure. Training has its own version of all three — and the...

MoEInference CostRouting

· 13 min read

The Router Tax: What Nobody Accounts For in MoE Inference Cost

Mixture-of-Experts models are sold as a 10× efficiency win: 10× more parameters, same per-token compute. That math is correct. But it omits the router — the...

LLM InferenceMemory BandwidthDecode

· 14 min read

Weight Streaming: Why Your Model's Weights Are the Real Decode Bottleneck

The KV cache gets all the attention. But for large models at low batch sizes, loading transformer layer weights from HBM dominates decode latency just as...

Quantum NISQ Algorithms

· 13 min read

Quantum Series #4: Why VQE/QAOA Don't Scale Yet — The Barren Plateau Problem

A thesis-driven essay on why VQE and QAOA hit barren plateaus, routing overhead, and noise walls long before meaningful NISQ advantage.

Quantum Error Correction QEC

· 9 min read

Quantum Series #3: Quantum Error Correction — Why 1 Good Qubit Costs 1,000 Bad Ones

A technical essay on surface codes, fault-tolerance thresholds, and the brutal redundancy required to turn noisy physical qubits into useful logical ones.

Quantum Cryptography Security

· 11 min read

Quantum Computing and Cryptography: What Breaks, What Survives, What Comes Next

A technical essay on which cryptographic systems quantum computers actually threaten, which survive, and how post-quantum migration should be understood.

Quantum Algorithms Computer Science

· 12 min read

Quantum Computing: Structured Interference at Scale

A technical essay on quantum computing as structured interference, covering superposition, entanglement, amplitude amplification, error correction, and where speedups actually come from.

LLMs Architecture Survey

· 10 min read

LLM Architecture Survey: From Transformers to Modern Variants

A technical survey of large language model architectures covering transformer backbones, attention variants, sparsity, MoE, normalization, positional encoding, and systems tradeoffs.

Quantum Primer Computer Science

· 15 min read

Quantum Computing: A Ground-Up Technical Primer

A ground-up technical primer on quantum computing covering qubits, gates, entanglement, circuits, error correction, and the path from theory to useful machines.

AI Infrastructure Systems Synthesis Memory Systems

· 17 min read

The Whole Stack: Everything Between a Transistor and a Token

A synthesis essay connecting transistor physics, memory hierarchies, inference systems, agents, and economics into one view of the AI stack.

AI Infrastructure Model Architecture MoE Systems

· 20 min read

Sparse MQA + Fused MoE Mega Kernel + Hyper-connections: The Trifecta Behind Modern Frontier Models

A technical essay on three systems ideas shaping frontier models: sparse MQA, fused MoE mega kernels, and hyper-connections as a throughput and efficiency stack.

AI Infrastructure Networking MoE Inference

· 14 min read

The PCIe Tax: How Host-Staged Networking Steals Half Your GPU Bandwidth

A technical essay on the host-side copy tax in inference systems, why PCIe crossings burn usable bandwidth, and how GPUDirect RDMA plus DPUs recover MoE throughput.

AI Hardware HBM Memory Architecture

· 16 min read

HBM Explained: How High Bandwidth Memory Is Actually Built

A technical essay on how HBM is physically constructed, from stacked DRAM dies and TSVs to silicon interposers, and why packaging physics sets the AI bandwidth ceiling.

GPU Architecture AI Hardware Tensor Cores

· 17 min read

Inside the GPU: SMs, Warps, Tensor Cores, and Why the Architecture Looks the Way It Does

A technical essay unpacking SMs, warps, tensor cores, register files, and why GPU design choices make sense for parallel workloads but strain during autoregressive decode.

Semiconductors Process Nodes Transistor Physics

· 17 min read

What "3nm" Actually Means: Process Nodes, Transistor Physics, and Why Scaling Is Slowing

A technical essay on what modern node names really mean, how FinFET and GAA devices extend scaling, and why Moore's Law is slowing for AI chip design.

Systems Architecture CXL Memory Long Context

· 12 min read

KV Fabrics: Treating Context as a Distributed Filesystem

A technical essay on turning KV cache into rack-scale infrastructure across HBM, CXL DRAM, and NVMe, with DPUs acting as the metadata control plane.

Inference Memory AI Systems

· 22 min read

The Memory Bottleneck: Why Inference Speed Is a Memory Problem, Not a Compute Problem

A technical essay on why AI inference latency is usually constrained by memory bandwidth and hierarchy design, not raw arithmetic throughput.

Systems Architecture Agentic AI Hardware Offload

· 12 min read

The DPU as Agent Memory Controller: Offloading Orchestration

Why the 78% CPU / 31% GPU problem in agentic inference isn't a software bug — it's an architecture problem. Using a BlueField-3 DPU to bypass the host and feed the GPU directly.

Systems Architecture AI Infrastructure Memory Systems

· 12 min read

Why Agentic AI Is a CPU and DRAM Problem, Not Just a GPU Problem

The center of gravity for AI inference is shifting from dense matmuls to stateful orchestration, context movement, and memory capacity. For three years, "AI scaling" meant buy more GPUs. Agentic AI breaks that model.

Systems Architecture Hardware Instruction Sets

· 8 min read

RISC vs CISC: Where Complexity Lives

RISC and CISC aren't processor brands — they're philosophies. One says make the hardware simple and push work to software. The other says make the hardware clever to make software easier. Modern chips use both ideas, but understanding the trade-off explains almost everything about why your phone's processor looks nothing like your laptop's from 1995.

Agent Systems Inference Architecture Multi-Model Scheduling

· 23 min read

The Agent Topology Problem: Inference Scheduling Across Model Boundaries

Agentic AI workloads — orchestrators calling planners calling retrievers calling verifiers — are not well-served by inference infrastructure designed for isolated single-model requests.

Inference Economics AI Infrastructure Cost Modelling

· 25 min read

The True Cost of a Token: Inference Economics from Silicon to SLO

OpenAI, Anthropic, Google, and a dozen open-source providers all quote $/1M-token prices. Nobody explains what those numbers are made of. This essay derives the cost of an output token from first principles.

Memory Systems MCOS Series Bandwidth Contracts

· 12 min read

Multi-Tenant KV Fabrics: Bandwidth Contracts for Shared LLM Memory Systems

KV cache bandwidth must be partitioned like network bandwidth. With compiler contracts, token buckets in hardware, and page coloring, multi-tenant tail latency improves 3.6x and throughput improves 28 to 31 percent.

Energy Constraints MCOS Series Joules-per-Token

· 14 min read

The Power Contract SLO: Making Joules-per-Token Schedulable

Power is not a limit to hit and then throttle. Power is a budget to allocate. By making joules-per-token a first class SLO, we can turn blind firmware throttling into global, proactive contract scheduling.

Hardware Interconnects CXL 3.0 PCIe Gen5

· 6 min read

Why AI Servers Need a Memory Fabric, Not Just a Faster Bus

The physical limits of the motherboard have been reached. HBM is fast, but the moment an AI workload spills to host memory over PCIe, performance falls off a 75x cliff. Here is why CXL memory disaggregation is dismantling the definition of a server.

Numerics AI Hardware Systems Math

· 9 min read

Floating Point in AI: The Hidden Numbers Running Your Models

AI is powered by billions of approximate numbers. Once you understand floating point constraints—from fp32 to fp8—a lot of AI hardware design, quantization, and training stability suddenly clicks.

Semiconductors Supply Chain Geopolitics

· 44 min read

The Semiconductor Ecosystem: Every Layer, Every Player, Every Dependency

A comprehensive guide to the global semiconductor supply chain — from EDA tools and chip design through foundries, equipment, materials, memory, packaging, CPO, and test. 100+ companies across 7 countries, with detailed chokepoint analysis.

TPU Architecture Memory Systems XLA Compiler

· 7 min read

Software-Managed Memory Is the TPU's Real Advantage — and Its Hardest Abstraction

No hardware caches. No L1/L2. The TPU exposes raw scratchpad SRAM to the XLA compiler — and that changes everything about memory determinism vs. generality.

TPU Interconnect OCS 3D Torus

· 7 min read

Optical Circuit Switching and the 3D Torus: Why TPU Pods Are a Different Kind of Supercomputer

GPU clusters route packets through switches. TPU pods route light through MEMS mirrors. The physical network reshapes itself to match the computation.

TPU vs GPU HBM Strategy Systems Economics

· 6 min read

Why TPUs Have Less HBM Than GPUs — and Why That's a Design Choice, Not a Limitation

NVIDIA bets on fat nodes with maximum per-GPU memory. Google bets on thin nodes with maximum inter-chip bandwidth. Both are rational — they optimize for different cost functions.

NAND Flash AI Infrastructure Storage Systems

· 7 min read

NAND Flash Is the Invisible Backbone of Every AI Cluster

The GPU gets the credit. The SSD does the lifting. A systems essay on the five roles NAND flash plays — from checkpoint absorption to KV-cache offloading and weight staging.

HBF Memory Architecture NAND

· 7 min read

High Bandwidth Flash: The Missing Tier Between HBM and NVMe in AI Inference

8–16× HBM capacity, within 2.2% of HBM read performance. How NAND-based HBF is designed to close the 300× bandwidth gap in the AI memory hierarchy.

Capacity Planning NAND Supply AI Clusters

· 7 min read

The Storage Geometry of a 100,000-GPU Cluster: How Much NAND Does AI Actually Need?

A bottom-up accounting of flash capacity in hyperscale AI — from 15 TB per node to 3 exabytes per cluster. The numbers, the form factors, and the supply-chain pressure.

LLM Serving Memory Fairness Multi-Tenant

· 8 min read

The Scheduling Tax of Multi-Tenant Inference: Why Fairness and Throughput Are a Memory Problem

When a 128K-context request shares a GPU with forty 2K requests, the scheduler sees equal slots. The memory system sees a 64× asymmetry — and the minnows pay the price.

Batch Scheduling Memory Architecture Inference Systems

· 7 min read

Inference Batch Geometry: Why the Shape of a Batch Determines Its Memory Cost More Than Its Size

Batch size is a number. Batch geometry is a distribution. Two batch-32 configurations can differ by 10× in memory cost — and the memory system sees the distribution.

Compute-Memory Tradeoffs KV Cache Inference Architecture

· 9 min read

The Recompute-vs-Transfer Frontier: When Redoing Work Is Cheaper Than Moving Bytes

For some KV pages, recomputing attention from scratch costs less than fetching from host RAM. A systems essay on the fourth action in eviction policy design.

KV Cache Eviction Policy Ablation Study

· 15 min read

KV Hierarchy Lab: Regret-Aware Eviction and Trace-Driven Policy Evaluation

A trace-driven research harness for studying KV-cache eviction, quantization, and prefetch tradeoffs — featuring a regret-aware eviction policy with multi-seed ablation results.

Power AI Infrastructure Grid to GPU

· 29 min read

The Power Stack: How AI Scale-Out Gets Its Electricity — From Grid to GPU

A technical and industrial essay on the full power ecosystem behind AI data centers, from generation and transmission to UPS, switchgear, and GPU delivery.

Vera Rubin Cooling AI Facilities

· 23 min read

Why Vera Rubin Changes Everything About Cooling: 45°C Supply Temperature, Fan-Free Trays, and the Infrastructure Behind Them

A systems essay on how Vera Rubin NVL72 changes AI data center cooling with 45°C supply temperatures, fan-free trays, hose-free design, and liquid-cooled busbars.

Blackwell Cooling Stack AI Infrastructure

· 22 min read

The Cooling Stack Is the New Critical Path: How Blackwell GB300 NVL72 Racks Manage 142 kW

A systems essay on the five-layer direct liquid cooling architecture of Blackwell GB300 NVL72 racks and the suppliers behind each thermal layer.

vLLM GPU Thermals Scheduler Internals

· 13 min read

vLLM Internals: Where to Actually Cut Batch Size When the GPU is Melting

A systems essay on where thermal control actually belongs inside vLLM, from scheduler decisions and swap behavior to tensor-parallel batch cutting.

HBM Thermals Observability GPU Infra

· 22 min read

Why HBM Thermal Throttling Is Silent: Reading the Tea Leaves in nvidia-smi

A technical guide to detecting silent HBM thermal throttling on H100 and H200 clusters when standard GPU temperature dashboards look deceptively healthy.

HBM Thermals KV Cache Inference Systems

· 18 min read

Thermal Debt Is a Memory Problem — How Hot Dies Throttle Your KV Prefetch

A systems essay on HBM thermal telemetry, KV fetch stalls, and why hot memory dies turn thermal debt into an inference scheduling problem.

Photonics Memory Fabrics MoE Serving

· 19 min read

CPO Isn’t About Power. It’s About Making Memory Disaggregation Schedulable

A systems essay on why co-packaged optics matters because it makes disaggregated memory and expert movement schedulable under tight latency budgets.

LLM Inference KV Cache HBM Thermals

· 11 min read

HBM Throttling-Safe KV Admission and ReuseNet for LLM Inference

A technical essay on thermal-safe KV admission, HBM backpressure, reuse prediction, and production-serving policy design for H100 and H200 inference clusters.

Photonics AI Networking Market Map

· 30 min read

The Photonics Stack: Who Builds What for AI Networking — Part 1

A technical primer mapping the AI optical networking stack across fiber, lasers, transceivers, DSPs, switches, and test infrastructure through the companies building each layer.

Transformer Inference KV Cache Memory Policy

· 17 min read

The Attention Sink Problem: Why Transformer Inference Wastes More Memory Than You Think

A systems essay on attention sink tokens, structural KV cache waste, and why long-context serving needs memory-policy-aware treatment of hot-but-low-utility tokens.

AI Clusters Thermals Reliability

· 19 min read

Thermal Debt in AI Clusters: The Silent Degradation Loop Nobody Is Measuring

A systems essay on how dense GPU racks accumulate thermal debt, why point-in-time observability misses it, and what thermally-aware control planes should measure.

Developer Tools Docs-to-Video Local-First

· 9 min read

TechDemoForge — A Local-First Engine for Turning Technical Docs into Demo Videos

A detailed product and architecture essay on TechDemoForge, its workflow, repo structure, and why local-first technical demo generation is useful.

Inference Systems Disaggregation KV Cache

· 17 min read

Prefill-Decode Disaggregation: Why the Next Big Inference Architecture Splits the Job in Two

A systems essay on why prefill and decode should be split across different hardware pools and why the real engineering challenge becomes memory orchestration.

Inference Systems Memory Policy Speculative Decoding

· 17 min read

Speculative Decoding Is a Memory Problem

A systems essay on draft/verify KV pressure, rollback fragmentation, and why speculation pays off only when memory policy is designed for it.

Long Context Sparse Serving Memory Systems

· 8 min read

The Next Frontier in Long-Context Inference: Memory-Orchestrated Sparse Serving

A systems essay on sparse attention serving, hierarchical KV residency, predictive prefetch, and why long-context wins increasingly come from memory policy.

Photonics Scale-Up AI Infrastructure

· 8 min read

Scale-Out Was Yesterday. Scale-Up Optics Is the Next Battle

A systems-first essay on why the next optical contest in AI infrastructure is shifting inward toward dense rack-scale and scale-up fabrics.

Photonics AI Networking Materials

· 6 min read

InP vs Silicon Photonics vs VCSEL: The Materials Stack Behind AI Networking

A deeper systems-first essay on why AI networking will be built from a layered materials stack rather than a single optical winner.

AI Infrastructure Interconnect Power Density

· 7 min read

The Real AI Bottleneck Is Moving From Compute to Interconnect Power Density

A deeper systems essay on why the next constraint is the power, heat, and topology cost of moving bits across clusters.

Photonics AI Clusters Systems Design

· 15 min read

Photonics Is No Longer a Component Story — It Is Becoming the Operating System of AI Clusters

A systems essay on how optics is moving into the scheduler, topology planner, reliability model, and power logic of next-generation AI infrastructure.

Optics AI Infrastructure Interconnect

· 15 min read

CPO, LPO, DSP, and VCSEL: What Actually Matters for AI Infrastructure

A practical hardware-focused companion essay on optical architectures, power budgets, serviceability, and the real tradeoffs hyperscalers optimize for.

AI Runtime Memory Scheduling Inference Systems

· 12 min read

The Memory Scheduler Is the New Critical Path in AI Inference

A systems essay on explicit data movement, KV cache management, tiered memory placement, DMA orchestration, and why scheduler quality now directly determines inference efficiency.

AI Hardware Memory Fabrics Systems Architecture

· 18 min read

Why Cache Coherency Is the Wrong Default for AI Machines

A long-form systems argument for selective coherency, explicit tensor movement, CXL.mem over universal coherence, and schedule-first design in large-model infrastructure.

HBM KV Cache AI Inference

· 16 min read

When VRAM Stops Being a Weight Warehouse

A systems primer on HBM as a bounded working set, KV cache dominance, PagedAttention, weight offload, and why scheduled weight streaming is the next step in inference architecture.

Virtual Memory Storage Systems Systems Architecture

· 11 min read

When 128 TB Stops Feeling Infinite: Address Space, mmap, and the Quiet Limits of Modern Systems

A systems essay on 48-bit virtual addressing, mmap-heavy designs, storage density, 5-level paging, and why explicit data orchestration becomes the long-term answer.

AI Compilers Memory Policy MCOS

· 16 min read

Memory Intent IR: Why AI Compilers Must Emit Memory Plans

A systems essay on compiler-emitted memory intent, object semantics, workload phases, reuse confidence, and why hardware orchestration needs structured plans instead of blind guesses.

MCOS Fabric Control AI Memory Systems

· 14 min read

MCOS-HFC: A Hardware Fabric Controller for Memory-Centric AI Systems

A technical essay on explicit memory intent, residency maps, regret-aware eviction, recomputation-vs-transfer arbitration, atomic doorbells, and DPU embodiments for AI memory fabrics.

MCOS AI Memory Fabrics Hardware Control

· 6 min read

MCOS Must Live in Hardware: From JBOD to Intelligent AI Memory Fabrics

A systems essay on why software-only memory orchestration hits a ceiling and why the real future is hardware-resident movement control near the fabric, the tiers, and the accelerators.

MCOS AI Systems Memory Control

· 8 min read

MCOS: A Memory-Centric Operating System for the Future of AI Systems

A technical essay on memory placement, movement, residency, reuse, admission, and eviction as first-class scheduling decisions rather than passive implementation details.

RDMA GPU Data Paths Storage Fabrics

· 8 min read

Why "Disk → RDMA → GPU" Is Still Fragmented Today

A technical essay on why local direct-storage acceleration and network direct-memory acceleration still do not add up to one universal GPU-native end-to-end storage fabric.

On-Chip Memory GPU Systems HBM→SRAM

· 8 min read

From SSD to GPU to SRAM: Why the Last Bottleneck Is Now On-Chip

A technical essay on why eliminating host-side bounce buffers shifts the real bottleneck inward, toward deterministic HBM↔SRAM orchestration inside the accelerator.

AI Systems Data Movement Zero-Copy

· 16 min read

Bounce Buffers: The Hidden Tax on Modern AI Systems

A technical essay on hidden staging buffers, GPUDirect-era dataflow, and why eliminating unnecessary copies matters across storage, network, memory, and accelerator paths.

RDMA AI Infrastructure Disaggregated Inference

· 10 min read

RDMA in the Age of AI: Zero-Copy, KV Cache Transfer, and the New Glue Layer

A detailed technical essay on RDMA, zero-copy realities, and why "RDMA exists" still does not mean true end-to-end zero-copy in disaggregated inference.

Disaggregated Inference KV Cache Transfer AI Infrastructure

· 7 min read

Seam Orchestrator: Workload-Aware KV Routing in Disaggregated Inference

A technical essay on policy above transport for KV movement, workload-aware admissibility, swappable glue layers, Scenario E and F, and experiment-backed results.

AI Infrastructure Reliability Control Planes

· 12 min read

AI Cluster Reliability Beyond Fault-Tolerant Parallelism

A technical essay on gray failures, checkpoint economics, cooling-compute seams, and seam-aware control planes for modern AI cluster reliability.

AI Infrastructure Reliability Cluster Operations

· 10 min read

The Next AI Cluster Failure Won’t Look Like a GPU Failure

A technical essay on seam failures across facilities, fabrics, heterogeneous inference pools, checkpoint economics, and why component dashboards miss the most expensive cluster incidents.

Edge AI Compiler Runtime Power Control

· 5 min read

Adaptive Compiler–Runtime Power Contract for Energy-Optimal Edge Inference

A technical explainer on authenticated power contracts, alternative execution plans, safe switching boundaries, and runtime enforcement for edge inference systems.

Inference Architecture DMA Scheduling On-Chip Buffers

· 4 min read

Deterministic Memory-Orchestrated Inference Using DMA and Bounded On-Chip Buffers

A technical deep dive into compiler-scheduled DMA, explicit fences, bounded SRAM, and why deterministic buffer legality changes the inference memory system.

Edge Inference Energy Scheduling ARM Systems

· 4 min read

SLA-Constrained Energy-Aware Inference Scheduling on ARM Edge Systems

A technical essay on latency-aware policy selection across model variants, DMA strategy, memory residency, and accelerator performance-state control on edge systems.

AI Infrastructure Weight Residency Runtime Control

· 8 min read

Predictive Weight Orchestration: Runtime Control for Multi-Tier Weight Residency

A technical essay on HBM pressure, predictive multi-tier placement, precision state transitions, MoE router-history signals, and bandwidth-aware runtime scheduling.

Agent Systems Control Plane Developer Tooling

· 8 min read

Mobile Agent Control: A Vendor-Neutral Control Plane for Terminal-Native Coding Agents

A systems essay on Android-first operations, FastAPI supervision, runtime adapters, websocket telemetry, and safe control of local coding agents across machines.

Developer Tooling Web Performance Telemetry

· 5 min read

Introducing ChromeLens: Systems-Grade Web Performance Telemetry

An introduction to ChromeLens, deterministic CDP tracing, interactive flow profiling, and the hydration penalty behind complex modern web applications.

Cache Hierarchy AI Infrastructure Memory Systems

· 11 min read

What Bigger L2 Actually Buys You

A detailed technical essay on larger L2 caches in AI systems, miss-rate reduction, average access time, bandwidth relief, energy tradeoffs, and where bigger L2 stops being enough.

AI Infrastructure Memory Hierarchy Chip Architecture

· 12 min read

Why AI Needs a New Memory Hierarchy, Not Just Bigger Caches

A detailed technical essay on AI-native residency fabrics, class-aware memory, weights, KV cache, experts, and why bigger generic caches are not the full answer.

Computer Architecture LLM Decode Memory Hierarchy

· 13 min read

SRMIC-X1: Rethinking the Memory Hierarchy for LLM Decode

A technical essay on residency-first decode acceleration, distributed on-package SRAM, HRM/HBM crossover behavior, and why autoregressive inference is fundamentally memory-bound.

GPU Observability NVIDIA H100/H200 Telemetry

· 9 min read

How to Measure GPU Underutilization on NVIDIA H100 and H200

A practical systems essay on rolling-window low-utilization metrics, sampled idle behavior, power telemetry, and what point-in-time GPU utilization misses.

Portfolio Tools Options Analysis Local Software

· 7 min read

Schwab Portfolio Tools and the Case for Local, Practical Portfolio Infrastructure

A technical essay on privacy-first local portfolio tooling, Schwab CSV analysis, options workflows, IV crush scenarios, and practical risk reporting without platform theater.

LLM Inference Memory Hierarchy Controller Design

· 8 min read

vOrchestrate and the Case for Controller-Centric Memory Policy in LLM Inference

A technical essay on dynamic multi-tier weight residency orchestration across HBM, DRAM, and NVMe, with scoring, guardrails, state transitions, and simulation-first evaluation.

Computational Biology Explainability Agent Systems

· 8 min read

MHC Atlas OS and the Case for Explainable Structure-Guided Prioritization

A technical essay on runtime-agnostic, policy-governed, structure-guided experimental prioritization using AlphaFold-derived data, explainable scoring, and decision memory.

Long Context Memory Policy AI Infrastructure

· 7 min read

Long-Context Inference Needs Better Memory Policy, Not Just More Memory

A technical essay on predictive context region orchestration, region-level attention and residency control, speculative promotion, reversible demotion, and coherence-aware long-context inference.

AI Systems Memory Hierarchy Bandwidth

· 8 min read

The Real Tax in AI Systems Is Moving Bytes

A technical essay on bandwidth amplification, repeated refill across the hierarchy, and why better AI machines need stronger movement discipline, not just more compute and memory.

AI Infrastructure SRAM Residency Patent-linked

· 13 min read

Hardware-Enforced On-Chip Memory Residency for Neural Network Inference Accelerators

A deep technical essay on the bandwidth cost of repeatedly reloading hot weights during autoregressive inference, and why a wired on-chip residency primitive changes the machine rather than merely nudging the policy.

Local AI Middleware Open Source

· 6 min read

gemma4-wdc: A middleware layer that stops local agents from doing the same work twice

A laptop-first systems write-up on shared execution units, bounded admission windows, and why local multi-agent workflows waste more backend work than most people realize.

HBM LLM Inference Memory Policy

· 23 min read

HBM Fragmentation Guard: Confidence-Gated Residency Control for AI Accelerators

A systems essay on why LRU is the wrong default for HBM residency under LLM serving pressure, and how confidence gating, thrash budgets, and safe-window compaction change allocator behavior.