Manish K. Lach RSS

Manish K. Lach RSS https://manishklach.github.io/writings.html Technical essays on AI infrastructure, memory systems, runtimes, and accelerator architecture. en-us Sun, 19 Apr 2026 00:00:00 GMT The Whole Stack: Everything Between a Transistor and a Token https://manishklach.github.io/writings/the-whole-stack-everything-between-a-transistor-and-a-token.html Eighty-plus essays on AI infrastructure eventually converge on one insight: every performance number, every cost figure, every architectural decision is a consequence of a smaller, more physical decision made lower in the stack. This essay connects all of them — from process nodes to token pricing — and shows where the binding constraint actually lives in 2026. Sun, 19 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/the-whole-stack-everything-between-a-transistor-and-a-token.html Sparse MQA + Fused MoE Mega Kernel + Hyper-connections: The Trifecta Behind Modern Frontier Models https://manishklach.github.io/writings/sparse-mqa-fused-moe-hyperconnections-frontier-models.html Three architectural innovations each solve one wall that stops dense transformers from scaling — memory, compute, and depth. Together they unlock 2M-token context, trillion-parameter capacity, and 120-layer depth on existing H100 clusters. Here is exactly how each one works and why they are inseparable. Sun, 19 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/sparse-mqa-fused-moe-hyperconnections-frontier-models.html Multi-Tenant KV Fabrics: Bandwidth Contracts for Shared LLM Memory Systems https://manishklach.github.io/writings/multi-tenant-kv-fabrics.html Manish K L — April 2026. MCOS series, essay 2 of 7. Full version. Sun, 19 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/multi-tenant-kv-fabrics.html What "3nm" Actually Means: Process Nodes, Transistor Physics, and Why Scaling Is Slowing https://manishklach.github.io/writings/what-3nm-actually-means.html When TSMC announces its N3 process or Intel announces 18A, what is actually changing about the transistor? What does the node number mean — and why does it no longer mean what it used to? This essay explains the physics of transistor scaling from FinFET to Gate-All-Around, why Moore's Law is slowing, and what it means for AI chip design. Fri, 17 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/what-3nm-actually-means.html The PCIe Tax: How Host-Staged Networking Steals Half Your GPU Bandwidth https://manishklach.github.io/writings/the-pcie-tax-why-bypassing-the-host-doubles-moe-throughput.html We buy H100s for 3.35 TB/s of HBM3e, then strangle them with a 25-year-old I/O model that forces every byte through the CPU. The fix is not faster PCIe — it is zero crossings. Fri, 17 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/the-pcie-tax-why-bypassing-the-host-doubles-moe-throughput.html KV Fabrics: Treating Context as a Distributed Filesystem https://manishklach.github.io/writings/kv-fabrics-treating-context-as-a-distributed-filesystem.html In The DPU is the New NIC , we argued that the unit of scheduling has shifted from packets to prompts, and that DPUs must become prompt-aware memory controllers. This essay extends that argument: if prompts are stateful, then the state they produce—the KV cache—must be stored, addressed, and shared like files, not like malloc buffers. Fri, 17 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/kv-fabrics-treating-context-as-a-distributed-filesystem.html Inside the GPU: SMs, Warps, Tensor Cores, and Why the Architecture Looks the Way It Does https://manishklach.github.io/writings/inside-the-gpu-sm-warps-tensor-cores.html The GPU is the central hardware artefact of the AI era, yet it is almost always treated as a black box — a thing that runs matrix multiplies and produces tokens. This essay opens the box: what lives inside an SM, why warps exist, how tensor cores achieve their throughput, and what the architecture's design choices mean for AI workloads specifically. Fri, 17 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/inside-the-gpu-sm-warps-tensor-cores.html The Memory Bottleneck: Why Inference Speed Is a Memory Problem, Not a Compute Problem https://manishklach.github.io/writings/inference-speed-is-a-memory-problem.html Every AI processor debate eventually circles back to the same misdiagnosis: we need more compute. The actual constraint is almost never the arithmetic. It is the speed at which memory can feed data to the cores that are already sitting idle, waiting. Fri, 17 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/inference-speed-is-a-memory-problem.html HBM Explained: How High Bandwidth Memory Is Actually Built https://manishklach.github.io/writings/hbm-how-it-is-actually-built.html Every essay in this corpus mentions HBM. None of them explains what it is at the physical level — how DRAM dies are stacked, what a TSV actually does, why the interposer exists, and why stacking is the only path to bandwidth density that silicon physics allows. Fri, 17 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/hbm-how-it-is-actually-built.html The True Cost of a Token: Inference Economics from Silicon to SLO https://manishklach.github.io/writings/true-cost-of-a-token.html A bottom-up cost model connecting HBM bandwidth, cooling kilowatts, NAND geometry, and interconnect physics to the number that actually matters: dollars per million output tokens. Thu, 16 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/true-cost-of-a-token.html RISC vs CISC: Where Complexity Lives https://manishklach.github.io/writings/risc-cisc.html RISC and CISC aren't processor brands — they're philosophies. One says make the hardware simple and push work to software. The other says make the hardware clever to make software easier. Modern chips use both ideas, but understanding the trade-off explains almost everything about why your phone's processor looks nothing like your laptop's from 1995. Thu, 16 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/risc-cisc.html The DPU as Agent Memory Controller: Offloading Orchestration from the Host https://manishklach.github.io/writings/dpu-agent-memory-controller.html Why the 78% CPU / 31% GPU problem in agentic inference isn't a software bug — it's an architecture problem. Thu, 16 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/dpu-agent-memory-controller.html Why Agentic AI Is a CPU and DRAM Problem, Not Just a GPU Problem https://manishklach.github.io/writings/agentic-ai-cpu-dram-problem.html The center of gravity for AI inference is shifting from dense matmuls to stateful orchestration, context movement, and memory capacity. Thu, 16 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/agentic-ai-cpu-dram-problem.html The Agent Topology Problem: Inference Scheduling Across Model Boundaries https://manishklach.github.io/writings/agent-topology-problem.html Every essay in this corpus has treated inference as one model, one request. The real 2026 workload is a directed graph of model calls. When the unit of scheduling shifts from a request to a pipeline, everything we know about KV locality, memory residency, and latency SLOs must be rebuilt from scratch. Thu, 16 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/agent-topology-problem.html The Power Contract SLO: Making Joules-per-Token Schedulable from Grid to SRAM https://manishklach.github.io/writings/power-contract-slo.html Manish K L — April 2026 . MCOS series, essay 4 of 7. Wed, 15 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/power-contract-slo.html Why AI Servers Need a Memory Fabric, Not Just a Faster Bus https://manishklach.github.io/writings/pcie-bottleneck-cxl-escape.html The discrete server motherboard is becoming an architectural bottleneck. While local HBM remains the ideal execution state, AI workloads inevitably spill to host memory—triggering a severe bandwidth cliff. Here is a definitive look at why upgrading PCIe generations won't solve the latency problem, and why the industry is migrating toward CXL-attached tiering and memory fabrics. Wed, 15 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/pcie-bottleneck-cxl-escape.html Floating Point in AI: The Hidden Numbers Running Your Models https://manishklach.github.io/writings/floating-point-in-ai.html AI is not powered by magic. It is powered by billions of approximate numbers, stored in carefully chosen formats, pushed through specialized hardware at absurd scale. Once you understand floating point, a lot of AI hardware, quantization, and training stability suddenly clicks. Wed, 15 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/floating-point-in-ai.html Why Vera Rubin Changes Everything About Cooling: 45°C Supply Temperature, Fan-Free Trays, and the Infrastructure Behind Them https://manishklach.github.io/writings/vera-rubin-cooling-45c-supply-temperature.html The Blackwell GB300 NVL72 established a new baseline for AI rack cooling: 142 kW per rack, five-layer direct liquid cooling, 450 quick disconnects per rack, 30–35°C CDU supply water. Vera Rubin NVL72 discards most of those constraints. It operates at 45°C supply temperature, eliminates fans and hoses from the tray design entirely, liquid-cools the busbar, and processes 187–227 kW per rack. Each of those changes cascades through the entire facility infrastructure stack — and through the commercia Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/vera-rubin-cooling-45c-supply-temperature.html Software-Managed Memory Is the TPU's Real Advantage https://manishklach.github.io/writings/tpu-software-managed-memory-real-advantage.html No hardware caches. No L1/L2. The TPU exposes raw scratchpad SRAM to the XLA compiler — and that changes everything about how the memory system behaves under load. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/tpu-software-managed-memory-real-advantage.html Optical Circuit Switching and the 3D Torus https://manishklach.github.io/writings/tpu-optical-circuit-switching-3d-torus.html GPU clusters route packets through electrical switches. TPU pods route light through MEMS mirrors. The physical network reshapes itself to match the computation's communication pattern. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/tpu-optical-circuit-switching-3d-torus.html Why TPUs Have Less HBM Than GPUs https://manishklach.github.io/writings/tpu-less-hbm-design-choice-not-limitation.html And why that's a design choice, not a limitation. NVIDIA bets on fat nodes with maximum per-GPU memory. Google bets on thin nodes with maximum inter-chip bandwidth. Both are rational — they optimize for different cost functions. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/tpu-less-hbm-design-choice-not-limitation.html The Semiconductor Ecosystem https://manishklach.github.io/writings/the-semiconductor-ecosystem.html Every modern technology — from the phone in your pocket to the GPU training the world's largest models — depends on a semiconductor supply chain of staggering complexity. This post maps the entire ecosystem: the companies, the dependencies, the chokepoints, and the technologies that make a chip possible. If you want to understand why ASML, TSMC, and Lam Research matter, start here. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/the-semiconductor-ecosystem.html The Storage Geometry of a 100,000-GPU Cluster https://manishklach.github.io/writings/storage-geometry-100k-gpu-cluster-nand-demand.html How much NAND does AI actually need? A full accounting — from per-node E1.S drives to petabyte-scale checkpoint reservoirs, dataset lakes, and the NAND supply pressure that hyperscale AI is creating. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/storage-geometry-100k-gpu-cluster-nand-demand.html The Recompute-vs-Transfer Frontier https://manishklach.github.io/writings/recompute-vs-transfer-frontier-inference.html When redoing work is cheaper than moving bytes — and why eviction policies should know the difference between pages worth storing and pages worth recomputing. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/recompute-vs-transfer-frontier-inference.html NAND Flash Is the Invisible Backbone of Every AI Cluster https://manishklach.github.io/writings/nand-flash-invisible-backbone-ai-clusters.html The GPU gets the credit. The SSD does the lifting. A systems essay on the five roles NAND flash plays in modern AI training and inference infrastructure — from checkpoint absorption to KV-cache offloading. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/nand-flash-invisible-backbone-ai-clusters.html The Scheduling Tax of Multi-Tenant Inference https://manishklach.github.io/writings/multi-tenant-inference-memory-fairness.html Why fairness and throughput in shared inference pools are fundamentally a memory problem — and why the scheduler's view of "fair" diverges from the memory system's reality by up to 64×. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/multi-tenant-inference-memory-fairness.html KV Hierarchy Lab: Regret-Aware Eviction https://manishklach.github.io/writings/kv-hierarchy-lab-regret-aware-eviction-trace-driven-policy-evaluation.html A trace-driven research harness for studying KV-cache residency, eviction, quantization, and prefetch tradeoffs — featuring a regret-aware eviction policy with multi-seed ablation results on synthetic traces. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/kv-hierarchy-lab-regret-aware-eviction-trace-driven-policy-evaluation.html Inference Batch Geometry https://manishklach.github.io/writings/inference-batch-geometry-memory-cost.html Why the shape of a batch — the distribution of sequence lengths, generation phases, and quantization formats within it — determines its memory cost more than its size. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/inference-batch-geometry-memory-cost.html High Bandwidth Flash: The Missing Tier https://manishklach.github.io/writings/high-bandwidth-flash-hbf-missing-tier-ai-inference.html HBM is fast but small and expensive. NVMe is large but slow. HBF (High Bandwidth Flash) is the NAND-based memory tier designed to sit between them — offering 8–16× the capacity of HBM at within 2.2% of its read performance for inference workloads. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/high-bandwidth-flash-hbf-missing-tier-ai-inference.html The Cooling Stack Is the New Critical Path: How Blackwell GB300 NVL72 Racks Manage 142 kW https://manishklach.github.io/writings/blackwell-gb300-nvl72-cooling-stack.html Cooling is no longer a facilities afterthought that gets sorted out after the GPUs arrive. For the Blackwell GB300 NVL72, the thermal system is a precision five-layer architecture with specific component vendors, defined thermal budgets at every interface, and failure modes that propagate directly into training throughput and inference latency. This is how it actually works — from the cold plate on the die to the chiller outside the building. Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/blackwell-gb300-nvl72-cooling-stack.html The Power Stack: How AI Scale-Out Gets Its Electricity — From Grid to GPU https://manishklach.github.io/writings/ai-power-stack-grid-to-gpu.html A single Vera Rubin NVL72 rack draws up to 227 kW. A 100,000-GPU training cluster draws somewhere between 300 MW and 1 GW continuously. An AI factory the scale Meta is building in Louisiana may eventually draw 5 GW — approximately half the average electricity consumption of New York City. None of that power is free, and none of it arrives at the GPU without passing through a stack of technologies built by a specific set of companies that each own a distinct and non-interchangeable piece of the p Tue, 14 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/ai-power-stack-grid-to-gpu.html Why HBM Thermal Throttling Is Silent: Reading the Tea Leaves in nvidia-smi https://manishklach.github.io/writings/why-hbm-thermal-throttling-is-silent.html On Hopper, nvidia-smi will happily show GPU Temp: 60°C while your HBM3e silently throttles at 95°C+, dropping memory bandwidth from 3.1 TB/s to 2.3 TB/s and spiking your vLLM p99 latency by 30%+. The core-temp sensor and power-throttle flags are no longer sufficient to diagnose memory-bound inference stalls. You need NVML's NVML_TEMPERATURE_MEMORY enum or DCGM field 252. This post explains the full signal path — why NVIDIA hides it, how to detect it across all three methods, and how to act on it Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/why-hbm-thermal-throttling-is-silent.html Why "Disk → RDMA → GPU" Is Still Fragmented Today https://manishklach.github.io/writings/why-disk-rdma-gpu-is-still-fragmented-today.html We have local direct-storage acceleration and network direct-memory acceleration, but we still do not have a universal GPU-native end-to-end storage fabric. The missing layer is orchestration. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/why-disk-rdma-gpu-is-still-fragmented-today.html vLLM Internals: Where to Actually Cut Batch Size When the GPU is Melting https://manishklach.github.io/writings/vllm-internals-cut-batch-size-gpu-melting.html Production vLLM deployments running at 90%+ GPU util will eventually hit thermal limits. Counterintuitively, reducing max_num_seqs during a thermal event often does nothing. This post traces exactly where batch size is enforced, why decode batches ignore your knobs, and how to build a thermal-aware scheduler that actually saves your GPUs. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/vllm-internals-cut-batch-size-gpu-melting.html Thermal Debt Is a Memory Problem — How Hot Dies Throttle Your KV Prefetch https://manishklach.github.io/writings/thermal-debt-memory-problem-hot-dies-throttle-kv-prefetch.html If you profile a large-context LLM job on an H100 cluster, you will see a familiar pattern: prefill starts at 2.1x nominal tokens/s, then slowly sags to 1.7x over the first 90 seconds. Your first instinct is power capping. You check nvidia-smi and see the GPU at 690W, below the 700W limit. Clocks are stable. PCI-e is idle. The NVLink counters show no contention. Yet the KV load units are stalling. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/thermal-debt-memory-problem-hot-dies-throttle-kv-prefetch.html Seam Orchestrator: Workload-Aware KV Routing https://manishklach.github.io/writings/seam-orchestrator-workload-aware-kv-routing.html Experiments in admissibility, capacity-aware routing, hysteresis, and replacement-feasible glue layers above transport — for disaggregated inference systems where the policy question matters as much as the transport question. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/seam-orchestrator-workload-aware-kv-routing.html RDMA in the Age of AI https://manishklach.github.io/writings/rdma-in-the-age-of-ai.html Remote Direct Memory Access is old enough to feel familiar and new enough to matter again. In the AI era, RDMA is no longer just an HPC detail — it sits in the critical path of GPU clusters, KV-cache transfer, and disaggregated inference. The real question isn't "is the link fast?" It's where copies still remain, and what software layer has the right to reason about them. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/rdma-in-the-age-of-ai.html MCOS Must Live in Hardware https://manishklach.github.io/writings/mcos-must-live-in-hardware-ai-memory-fabrics.html From JBOD to NAS, storage became powerful when it stopped being passive. The same shift is coming for AI memory systems: data movement needs a brain, and that brain cannot live only in software. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/mcos-must-live-in-hardware-ai-memory-fabrics.html MCOS: A Memory-Centric Operating System for AI https://manishklach.github.io/writings/mcos-memory-centric-operating-system-ai-systems.html A software-defined control layer that treats memory placement, movement, residency, reuse, admission, and eviction as first-class scheduling objects — not passive implementation details. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/mcos-memory-centric-operating-system-ai-systems.html MCOS-HFC: A Hardware Fabric Controller for Memory-Centric AI https://manishklach.github.io/writings/mcos-hfc-hardware-fabric-controller-memory-centric-ai-systems.html Software defines the intent. Hardware enforces the residency, movement, admission, eviction, and handoff — near the fabric and the accelerator, without touching the CPU hot path. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/mcos-hfc-hardware-fabric-controller-memory-centric-ai-systems.html gemma4-wdc: A middleware layer that stops local agents from doing the same work twice https://manishklach.github.io/writings/local-agent-deduplication-middleware.html As local multi-agent systems get practical, the next bottleneck is not the models. It is the duplicated downstream work underneath them. This essay explains the middleware primitive, the bounded admission window, the safety case, and the laptop-scale prototype I built to make that behavior visible. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/local-agent-deduplication-middleware.html HBM Throttling-Safe KV Admission and ReuseNet for LLM Inference https://manishklach.github.io/writings/hbm-throttling-safe-kv-admission-reusenet-llm-inference.html A production-grade admission policy with thermal backpressure and learned reuse probability that eliminates HBM thermal throttling while maintaining 99.2% P50 TTFT under load. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/hbm-throttling-safe-kv-admission-reusenet-llm-inference.html HBM Fragmentation Guard: Confidence-Gated Residency Control for AI Accelerators https://manishklach.github.io/writings/hbm-fragmentation-guard.html Why LRU is the wrong eviction policy for LLM inference, and what to do about it. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/hbm-fragmentation-guard.html From SSD to GPU to SRAM: Why the Last Bottleneck Is Now On-Chip https://manishklach.github.io/writings/from-ssd-to-gpu-to-sram-last-bottleneck-on-chip.html Once host-memory bounce buffers are reduced, the bottleneck shifts inward. The next frontier is not just storage-to-GPU delivery, but deterministic movement between HBM and on-chip SRAM. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/from-ssd-to-gpu-to-sram-last-bottleneck-on-chip.html CPO Isn’t About Power. It’s About Making Memory Disaggregation Schedulable https://manishklach.github.io/writings/cpo-making-memory-disaggregation-schedulable.html Every major analyst report on Co-Packaged Optics opens with the same chart: watts per terabit. CPO at 5 pJ/bit versus 15 pJ/bit for pluggables. LPO somewhere in the middle at 7 pJ/bit if you squint and assume perfect link conditions. The narrative is that we need CPO because the 51.2T and 102.4T switch radixes will melt faceplates. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/cpo-making-memory-disaggregation-schedulable.html When 128 TB Stops Feeling Infinite https://manishklach.github.io/writings/address-space-mmap-quiet-limits-modern-systems.html For years, 48-bit virtual addressing sounded effectively unbounded. Then storage kept scaling, mmap stayed convenient, and modern systems quietly wandered into a new class of limit: not RAM exhaustion, but address-space exhaustion. Mon, 13 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/address-space-mmap-quiet-limits-modern-systems.html Thermal Debt in AI Clusters: The Silent Degradation Loop Nobody Is Measuring https://manishklach.github.io/writings/thermal-debt-ai-clusters.html Dense GPU racks do not fail suddenly. They degrade. Thermal energy accumulates across requests, across shifts, across weeks — in the substrate, in the power delivery, in the packaging. Current observability stacks measure point-in-time temperatures and call the cluster healthy. This essay maps the degradation loop, its failure modes, and what a thermally-aware control plane would need to know. Sun, 12 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/thermal-debt-ai-clusters.html TechDemoForge: A Local-First Engine for Turning Technical Docs into Demo Videos https://manishklach.github.io/writings/techdemoforge-local-first-engine-turning-technical-docs-into-demo-videos.html A detailed look at what TechDemoForge is for, why it is different from generic AI video tooling, how the repository is structured, and where MiniMax-M2.7 fits in the workflow. Sun, 12 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/techdemoforge-local-first-engine-turning-technical-docs-into-demo-videos.html The Photonics Stack: Who Builds What for AI Networking https://manishklach.github.io/writings/photonics-stack-ai-networking-part-1.html AI clusters need extraordinary amounts of light. Not metaphorically — literally. Every GPU-to-GPU communication in a training cluster travels as photons over fiber. The companies that generate, shape, modulate, route, amplify, and test those photons are the unseeen infrastructure of the AI era. This is a technical map of eight of them: what each one actually builds, where it sits in the stack, and why its piece of the problem is hard. Sun, 12 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/photonics-stack-ai-networking-part-1.html The Attention Sink Problem: Why Transformer Inference Wastes More Memory Than You Think https://manishklach.github.io/writings/attention-sink-problem-transformer-inference-memory-waste.html Transformers are forced by softmax to attend somewhere — and when there is no semantically meaningful target, they pile attention mass onto a single structural artifact: the first token. That token is useless for prediction. But its KV state must live in memory forever. This is the attention sink problem, and it is quietly inflating your cache budget at every sequence length. Sun, 12 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/attention-sink-problem-transformer-inference-memory-waste.html The Real AI Bottleneck Is Moving From Compute to Interconnect Power Density https://manishklach.github.io/writings/the-real-ai-bottleneck-is-moving-from-compute-to-interconnect-power-density.html For years the conversation was about FLOPS. Then it became about memory bandwidth. Now the harder wall is not raw transport speed but the full stack cost of moving bits: SerDes power, retimers, DSP burden, switch radix pressure, optical engines, cooling overhead, routing complexity, and the topology tax of making many accelerators behave like one machine. Sat, 11 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/the-real-ai-bottleneck-is-moving-from-compute-to-interconnect-power-density.html Speculative Decoding Is a Memory Problem https://manishklach.github.io/writings/speculative-decoding-is-a-memory-problem.html Everyone talks about speculative decoding as a compute trick — a draft model generates tokens cheaply, the target model verifies them in parallel, and the system enjoys superlinear throughput. That framing is correct but incomplete. The real constraint, at serving scale, is memory: dual KV caches, divergence-driven rollback, verify-phase bandwidth spikes, and draft residency pressure that fights directly with concurrent request capacity. Sat, 11 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/speculative-decoding-is-a-memory-problem.html Scale-Out Was Yesterday. Scale-Up Optics Is the Next Battle https://manishklach.github.io/writings/scale-out-was-yesterday-scale-up-optics-is-the-next-battle.html For years, optical networking discussion centered on classic scale-out: rack-to-rack, row-to-row, and datacenter interconnect. AI is changing the center of gravity. Sat, 11 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/scale-out-was-yesterday-scale-up-optics-is-the-next-battle.html Prefill-Decode Disaggregation: Why the Next Big Inference Architecture Splits the Job in Two https://manishklach.github.io/writings/prefill-decode-disaggregation-why-the-next-big-inference-architecture-splits-the-job-in-two.html Prefill and decode have opposite resource profiles. Prefill is compute-bound — it processes the entire input context in a single large matrix multiply and wants high FLOP density. Decode is memory-bandwidth-bound — it generates one token per step and spends most of its time reading weights and KV cache from HBM. Running them on the same GPU imposes a hardware compromise that satisfies neither. Disaggregation separates them — and exposes a set of memory orchestration problems that will define the Sat, 11 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/prefill-decode-disaggregation-why-the-next-big-inference-architecture-splits-the-job-in-two.html The Next Frontier in Long-Context Inference: Memory-Orchestrated Sparse Serving https://manishklach.github.io/writings/memory-orchestrated-sparse-serving-next-frontier-in-long-context-inference.html Sparse attention is an important step, but it is not the full answer. Once sparse decode becomes cheap enough computationally, the bottleneck shifts to something less glamorous and more decisive: KV residency and movement policy. Sat, 11 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/memory-orchestrated-sparse-serving-next-frontier-in-long-context-inference.html InP vs Silicon Photonics vs VCSEL: The Materials Stack Behind AI Networking https://manishklach.github.io/writings/inp-vs-silicon-photonics-vs-vcsel-ai-networking.html After bandwidth, power density, and failure domains, the next question is physical: what actual materials are going to build AI networking? Sat, 11 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/inp-vs-silicon-photonics-vs-vcsel-ai-networking.html Why Cache Coherency Is the Wrong Default for AI Machines https://manishklach.github.io/writings/why-cache-coherency-is-the-wrong-default-for-ai-machines.html Classical cache coherency made CPUs programmable and pleasant. AI machines live under a different regime: giant working sets, mostly read-heavy model state, throughput-first economics, and increasingly explicit data movement. In that world, strict coherency stops looking like a universal good and starts looking like an expensive default. Fri, 10 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/why-cache-coherency-is-the-wrong-default-for-ai-machines.html The Memory Scheduler Is the New Critical Path in AI Inference https://manishklach.github.io/writings/the-memory-scheduler-is-the-new-critical-path-in-ai-inference.html When hardware coherency steps back, software scheduling steps forward. As AI machines move from transparent coherence to explicit data movement, the memory scheduler quietly becomes the most performance-critical component in the stack — and most systems aren't treating it that way yet. Fri, 10 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/the-memory-scheduler-is-the-new-critical-path-in-ai-inference.html Photonics Is No Longer a Component Story — It Is Becoming the Operating System of AI Clusters https://manishklach.github.io/writings/photonics-becoming-the-operating-system-of-ai-clusters.html The industry still talks about optics as if it lives inside modules, pluggables, and line cards. AI is changing that. Once bandwidth, power density, thermals, and collective communication start dominating cluster design, optics stops being a passive transport layer and starts becoming something the scheduler, runtime, and topology planner have to understand. Fri, 10 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/photonics-becoming-the-operating-system-of-ai-clusters.html CPO, LPO, DSP, and VCSEL: What Actually Matters for AI Infrastructure https://manishklach.github.io/writings/cpo-lpo-dsp-vcsel-what-actually-matters-for-ai-infrastructure.html The optics conversation is getting noisy. Co-packaged optics, linear pluggables, DSP-heavy modules, VCSEL scale-up links, silicon photonics, InP — it is easy to turn all of it into buzzwords. The useful question is simpler: which technology solves which bottleneck, for which part of the AI fabric, under which operational constraints? Fri, 10 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/cpo-lpo-dsp-vcsel-what-actually-matters-for-ai-infrastructure.html When VRAM Stops Being a Weight Warehouse https://manishklach.github.io/writings/when-vram-stops-being-a-weight-warehouse.html A systems primer on HBM, KV cache, PagedAttention, weight offload — and why the real end state isn't smarter offloading, it's treating GPU memory as a bounded execution working set that weights pass through , not live in. Thu, 09 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/when-vram-stops-being-a-weight-warehouse.html Memory Intent IR: The Missing Compiler Output https://manishklach.github.io/writings/memory-intent-ir-why-ai-compilers-must-emit-memory-plans.html Compute graphs tell the system what to calculate. They stay silent about how the objects involved are expected to live, compete, expire, and move. That silence is becoming a first-order performance tax. Thu, 09 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/memory-intent-ir-why-ai-compilers-must-emit-memory-plans.html How to Measure GPU Underutilization on NVIDIA H100 and H200 https://manishklach.github.io/writings/how-to-measure-gpu-underutilization-on-nvidia-h100-and-h200.html Point-in-time GPU busy metrics miss a lot. Rolling-window low-utilization time, sampled idle-state behavior, and power draw tell a much better story about whether an expensive GPU is actually doing meaningful work over time. Thu, 09 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/how-to-measure-gpu-underutilization-on-nvidia-h100-and-h200.html Why AI Needs a New Memory Hierarchy, Not Just Bigger Caches https://manishklach.github.io/writings/why-ai-needs-a-new-memory-hierarchy-not-just-bigger-caches.html Bigger L1, L2, and L3 can absolutely help. But if AI systems are going to be designed from scratch, the real opportunity is not a larger generic cache pyramid. It is an AI-native, class-aware residency hierarchy built for weights, KV cache, activations, and experts. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/why-ai-needs-a-new-memory-hierarchy-not-just-bigger-caches.html What Bigger L2 Actually Buys You https://manishklach.github.io/writings/what-bigger-l2-actually-buys-you.html Larger L2 can be one of the highest-ROI levers in AI systems. But the real win is not usually a dramatic reduction in cache-hit latency. It is lower miss rate, lower off-chip pressure, lower energy per useful operation, and a more stable machine under real inference load. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/what-bigger-l2-actually-buys-you.html vOrchestrate and the Case for Controller-Centric Memory Policy in LLM Inference https://manishklach.github.io/writings/vorchestrate-controller-centric-memory-policy.html Why dynamic multi-tier weight residency should be treated as a control problem across HBM, DRAM, and NVMe, not just a static quantization or offload problem. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/vorchestrate-controller-centric-memory-policy.html The Real Tax in AI Systems Is Moving Bytes https://manishklach.github.io/writings/the-real-tax-in-ai-systems-is-moving-bytes.html More SRAM, more HBM, and more bandwidth all help. But the deeper problem is that many AI systems keep moving, staging, and reloading the same bytes far more often than the workload actually requires. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/the-real-tax-in-ai-systems-is-moving-bytes.html SRMIC-X1: A Residency-First Architecture for LLM Decode Acceleration https://manishklach.github.io/writings/srmic-x1-rethinking-memory-hierarchy-llm-decode.html What happens when you redesign the memory hierarchy around the actual access patterns of autoregressive inference — replacing HBM as the primary decode tier with distributed on-package SRAM, connected by a purpose-built fabric with 2× the bandwidth? Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/srmic-x1-rethinking-memory-hierarchy-llm-decode.html SLA-Constrained Energy-Aware Inference Scheduling on ARM Edge Systems https://manishklach.github.io/writings/sla-constrained-energy-aware-inference-scheduling-arm-edge-systems.html A technical essay on a runtime controller that co-optimizes latency SLA, energy, memory residency, DMA policy, model variant selection, and performance-state control for edge inference deployments. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/sla-constrained-energy-aware-inference-scheduling-arm-edge-systems.html Schwab Portfolio Tools and the Case for Local, Practical Portfolio Infrastructure https://manishklach.github.io/writings/schwab-portfolio-tools-local-practical-portfolio-infrastructure.html Why privacy-first, local CSV tooling can be more useful than glossy dashboards when the real job is understanding spread exposure, uncovered short puts, IV crush, and portfolio risk. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/schwab-portfolio-tools-local-practical-portfolio-infrastructure.html Not static quantization. Not simple weight offload. A runtime controller for multi-tier weight residency. https://manishklach.github.io/writings/predictive-weight-orchestration-runtime-control-for-multi-tier-weight-residency.html Predictive Multi-Tier Weight Residency and Precision Orchestration reframes inference under HBM pressure as a runtime control problem. Instead of treating weights as static assets, it treats them as managed operating state whose precision, placement, and transfer priority must be decided continuously. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/predictive-weight-orchestration-runtime-control-for-multi-tier-weight-residency.html A vendor-neutral mobile control plane for terminal-native coding agents https://manishklach.github.io/writings/mobile-agent-control-vendor-neutral-control-plane-terminal-native-coding-agents.html Mobile Agent Control tackles a very specific systems problem: once coding agents become terminal-native, multi-runtime, and machine-local, someone still needs a safe way to launch them, monitor them, steer them, and audit them from anywhere. This repository turns that operational gap into a concrete product architecture. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/mobile-agent-control-vendor-neutral-control-plane-terminal-native-coding-agents.html MHC Atlas OS and the Case for Explainable Structure-Guided Prioritization https://manishklach.github.io/writings/mhc-atlas-os-explainable-structure-guided-prioritization.html Why a runtime-agnostic, policy-governed, agent-driven system is a stronger foundation for experimental prioritization than opaque black-box prediction alone. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/mhc-atlas-os-explainable-structure-guided-prioritization.html Long-Context Inference Needs Better Memory Policy, Not Just More Memory https://manishklach.github.io/writings/long-context-inference-needs-better-memory-policy.html Why long-context inference should be managed as region-level policy across both attention mode and memory residency tier, with predictive promotion, reversible demotion, and coherence-aware guardrails. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/long-context-inference-needs-better-memory-policy.html Introducing ChromeLens: The MRI of Web Performance https://manishklach.github.io/writings/introducing-chromelens-systems-grade-web-performance-telemetry.html Standard tools like Lighthouse give you a thermometer rating. ChromeLens gives you a deeply technical MRI scanner for your rendering pipeline. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/introducing-chromelens-systems-grade-web-performance-telemetry.html Hardware-Enforced On-Chip Memory Residency for Neural Network Inference Accelerators https://manishklach.github.io/writings/hardware-enforced-on-chip-memory-residency.html AI Infrastructure Essay · Patent Application 202641043359 Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/hardware-enforced-on-chip-memory-residency.html Deterministic Memory-Orchestrated Inference Using DMA and Bounded On-Chip Buffers https://manishklach.github.io/writings/deterministic-memory-orchestrated-inference-dma-bounded-on-chip-buffers.html A technical deep dive into a compiler-scheduled inference architecture where DMA, explicit fences, and bounded on-chip buffers pull DRAM off the compute critical path. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/deterministic-memory-orchestrated-inference-dma-bounded-on-chip-buffers.html Bounce Buffers: The Hidden Tax on Modern AI Systems https://manishklach.github.io/writings/bounce-buffers-hidden-tax-ai-systems.html Why extra copies and intermediate staging buffers have become one of the least visible, yet most consequential, bottlenecks in long-context inference, retrieval-heavy serving, agentic loops, and modern GPU-centric infrastructure. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/bounce-buffers-hidden-tax-ai-systems.html AI Cluster Reliability Beyond Fault-Tolerant Parallelism: Gray Failures, Checkpoint Economics, and Seam-Aware Control Planes https://manishklach.github.io/writings/beyond-fault-tolerant-parallelism-ai-cluster-reliability.html The next frontier in AI infrastructure may not be bigger clusters or cleverer collective communication alone. It may be a new class of seam-aware control architectures that detect gray failures, contain failure amplification, orchestrate checkpoint economics, and couple facilities, fabric, storage, runtime, and business criticality into one system. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/beyond-fault-tolerant-parallelism-ai-cluster-reliability.html The Next AI Cluster Failure Won’t Look Like a GPU Failure https://manishklach.github.io/writings/ai-cluster-failure-seams.html The expensive failure in modern AI infrastructure is often not a dead GPU or a crashed node. It is lost productive throughput across a tightly coupled system: cooling, fabric, accelerator pools, runtime behavior, and the schedule attached to the training or inference program. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/ai-cluster-failure-seams.html Adaptive Compiler–Runtime Power Contract for Energy-Optimal Edge Inference https://manishklach.github.io/writings/adaptive-compiler-runtime-power-contract-energy-optimal-edge-inference.html A technical explainer on contract-driven inference execution across SRAM, HBM, DRAM, and NVM, with authenticated plans, safe switching boundaries, and runtime enforcement. Tue, 07 Apr 2026 00:00:00 GMT https://manishklach.github.io/writings/adaptive-compiler-runtime-power-contract-energy-optimal-edge-inference.html