<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Manish K. Lach RSS</title>
    <link>https://manishklach.github.io/writings.html</link>
    <description>Technical essays on AI infrastructure, memory systems, runtimes, and accelerator architecture.</description>
    <language>en-us</language>
    <lastBuildDate>Sun, 19 Apr 2026 00:00:00 GMT</lastBuildDate>
    <item>
      <title>The Whole Stack: Everything Between a Transistor and a Token</title>
      <link>https://manishklach.github.io/writings/the-whole-stack-everything-between-a-transistor-and-a-token.html</link>
      <description>Eighty-plus essays on AI infrastructure eventually converge on one insight: every performance number, every cost figure, every architectural decision is a consequence of a smaller, more physical decision made lower in the stack. This essay connects all of them — from process nodes to token pricing — and shows where the binding constraint actually lives in 2026.</description>
      <pubDate>Sun, 19 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/the-whole-stack-everything-between-a-transistor-and-a-token.html</guid>
    </item>
    <item>
      <title>Sparse MQA + Fused MoE Mega Kernel + Hyper-connections: The Trifecta Behind Modern Frontier Models</title>
      <link>https://manishklach.github.io/writings/sparse-mqa-fused-moe-hyperconnections-frontier-models.html</link>
      <description>Three architectural innovations each solve one wall that stops dense transformers from scaling — memory, compute, and depth. Together they unlock 2M-token context, trillion-parameter capacity, and 120-layer depth on existing H100 clusters. Here is exactly how each one works and why they are inseparable.</description>
      <pubDate>Sun, 19 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/sparse-mqa-fused-moe-hyperconnections-frontier-models.html</guid>
    </item>
    <item>
      <title>Multi-Tenant KV Fabrics: Bandwidth Contracts for Shared LLM Memory Systems</title>
      <link>https://manishklach.github.io/writings/multi-tenant-kv-fabrics.html</link>
      <description>Manish K L — April 2026. MCOS series, essay 2 of 7. Full version.</description>
      <pubDate>Sun, 19 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/multi-tenant-kv-fabrics.html</guid>
    </item>
    <item>
      <title>What &quot;3nm&quot; Actually Means: Process Nodes, Transistor Physics, and Why Scaling Is Slowing</title>
      <link>https://manishklach.github.io/writings/what-3nm-actually-means.html</link>
      <description>When TSMC announces its N3 process or Intel announces 18A, what is actually changing about the transistor? What does the node number mean — and why does it no longer mean what it used to? This essay explains the physics of transistor scaling from FinFET to Gate-All-Around, why Moore&apos;s Law is slowing, and what it means for AI chip design.</description>
      <pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/what-3nm-actually-means.html</guid>
    </item>
    <item>
      <title>The PCIe Tax: How Host-Staged Networking Steals Half Your GPU Bandwidth</title>
      <link>https://manishklach.github.io/writings/the-pcie-tax-why-bypassing-the-host-doubles-moe-throughput.html</link>
      <description>We buy H100s for 3.35 TB/s of HBM3e, then strangle them with a 25-year-old I/O model that forces every byte through the CPU. The fix is not faster PCIe — it is zero crossings.</description>
      <pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/the-pcie-tax-why-bypassing-the-host-doubles-moe-throughput.html</guid>
    </item>
    <item>
      <title>KV Fabrics: Treating Context as a Distributed Filesystem</title>
      <link>https://manishklach.github.io/writings/kv-fabrics-treating-context-as-a-distributed-filesystem.html</link>
      <description>In The DPU is the New NIC , we argued that the unit of scheduling has shifted from packets to prompts, and that DPUs must become prompt-aware memory controllers. This essay extends that argument: if prompts are stateful, then the state they produce—the KV cache—must be stored, addressed, and shared like files, not like malloc buffers.</description>
      <pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/kv-fabrics-treating-context-as-a-distributed-filesystem.html</guid>
    </item>
    <item>
      <title>Inside the GPU: SMs, Warps, Tensor Cores, and Why the Architecture Looks the Way It Does</title>
      <link>https://manishklach.github.io/writings/inside-the-gpu-sm-warps-tensor-cores.html</link>
      <description>The GPU is the central hardware artefact of the AI era, yet it is almost always treated as a black box — a thing that runs matrix multiplies and produces tokens. This essay opens the box: what lives inside an SM, why warps exist, how tensor cores achieve their throughput, and what the architecture&apos;s design choices mean for AI workloads specifically.</description>
      <pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/inside-the-gpu-sm-warps-tensor-cores.html</guid>
    </item>
    <item>
      <title>The Memory Bottleneck: Why Inference Speed Is a Memory Problem, Not a Compute Problem</title>
      <link>https://manishklach.github.io/writings/inference-speed-is-a-memory-problem.html</link>
      <description>Every AI processor debate eventually circles back to the same misdiagnosis: we need more compute. The actual constraint is almost never the arithmetic. It is the speed at which memory can feed data to the cores that are already sitting idle, waiting.</description>
      <pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/inference-speed-is-a-memory-problem.html</guid>
    </item>
    <item>
      <title>HBM Explained: How High Bandwidth Memory Is Actually Built</title>
      <link>https://manishklach.github.io/writings/hbm-how-it-is-actually-built.html</link>
      <description>Every essay in this corpus mentions HBM. None of them explains what it is at the physical level — how DRAM dies are stacked, what a TSV actually does, why the interposer exists, and why stacking is the only path to bandwidth density that silicon physics allows.</description>
      <pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/hbm-how-it-is-actually-built.html</guid>
    </item>
    <item>
      <title>The True Cost of a Token: Inference Economics from Silicon to SLO</title>
      <link>https://manishklach.github.io/writings/true-cost-of-a-token.html</link>
      <description>A bottom-up cost model connecting HBM bandwidth, cooling kilowatts, NAND geometry, and interconnect physics to the number that actually matters: dollars per million output tokens.</description>
      <pubDate>Thu, 16 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/true-cost-of-a-token.html</guid>
    </item>
    <item>
      <title>RISC vs CISC: Where Complexity Lives</title>
      <link>https://manishklach.github.io/writings/risc-cisc.html</link>
      <description>RISC and CISC aren&apos;t processor brands — they&apos;re philosophies. One says make the hardware simple and push work to software. The other says make the hardware clever to make software easier. Modern chips use both ideas, but understanding the trade-off explains almost everything about why your phone&apos;s processor looks nothing like your laptop&apos;s from 1995.</description>
      <pubDate>Thu, 16 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/risc-cisc.html</guid>
    </item>
    <item>
      <title>The DPU as Agent Memory Controller: Offloading Orchestration from the Host</title>
      <link>https://manishklach.github.io/writings/dpu-agent-memory-controller.html</link>
      <description>Why the 78% CPU / 31% GPU problem in agentic inference isn&apos;t a software bug — it&apos;s an architecture problem.</description>
      <pubDate>Thu, 16 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/dpu-agent-memory-controller.html</guid>
    </item>
    <item>
      <title>Why Agentic AI Is a CPU and DRAM Problem, Not Just a GPU Problem</title>
      <link>https://manishklach.github.io/writings/agentic-ai-cpu-dram-problem.html</link>
      <description>The center of gravity for AI inference is shifting from dense matmuls to stateful orchestration, context movement, and memory capacity.</description>
      <pubDate>Thu, 16 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/agentic-ai-cpu-dram-problem.html</guid>
    </item>
    <item>
      <title>The Agent Topology Problem: Inference Scheduling Across Model Boundaries</title>
      <link>https://manishklach.github.io/writings/agent-topology-problem.html</link>
      <description>Every essay in this corpus has treated inference as one model, one request. The real 2026 workload is a directed graph of model calls. When the unit of scheduling shifts from a request to a pipeline, everything we know about KV locality, memory residency, and latency SLOs must be rebuilt from scratch.</description>
      <pubDate>Thu, 16 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/agent-topology-problem.html</guid>
    </item>
    <item>
      <title>The Power Contract SLO: Making Joules-per-Token Schedulable from Grid to SRAM</title>
      <link>https://manishklach.github.io/writings/power-contract-slo.html</link>
      <description>Manish K L — April 2026 . MCOS series, essay 4 of 7.</description>
      <pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/power-contract-slo.html</guid>
    </item>
    <item>
      <title>Why AI Servers Need a Memory Fabric, Not Just a Faster Bus</title>
      <link>https://manishklach.github.io/writings/pcie-bottleneck-cxl-escape.html</link>
      <description>The discrete server motherboard is becoming an architectural bottleneck. While local HBM remains the ideal execution state, AI workloads inevitably spill to host memory—triggering a severe bandwidth cliff. Here is a definitive look at why upgrading PCIe generations won&apos;t solve the latency problem, and why the industry is migrating toward CXL-attached tiering and memory fabrics.</description>
      <pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/pcie-bottleneck-cxl-escape.html</guid>
    </item>
    <item>
      <title>Floating Point in AI: The Hidden Numbers Running Your Models</title>
      <link>https://manishklach.github.io/writings/floating-point-in-ai.html</link>
      <description>AI is not powered by magic. It is powered by billions of approximate numbers, stored in carefully chosen formats, pushed through specialized hardware at absurd scale. Once you understand floating point, a lot of AI hardware, quantization, and training stability suddenly clicks.</description>
      <pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/floating-point-in-ai.html</guid>
    </item>
    <item>
      <title>Why Vera Rubin Changes Everything About Cooling: 45°C Supply Temperature, Fan-Free Trays, and the Infrastructure Behind Them</title>
      <link>https://manishklach.github.io/writings/vera-rubin-cooling-45c-supply-temperature.html</link>
      <description>The Blackwell GB300 NVL72 established a new baseline for AI rack cooling: 142 kW per rack, five-layer direct liquid cooling, 450 quick disconnects per rack, 30–35°C CDU supply water. Vera Rubin NVL72 discards most of those constraints. It operates at 45°C supply temperature, eliminates fans and hoses from the tray design entirely, liquid-cools the busbar, and processes 187–227 kW per rack. Each of those changes cascades through the entire facility infrastructure stack — and through the commercia</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/vera-rubin-cooling-45c-supply-temperature.html</guid>
    </item>
    <item>
      <title>Software-Managed Memory Is the TPU&apos;s Real Advantage</title>
      <link>https://manishklach.github.io/writings/tpu-software-managed-memory-real-advantage.html</link>
      <description>No hardware caches. No L1/L2. The TPU exposes raw scratchpad SRAM to the XLA compiler — and that changes everything about how the memory system behaves under load.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/tpu-software-managed-memory-real-advantage.html</guid>
    </item>
    <item>
      <title>Optical Circuit Switching and the 3D Torus</title>
      <link>https://manishklach.github.io/writings/tpu-optical-circuit-switching-3d-torus.html</link>
      <description>GPU clusters route packets through electrical switches. TPU pods route light through MEMS mirrors. The physical network reshapes itself to match the computation&apos;s communication pattern.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/tpu-optical-circuit-switching-3d-torus.html</guid>
    </item>
    <item>
      <title>Why TPUs Have Less HBM Than GPUs</title>
      <link>https://manishklach.github.io/writings/tpu-less-hbm-design-choice-not-limitation.html</link>
      <description>And why that&apos;s a design choice, not a limitation. NVIDIA bets on fat nodes with maximum per-GPU memory. Google bets on thin nodes with maximum inter-chip bandwidth. Both are rational — they optimize for different cost functions.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/tpu-less-hbm-design-choice-not-limitation.html</guid>
    </item>
    <item>
      <title>The Semiconductor Ecosystem</title>
      <link>https://manishklach.github.io/writings/the-semiconductor-ecosystem.html</link>
      <description>Every modern technology — from the phone in your pocket to the GPU training the world&apos;s largest models — depends on a semiconductor supply chain of staggering complexity. This post maps the entire ecosystem: the companies, the dependencies, the chokepoints, and the technologies that make a chip possible. If you want to understand why ASML, TSMC, and Lam Research matter, start here.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/the-semiconductor-ecosystem.html</guid>
    </item>
    <item>
      <title>The Storage Geometry of a 100,000-GPU Cluster</title>
      <link>https://manishklach.github.io/writings/storage-geometry-100k-gpu-cluster-nand-demand.html</link>
      <description>How much NAND does AI actually need? A full accounting — from per-node E1.S drives to petabyte-scale checkpoint reservoirs, dataset lakes, and the NAND supply pressure that hyperscale AI is creating.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/storage-geometry-100k-gpu-cluster-nand-demand.html</guid>
    </item>
    <item>
      <title>The Recompute-vs-Transfer Frontier</title>
      <link>https://manishklach.github.io/writings/recompute-vs-transfer-frontier-inference.html</link>
      <description>When redoing work is cheaper than moving bytes — and why eviction policies should know the difference between pages worth storing and pages worth recomputing.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/recompute-vs-transfer-frontier-inference.html</guid>
    </item>
    <item>
      <title>NAND Flash Is the Invisible Backbone of Every AI Cluster</title>
      <link>https://manishklach.github.io/writings/nand-flash-invisible-backbone-ai-clusters.html</link>
      <description>The GPU gets the credit. The SSD does the lifting. A systems essay on the five roles NAND flash plays in modern AI training and inference infrastructure — from checkpoint absorption to KV-cache offloading.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/nand-flash-invisible-backbone-ai-clusters.html</guid>
    </item>
    <item>
      <title>The Scheduling Tax of Multi-Tenant Inference</title>
      <link>https://manishklach.github.io/writings/multi-tenant-inference-memory-fairness.html</link>
      <description>Why fairness and throughput in shared inference pools are fundamentally a memory problem — and why the scheduler&apos;s view of &quot;fair&quot; diverges from the memory system&apos;s reality by up to 64×.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/multi-tenant-inference-memory-fairness.html</guid>
    </item>
    <item>
      <title>KV Hierarchy Lab: Regret-Aware Eviction</title>
      <link>https://manishklach.github.io/writings/kv-hierarchy-lab-regret-aware-eviction-trace-driven-policy-evaluation.html</link>
      <description>A trace-driven research harness for studying KV-cache residency, eviction, quantization, and prefetch tradeoffs — featuring a regret-aware eviction policy with multi-seed ablation results on synthetic traces.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/kv-hierarchy-lab-regret-aware-eviction-trace-driven-policy-evaluation.html</guid>
    </item>
    <item>
      <title>Inference Batch Geometry</title>
      <link>https://manishklach.github.io/writings/inference-batch-geometry-memory-cost.html</link>
      <description>Why the shape of a batch — the distribution of sequence lengths, generation phases, and quantization formats within it — determines its memory cost more than its size.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/inference-batch-geometry-memory-cost.html</guid>
    </item>
    <item>
      <title>High Bandwidth Flash: The Missing Tier</title>
      <link>https://manishklach.github.io/writings/high-bandwidth-flash-hbf-missing-tier-ai-inference.html</link>
      <description>HBM is fast but small and expensive. NVMe is large but slow. HBF (High Bandwidth Flash) is the NAND-based memory tier designed to sit between them — offering 8–16× the capacity of HBM at within 2.2% of its read performance for inference workloads.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/high-bandwidth-flash-hbf-missing-tier-ai-inference.html</guid>
    </item>
    <item>
      <title>The Cooling Stack Is the New Critical Path: How Blackwell GB300 NVL72 Racks Manage 142 kW</title>
      <link>https://manishklach.github.io/writings/blackwell-gb300-nvl72-cooling-stack.html</link>
      <description>Cooling is no longer a facilities afterthought that gets sorted out after the GPUs arrive. For the Blackwell GB300 NVL72, the thermal system is a precision five-layer architecture with specific component vendors, defined thermal budgets at every interface, and failure modes that propagate directly into training throughput and inference latency. This is how it actually works — from the cold plate on the die to the chiller outside the building.</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/blackwell-gb300-nvl72-cooling-stack.html</guid>
    </item>
    <item>
      <title>The Power Stack: How AI Scale-Out Gets Its Electricity — From Grid to GPU</title>
      <link>https://manishklach.github.io/writings/ai-power-stack-grid-to-gpu.html</link>
      <description>A single Vera Rubin NVL72 rack draws up to 227 kW. A 100,000-GPU training cluster draws somewhere between 300 MW and 1 GW continuously. An AI factory the scale Meta is building in Louisiana may eventually draw 5 GW — approximately half the average electricity consumption of New York City. None of that power is free, and none of it arrives at the GPU without passing through a stack of technologies built by a specific set of companies that each own a distinct and non-interchangeable piece of the p</description>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/ai-power-stack-grid-to-gpu.html</guid>
    </item>
    <item>
      <title>Why HBM Thermal Throttling Is Silent: Reading the Tea Leaves in nvidia-smi</title>
      <link>https://manishklach.github.io/writings/why-hbm-thermal-throttling-is-silent.html</link>
      <description>On Hopper, nvidia-smi will happily show GPU Temp: 60°C while your HBM3e silently throttles at 95°C+, dropping memory bandwidth from 3.1 TB/s to 2.3 TB/s and spiking your vLLM p99 latency by 30%+. The core-temp sensor and power-throttle flags are no longer sufficient to diagnose memory-bound inference stalls. You need NVML&apos;s NVML_TEMPERATURE_MEMORY enum or DCGM field 252. This post explains the full signal path — why NVIDIA hides it, how to detect it across all three methods, and how to act on it</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/why-hbm-thermal-throttling-is-silent.html</guid>
    </item>
    <item>
      <title>Why &quot;Disk → RDMA → GPU&quot; Is Still Fragmented Today</title>
      <link>https://manishklach.github.io/writings/why-disk-rdma-gpu-is-still-fragmented-today.html</link>
      <description>We have local direct-storage acceleration and network direct-memory acceleration, but we still do not have a universal GPU-native end-to-end storage fabric. The missing layer is orchestration.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/why-disk-rdma-gpu-is-still-fragmented-today.html</guid>
    </item>
    <item>
      <title>vLLM Internals: Where to Actually Cut Batch Size When the GPU is Melting</title>
      <link>https://manishklach.github.io/writings/vllm-internals-cut-batch-size-gpu-melting.html</link>
      <description>Production vLLM deployments running at 90%+ GPU util will eventually hit thermal limits. Counterintuitively, reducing max_num_seqs during a thermal event often does nothing. This post traces exactly where batch size is enforced, why decode batches ignore your knobs, and how to build a thermal-aware scheduler that actually saves your GPUs.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/vllm-internals-cut-batch-size-gpu-melting.html</guid>
    </item>
    <item>
      <title>Thermal Debt Is a Memory Problem — How Hot Dies Throttle Your KV Prefetch</title>
      <link>https://manishklach.github.io/writings/thermal-debt-memory-problem-hot-dies-throttle-kv-prefetch.html</link>
      <description>If you profile a large-context LLM job on an H100 cluster, you will see a familiar pattern: prefill starts at 2.1x nominal tokens/s, then slowly sags to 1.7x over the first 90 seconds. Your first instinct is power capping. You check nvidia-smi and see the GPU at 690W, below the 700W limit. Clocks are stable. PCI-e is idle. The NVLink counters show no contention. Yet the KV load units are stalling.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/thermal-debt-memory-problem-hot-dies-throttle-kv-prefetch.html</guid>
    </item>
    <item>
      <title>Seam Orchestrator: Workload-Aware KV Routing</title>
      <link>https://manishklach.github.io/writings/seam-orchestrator-workload-aware-kv-routing.html</link>
      <description>Experiments in admissibility, capacity-aware routing, hysteresis, and replacement-feasible glue layers above transport — for disaggregated inference systems where the policy question matters as much as the transport question.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/seam-orchestrator-workload-aware-kv-routing.html</guid>
    </item>
    <item>
      <title>RDMA in the Age of AI</title>
      <link>https://manishklach.github.io/writings/rdma-in-the-age-of-ai.html</link>
      <description>Remote Direct Memory Access is old enough to feel familiar and new enough to matter again. In the AI era, RDMA is no longer just an HPC detail — it sits in the critical path of GPU clusters, KV-cache transfer, and disaggregated inference. The real question isn&apos;t &quot;is the link fast?&quot; It&apos;s where copies still remain, and what software layer has the right to reason about them.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/rdma-in-the-age-of-ai.html</guid>
    </item>
    <item>
      <title>MCOS Must Live in Hardware</title>
      <link>https://manishklach.github.io/writings/mcos-must-live-in-hardware-ai-memory-fabrics.html</link>
      <description>From JBOD to NAS, storage became powerful when it stopped being passive. The same shift is coming for AI memory systems: data movement needs a brain, and that brain cannot live only in software.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/mcos-must-live-in-hardware-ai-memory-fabrics.html</guid>
    </item>
    <item>
      <title>MCOS: A Memory-Centric Operating System for AI</title>
      <link>https://manishklach.github.io/writings/mcos-memory-centric-operating-system-ai-systems.html</link>
      <description>A software-defined control layer that treats memory placement, movement, residency, reuse, admission, and eviction as first-class scheduling objects — not passive implementation details.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/mcos-memory-centric-operating-system-ai-systems.html</guid>
    </item>
    <item>
      <title>MCOS-HFC: A Hardware Fabric Controller for Memory-Centric AI</title>
      <link>https://manishklach.github.io/writings/mcos-hfc-hardware-fabric-controller-memory-centric-ai-systems.html</link>
      <description>Software defines the intent. Hardware enforces the residency, movement, admission, eviction, and handoff — near the fabric and the accelerator, without touching the CPU hot path.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/mcos-hfc-hardware-fabric-controller-memory-centric-ai-systems.html</guid>
    </item>
    <item>
      <title>gemma4-wdc: A middleware layer that stops local agents from doing the same work twice</title>
      <link>https://manishklach.github.io/writings/local-agent-deduplication-middleware.html</link>
      <description>As local multi-agent systems get practical, the next bottleneck is not the models. It is the duplicated downstream work underneath them. This essay explains the middleware primitive, the bounded admission window, the safety case, and the laptop-scale prototype I built to make that behavior visible.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/local-agent-deduplication-middleware.html</guid>
    </item>
    <item>
      <title>HBM Throttling-Safe KV Admission and ReuseNet for LLM Inference</title>
      <link>https://manishklach.github.io/writings/hbm-throttling-safe-kv-admission-reusenet-llm-inference.html</link>
      <description>A production-grade admission policy with thermal backpressure and learned reuse probability that eliminates HBM thermal throttling while maintaining 99.2% P50 TTFT under load.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/hbm-throttling-safe-kv-admission-reusenet-llm-inference.html</guid>
    </item>
    <item>
      <title>HBM Fragmentation Guard: Confidence-Gated Residency Control for AI Accelerators</title>
      <link>https://manishklach.github.io/writings/hbm-fragmentation-guard.html</link>
      <description>Why LRU is the wrong eviction policy for LLM inference, and what to do about it.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/hbm-fragmentation-guard.html</guid>
    </item>
    <item>
      <title>From SSD to GPU to SRAM: Why the Last Bottleneck Is Now On-Chip</title>
      <link>https://manishklach.github.io/writings/from-ssd-to-gpu-to-sram-last-bottleneck-on-chip.html</link>
      <description>Once host-memory bounce buffers are reduced, the bottleneck shifts inward. The next frontier is not just storage-to-GPU delivery, but deterministic movement between HBM and on-chip SRAM.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/from-ssd-to-gpu-to-sram-last-bottleneck-on-chip.html</guid>
    </item>
    <item>
      <title>CPO Isn’t About Power. It’s About Making Memory Disaggregation Schedulable</title>
      <link>https://manishklach.github.io/writings/cpo-making-memory-disaggregation-schedulable.html</link>
      <description>Every major analyst report on Co-Packaged Optics opens with the same chart: watts per terabit. CPO at 5 pJ/bit versus 15 pJ/bit for pluggables. LPO somewhere in the middle at 7 pJ/bit if you squint and assume perfect link conditions. The narrative is that we need CPO because the 51.2T and 102.4T switch radixes will melt faceplates.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/cpo-making-memory-disaggregation-schedulable.html</guid>
    </item>
    <item>
      <title>When 128 TB Stops Feeling Infinite</title>
      <link>https://manishklach.github.io/writings/address-space-mmap-quiet-limits-modern-systems.html</link>
      <description>For years, 48-bit virtual addressing sounded effectively unbounded. Then storage kept scaling, mmap stayed convenient, and modern systems quietly wandered into a new class of limit: not RAM exhaustion, but address-space exhaustion.</description>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/address-space-mmap-quiet-limits-modern-systems.html</guid>
    </item>
    <item>
      <title>Thermal Debt in AI Clusters: The Silent Degradation Loop Nobody Is Measuring</title>
      <link>https://manishklach.github.io/writings/thermal-debt-ai-clusters.html</link>
      <description>Dense GPU racks do not fail suddenly. They degrade. Thermal energy accumulates across requests, across shifts, across weeks — in the substrate, in the power delivery, in the packaging. Current observability stacks measure point-in-time temperatures and call the cluster healthy. This essay maps the degradation loop, its failure modes, and what a thermally-aware control plane would need to know.</description>
      <pubDate>Sun, 12 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/thermal-debt-ai-clusters.html</guid>
    </item>
    <item>
      <title>TechDemoForge: A Local-First Engine for Turning Technical Docs into Demo Videos</title>
      <link>https://manishklach.github.io/writings/techdemoforge-local-first-engine-turning-technical-docs-into-demo-videos.html</link>
      <description>A detailed look at what TechDemoForge is for, why it is different from generic AI video tooling, how the repository is structured, and where MiniMax-M2.7 fits in the workflow.</description>
      <pubDate>Sun, 12 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/techdemoforge-local-first-engine-turning-technical-docs-into-demo-videos.html</guid>
    </item>
    <item>
      <title>The Photonics Stack: Who Builds What for AI Networking</title>
      <link>https://manishklach.github.io/writings/photonics-stack-ai-networking-part-1.html</link>
      <description>AI clusters need extraordinary amounts of light. Not metaphorically — literally. Every GPU-to-GPU communication in a training cluster travels as photons over fiber. The companies that generate, shape, modulate, route, amplify, and test those photons are the unseeen infrastructure of the AI era. This is a technical map of eight of them: what each one actually builds, where it sits in the stack, and why its piece of the problem is hard.</description>
      <pubDate>Sun, 12 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/photonics-stack-ai-networking-part-1.html</guid>
    </item>
    <item>
      <title>The Attention Sink Problem: Why Transformer Inference Wastes More Memory Than You Think</title>
      <link>https://manishklach.github.io/writings/attention-sink-problem-transformer-inference-memory-waste.html</link>
      <description>Transformers are forced by softmax to attend somewhere — and when there is no semantically meaningful target, they pile attention mass onto a single structural artifact: the first token. That token is useless for prediction. But its KV state must live in memory forever. This is the attention sink problem, and it is quietly inflating your cache budget at every sequence length.</description>
      <pubDate>Sun, 12 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/attention-sink-problem-transformer-inference-memory-waste.html</guid>
    </item>
    <item>
      <title>The Real AI Bottleneck Is Moving From Compute to Interconnect Power Density</title>
      <link>https://manishklach.github.io/writings/the-real-ai-bottleneck-is-moving-from-compute-to-interconnect-power-density.html</link>
      <description>For years the conversation was about FLOPS. Then it became about memory bandwidth. Now the harder wall is not raw transport speed but the full stack cost of moving bits: SerDes power, retimers, DSP burden, switch radix pressure, optical engines, cooling overhead, routing complexity, and the topology tax of making many accelerators behave like one machine.</description>
      <pubDate>Sat, 11 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/the-real-ai-bottleneck-is-moving-from-compute-to-interconnect-power-density.html</guid>
    </item>
    <item>
      <title>Speculative Decoding Is a Memory Problem</title>
      <link>https://manishklach.github.io/writings/speculative-decoding-is-a-memory-problem.html</link>
      <description>Everyone talks about speculative decoding as a compute trick — a draft model generates tokens cheaply, the target model verifies them in parallel, and the system enjoys superlinear throughput. That framing is correct but incomplete. The real constraint, at serving scale, is memory: dual KV caches, divergence-driven rollback, verify-phase bandwidth spikes, and draft residency pressure that fights directly with concurrent request capacity.</description>
      <pubDate>Sat, 11 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/speculative-decoding-is-a-memory-problem.html</guid>
    </item>
    <item>
      <title>Scale-Out Was Yesterday. Scale-Up Optics Is the Next Battle</title>
      <link>https://manishklach.github.io/writings/scale-out-was-yesterday-scale-up-optics-is-the-next-battle.html</link>
      <description>For years, optical networking discussion centered on classic scale-out: rack-to-rack, row-to-row, and datacenter interconnect. AI is changing the center of gravity.</description>
      <pubDate>Sat, 11 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/scale-out-was-yesterday-scale-up-optics-is-the-next-battle.html</guid>
    </item>
    <item>
      <title>Prefill-Decode Disaggregation: Why the Next Big Inference Architecture Splits the Job in Two</title>
      <link>https://manishklach.github.io/writings/prefill-decode-disaggregation-why-the-next-big-inference-architecture-splits-the-job-in-two.html</link>
      <description>Prefill and decode have opposite resource profiles. Prefill is compute-bound — it processes the entire input context in a single large matrix multiply and wants high FLOP density. Decode is memory-bandwidth-bound — it generates one token per step and spends most of its time reading weights and KV cache from HBM. Running them on the same GPU imposes a hardware compromise that satisfies neither. Disaggregation separates them — and exposes a set of memory orchestration problems that will define the</description>
      <pubDate>Sat, 11 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/prefill-decode-disaggregation-why-the-next-big-inference-architecture-splits-the-job-in-two.html</guid>
    </item>
    <item>
      <title>The Next Frontier in Long-Context Inference: Memory-Orchestrated Sparse Serving</title>
      <link>https://manishklach.github.io/writings/memory-orchestrated-sparse-serving-next-frontier-in-long-context-inference.html</link>
      <description>Sparse attention is an important step, but it is not the full answer. Once sparse decode becomes cheap enough computationally, the bottleneck shifts to something less glamorous and more decisive: KV residency and movement policy.</description>
      <pubDate>Sat, 11 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/memory-orchestrated-sparse-serving-next-frontier-in-long-context-inference.html</guid>
    </item>
    <item>
      <title>InP vs Silicon Photonics vs VCSEL: The Materials Stack Behind AI Networking</title>
      <link>https://manishklach.github.io/writings/inp-vs-silicon-photonics-vs-vcsel-ai-networking.html</link>
      <description>After bandwidth, power density, and failure domains, the next question is physical: what actual materials are going to build AI networking?</description>
      <pubDate>Sat, 11 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/inp-vs-silicon-photonics-vs-vcsel-ai-networking.html</guid>
    </item>
    <item>
      <title>Why Cache Coherency Is the Wrong Default for AI Machines</title>
      <link>https://manishklach.github.io/writings/why-cache-coherency-is-the-wrong-default-for-ai-machines.html</link>
      <description>Classical cache coherency made CPUs programmable and pleasant. AI machines live under a different regime: giant working sets, mostly read-heavy model state, throughput-first economics, and increasingly explicit data movement. In that world, strict coherency stops looking like a universal good and starts looking like an expensive default.</description>
      <pubDate>Fri, 10 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/why-cache-coherency-is-the-wrong-default-for-ai-machines.html</guid>
    </item>
    <item>
      <title>The Memory Scheduler Is the New Critical Path in AI Inference</title>
      <link>https://manishklach.github.io/writings/the-memory-scheduler-is-the-new-critical-path-in-ai-inference.html</link>
      <description>When hardware coherency steps back, software scheduling steps forward. As AI machines move from transparent coherence to explicit data movement, the memory scheduler quietly becomes the most performance-critical component in the stack — and most systems aren&apos;t treating it that way yet.</description>
      <pubDate>Fri, 10 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/the-memory-scheduler-is-the-new-critical-path-in-ai-inference.html</guid>
    </item>
    <item>
      <title>Photonics Is No Longer a Component Story — It Is Becoming the Operating System of AI Clusters</title>
      <link>https://manishklach.github.io/writings/photonics-becoming-the-operating-system-of-ai-clusters.html</link>
      <description>The industry still talks about optics as if it lives inside modules, pluggables, and line cards. AI is changing that. Once bandwidth, power density, thermals, and collective communication start dominating cluster design, optics stops being a passive transport layer and starts becoming something the scheduler, runtime, and topology planner have to understand.</description>
      <pubDate>Fri, 10 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/photonics-becoming-the-operating-system-of-ai-clusters.html</guid>
    </item>
    <item>
      <title>CPO, LPO, DSP, and VCSEL: What Actually Matters for AI Infrastructure</title>
      <link>https://manishklach.github.io/writings/cpo-lpo-dsp-vcsel-what-actually-matters-for-ai-infrastructure.html</link>
      <description>The optics conversation is getting noisy. Co-packaged optics, linear pluggables, DSP-heavy modules, VCSEL scale-up links, silicon photonics, InP — it is easy to turn all of it into buzzwords. The useful question is simpler: which technology solves which bottleneck, for which part of the AI fabric, under which operational constraints?</description>
      <pubDate>Fri, 10 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/cpo-lpo-dsp-vcsel-what-actually-matters-for-ai-infrastructure.html</guid>
    </item>
    <item>
      <title>When VRAM Stops Being a Weight Warehouse</title>
      <link>https://manishklach.github.io/writings/when-vram-stops-being-a-weight-warehouse.html</link>
      <description>A systems primer on HBM, KV cache, PagedAttention, weight offload — and why the real end state isn&apos;t smarter offloading, it&apos;s treating GPU memory as a bounded execution working set that weights pass through , not live in.</description>
      <pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/when-vram-stops-being-a-weight-warehouse.html</guid>
    </item>
    <item>
      <title>Memory Intent IR: The Missing Compiler Output</title>
      <link>https://manishklach.github.io/writings/memory-intent-ir-why-ai-compilers-must-emit-memory-plans.html</link>
      <description>Compute graphs tell the system what to calculate. They stay silent about how the objects involved are expected to live, compete, expire, and move. That silence is becoming a first-order performance tax.</description>
      <pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/memory-intent-ir-why-ai-compilers-must-emit-memory-plans.html</guid>
    </item>
    <item>
      <title>How to Measure GPU Underutilization on NVIDIA H100 and H200</title>
      <link>https://manishklach.github.io/writings/how-to-measure-gpu-underutilization-on-nvidia-h100-and-h200.html</link>
      <description>Point-in-time GPU busy metrics miss a lot. Rolling-window low-utilization time, sampled idle-state behavior, and power draw tell a much better story about whether an expensive GPU is actually doing meaningful work over time.</description>
      <pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/how-to-measure-gpu-underutilization-on-nvidia-h100-and-h200.html</guid>
    </item>
    <item>
      <title>Why AI Needs a New Memory Hierarchy, Not Just Bigger Caches</title>
      <link>https://manishklach.github.io/writings/why-ai-needs-a-new-memory-hierarchy-not-just-bigger-caches.html</link>
      <description>Bigger L1, L2, and L3 can absolutely help. But if AI systems are going to be designed from scratch, the real opportunity is not a larger generic cache pyramid. It is an AI-native, class-aware residency hierarchy built for weights, KV cache, activations, and experts.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/why-ai-needs-a-new-memory-hierarchy-not-just-bigger-caches.html</guid>
    </item>
    <item>
      <title>What Bigger L2 Actually Buys You</title>
      <link>https://manishklach.github.io/writings/what-bigger-l2-actually-buys-you.html</link>
      <description>Larger L2 can be one of the highest-ROI levers in AI systems. But the real win is not usually a dramatic reduction in cache-hit latency. It is lower miss rate, lower off-chip pressure, lower energy per useful operation, and a more stable machine under real inference load.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/what-bigger-l2-actually-buys-you.html</guid>
    </item>
    <item>
      <title>vOrchestrate and the Case for Controller-Centric Memory Policy in LLM Inference</title>
      <link>https://manishklach.github.io/writings/vorchestrate-controller-centric-memory-policy.html</link>
      <description>Why dynamic multi-tier weight residency should be treated as a control problem across HBM, DRAM, and NVMe, not just a static quantization or offload problem.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/vorchestrate-controller-centric-memory-policy.html</guid>
    </item>
    <item>
      <title>The Real Tax in AI Systems Is Moving Bytes</title>
      <link>https://manishklach.github.io/writings/the-real-tax-in-ai-systems-is-moving-bytes.html</link>
      <description>More SRAM, more HBM, and more bandwidth all help. But the deeper problem is that many AI systems keep moving, staging, and reloading the same bytes far more often than the workload actually requires.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/the-real-tax-in-ai-systems-is-moving-bytes.html</guid>
    </item>
    <item>
      <title>SRMIC-X1: A Residency-First Architecture for LLM Decode Acceleration</title>
      <link>https://manishklach.github.io/writings/srmic-x1-rethinking-memory-hierarchy-llm-decode.html</link>
      <description>What happens when you redesign the memory hierarchy around the actual access patterns of autoregressive inference — replacing HBM as the primary decode tier with distributed on-package SRAM, connected by a purpose-built fabric with 2× the bandwidth?</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/srmic-x1-rethinking-memory-hierarchy-llm-decode.html</guid>
    </item>
    <item>
      <title>SLA-Constrained Energy-Aware Inference Scheduling on ARM Edge Systems</title>
      <link>https://manishklach.github.io/writings/sla-constrained-energy-aware-inference-scheduling-arm-edge-systems.html</link>
      <description>A technical essay on a runtime controller that co-optimizes latency SLA, energy, memory residency, DMA policy, model variant selection, and performance-state control for edge inference deployments.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/sla-constrained-energy-aware-inference-scheduling-arm-edge-systems.html</guid>
    </item>
    <item>
      <title>Schwab Portfolio Tools and the Case for Local, Practical Portfolio Infrastructure</title>
      <link>https://manishklach.github.io/writings/schwab-portfolio-tools-local-practical-portfolio-infrastructure.html</link>
      <description>Why privacy-first, local CSV tooling can be more useful than glossy dashboards when the real job is understanding spread exposure, uncovered short puts, IV crush, and portfolio risk.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/schwab-portfolio-tools-local-practical-portfolio-infrastructure.html</guid>
    </item>
    <item>
      <title>Not static quantization. Not simple weight offload. A runtime controller for multi-tier weight residency.</title>
      <link>https://manishklach.github.io/writings/predictive-weight-orchestration-runtime-control-for-multi-tier-weight-residency.html</link>
      <description>Predictive Multi-Tier Weight Residency and Precision Orchestration reframes inference under HBM pressure as a runtime control problem. Instead of treating weights as static assets, it treats them as managed operating state whose precision, placement, and transfer priority must be decided continuously.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/predictive-weight-orchestration-runtime-control-for-multi-tier-weight-residency.html</guid>
    </item>
    <item>
      <title>A vendor-neutral mobile control plane for terminal-native coding agents</title>
      <link>https://manishklach.github.io/writings/mobile-agent-control-vendor-neutral-control-plane-terminal-native-coding-agents.html</link>
      <description>Mobile Agent Control tackles a very specific systems problem: once coding agents become terminal-native, multi-runtime, and machine-local, someone still needs a safe way to launch them, monitor them, steer them, and audit them from anywhere. This repository turns that operational gap into a concrete product architecture.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/mobile-agent-control-vendor-neutral-control-plane-terminal-native-coding-agents.html</guid>
    </item>
    <item>
      <title>MHC Atlas OS and the Case for Explainable Structure-Guided Prioritization</title>
      <link>https://manishklach.github.io/writings/mhc-atlas-os-explainable-structure-guided-prioritization.html</link>
      <description>Why a runtime-agnostic, policy-governed, agent-driven system is a stronger foundation for experimental prioritization than opaque black-box prediction alone.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/mhc-atlas-os-explainable-structure-guided-prioritization.html</guid>
    </item>
    <item>
      <title>Long-Context Inference Needs Better Memory Policy, Not Just More Memory</title>
      <link>https://manishklach.github.io/writings/long-context-inference-needs-better-memory-policy.html</link>
      <description>Why long-context inference should be managed as region-level policy across both attention mode and memory residency tier, with predictive promotion, reversible demotion, and coherence-aware guardrails.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/long-context-inference-needs-better-memory-policy.html</guid>
    </item>
    <item>
      <title>Introducing ChromeLens: The MRI of Web Performance</title>
      <link>https://manishklach.github.io/writings/introducing-chromelens-systems-grade-web-performance-telemetry.html</link>
      <description>Standard tools like Lighthouse give you a thermometer rating. ChromeLens gives you a deeply technical MRI scanner for your rendering pipeline.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/introducing-chromelens-systems-grade-web-performance-telemetry.html</guid>
    </item>
    <item>
      <title>Hardware-Enforced On-Chip Memory Residency for Neural Network Inference Accelerators</title>
      <link>https://manishklach.github.io/writings/hardware-enforced-on-chip-memory-residency.html</link>
      <description>AI Infrastructure Essay · Patent Application 202641043359</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/hardware-enforced-on-chip-memory-residency.html</guid>
    </item>
    <item>
      <title>Deterministic Memory-Orchestrated Inference Using DMA and Bounded On-Chip Buffers</title>
      <link>https://manishklach.github.io/writings/deterministic-memory-orchestrated-inference-dma-bounded-on-chip-buffers.html</link>
      <description>A technical deep dive into a compiler-scheduled inference architecture where DMA, explicit fences, and bounded on-chip buffers pull DRAM off the compute critical path.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/deterministic-memory-orchestrated-inference-dma-bounded-on-chip-buffers.html</guid>
    </item>
    <item>
      <title>Bounce Buffers: The Hidden Tax on Modern AI Systems</title>
      <link>https://manishklach.github.io/writings/bounce-buffers-hidden-tax-ai-systems.html</link>
      <description>Why extra copies and intermediate staging buffers have become one of the least visible, yet most consequential, bottlenecks in long-context inference, retrieval-heavy serving, agentic loops, and modern GPU-centric infrastructure.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/bounce-buffers-hidden-tax-ai-systems.html</guid>
    </item>
    <item>
      <title>AI Cluster Reliability Beyond Fault-Tolerant Parallelism: Gray Failures, Checkpoint Economics, and Seam-Aware Control Planes</title>
      <link>https://manishklach.github.io/writings/beyond-fault-tolerant-parallelism-ai-cluster-reliability.html</link>
      <description>The next frontier in AI infrastructure may not be bigger clusters or cleverer collective communication alone. It may be a new class of seam-aware control architectures that detect gray failures, contain failure amplification, orchestrate checkpoint economics, and couple facilities, fabric, storage, runtime, and business criticality into one system.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/beyond-fault-tolerant-parallelism-ai-cluster-reliability.html</guid>
    </item>
    <item>
      <title>The Next AI Cluster Failure Won’t Look Like a GPU Failure</title>
      <link>https://manishklach.github.io/writings/ai-cluster-failure-seams.html</link>
      <description>The expensive failure in modern AI infrastructure is often not a dead GPU or a crashed node. It is lost productive throughput across a tightly coupled system: cooling, fabric, accelerator pools, runtime behavior, and the schedule attached to the training or inference program.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/ai-cluster-failure-seams.html</guid>
    </item>
    <item>
      <title>Adaptive Compiler–Runtime Power Contract for Energy-Optimal Edge Inference</title>
      <link>https://manishklach.github.io/writings/adaptive-compiler-runtime-power-contract-energy-optimal-edge-inference.html</link>
      <description>A technical explainer on contract-driven inference execution across SRAM, HBM, DRAM, and NVM, with authenticated plans, safe switching boundaries, and runtime enforcement.</description>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <guid>https://manishklach.github.io/writings/adaptive-compiler-runtime-power-contract-energy-optimal-edge-inference.html</guid>
    </item>
  </channel>
</rss>
