MANISH AI - Writings

Current essays

198 essays · updated May 30, 2026

May 30, 2026 · 25 min read

Close-to-Metal RL Inference at GB300 Scale

A deep systems-level reference on close-to-metal RL inference at GB300 scale: persistent decode workers, hugepage-backed KV arenas, GPU/NIC command rings, NUMA locality, TLB/IOMMU reduction, cache coherency, and async reward handoff — with full C code.

Search essays

Current essays

Close-to-Metal RL Inference at GB300 Scale

GPU-to-GPU Hybrid Bonding: From Multi-GPU Communication to Multi-Die GPU Fabrics

Intent Attention Kernel: Intent-Aware KV Execution for Agentic Long-Context Inference

MLCC: The Invisible Backbone of Modern Electronics

SK Hynix's iHBM: Inside the Technology Redefining AI Memory Cooling

NVIDIA Vera and the Control Plane of the AI Factory

Huawei's Tau Scaling Law and the Beginning of Distance-Centric Computing

What Is a Token? The Atom of AI Language

Disaggregated Memory for Long-Context AI Systems

KV ASIC Part 3: The Chip

Why KV Cache Needs Dedicated Infrastructure

GQA and MQA: Attention, KV Cache, and Faster LLM Inference

CXL and the Search for a New Memory Tier in AI Inference

Beyond CXL: The Case for a KV-Aware Memory ASIC

Retimers: The Hidden Signal Infrastructure of the AI Chip Era

Part 2: How Memory Timings Are Actually Configured in Modern Systems

How Memory Timings Are Actually Configured in Modern Systems

The Rise of AI Kernel Engineers

How Chips Actually Work: From Logic Gates to GPUs, ASICs, TPUs, FPGAs, Accelerators, and the Brain

Zhenwu M890: Why Alibaba Bet on Memory Fabrics, Not Just FLOPS

First Silicon Is Not the Finish Line: Linux Board Bring-Up for x86, Arm, and Accelerators

HBM, DDR, SODIMM and the AI Memory Wall

The Quadrillion-Token Era Has Arrived

Disaggregated Memory — The Architecture Reshaping AI Infrastructure

Why Co-Packaged Optics Is Crucial in Semiconductors

Pretraining Scaling: The Engine Behind Modern AI

GhostKV: Attention Without Storing Full KV

The Memory Wall Is Moving

The SRAM Insight: How I Built a Patent Around AI's Oldest Memory Habit

SRAM as the Specialist: Heterogeneous Inference and the Decode Bottleneck

MFU, BF16, FP8 and AI Numeric Formats: The Complete Guide

The Ideal PagedAttention Stack: Hardware + Software for Long-Context Inference

Four Things Nobody Has Written About FlashDecode

KV Cache, Transformer Memory, and Why TurboQuant Matters

The Compiler Becomes the Memory Scheduler

The Next AI Bottleneck Isn't FLOPs. It's Attention Memory.

Coherent Fabrics: The Memory Highways Behind Agentic AI

The Memory Wall Is the New Compute Wall

Host RAM vs HBM for Inference: What Really Lives Where

The Memory Wall — How AI Infrastructure Is Being Rewired From the Ground Up

KV Cache Is a Memory System

Why AI Inference Needs a Weight Delivery Architecture

Why LLM Prefill and Decode Suffer from Cache Misses, Branch Stalls, and Pipeline Bubbles on CPUs

/dev/mem_hint: A Kernel Control Plane for AI Memory Systems

Teaching Computers to Remember Smarter

HBM Is Not a Cache: What High-Bandwidth Memory Actually Does in AI Systems

MoE Is a Networking Problem Wearing a Model Costume

Power Is Becoming a Scheduling Constraint

Inference Is a Queueing System: Why AI Serving Lives or Dies by Arrival Curves, Batching, and Tail Latency

The Scheduler Is the Product: Why AI Infrastructure Moats Are Becoming Policy Engines

The Real Cost of a Miss: Unifying Cache Misses, KV Misses, Fabric Misses, and Remote Fetch in AI Systems

Rambus: The Hidden Bottleneck Bet in the AI Memory Supercycle

DeepSeek V4: The Model That Refuses to Stop Shocking the World

Intel 18A & 14A: The Nodes That Could Rewrite Chip History

ARM Data Center CPU Report: Comprehensive Technical Analysis

Rambus SOCAMM2: The Modular LPDDR Memory Layer for AI Servers

High-Power Laser Arrays for Co-Packaged Optics: The Hidden Power Plant of AI Interconnects

The Substrate Layer: What Sits Between the Chip and Everything Else

The Brains Behind AI Storage, v2: Smart Controllers, Linux Drivers, and the Full Path from read() to GPU

AI Storage Primer: From NAND Physics to GPUDirect and the Memory Wall

Vera, Venice, AGI, Clearwater: The Coming Wave of AI CPUs

The CPU Is Back: Why AI Broke the GPU-Only Illusion

The AI Power Delivery Stack: How Modern Accelerators Are Really Powered, and What Each Vendor Actually Does

The 800V DC Era Is Here — and It's Rewriting Who Builds AI Power

The Next-Generation AI Chip: Inside the Logic Bridge, Realistic Floorplans, and the Rise of On-Package NUMA

The Next-Generation AI Chip: From Flat HBM to an On-Package NUMA Fabric

NVLink Switch Is Not NVLink: The Scale-Up Fabric Architecture Nobody Fully Explains

800G Ethernet vs. InfiniBand: The AI Scale-Out Decision Nobody Documents Honestly

CXL Is Three Protocols in a Trenchcoat: What .io, .mem, and .cache Actually Do

NCCL Internals: The Collective Communications Layer Nobody Reads — But Everyone Depends On

The Cost of Testing AI Chips Is Exploding

The Hidden Backbone of AI Hardware: How Chips Are Tested Before They Power the World

The Compiler Is the New Kernel: Why MLIR/Triton/XLA Are the Most Underrated Layer in AI Infrastructure

Fat-Tree Is Lying to You: Network Topology as a First-Class Inference Constraint

The Inference Unit Economics Time Bomb: Why $/Token Will Collapse (and What Survives)

Co-Packaged Optics (CPO): The End of Pluggable Transceivers — A Ground-Up Guide

The Machines That Build the Machines Pt. 3 — Beyond Lithography

Inside ASML: The Product Line That Prints the Future (and Why EUV Took 30 Years)