← All posts
LLM Architecture Survey
Transformer Variants & Systems — Final Version
Updated April 2026

LLM Architecture Survey: From Transformers to Modern Variants

Published · 10 min read

A technical survey of large language model architectures covering backbone design, attention mechanisms, sparsity patterns, mixture-of-experts, normalization strategies, loss modifications, position encodings, and systems-level considerations for training and inference at scale.

Pre-norm GQA SwiGLU RoPE MoE FlashAttention Z-Loss

1. Introduction

The transformer architecture introduced by Vaswani et al. [1] remains the dominant paradigm for large language models. However, nearly every component has been re-examined and modified to improve training stability, inference efficiency, and scaling behavior. Modern LLMs deviate significantly from the original encoder-decoder design.

This survey tracks the evolution from the vanilla transformer to current decoder-only dense and mixture-of-experts models. We focus on architectural choices that meaningfully impact model quality, throughput, and memory at the 7B to 500B+ parameter scale. The goal is to systematize the design space and connect implementation details to first-principles trade-offs.

Scope: We cover PaLM [2], LLaMA 2 [3], Mistral [12], Command A [13], and related open models. We omit multimodal and retrieval-augmented extensions. Scaling laws from Kaplan et al. [10] provide the theoretical backdrop.

2. Transformer Backbone

All modern LLMs use a decoder-only architecture with causal masking. The encoder-decoder structure was dropped early due to simplicity and the observation that decoder-only models trained with next-token prediction scale smoothly to zero-shot and few-shot tasks.

2.1 Standard Decoder Block

A single transformer block composes attention and feed-forward networks around residual connections. The exact ordering has major implications for gradient flow and training stability.

2.1 Pre-norm vs Post-norm Gradient Flow

The original transformer used post-norm, but this becomes unstable beyond ~12 layers without careful learning rate warmup. Pre-norm was adopted by GPT-2 and is now universal. The difference is critical for scaling to 100+ layers.

3. Attention Revisited

Self-attention remains O(N²D) in time and memory, where N is sequence length and D is model dimension. For N=32k, attention dominates compute and memory. Optimizations target both the activation footprint and the arithmetic intensity.

3.1 MHA, MQA, and GQA

Multi-Head Attention (MHA) uses H independent queries, keys, and values. During autoregressive decoding, the KV-cache stores 2NHD_k elements per layer, where D_k = D/H. For LLaMA-65B with H=64, N=2048, layers=80, this is ~130GB in FP16.

Multi-Query Attention (MQA) [4] shares a single K and V across all heads, reducing KV-cache to 2ND_k. This cuts memory Hx but degrades quality. Grouped-Query Attention (GQA) [5] interpolates by using G key-value heads for H query heads, with G < H. LLaMA 2 70B uses G=8, H=64.

Method Q Heads KV Heads KV-Cache Size Quality
MHA H H 2NHDk Baseline
GQA H G 2NGDk ≈ MHA
MQA H 1 2NDk -0.5 to -1.5%

3.2 KV-Cache Arithmetic Intensity

Inference has two phases: prefill processes the prompt, and decode generates tokens one-by-one. Their computational profiles differ drastically. Prefill is compute-bound due to large batch matrix multiplies. Decode is memory-bound because it loads the entire KV-cache to compute a single vector.

4. Sparsity & Long Context

Full attention becomes prohibitive beyond 8k tokens. Sparse patterns reduce complexity while preserving long-range dependencies. The key insight is that most attention weights are near-zero; we only need to compute entries likely to be large.

4.1 Sliding Window + Full Attention

Mistral 7B [12] and Command A [13] use alternating sparsity: most layers use sliding window attention with size w=1024 to 4096, while every k-th layer uses full attention. This gives O(Nw) complexity with O(N) receptive field growth across layers.

5. Mixture of Experts

MoE replaces the dense FFN with E expert networks, routing each token to K experts. This increases parameter count without increasing FLOPs: a 1.6T parameter MoE with K=2 active experts costs similar FLOPs to a 100B dense model.

The routing function is typically top_k(G·x) where G is a learned gating matrix. Load balancing loss prevents expert collapse. Recent models like Mixtral 8x7B use E=8, K=2.

y = Σ_{i∈TopK(Gx)} softmax(Gx)_i · Expert_i(x)

6. Normalization & Residuals

RMSNorm [6] replaces LayerNorm by removing mean-centering: RMSNorm(x) = x / RMS(x) · γ. It saves ~7% compute and is now standard in LLaMA, PaLM, and others.

SwiGLU replaces the ReLU FFN. Instead of max(0, xW₁)W₂, it computes (Swish(xW₁) ⊙ xW₂)W₃. This increases parameters 1.5x for the same hidden size but improves quality per parameter [7].

7. Losses & Stability

Large logits in the output softmax cause numerical overflow and gradient spikes. Two techniques stabilize training: z-loss and query-key normalization.

7.1 Z-Loss Intuition

Z-loss adds a penalty on the log-partition function: L_z = α · (log Z)² where Z = Σ exp(logits_i). This encourages logits to stay small without changing the softmax output. PaLM uses α=10⁻⁴ [2, 15].

QK-Normalization applies LayerNorm to queries and keys before the dot product: Attn = softmax(LN(Q)LN(K)ᵀ/√d). This bounds attention scores and prevents entropy collapse at scale.

8. Positional Encodings

RoPE (Rotary Position Embedding) [8] encodes position by rotating query and key vectors by an angle proportional to their position. For 2D subspace [x_i, x_{i+1}], rotation by mθ_i where θ_i = base^{-2i/d}:

[x_i', x'_{i+1}] = [x_i cos(mθ_i) - x_{i+1} sin(mθ_i), x_i sin(mθ_i) + x_{i+1} cos(mθ_i)]

RoPE enables length extrapolation: a model trained on 2k tokens can run on 8k without retraining by scaling base or using NTK-aware interpolation [14]. ALiBi uses additive bias instead [9].

9. Systems View: Training and Inference

Architecture and systems are co-designed. FlashAttention [11] recomputes attention online in SRAM tiles, reducing HBM reads from O(N²) to O(N²/HBM_BW). This makes attention compute-bound instead of memory-bound.

Activation checkpointing saves memory by recomputing activations during backward. Combined with FSDP or 3D parallelism, 70B models train on 2k GPUs with high MFU.

For inference, paged attention manages KV-cache in non-contiguous blocks, enabling larger batch sizes and longer contexts without fragmentation. GQA reduces this cache by H/G, as shown in Figure 3.

10. Modern Recipes: Putting It Together

A representative 70B model combines these innovations:

Component Choice Rationale
Backbone Decoder-only, Pre-norm Stability at 80 layers
Attention GQA, H=64, G=8 8x KV-cache reduction
FFN SwiGLU, 1.5x expansion Quality per FLOP
Norm RMSNorm 7% faster than LayerNorm
Position RoPE, base=10,000 Length extrapolation
Sparsity Alternating SW=4k + Full 128k context at 4x cost
Stability Z-loss, QK-norm Prevent logit explosion

This configuration trains stably at 8192 token sequences, runs inference at 32k with sliding window, and achieves >50% MFU on H100s with FlashAttention-2 and FSDP.

11. References

  1. Vaswani, A., et al. "Attention is all you need." NeurIPS 2017.
  2. Chowdhery, A., et al. "PaLM: Scaling language modeling with pathways." arXiv:2204.02311, 2022.
  3. Touvron, H., et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv:2307.09288, 2023.
  4. Shazeer, N. "Fast transformer decoding: One write-head is all you need." arXiv:1911.02150, 2019.
  5. Ainslie, J., et al. "GQA: Training generalized multi-query transformer models from multi-head checkpoints." arXiv:2305.13245, 2023.
  6. Zhang, B. & Sennrich, R. "Root mean square layer normalization." NeurIPS 2019.
  7. Shazeer, N. "GLU variants improve transformer." arXiv:2002.05202, 2020.
  8. Su, J., et al. "RoFormer: Enhanced transformer with rotary position embedding." arXiv:2104.09864, 2021.
  9. Press, O., et al. "Train short, test long: Attention with linear biases enables input length extrapolation." ICLR 2022.
  10. Kaplan, J., et al. "Scaling laws for neural language models." arXiv:2001.08361, 2020.
  11. Dao, T., et al. "FlashAttention: Fast and memory-efficient exact attention with IO-awareness." NeurIPS 2022.
  12. Jiang, A. Q., et al. "Mistral 7B." arXiv:2310.06825, 2023.
  13. Cohere. "Command A technical report." 2024.
  14. Sun, Y., et al. "A length-extrapolatable transformer." ACL 2023.
  15. Develin, M. "Softmax temperature and z-loss." 2014. [Technique used in PaLM]

LLM Architecture Survey — Living Technical Document. Updated April 2026. This document synthesizes public technical reports and papers. All diagrams are original SVG illustrations for educational purposes.