Blog post: Pretraining Scaling in Large Language Models
Pretraining Scaling: The Engine Behind Modern AI
How bigger models, larger datasets, and vast compute budgets kept producing smarter systems, why that was economically rational, and where the classical scaling story now starts to bend.
That is the historical hinge. Frontier labs did not spend billions because bigger models sometimes worked. They spent billions because, across many orders of magnitude, the returns were smooth enough to extrapolate from small runs to giant ones.
C ≈ 6ND, ignoring lower-order attention, optimizer, and systems overheads.Every time a new generation of AI model arrives and astonishes the world, the same question surfaces: how? The honest answer, more often than not, is scale. Pretraining scaling is the most consequential empirical idea in modern AI, and it remains one of the least intuitively understood.
This post unpacks what pretraining scaling means, where the idea came from, the mathematics that govern it, why transformers proved especially scalable, and why the economics of scaling mattered almost as much as the science.
What is pretraining?
Before we can talk about scaling, we need to understand what is being scaled. Pretraining is the first, longest, and usually most expensive phase of training a large language model.
During pretraining, a model is exposed to an enormous corpus of text: web pages, books, code repositories, reference material, scientific papers, and increasingly multimodal data aligned into a token stream. The task is deceptively simple: predict the next token.
Given a sequence of prior tokens, what token comes next? The model adjusts its parameters after each prediction, gradually learning grammar, facts, latent concepts, code structure, translation patterns, and the statistical regularities that underlie reasoning-like behavior.
"The pretraining objective is humble. The emergent capabilities it produces are anything but."
After pretraining, models are typically fine-tuned with supervised instruction data, preference optimization, safety methods, or reinforcement learning. Those later stages matter enormously for usability, but the base capabilities are largely set by pretraining.
The three classical axes of scale
Pretraining scaling originally meant systematically increasing three tightly coupled quantities.
These axes are not independent. For a decoder-only transformer, a useful first approximation is C ≈ 6ND. That means scaling is always an allocation problem. If compute is fixed, should you spend it on a larger model, more data, or some balance of both?
The 6ND rule is pedagogically useful because it captures the dominant dense-matrix-multiplication cost of training. In practice, attention cost, sequence length, activation recomputation, optimizer overhead, communication, and systems choices all matter too.
Why transformers unlocked scaling
A missing but important bridge in many scaling explainers is that scaling laws are not architecture-agnostic magic. Transformers proved unusually compatible with industrial scale.
Self-attention is also highly parallelizable during training, unlike earlier recurrent architectures that processed tokens more sequentially. Tokenization helped too: once language, code, images, and audio could all be converted into a common token stream, web-scale corpora became tractable. The result was an architecture that was not just expressive, but operationally scalable.
That is why the modern scaling story is really a story about transformer-era systems engineering as much as model theory.
The scaling laws: from Kaplan to Chinchilla
The 2020 Kaplan et al. result
In 2020, Jared Kaplan and colleagues at OpenAI published one of the defining papers of the field. Their central finding was startling in its regularity: language-model loss appeared to follow smooth power laws with respect to parameters, data, and compute.
L(D) = (Dc / D)αD loss vs. tokens, holding model size fixed
L(C) = (Cc / C)αC loss vs. compute budget
What made this so consequential was not merely that larger models did better. Researchers observed an almost eerie smoothness over multiple orders of magnitude, from small early transformer runs with hidden sizes around 768 and parameter counts in the millions up through billion-parameter systems. The curves looked extrapolatable.
The Chinchilla correction
The original Kaplan-era reading encouraged labs to emphasize parameter count aggressively, even if that meant relatively fewer tokens per parameter. That was a reasonable interpretation at the time, and it shaped the design of large models such as GPT-3.
Then came DeepMind's 2022 Chinchilla paper. Hoffmann and colleagues re-examined the tradeoff and concluded that many large models were simply undertrained.
For compute-optimal training, model size and training tokens should grow together much more evenly than the field had assumed.
The practical rule of thumb was unforgettable: for a model with N parameters, compute-optimal training uses roughly 20N tokens. Not exact. Not universal. But directionally transformative.
Noptimal ∝ C0.5 optimal model size grows with the square root of compute
Doptimal ∝ C0.5 optimal token count does too
A concrete worked example
Suppose a lab has a training budget of 10^23 FLOPs, squarely in frontier territory. Using the rough transformer cost model and the Chinchilla token ratio:
D ≈ 20N
C = 6 × N × 20N = 120N2
N2 = C / 120 = 1023 / 120 ≈ 8.3 × 1020
N ≈ 2.9 × 1010 = 29B parameters
D ≈ 20N = 580B tokens
That is the key intuition. With a fixed compute budget, a roughly 29B model trained on roughly 580B tokens can beat a much larger but relatively undertrained model. Bigger is not automatically better. Better-balanced is often better.
Many production models are intentionally trained beyond pure compute-optimality. Training is usually far more expensive than single-pass inference, but at internet scale the cumulative cost of serving tokens dominates design decisions. A smaller overtrained model can be more profitable than a larger perfectly compute-optimal one.
Why scaling was economically rational
This is the part that deserves more attention than it usually gets. Scaling was not only scientifically interesting. It was investable.
If returns had been chaotic, stepwise, or wildly irreproducible, no serious organization would have built the hardware, datacenter, and networking stack required for frontier training. But the power laws were smooth enough that small experiments could forecast large runs. That changed the economics of belief.
The deepest historical fact is not just that bigger models improved. It is that the improvement curve was smooth enough to industrialize.
Historical milestones: scaling in action
Why does scaling work?
Here the story becomes less settled. We have robust empirical evidence that scaling works. We do not yet have a fully satisfying theory for why the behavior is so smooth and so durable.
Capacity and representation
Larger models can encode more structure with less destructive compression. They can devote capacity to rare patterns, long-range dependencies, and subtle distinctions that small models are forced to blur together.
Optimization at scale
Bigger models are not just larger lookup tables. In practice they often optimize into smoother, more reusable internal representations. More compute means more parameter updates; more data means a better estimate of the world-distribution; both help shape the model into a richer predictive engine.
Compression and intelligence
Next-token prediction is also a compression problem. A model that predicts text well has discovered latent regularities in language, knowledge, and behavior. In that sense, lower loss is evidence of a more informative world model. That is one reason scaling laws feel deeper than a benchmark trick.
Emergence, with appropriate caution
Some capabilities appear to arrive abruptly rather than smoothly. But this remains an active debate. Some researchers argue that apparent emergence is partly an artifact of thresholded or nonlinear evaluation metrics; others believe there are genuine capability phase transitions in the underlying models. The careful statement is not that emergence is settled, but that the phenomenon is important enough to remain contested.
Modern scaling is no longer only about N, D, and C
The classical picture is still foundational, but frontier systems now scale along additional dimensions that interact with pretraining.
Current limits and open questions
The bottom line
Pretraining scaling is not a hack or a one-off curiosity. It is a profound empirical regularity: loss improves predictably as parameters, data, and compute increase, at least over the ranges that transformed modern AI. Kaplan gave the field its first clean formalization. Chinchilla corrected its intuition about how to spend compute. The frontier since then has been about folding economics, systems design, and new scaling dimensions into the same story.
What makes the phenomenon remarkable is not only that it works, but that it works predictably enough to plan around. Labs fit small curves, extrapolate, and spend accordingly. The fact that those extrapolations are reliable enough to bet billions of dollars on is itself one of the strangest and most consequential facts about the present technological era.
Further reading: Kaplan et al. (2020), "Scaling Laws for Neural Language Models"; Hoffmann et al. (2022), "Training Compute-Optimal Large Language Models"; Wei et al. (2022), "Emergent Abilities of Large Language Models".