Blog post: Pretraining Scaling in Large Language Models

Deep Dive · AI Research

Pretraining Scaling: The Engine Behind Modern AI

How bigger models, larger datasets, and vast compute budgets kept producing smarter systems, why that was economically rational, and where the classical scaling story now starts to bend.

May 2026 ~17 min read Foundations · LLM Research
The Core Surprise
Progress was not merely large. It was predictable.

That is the historical hinge. Frontier labs did not spend billions because bigger models sometimes worked. They spent billions because, across many orders of magnitude, the returns were smooth enough to extrapolate from small runs to giant ones.

6ND
Back-of-the-envelope transformer training cost: roughly C ≈ 6ND, ignoring lower-order attention, optimizer, and systems overheads.
20N
Chinchilla-era rule of thumb: compute-optimal training uses about twenty tokens per parameter.
Not equal
Training-optimal models and deployment-optimal models are often different objects once inference economics enter the picture.

Every time a new generation of AI model arrives and astonishes the world, the same question surfaces: how? The honest answer, more often than not, is scale. Pretraining scaling is the most consequential empirical idea in modern AI, and it remains one of the least intuitively understood.

This post unpacks what pretraining scaling means, where the idea came from, the mathematics that govern it, why transformers proved especially scalable, and why the economics of scaling mattered almost as much as the science.


What is pretraining?

Before we can talk about scaling, we need to understand what is being scaled. Pretraining is the first, longest, and usually most expensive phase of training a large language model.

During pretraining, a model is exposed to an enormous corpus of text: web pages, books, code repositories, reference material, scientific papers, and increasingly multimodal data aligned into a token stream. The task is deceptively simple: predict the next token.

Given a sequence of prior tokens, what token comes next? The model adjusts its parameters after each prediction, gradually learning grammar, facts, latent concepts, code structure, translation patterns, and the statistical regularities that underlie reasoning-like behavior.

"The pretraining objective is humble. The emergent capabilities it produces are anything but."

After pretraining, models are typically fine-tuned with supervised instruction data, preference optimization, safety methods, or reinforcement learning. Those later stages matter enormously for usability, but the base capabilities are largely set by pretraining.


The three classical axes of scale

Pretraining scaling originally meant systematically increasing three tightly coupled quantities.

Axis 1
N
Model parameters: the size and representational capacity of the neural network.
Axis 2
D
Dataset tokens: how much text or multimodal tokenized data the model sees during training.
Axis 3
C
Compute budget: the total training work, usually expressed in floating-point operations.

These axes are not independent. For a decoder-only transformer, a useful first approximation is C ≈ 6ND. That means scaling is always an allocation problem. If compute is fixed, should you spend it on a larger model, more data, or some balance of both?

Approximation, Not Identity

The 6ND rule is pedagogically useful because it captures the dominant dense-matrix-multiplication cost of training. In practice, attention cost, sequence length, activation recomputation, optimizer overhead, communication, and systems choices all matter too.


Why transformers unlocked scaling

A missing but important bridge in many scaling explainers is that scaling laws are not architecture-agnostic magic. Transformers proved unusually compatible with industrial scale.

Hardware Fit Dense matrix multiplies map extremely well onto GPUs and specialized accelerators. That gave researchers an architecture whose most expensive operations aligned with the hardware trend line.
Optimization Stability Residual connections, layer normalization, Adam-style optimizers, and later refinements made it practical to train deep models without optimization blowing up.

Self-attention is also highly parallelizable during training, unlike earlier recurrent architectures that processed tokens more sequentially. Tokenization helped too: once language, code, images, and audio could all be converted into a common token stream, web-scale corpora became tractable. The result was an architecture that was not just expressive, but operationally scalable.

That is why the modern scaling story is really a story about transformer-era systems engineering as much as model theory.


The scaling laws: from Kaplan to Chinchilla

The 2020 Kaplan et al. result

In 2020, Jared Kaplan and colleagues at OpenAI published one of the defining papers of the field. Their central finding was startling in its regularity: language-model loss appeared to follow smooth power laws with respect to parameters, data, and compute.

L(N) = (Nc / N)αN    loss vs. parameters, holding data fixed
L(D) = (Dc / D)αD    loss vs. tokens, holding model size fixed
L(C) = (Cc / C)αC    loss vs. compute budget

What made this so consequential was not merely that larger models did better. Researchers observed an almost eerie smoothness over multiple orders of magnitude, from small early transformer runs with hidden sizes around 768 and parameter counts in the millions up through billion-parameter systems. The curves looked extrapolatable.

Schematic Frontier Curves
Illustrative scatter plus fit lines: smooth trends, different training regimes, and a visible shift from Kaplan-style undertraining toward Chinchilla-era compute optimality.
2.2 2.4 2.6 2.8 3.0 3.2 20.0 20.7 21.4 22.1 22.8 23.5 log10(training compute in FLOPs) validation loss frontier fit Kaplan-era undertrained runs Chinchilla-style compute-optimal runs overtrained deployment-efficient models

The Chinchilla correction

The original Kaplan-era reading encouraged labs to emphasize parameter count aggressively, even if that meant relatively fewer tokens per parameter. That was a reasonable interpretation at the time, and it shaped the design of large models such as GPT-3.

Then came DeepMind's 2022 Chinchilla paper. Hoffmann and colleagues re-examined the tradeoff and concluded that many large models were simply undertrained.

For compute-optimal training, model size and training tokens should grow together much more evenly than the field had assumed.

The practical rule of thumb was unforgettable: for a model with N parameters, compute-optimal training uses roughly 20N tokens. Not exact. Not universal. But directionally transformative.

Doptimal ≈ 20N    rough Chinchilla token rule
Noptimal ∝ C0.5    optimal model size grows with the square root of compute
Doptimal ∝ C0.5    optimal token count does too

A concrete worked example

Suppose a lab has a training budget of 10^23 FLOPs, squarely in frontier territory. Using the rough transformer cost model and the Chinchilla token ratio:

C = 6 × N × D
D ≈ 20N

C = 6 × N × 20N = 120N2

N2 = C / 120 = 1023 / 120 ≈ 8.3 × 1020

N ≈ 2.9 × 1010  =  29B parameters
D ≈ 20N  =  580B tokens

That is the key intuition. With a fixed compute budget, a roughly 29B model trained on roughly 580B tokens can beat a much larger but relatively undertrained model. Bigger is not automatically better. Better-balanced is often better.

Deployment Reality

Many production models are intentionally trained beyond pure compute-optimality. Training is usually far more expensive than single-pass inference, but at internet scale the cumulative cost of serving tokens dominates design decisions. A smaller overtrained model can be more profitable than a larger perfectly compute-optimal one.


Why scaling was economically rational

This is the part that deserves more attention than it usually gets. Scaling was not only scientifically interesting. It was investable.

If returns had been chaotic, stepwise, or wildly irreproducible, no serious organization would have built the hardware, datacenter, and networking stack required for frontier training. But the power laws were smooth enough that small experiments could forecast large runs. That changed the economics of belief.

Predictable returns Each extra order of magnitude in compute looked like it bought a fairly consistent reduction in loss.
Extrapolation discipline Teams could fit curves on smaller runs, estimate payoff, then justify larger capex with something better than hope.
Infrastructure co-design When returns are forecastable, cluster design, networking, and software optimization become strategic levers rather than speculative bets.

The deepest historical fact is not just that bigger models improved. It is that the improvement curve was smooth enough to industrialize.


Historical milestones: scaling in action

2018
GPT-1 · 117M parameters
OpenAI showed that unsupervised pretraining could create broadly useful language representations. The scaling hypothesis existed, but it was still a hypothesis.
2019
GPT-2 · 1.5B parameters
A roughly 10x jump in parameter count produced writing quality surprising enough to trigger staged release debates. The model-family trend line looked real.
2020
GPT-3 · 175B parameters
Another immense jump. In-context learning became impossible to ignore, and the Kaplan paper gave practitioners a formal language for what they were observing empirically.
2022
Chinchilla · 70B parameters, 1.4T tokens
DeepMind demonstrated that a smaller, better-trained model could outperform much larger undertrained peers. The field's notion of optimality changed almost overnight.
2023 to 2025
Llama, Claude, Gemini, Mistral and beyond
The post-Chinchilla era brought overtraining for deployment efficiency, multimodal pretraining, synthetic-data pipelines, and a shift from pure model size toward full-system optimization.

Why does scaling work?

Here the story becomes less settled. We have robust empirical evidence that scaling works. We do not yet have a fully satisfying theory for why the behavior is so smooth and so durable.

Capacity and representation

Larger models can encode more structure with less destructive compression. They can devote capacity to rare patterns, long-range dependencies, and subtle distinctions that small models are forced to blur together.

Optimization at scale

Bigger models are not just larger lookup tables. In practice they often optimize into smoother, more reusable internal representations. More compute means more parameter updates; more data means a better estimate of the world-distribution; both help shape the model into a richer predictive engine.

Compression and intelligence

Next-token prediction is also a compression problem. A model that predicts text well has discovered latent regularities in language, knowledge, and behavior. In that sense, lower loss is evidence of a more informative world model. That is one reason scaling laws feel deeper than a benchmark trick.

Emergence, with appropriate caution

Some capabilities appear to arrive abruptly rather than smoothly. But this remains an active debate. Some researchers argue that apparent emergence is partly an artifact of thresholded or nonlinear evaluation metrics; others believe there are genuine capability phase transitions in the underlying models. The careful statement is not that emergence is settled, but that the phenomenon is important enough to remain contested.


Modern scaling is no longer only about N, D, and C

The classical picture is still foundational, but frontier systems now scale along additional dimensions that interact with pretraining.

Dimension 1
Context length
Longer contexts change both capability and serving cost. Retrieval, memory behavior, and attention efficiency all become first-order concerns.
Dimension 2
Synthetic data
As high-quality human text becomes scarcer, self-generated and curated synthetic corpora increasingly determine whether scaling can continue cleanly.
Dimension 3
Inference-time compute
Reasoning traces, search, verification, reranking, and tool use allow capability to grow at deployment time rather than only during pretraining.
Dimension 4
Multimodality
Text is no longer the only substrate. Images, audio, video, and action trajectories widen both the data pipeline and the notion of what "scale" means.
Dimension 5
Serving economics
Memory bandwidth, latency, batchability, and accelerator efficiency now determine which model is truly optimal in the market, not just in the training lab.
Dimension 6
Agentic scaffolding
A model paired with tools, retrieval, code execution, and memory can exhibit a different scaling profile than the base model alone.

Current limits and open questions

Are we hitting a data wall? High-quality public text is finite. The next phase may depend on synthetic data quality, private data partnerships, and multimodal corpora rather than raw web scraping.
Will the power laws bend? The smooth exponents observed so far may not continue forever, especially for specific capabilities rather than aggregate loss.
What is the right objective now? Minimizing pretraining loss is no longer the only game. Alignment, reasoning reliability, tool competence, and latency all matter.
How far can test-time scaling go? If inference-time search and verification continue improving, the center of gravity may shift from ever-larger pretraining runs toward smarter deployment loops.

The bottom line

Pretraining scaling is not a hack or a one-off curiosity. It is a profound empirical regularity: loss improves predictably as parameters, data, and compute increase, at least over the ranges that transformed modern AI. Kaplan gave the field its first clean formalization. Chinchilla corrected its intuition about how to spend compute. The frontier since then has been about folding economics, systems design, and new scaling dimensions into the same story.

What makes the phenomenon remarkable is not only that it works, but that it works predictably enough to plan around. Labs fit small curves, extrapolate, and spend accordingly. The fact that those extrapolations are reliable enough to bet billions of dollars on is itself one of the strangest and most consequential facts about the present technological era.

106x
Approximate growth in frontier training compute from the late 2010s into the mid-2020s
~0.05
Order-of-magnitude intuition for the compute-loss exponent often cited in scaling-law discussions
20x
The memorable Chinchilla token-per-parameter heuristic

Further reading: Kaplan et al. (2020), "Scaling Laws for Neural Language Models"; Hoffmann et al. (2022), "Training Compute-Optimal Large Language Models"; Wei et al. (2022), "Emergent Abilities of Large Language Models".