Every time you type a message to an AI — asking it to write a poem, debug your code, or explain quantum physics — something remarkable happens before the model even begins to "think." Your words are sliced apart. Not into letters. Not into whole words. Into something in between: tokens.
Tokens are the fundamental unit of language that large language models (LLMs) like GPT-4, Claude, and Gemini operate on. They are to AI what atoms are to chemistry — the smallest pieces the system actually works with. Understanding tokens unlocks a deeper understanding of why AI behaves the way it does: why it sometimes loses track of long conversations, why API pricing is structured the way it is, and why feeding it a 50-page PDF costs more than a quick question.
This guide starts from first principles and works up to the details that matter most for people who build with or alongside AI every day.
· · ·Words, Letters, and the Space In Between
The most natural way to chop up language would be by words. Every English dictionary does this; your brain does this instinctively. But words are a surprisingly messy unit for machines. "Run", "running", "runner", and "runs" are all variations of the same concept — a word-based system would have to store and learn each separately. Languages with rich conjugations (like Turkish or Finnish) would explode into unmanageable vocabularies.
The other extreme — splitting text into individual letters — solves the vocabulary problem but creates a different one. To understand the word "banana," the model would have to maintain context across six separate units: b, a, n, a, n, a. That's six times the computation, and relationships between letters are far more abstract than relationships between meaningful word fragments.
"Tokens are the compromise: large enough to carry meaning, small enough to keep the vocabulary finite and manageable."
The solution used by virtually all modern LLMs is a method called Byte-Pair Encoding (BPE), or a variant of it. BPE starts with individual characters, then iteratively merges the most frequently occurring pairs into new "symbols." Over millions of iterations on enormous text corpora, this produces a vocabulary of roughly 50,000–100,000 tokens that covers common words, frequent word parts, punctuation patterns, and individual characters as a fallback for rare sequences.
What Does a Token Actually Look Like?
Here is the sentence "Tokens are the atoms of AI" as a language model actually sees it — each coloured chip is one token:
6 tokens. Each has a unique integer ID in the model's vocabulary. Notice the leading spaces are part of the token itself.
Several things jump out immediately. First, notice that the leading space before "are" is included as part of the token — not as a standalone unit. This is because BPE is trained on raw bytes and learns that a word following a space is a very common pattern worth encoding together. Second, every token maps to a specific integer ID in a lookup table (the "vocabulary"). The model never actually processes text — it processes streams of integers, each one representing a token.
Now look at a stranger example:
5 tokens. Uncommon words are split at high-frequency subword boundaries the tokeniser learned from training data.
The word "serendipitously" is rare enough that no single token exists for it. Instead, the tokeniser carves it into common sub-pieces: Ser, end, ip, it, ously. The model can still process and generate this word perfectly — it just requires slightly more "budget" than a common word like "the" (which is always a single token).
· · ·The Golden Rule: ~¾ of a Word Per Token
As a practical rule of thumb that holds remarkably well across English text: one token ≈ four characters, or about ¾ of a word. Put differently, 100 tokens is roughly 75 words, or about a short paragraph.
These ratios shift for non-English languages and for technical content. Chinese and Japanese characters, for instance, often map to 1–2 characters per token because the characters themselves are dense with meaning. Code is particularly token-hungry: variable names like getUserProfileDataFromCache may be split into five or six tokens, and punctuation-heavy syntax adds up fast.
Think of tokens as the currency your AI uses to read and write. English prose spends about 1 token per ¾ word. A 500-word email costs roughly 650 tokens. A 10-page PDF (~2,500 words) costs around 3,300 tokens. A large codebase or legal document can run into the hundreds of thousands.
Why Tokens Are the Unit of Measurement (Not Words)
The reason tokens — rather than words, sentences, or characters — became the standard unit comes down to how neural networks process information. At each step, the model takes a fixed-size vector of numbers representing the current token and predicts the probability distribution over all possible next tokens. The word "step" here is literal: the model processes tokens one at a time in a sequence, and everything about its architecture — its layers, its attention mechanism, its output head — is built around this per-token operation.
This is why every meaningful quantity in the LLM world is denominated in tokens:
| Concept | What it means in tokens | Typical range |
|---|---|---|
| Context window | The maximum number of tokens the model can "see" at once (input + output combined) | 4K – 2M tokens |
| Prompt / input tokens | Everything you send to the model in a request | Tens to hundreds of thousands |
| Completion / output tokens | Everything the model generates back to you | Usually 1 – ~8,000 |
| Pricing unit | APIs charge per 1,000 or 1,000,000 tokens processed | $0.25 – $15 / 1M tokens |
| Rate limit | Requests-per-minute and tokens-per-minute quotas | Varies by plan and model |
The Context Window: A Model's Working Memory
If tokens are atoms, the context window is the laboratory bench. It represents the total number of tokens the model can actively process at one time — both what you send in and what it sends back. Think of it as the model's working memory: everything outside this window is, from the model's perspective, as if it never happened.
Early GPT models had context windows of just 2,048 tokens — a bit under 1,500 words, or about 4–5 pages of text. Current frontier models range from 128K tokens (roughly a 200-page novel) to 1M tokens (Gemini 1.5) and beyond. This matters enormously in practice: a short context window means the model may "forget" the beginning of a long conversation, lose track of earlier instructions, or be unable to process a lengthy document in a single request.
"A model cannot selectively 'remember' things outside its context window. If the tokens aren't in the window, they simply don't exist for the model."
The way around context limits — when they do apply — is a technique called retrieval-augmented generation (RAG), where only the most relevant chunks of a large corpus are retrieved and inserted into the context window as needed. But that is a story for another day.
· · ·Tokens and Pricing: The Economics of Language
For anyone building with AI APIs, understanding tokens has direct financial implications. Cloud AI providers charge separately for input tokens (the prompt you send) and output tokens (the response generated). Output tokens are almost always more expensive — typically 3–5× the input price — because generating text is computationally more intensive than reading it.
This pricing structure creates clear incentives for developers: keep system prompts concise, summarise earlier conversation history rather than replaying it verbatim, and avoid generating verbose outputs when a terse one will do. A well-designed prompt that achieves its goal in 500 tokens beats a sprawling one that uses 2,000, both on cost and often on quality.
Many AI APIs now offer prompt caching, where frequently used prompt prefixes (like a long system instruction) are stored and reused across requests at a dramatically reduced per-token cost — sometimes as low as 10% of the original price. If you have a large, stable system prompt, enabling caching can cut your bill substantially.
Surprising Token Behaviours Worth Knowing
Tokens produce some counterintuitive behaviours that trip up even experienced AI users:
- 1 Spelling is hard. "Count the letter e in 'serendipitous'" is tricky because the model sees tokens, not letters. It must reason about character-level structure from a token-level representation — like reading Morse code when you only have the words.
- 2 Arithmetic can be error-prone. The number "1,234,567" may be tokenised as several separate tokens. The model has no built-in abacus; it must learn numerical patterns from data, which is why it sometimes makes arithmetic mistakes and why code-execution tools help enormously.
- 3 Rare words cost more. An obscure proper noun or technical term may use 3–5 tokens where a common synonym uses just one. Prompts using simpler vocabulary can sometimes be more token-efficient without losing meaning.
- 4 Tokenisation is language-specific. The same number of characters in English vs. a morphologically rich language can produce vastly different token counts. A Spanish sentence typically uses slightly more tokens than its English translation; Turkish and Finnish even more so.
- 5 Whitespace and formatting matter. Markdown headers, bullet points, extra newlines, and indentation all consume tokens. A heavily formatted document can use 20–30% more tokens than plain prose conveying the same information.
How the Model Uses Tokens to Generate Text
Generation is the reverse of tokenisation. Once the model has processed all your input tokens, it produces output one token at a time. At each step, it outputs a probability distribution over every token in its vocabulary (all ~50,000+ of them) and samples from that distribution. The sampled token is appended to the context, and the process repeats. This continues until the model produces a special "end of text" token, or until a length limit is hit.
The "temperature" setting you may have seen in API calls controls the sharpness of this distribution. A temperature of 0 always selects the single highest-probability token (deterministic, repetitive). A high temperature of 1.5 flattens the distribution, producing wilder, more creative — sometimes incoherent — outputs. Most production use cases land between 0.2 and 1.0.
· · ·The Bigger Picture
Tokens are, at root, an engineering compromise — a practical solution to the challenge of turning the infinite complexity of human language into discrete units that a neural network can process efficiently. They are not "words" and they are not "letters." They are something new: statistical units optimised for the patterns that actually appear in large bodies of text.
Understanding them does not require a PhD in machine learning. It requires only the insight that when you type to an AI, you are not typing to something that reads the way you do. You are sending a stream of numerical codes into a system that has learned, from unimaginable amounts of text, what tends to follow what. The miracle is not just that this works — it's how well it works.
Every response you've ever received from a language model was, at its core, a very sophisticated sequence of token predictions. Now you know what a token is. The rest is statistics.
Key Takeaways
- →A token is a subword unit — larger than a letter, usually smaller than a word — used by LLMs as their basic unit of text.
- →In English, roughly 1 token ≈ 4 characters ≈ ¾ of a word. 1,000 tokens ≈ 750 words.
- →The context window is the total number of tokens a model can process at once — its working memory.
- →API pricing, rate limits, and model behaviour all flow directly from token counts.
- →Tokenisation shapes AI behaviour in subtle ways: spelling tasks, arithmetic, rare words, and formatting all interact with how text gets split.