How AI breaks down language into tokens
Now that we've seen why older language models kept hitting walls — why rule-based systems couldn't generalize, why RNNs struggled to hold long contexts — we're almost ready to talk about what transformers actually do differently. But before we get there, we need to answer a more fundamental question: what actually goes into a language model in the first place?
Before any sophisticated processing can happen, the text you give the model has to be converted into something it can actually work with. That something is called a token, and understanding tokens is absolutely critical to understanding both why LLMs are so capable and why they sometimes fail in bizarrely specific ways.
Consider this small experiment: Open ChatGPT and ask it, "How many times does the letter 'r' appear in the word 'strawberry'?" A distressing number of people have discovered, sometimes to great amusement and sometimes to genuine frustration, that a model capable of writing sonnets, debugging Python, and explaining quantum entanglement will occasionally fumble this task. This isn't a fluke, and it's not a sign of incompetence. It's a direct, logical consequence of one foundational design decision that sits before any of the sophisticated processing even begins: ChatGPT doesn't read letters. It doesn't even read words. It reads tokens. Everything downstream — the context window, the model's surprising linguistic range, the hallucinations, the occasional letter-counting failures — traces back to this one choice about how input gets prepared.
Consider the alternatives, and you'll see why tokens exist at all. You could tokenize at the character level — split everything into individual a's, b's, c's — but then a single sentence becomes hundreds of items for the model to process. The context window shrinks instantly. Or you could tokenize at the word level, which seems natural but creates a nightmare: you'd need a lookup table with millions of rows, which is computationally enormous and also completely unable to handle any word it hasn't seen before. A word like "GPT-4o" or a user's unusual surname simply wouldn't exist in the vocabulary. The model would be helpless.
Tokens are the elegant solution to both problems. They live in between characters and words — chunked at a level that keeps context tractable, keeps vocabulary manageable, and handles novel text gracefully.
What a Token Actually Is
A token is a chunk of text. That's the plain-English definition, and while it sounds underwhelming, the interesting part is how those chunks are determined.
As IBM's explanation of LLMs describes, tokens are "smaller units such as words, subwords or characters" — a deliberately vague definition because the boundaries of tokens aren't fixed by any grammatical rule. They're learned from data.
In practice, common short words tend to become their own tokens. "the," "is," "of," "and" — these are used so frequently that it would be wasteful to break them up further. Longer or rarer words tend to get split into pieces. "tokenization" might become ["token", "ization"]. A highly specialized word like "cholangiocarcinoma" might become five or six fragments. And emojis, punctuation, and even individual letters can each become their own token when the situation calls for it.
The best way to get an intuitive feel for this is to play with OpenAI's tokenizer tool, which lets you type any text and see exactly how it gets split. A few minutes with that tool will teach you more about tokenization than several paragraphs of explanation.
Here are some examples that reveal the underlying logic:
- "Hello" → 1 token (common word)
- "Hello!" → 2 tokens ("Hello" + "!")
- "unbelievable" → 3 tokens ("un" + "believ" + "able")
- "ChatGPT" → 2 tokens ("Chat" + "GPT")
- "tokenize" → 3 tokens (the exact splits vary by model)
The important thing to internalize is this: the token boundary is never where you expect it to be. "Tokenize" and "token" do not share a token boundary just because they share a root. "New York" is not one token just because it's one concept. The tokenizer doesn't know anything about meaning. It knows about frequency.
Byte Pair Encoding: How the Tokenizer Learns Its Splits
The algorithm behind modern tokenization — the one used by GPT models — is called Byte Pair Encoding, or BPE. The name sounds technical, but the core idea is elegantly simple: start from the smallest possible units and repeatedly merge the most common pairs.
BPE was originally a data compression algorithm, invented in the 1990s, and its application to NLP tokenization was a genuine insight — one of those moments where borrowing from a completely different field produces something unexpectedly powerful. The version used by GPT-2, GPT-3, and GPT-4 is specifically called Byte-Level BPE — and the "byte" part is not incidental. It's central to why modern models are so robust.
Here's the key insight: instead of starting from characters (which vary wildly across languages and scripts), Byte-Level BPE starts from raw bytes. Every piece of text you can possibly type — in any language, any script, any emoji — can be represented as a sequence of bytes in the UTF-8 encoding standard. And here's the elegant part: there are exactly 256 possible byte values. That's it. So the base vocabulary starts with exactly 256 items, one for each possible byte.
This is a remarkably clean foundation. You don't need to decide in advance which languages to support. You don't need a special "unknown word" token for text you've never seen before. Every string of text that has ever existed or will ever exist can be represented as some sequence drawn from these 256 starting elements. An emoji, a Chinese character, a newly coined slang term, your unusual surname — all of it breaks down cleanly into bytes.
From there, the algorithm builds up:
graph TD
A[Start: large text corpus] --> B[Convert all text to raw UTF-8 bytes - 256 base tokens]
B --> C[Count all adjacent byte pairs]
C --> D[Merge the most frequent pair into a new token]
D --> E[Update the vocabulary with the new merged token]
E --> F{Vocabulary size reached?}
F -- No --> C
F -- Yes --> G[Final tokenizer vocabulary]
Start with a huge corpus of text — billions of words scraped from the internet, books, code repositories. Represent everything as bytes. Now count every pair of adjacent bytes across the entire corpus. Which pair appears most often? In English text, it's likely the bytes corresponding to "t" followed by "h" — since "the," "that," "this," "they," and hundreds of other common words all contain "th." Merge that pair into a single new token: the byte sequence for "th." Vocabulary: 257 items.
Repeat. Count again. Now "th" might pair frequently with "e" to form "the" — the most common word in English. Merge again. Vocabulary: 258 items.
Keep going. After tens of thousands of merges, you've built a vocabulary that contains the most common bytes, byte pairs, syllables, subwords, and complete words in the training corpus. Each merge encodes a real statistical regularity — the fact that certain byte sequences appear together constantly enough to be worth treating as a single unit.
The process stops when you hit a predetermined vocabulary size. For GPT-3 and GPT-4, that target is around 50,000 to 100,000 tokens. GPT-3's tokenizer, tiktoken, uses a vocabulary of roughly 50,257 tokens (256 base bytes, plus around 50,000 learned merges, plus one special token). GPT-4 expanded this. The exact number is a design choice that involves tradeoffs — a larger vocabulary means fewer tokens per sentence (more efficient) but requires more parameters to represent each possible token.
What Byte-Level BPE produces is not just a vocabulary — it's also a set of learned rules for how to apply that vocabulary. When the tokenizer encounters new text at inference time, it applies those merge rules in order, greedily combining bytes into the largest matching token it can find. The result is that common words become single tokens, rare words get broken into recognizable subword fragments, and truly novel strings get broken down as far as individual bytes if necessary — but never fail entirely.
The Vocabulary: A Fixed Menu of Possibilities
Once training is complete, the tokenizer's vocabulary is frozen. It's a fixed list of roughly 50,000–100,000 items — call it the model's "alphabet," though that word doesn't quite capture how much richer this alphabet is than the 26 letters we usually mean by that word.
Every piece of text that enters the model — every prompt you've ever typed into ChatGPT — gets translated into a sequence of integers drawn from this fixed menu. "The cat sat on the mat" doesn't enter the model as letters or words. It enters as something like [464, 3797, 3332, 319, 262, 2603] — a series of index numbers pointing to specific entries in the vocabulary.
The model never sees the actual characters. From the moment tokenization happens, the model lives entirely in the world of integer IDs and the vector embeddings associated with them. (More on embeddings in the next section — that's where things get genuinely magical.)
This has several important consequences:
The vocabulary is multilingual by design. Because Byte-Level BPE starts from raw bytes rather than language-specific characters, it handles every script automatically. Chinese characters, Arabic text, Russian Cyrillic, emoji — all of it is just sequences of bytes, and the merge algorithm learns to group whatever co-occurs frequently. The tokenizer didn't "decide" to support multiple languages; it emerges from the byte-level foundation and the multilingual training data.
But not all languages are equal. Because the training data contains vastly more English than any other language, English text is tokenized much more efficiently. An English word like "hello" is likely one token. Its equivalent in a less-represented language might be three or four tokens. This has real consequences: as researchers have noted, the model effectively has a smaller "working memory" when operating in those languages, because each thought requires more tokens to express, eating up more of the context window. A prompt in Thai or Swahili that would be 100 tokens in English might be 300 tokens in the model's representation — which means the model can hold less context for the same conversation length.
Why Tokenization Creates Surprising Behaviors
Here's where this becomes more than an engineering detail and starts explaining real things about how ChatGPT behaves in the wild.
The letter-counting problem. When you ask ChatGPT how many times "r" appears in "strawberry," the model is not working with the characters s-t-r-a-w-b-e-r-r-y. It's working with tokens — probably something like "st" + "raw" + "berry" (the exact split depends on the tokenizer). The individual letters are not discrete units the model processes. It's a bit like asking someone who reads only in syllable chunks to count individual letters — they can probably do it by reasoning about what they know, but they're working against their native grain.
This doesn't mean the model is fundamentally incapable of letter reasoning. With careful prompting — "spell out each letter first, then count" — the model often gets it right, because forcing the model to generate the letters sequentially gives each letter its own token to reason about. But the naive version of the question is surprisingly hard for exactly the reason you'd expect given how tokens work.
The strange arithmetic of words. "Unbelievable" might be one token. "Unbelievably" might be two. These look like nearly the same word to you, but to the model they have quite different structural representations. The model can learn that they're related — it sees them in similar contexts — but it has to learn this from data rather than from the obvious fact that one is just the other with a "y" added. Characters within a token are not separately accessible to the model's processing. The token is the atom.
The efficient coding problem. Because common words are single tokens and rare words are many tokens, the model's computation is not uniformly distributed across text. Processing 500 tokens of common English conversation is very different from processing 500 tokens of dense legal jargon or highly technical code. The same token budget buys different amounts of "meaning" in different domains.
The whitespace surprise. One thing that trips up people when first exploring tokenization: spaces are often included in tokens. " hello" (space before "hello") and "hello" (no space) are different tokens. "Hello" at the start of a sentence might be tokenized differently than "hello" in the middle of one. The tokenizer is not respecting any linguistic boundary — it's just following statistical patterns about what byte sequences appear together.
Tokens and the Context Window
One of the most practical implications of tokenization for everyday users is the context window — the limit on how much text the model can "see" at once. You'll hear that GPT-4 has a 128,000-token context window, or that some models support up to a million tokens. These limits are measured in tokens, not words or characters.
A rough rule of thumb: one token is approximately three-quarters of a word in English. So 100,000 tokens is roughly 75,000 words — about the length of a short novel. But this rule of thumb breaks down quickly outside of clean English text. Code is often more token-efficient than prose (many keywords are single tokens). Dense academic writing with unusual vocabulary is less efficient. And non-English text is often substantially less efficient, for the reasons described above.
IBM's LLM overview notes that after text is split into tokens, each token is mapped to a vector of numbers called an embedding — and it's the sequence of these embeddings that gets passed through the model. Every position in the context window costs computation. Every token must be processed. This is why longer context windows require more compute and more memory — not because the text is longer, but because there are more tokens, each of which participates in calculations across the entire sequence.
When a user asks why their conversation suddenly seemed to "forget" earlier context, the answer is almost always tokens. They've exceeded the context window. The model has no memory of what came before because that text literally doesn't exist in its current input — it was dropped to make room for newer tokens.
The Vocabulary Gap and Rare Words
Byte-Level BPE handles words it's never seen before not merely gracefully but completely — because at the byte level, there is no such thing as an unknown input. Every possible string of text, no matter how strange, can be decomposed into bytes, and the tokenizer has a token for every possible byte. The model will never encounter a string and simply fail to process it.
For somewhat unusual words, the merge rules will typically find recognizable subword fragments. "Cromulence" might become ["Crom", "ul", "ence"] or some similar decomposition. The fragment "ence" appears in "science," "experience," "sentence," "influence" — all of which relate to abstract nouns. So even without having ever seen "cromulence," the model has some statistical basis for treating it like a noun of a certain kind. Whether this represents "understanding" in any meaningful sense is a philosophical question we'll take up later in the course — but it does mean the model degrades gracefully rather than failing completely when it encounters unfamiliar text.
The same logic applies to names. "ChatGPT" probably becomes "Chat" + "GPT" (or something similar), because neither the full string nor the capitalization pattern was common enough in the training corpus to merit its own merged token. Each fragment carries its own associations, and the model stitches those associations together into something approximating an understanding of the whole.
Why Tokenization Is an Underrated Concept
Most explainers of LLMs skip tokenization quickly, treating it as mere plumbing. But if you've followed this section carefully, you can already see that tokenization is load-bearing. It determines:
- What information the model can and can't easily access about its own inputs
- How efficiently different languages and domains use the context window
- Why certain simple tasks (letter counting, reverse spelling) require special handling
- How the model handles novel words, names, and emoji without ever breaking
- The exact nature of the vocabulary — the set of "atoms" from which all language is constructed
graph LR
A[Raw text input] --> B[Convert to UTF-8 bytes]
B --> C[Apply BPE merge rules]
C --> D[Token IDs - sequence of integers]
D --> E[Embedding lookup]
E --> F[Model processes token vectors]
F --> G[Output: probability over next token]
G --> H[Sample next token ID]
H --> I[Decode back to text]
The tokenizer sits at the very front door of the model. Every insight the model has, every poem it writes, every bug it fixes — all of that emerges from processing a sequence of token IDs. Understanding what those IDs are and how they were determined is the foundation on which everything else is built.
There's something philosophically interesting here too, if you're willing to squint at it. We take it for granted that words are the natural unit of language. But are they? Poets play with syllables. Linguists study morphemes. Children learning to read learn phonemes. The "word" is a convenient fiction — one that happens to work well for human communication but isn't obviously the right atom for a learning algorithm. Byte-Level BPE, by starting from the smallest possible units and merging upward based on raw statistical frequency, effectively re-discovers the structure of language from scratch — and it finds something that doesn't quite match what linguists would have predicted. That's worth sitting with.
In the next section, we'll take the sequence of token IDs that emerges from tokenization and ask the next question: once you have a list of integers, how does the model encode meaning into them? The answer — word vectors and embeddings — is where things get genuinely beautiful.
Only visible to you
Sign in to take notes.