How ChatGPT Works: The Definitive Plain-English Guide to Large Language Models

Section 11 of 13

How ChatGPT's Context Window Works

The scaling story we've just traced — from GPT-1's 117 million parameters to GPT-4's unknown vastness — reveals something crucial about how these models work: massive capability emerges from massive training. But emergence has limits, and one of those limits is right in front of you every time you use ChatGPT. The model has no memory of you. Not of your name, your project, the conversation you had yesterday, or the brilliant prompt you spent an hour crafting last Tuesday. Each time you open a new chat, you're talking to something that has never met you before — a mind that springs into existence with your first message and dissolves the moment you close the window.

And yet, within a single conversation, ChatGPT can track a complex argument across dozens of exchanges, refer back to something you mentioned ten messages ago, and maintain a consistent persona throughout. How does it hold all of that together without any persistent memory? The answer is the context window — one of the most practically important, and most frequently misunderstood, concepts in all of AI. Understanding it doesn't just satisfy curiosity. It explains why ChatGPT loses track of long conversations, why you sometimes need to repeat yourself, why the model occasionally contradicts something it said earlier, and why the apparent "memory" in some AI products is more engineering workaround than genuine remembering. Once you see how it works, a lot of the quirks of interacting with these models snap into focus.

Everything at Once, All the Time

Here's the mechanism, and it's genuinely surprising. When you send a message to ChatGPT, the model doesn't read your message and then consult some internal database of your previous exchanges. There is no database. Instead, the entire conversation — every single word you've typed and every single word the model has generated — gets assembled into one long sequence of tokens and fed into the model simultaneously. The model processes all of it at once, right now, in order to predict what should come next. What the model is fundamentally doing at every step is asking: given everything I've seen so far, what token should come next? That "everything I've seen so far" is the context window. It's not a memory system — it's the model's entire perceptual field, the complete universe of information it's working with at any given moment.

Now, there's a wrinkle here worth understanding. The correspondence between tokens and human-readable text isn't 1:1. As covered earlier in this course, a token is roughly three-quarters of a word on average — common English words are often a single token, but unusual words, proper nouns, and non-English text can be split into multiple tokens. The word "unbelievable" might be one token; the word "Schokoladeneis" (German for chocolate ice cream) might be four.

What this means practically: a context window of 4,096 tokens fits roughly 3,000 words of typical English text. A 128,000-token context window — the size available in some current models — can hold somewhere around 96,000 words, or roughly the length of a short novel.

Why tokens rather than words? Because the model doesn't actually work in words — it works in tokens. The attention mechanism at the heart of the transformer architecture processes each token as a unit, computing relationships between every token and every other token in the sequence. The context window limit is, at its core, the limit on how many of these units the model can process in one forward pass.

Why the Limit Exists: Attention's Quadratic Problem

The attention mechanism — the key innovation that makes transformers so powerful — is also why context windows have hard limits. Here's where things get interesting.

In attention, every token in the sequence must "look at" every other token to determine which ones are relevant to predicting the next token. If your context has 1,000 tokens, that's roughly 1,000 × 1,000 = one million attention computations. If it has 4,000 tokens, that's 16 million. If it has 100,000 tokens, that's 10 billion. The computational cost grows quadratically — doubling the context length quadruples the computation required.

graph TD
    A[Context Length: 1,000 tokens] --> B["~1M attention computations"]
    C[Context Length: 4,000 tokens] --> D["~16M attention computations"]
    E[Context Length: 32,000 tokens] --> F["~1B attention computations"]
    G[Context Length: 128,000 tokens] --> H["~16B attention computations"]
    B --> I["Fast, cheap"]
    D --> J["Manageable"]
    F --> K["Expensive"]
    H --> L["Very expensive, requires specialized hardware"]

This isn't a temporary engineering limitation that a clever programmer can optimize away. It's a fundamental property of the attention mechanism as it's currently designed. Researchers are actively working on attention variants that scale more efficiently — sparse attention, linear attention, and other approximations — but as of now, extending context length still comes with significant computational cost. This is why longer context models are more expensive to run and why API pricing for large-language-model providers typically charges more per token as context grows.

The practical takeaway: context length is genuinely a design tradeoff, not just an arbitrary cap. Making the window larger makes the model more capable in some ways and more expensive in others.

What Gets Put in the Context Window

Here's something users often don't realize: your message is never the only thing in the context window. In a typical ChatGPT conversation, the context contains at least three distinct components.

graph TD
    A[Full Context Window] --> B[System Prompt]
    A --> C[Conversation History]
    A --> D[Your Latest Message]
    B --> E["Model's persona, rules, capabilities\n(set by OpenAI or application developer)"]
    C --> F["All previous user messages +\nall previous assistant responses"]
    D --> G["What you just typed"]

The system prompt sits at the very beginning. It's a set of instructions, invisible to you, that was written by OpenAI (or, if you're using ChatGPT embedded in a third-party product, by that product's developers). The system prompt tells the model who it is, how it should behave, what it should and shouldn't do, and sometimes what information it should have about the context you're using it in. A customer service chatbot built on ChatGPT might have a system prompt like: "You are a helpful assistant for Acme Corporation. You only answer questions about our products. Never discuss competitors. Always maintain a professional tone." That prompt consumes tokens in the context window — sometimes quite a few of them.

The conversation history takes up the bulk of the context in a long conversation. Every message you've sent and every response the model has generated is preserved and prepended to each new query. When ChatGPT seems to "remember" something you said fifteen messages ago, it's not because it stored it somewhere — it's because that message is literally still present in the context, and the model can attend to it while generating its response.

Your current message is, of course, also there. But now you can see that it's one part of a much larger input.

This architecture has a beautiful simplicity to it. The model doesn't need a special "memory" module or a separate retrieval system. It just needs everything in the context, and it can attend to all of it simultaneously. But it also means the context window is a genuinely shared and finite resource, constantly being consumed by all three components.

When the Conversation Fills Up

So what happens when a conversation goes on long enough that it would exceed the context window?

Something has to give. And what gives is the beginning of the conversation.

Different implementations handle this slightly differently, but the most common approach is simple truncation: the oldest messages in the conversation history are dropped to make room for the newest ones. From that point on, the model simply cannot see those early exchanges. They're gone. It will write as if they never existed, because from its perspective, they didn't.

This creates some genuinely strange behaviors. The model might agree with something you said early in a long conversation, then contradict itself much later — not because it changed its mind, but because the earlier message literally no longer exists in its field of view. It might ask you something you already answered. It might lose track of a constraint you established at the beginning ("always respond in French," for example) if that instruction appeared only in early messages that have since been dropped.

The practical lesson here is counterintuitive: in a very long conversation with ChatGPT, the most important information to preserve is actually the most recent context, not the oldest. The beginning of your conversation may be invisible to the model by the time you're deep into it.

If you've ever noticed ChatGPT become less coherent or more forgetful as a conversation stretches on, this is almost certainly why. It's not the model degrading — it's the model losing access to context that was originally informing its responses.

Why ChatGPT Doesn't Remember Last Week

This brings us to the question that probably every ChatGPT user has wondered about: why doesn't the model remember your conversation from yesterday? Or the one from last month, where you explained your entire project in detail?

The answer should be clear now: each conversation starts with a fresh context window. The previous conversation isn't in the context, so the model doesn't know it happened. It's not that OpenAI chose not to implement memory — it's that "memory across conversations" requires something fundamentally different from what the base model does. It requires either expanding the context (which gets expensive fast), or storing information externally and retrieving it (which is an engineering system layered on top of the model), or fine-tuning the model on your data (which is expensive and raises privacy concerns).

The "Memory" feature that OpenAI has added to ChatGPT is, in fact, exactly the second approach: an external storage system. When memory is enabled, important facts from your conversations — your name, your preferences, your ongoing projects — are extracted, stored in a database outside the model, and then selectively injected into the context window at the start of future conversations. This is a clever engineering workaround, but it's worth understanding what it actually is. The model is not "remembering" in any deep sense. Someone (or something) is looking up relevant notes and slipping them into the context before you start talking.

This is why memory features can feel both impressive and oddly clunky. They're impressive when the retrieval works well — it really does feel like the AI knows you. They're clunky when the retrieval gets it wrong, surfaces irrelevant information, or misses something important. The limitation isn't in the model; it's in the retrieval system that decides what to inject into your context.

The Invisible Shaping Force: System Prompts

System prompts deserve a closer look, because they're doing more work than most users realize.

In every ChatGPT conversation, before you type a single word, the model has already been given a set of instructions. These instructions can range from a few lines ("be helpful, harmless, and honest") to thousands of tokens of detailed guidance. They establish the model's persona, its constraints, its knowledge of who you are and why you're talking to it, and its rules of engagement.

If you've ever used a ChatGPT-powered product that has a specific name ("Hi, I'm Aria from TechCorp!"), that personality exists in the system prompt. If you've noticed that ChatGPT refuses certain topics in some applications but not others, those refusals often come from system prompt instructions. If you're a developer building with the API, the system prompt is your primary tool for customizing the model's behavior without retraining it.

The system prompt is also one of the reasons why context windows fill up faster than you'd expect. A detailed system prompt might consume 500 to 2,000 tokens — tokens that are unavailable for conversation history. In applications with very elaborate system prompts, users might find conversations becoming incoherent earlier than the nominal context window size would suggest.

Here's a lesson that production builders have learned the hard way: if you're building something on top of a language model and finding that conversations degrade faster than expected, audit your system prompt first. Bloated system prompts are among the most common hidden culprits behind context window problems in real applications.

How Longer Context Windows Changed the Game

Early transformer models had context windows of 512 tokens. GPT-2 pushed this to 1,024. GPT-3 went to 2,048, then 4,096. GPT-4 launched with 8,192 tokens, and some versions could handle 32,768. By the time of GPT-4 Turbo, OpenAI had pushed this to 128,000. Anthropic's Claude 3 models offer up to 200,000 tokens.

These numbers aren't just bragging rights. Each jump in context size genuinely unlocked new capabilities:

4K tokens: Can handle most short conversations and brief documents
8K tokens: Can analyze a typical academic paper or short story in one pass
32K tokens: Can process an entire corporate report, codebase section, or book chapter
128K tokens: Can hold an entire novel, a lengthy legal document, or dozens of research papers simultaneously

The jump from 4K to 128K is particularly transformative for document-heavy use cases. A lawyer who wants to ask questions about a 200-page contract, a researcher who wants to synthesize a stack of papers, a developer who wants the model to understand an entire codebase — none of these use cases were practically possible with early context sizes. They become feasible with 128K tokens.

But the computational cost equation remains unforgiving. Running a model with 128K tokens in context costs vastly more per query than running the same model with 4K tokens. This is why these larger context tiers are priced higher in the API, and why getting the most from a large context window still requires thoughtful management of what you put in it.

Practical Implications: Working With the Context Window

Understanding the context window changes how you should interact with ChatGPT in practice. A few patterns that follow directly from the mechanics:

Front-load the important stuff. If there are constraints, personas, or information that must persist through the whole conversation, establish them clearly at the beginning — and consider reminding the model of them periodically in very long conversations. Don't assume that something you said twenty exchanges ago is still visible to the model.

Summarize and restart for very long projects. If you're working on a long document or complex problem and you feel the conversation getting unwieldy, it's often better to start a new conversation and provide a summary of what you've established so far. This gives you a fresh context with the key information cleanly loaded, rather than a degraded context filled with early exchanges that may already be getting truncated.

Think of your prompt as screen real estate. Every token you put in the context is a token that can't be used for something else. Long, redundant preambles, unnecessary repetition, or verbose explanations of things the model already knows are all consuming context space you might need later.

Be explicit when referring back. If you want the model to use something from earlier in the conversation, say so directly. "Earlier, I mentioned my deadline is Friday — with that in mind, what should I prioritize?" This isn't just good communication style; it's insurance against the possibility that the relevant context has been truncated.

Understand what "memory" features actually are. If you're relying on ChatGPT's memory feature, know that it's a retrieval system, not genuine recall. For information that's truly critical — your project requirements, your style guidelines, your key constraints — consider explicitly reinjecting it into each conversation rather than hoping the retrieval system surfaces it.

The Context Window as Design Philosophy

There's something deeper worth reflecting on here. The context window isn't just a technical limitation — it's a fundamental design choice that reflects a particular theory of what language understanding is.

The attention mechanism treats language understanding as a function of context. Meaning doesn't exist in tokens in isolation; it emerges from the relationships between tokens, all of them present simultaneously. This is, interestingly, not entirely unlike how human comprehension works. You understand the word "bank" in "river bank" differently than in "savings bank" because the surrounding context disambiguates it.

The transformer architecture just formalizes this intuition and scales it to enormous sequences. Everything the model knows about what you're asking is encoded in the relationships between the tokens in front of it, right now. There is no separate "understanding" happening in the background, no persistent representation of you as a person, no model of your goals that persists between sessions.

This is why the pattern-completion framing from the course's central thesis is so apt: ChatGPT isn't consulting a database or retrieving stored knowledge about you. It's completing a pattern, and that pattern is the context window — everything it can see, all at once, right now. Understanding this doesn't diminish what the model does. If anything, it makes the performance more remarkable. Generating coherent, helpful, contextually aware responses from nothing more than "the text so far" — and doing so across context windows of 128,000 tokens — is a genuinely extraordinary engineering achievement.

But it also makes the limits predictable. When something goes wrong — when ChatGPT forgets what you said, contradicts itself, or seems to lose the thread — the context window is almost always part of the explanation. Once you understand the mechanism, the failures make sense. And that's actually the most useful thing you can know.

GPT Models and How They Scale in Size and Ability Why ChatGPT Makes Up False Information

Only visible to you