How ChatGPT Works: The Definitive Plain-English Guide to Large Language Models

Section 13 of 13

How ChatGPT Learns and Uses Knowledge

We've spent the last section understanding what ChatGPT doesn't know reliably — where it confabulates, misremembers, and produces plausible-sounding falsehoods. But there's a paradox worth sitting with: the same model that hallucinates numerical facts also demonstrates genuine mathematical reasoning. The same system that can't reliably recall your company's details can learn entirely new tasks from a handful of in-context examples. How can both be true? The answer lies in understanding what "knowing" actually means inside a language model — and recognizing that emergent capabilities and confabulation aren't contradictions. They're two sides of the same coin.

There's a moment that researchers at Google Brain documented in 2020 that still hasn't fully been explained. They were testing GPT-3, OpenAI's then-new language model, and they gave it a handful of arithmetic examples — "2 + 3 = 5, 7 + 1 = 8, 14 + 6 =" — and asked it to complete the pattern. GPT-3 got it right. This wouldn't be remarkable if GPT-3 had been trained to do arithmetic. It hadn't been. It had been trained to predict text. Yet somehow, from exposure to enough human-written text about numbers, calculations, and mathematical reasoning, it had absorbed something that looked a lot like the ability to add. This is the mystery at the heart of modern AI, and it's the question this section is going to explore honestly: What does ChatGPT actually know, and how does it know it? The easy answer — "it learned from a lot of text" — is technically true and almost completely unsatisfying. The harder, more interesting answer requires us to think carefully about what "knowing" means, how abilities can emerge from pattern recognition at sufficient scale, and why even the people who built these systems are sometimes genuinely surprised by what comes out of them.

Why does this happen? The honest answer is that we're still not completely sure. One compelling hypothesis is that the model, during pretraining, encountered countless examples of humans demonstrating tasks, explaining reasoning, and working through problems — and it learned the meta-pattern of how example-followed-by-answer works. When it sees that structure in the prompt, it recognizes the shape of learning-from-examples, a structure it's seen thousands of times in human writing.

But here's the thing: even if that explanation is right, it's still remarkable. The model extracted and internalized a general strategy for learning from examples across different domains, without ever being explicitly taught to do so.

Chain-of-Thought: The Power of Thinking Out Loud

If few-shot learning was the first surprise, chain-of-thought prompting was the second — and in some ways, it's even stranger.

In 2022, researchers at Google discovered something almost embarrassingly simple: if you ask a language model to explain its reasoning step by step before giving an answer, its accuracy on complex problems improves dramatically. Not marginally. Dramatically. On some math word problems, accuracy jumped from below 20% to above 50% just by adding "Let's think step by step" to the prompt.

Here's what this looks like in practice:

Without chain-of-thought: Prompt: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?" Model answer: "11" ✓ (sometimes — but often wrong on harder versions)

With chain-of-thought: Prompt: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now? Let's think step by step." Model answer: "Roger starts with 5 tennis balls. He buys 2 cans of 3 balls each, so he gets 2 × 3 = 6 more balls. 5 + 6 = 11. The answer is 11." ✓

The performance gap widens dramatically as problems get harder. So why would this work?

The architectural explanation is partly about how tokens flow through the system. Remember from earlier sections that language models generate text one token at a time, and each new token is produced based on everything that came before it. When the model generates its reasoning steps, those steps become part of the context — and so the final answer can pay attention to the intermediate work rather than trying to produce the answer in one leap.

Think of it like doing mental arithmetic versus written arithmetic. For "12 × 17," most people find it much easier to write out the intermediate steps than to hold all the partial products in their head simultaneously. The chain-of-thought approach gives the model a kind of scratchpad. By generating intermediate reasoning, that reasoning becomes available for the rest of the computation.

But something deeper is happening too. The model was trained on human text, and human text — especially in technical domains, mathematics, and careful reasoning — is full of step-by-step working. Scientific papers show their methods. Math textbooks show their proofs. Stack Overflow answers explain each step. By absorbing all this text during training, the model learned that correct answers to hard problems are usually preceded by step-by-step working. When you prompt it to produce that working, you're inviting it into the mode where it's seen the most reliable answers.

This has real practical implications. If you're using ChatGPT for anything analytically complex — planning, problem-solving, decision analysis — getting it to reason through the steps before giving you an answer consistently produces better results than asking for the answer directly. This isn't a trick or a hack; it reflects something genuine about how the system works.

Emergent Abilities: Capabilities That Appear from Nowhere

One of the most philosophically fascinating phenomena in recent AI research is what researchers call emergent capabilities — abilities that show up in large models but are essentially absent in smaller ones, and that seem to flip on suddenly as models cross certain size thresholds.

Researchers at Google published a landmark paper in 2022 documenting dozens of these. They found that small models scored near random chance on certain tasks — arithmetic, logical reasoning, multi-step problem solving — and then, as model size increased, performance would remain flat... flat... flat... and then jump sharply. Not gradually improve. Jump.

graph LR
    A[Small Model\n~1B params] -->|Flat performance| B[Medium Model\n~10B params]
    B -->|Still flat| C[Large Model\n~50B params]
    C -->|Sudden jump| D[Very Large Model\n~100B+ params]
    D --> E[Emergent capability\napparent]
    style E fill:#2ecc71,color:#000
    style A fill:#e74c3c,color:#fff
    style B fill:#e74c3c,color:#fff
    style C fill:#e74c3c,color:#fff
    style D fill:#2ecc71,color:#000

The capabilities that emerge this way include: multi-step arithmetic, word unscrambling, identifying the intended meaning of figurative language, and solving analogy problems that require understanding relationships between concepts. None of these were explicitly trained for. They seem to materialize from the model having absorbed enough of the underlying structure of human thought.

This emergence is both exciting and unsettling. Exciting, because it suggests that larger models might develop new useful capabilities without any targeted engineering effort. Unsettling, because it means we can't fully predict what a new model will and won't be able to do before we build and test it. There's no equation that tells you "at this scale, multi-step arithmetic will emerge." You have to watch it happen.

One explanation for emergence is that some tasks require multiple underlying capabilities that each develop gradually — and the emergent ability only appears when all the prerequisite sub-capabilities exist simultaneously. Arithmetic, for example, requires understanding number representations, carrying over values, and following multi-step procedures. Each of these might develop below the threshold for task performance until a model is large enough to have all three at once.

Another explanation is more humbling: our benchmarks are binary. A model either gets the answer right or wrong. If a model has developed 90% of the capability needed for a task but is still making errors, it scores 0% on accuracy-based tests. When it crosses the threshold to 95% capability, it suddenly passes many problems and scores jump dramatically — even though the underlying capability built up gradually all along.

This doesn't fully solve the mystery. But it suggests that "emergence" might be partly an artifact of how we measure capability.

What's Actually Inside: Circuits, Features, and Mechanistic Interpretability

So far we've been describing capabilities from the outside — what the model can do. But what's actually happening inside?

This is where things get genuinely difficult. As researchers working on large language models have noted, no one on Earth fully understands the inner workings of these systems. The training process builds billions of numerical parameters, each adjusted fractionally across trillions of optimization steps — and the result is a system that no one designed at the level of individual components. You can't look at weight number 47,382,916 and say "ah, that's the weight that knows Paris is the capital of France."

But researchers are making progress. A field called mechanistic interpretability is attempting to reverse-engineer what computations are actually happening inside neural networks. The goal is to identify circuits — small subgraphs of neurons that reliably perform specific functions — and understand how they compose into larger behaviors.

Some early results are striking. Researchers have found, for example, that certain attention heads in transformer models appear to implement specific, identifiable algorithms. One attention head in GPT-2 was found to reliably move information about the subject of a sentence to the position where the model would predict the verb — a grammatically meaningful operation, learned from scratch. Another attention head appears to track the beginning of sentences. These are not things anyone programmed; they're patterns the model discovered because they were useful for predicting text.

Anthropic's interpretability researchers have identified what they call features — directions in the high-dimensional activation space of a neural network that correspond to recognizable concepts. Some features clearly correspond to things like "is this text about a city" or "is this token a number." Others correspond to more abstract properties. And many remain opaque — they activate reliably for certain inputs, but it's not obvious what semantic property they're tracking.

The field is still young. Mechanistic interpretability can currently give clear accounts of small models or specific behaviors in larger ones. A complete circuit-level understanding of GPT-4 is not something anyone has — and may not be achievable with current tools. But the early findings are enough to challenge two common assumptions: that there's nothing interesting going on inside these models (there demonstrably is some structure), and that we'll ever be able to understand them the way we understand traditional software (we probably won't — not completely).

The Question We Can't Avoid: Is It Really Reasoning?

Here we arrive at the most contested question in AI right now, and intellectual honesty requires us to sit with it rather than dodge it.

When ChatGPT solves a logic puzzle, passes the bar exam, or writes a subtle literary analysis — is it reasoning? Or is it doing something that resembles reasoning so closely from the outside that the distinction might not matter practically — but which is fundamentally different underneath?

The "it's just pattern matching" camp has a coherent story: LLMs learned statistical regularities in human text, including text that documents human reasoning processes. When prompted to reason, they reproduce patterns that look like reasoning. They're very, very good at this. But there's no genuine inference, no real understanding — just sophisticated autocomplete.

The counter-argument is that this framing assumes what needs to be proven. What is human reasoning, after all, if not neurons doing pattern recognition on the inputs they've been shaped (by genetics and experience) to process? If the outputs of LLM processing and human reasoning are indistinguishable — if the system passes the same tests, solves the same problems, catches the same errors — what does it mean to say one is "real" reasoning and the other isn't?

A practical way to probe this is to look at systematic failures. If LLMs were genuinely reasoning, they should fail on tasks that require reasoning and succeed on tasks where pattern matching is sufficient. If they're pure pattern matchers, they should fail whenever a problem is genuinely novel and can't be solved by statistical similarity to training data.

The reality is messier than either story predicts. LLMs fail at things a reasonable person would expect them to handle — simple arithmetic that requires carrying, spatial reasoning, problems that require genuinely keeping track of state. But they also succeed at things that seem to require genuine insight — novel analogies, subtle logical inferences, creative synthesis across domains they've never explicitly seen combined.

This inconsistency is itself meaningful. It suggests something more complex than either "pure autocomplete" or "genuine reasoning." The system appears to have developed some reasoning-like capabilities as a byproduct of learning to predict text well, but those capabilities are patchy, domain-dependent, and fragile in ways that human reasoning typically is not.

What 'Knowledge' Means for a Model

There's a related question worth unpacking: when we say ChatGPT "knows" something, what do we actually mean?

Human knowledge has several layers. There's declarative knowledge — knowing that Paris is the capital of France. There's procedural knowledge — knowing how to ride a bike. There's situated knowledge — knowing what this specific situation requires, right now. And underneath all of it is something like grounding — the fact that human concepts connect to bodily experience, to the world, to the consequences of actions.

LLM knowledge is both similar and radically different. An LLM doesn't "know" that Paris is the capital of France the way you do — with a memory of learning it, a sense of its importance, connections to experiences of Paris. Instead, the fact is encoded somehow in the model's weights, such that "Paris" and "capital of France" are statistically highly associated in the model's internal representations. When asked, the model activates patterns that reliably produce the correct answer.

For most practical purposes, this is functionally equivalent to knowing. The model gives correct answers to questions about Paris. But the underlying mechanism is profoundly different — and it leads to the model's characteristic failure modes.

As Stephen Wolfram observed in his deep analysis of what ChatGPT is doing, the model is always fundamentally trying to produce "a reasonable continuation of whatever text it's got so far." Its knowledge is, in a real sense, the knowledge of what good text looks like — absorbed across billions of documents. When the system gets facts right, it's because good text about those facts was in the training data. When it gets facts wrong, it's often because the statistical patterns in the training data were misleading, or because the required fact is rare enough that the relevant patterns are weak.

This is why LLMs are better at recognizing correct answers than generating them reliably. If you give ChatGPT a multiple-choice question, it performs better than on an open-ended equivalent, because recognition only requires the model to evaluate which option "sounds right" — a task much more directly served by pattern matching than generating novel correct content from scratch. This asymmetry tells us something real about the nature of what these systems are doing.

The Chinese Room, Revisited

No discussion of AI understanding is complete without encountering John Searle's Chinese Room thought experiment, proposed in 1980 and still fiercely debated.

Imagine you're sitting in a room, following a rulebook written in English that tells you how to respond to any sequence of Chinese symbols with another sequence of Chinese symbols. People slide Chinese text under your door. You look up the rules, produce the right Chinese symbols in response, and slide them back. To people outside the room, you appear to understand Chinese. But you understand nothing — you're just following rules.

Searle's argument was that this is all any computer program can do: manipulate symbols according to rules, without any genuine understanding. The form is there; the meaning is absent.

The Chinese Room has always had powerful critics. The "systems reply" argues that while you don't understand Chinese, the system — you plus the rulebook — does. Understanding might be a property of the whole system rather than any individual component. Your brain's neurons don't "understand" anything either, but somehow the combination does.

For LLMs specifically, the thought experiment is both illuminating and limited. It correctly captures something: an LLM processes symbols according to mathematical rules and doesn't have the rich intentional connection to the world that characterizes human understanding. In that sense, it's "like" the Chinese Room.

But it undersells something too. The rulebook in Searle's scenario was handcrafted by people who understood Chinese — so the understanding was baked in by the designers. LLM "rules" — the weights — were learned from data, through a process of self-organization that nobody designed top-down. The patterns the model has internalized may constitute a form of understanding that doesn't fit neatly into Searle's categories.

What the Chinese Room definitely does tell us about LLMs: be careful about assuming that because the output looks like understanding, understanding is present in any deep sense. The model can produce fluent text about the phenomenology of sadness without having experienced anything. It can write compellingly about physical pain without having a body. These are real limitations that matter for certain applications.

What it doesn't tell us: whether the pattern-recognition processes of a sufficiently sophisticated LLM constitute something that deserves to be called understanding in any meaningful sense. Philosophers have been arguing about this for 45 years and haven't reached consensus. It would be intellectually dishonest to pretend the question is settled.

The Asymmetry Between Generation and Recognition

One of the most practically useful insights about LLM capabilities is this: these models are systematically better at evaluating and recognizing good answers than they are at reliably generating them from scratch.

This shows up in multiple ways. Models perform better on multiple-choice tasks than open-ended equivalents. Models can identify errors in a piece of code faster than they reliably write error-free code on the first try. Models can evaluate whether a logical argument is valid more reliably than they can construct a novel valid argument on a difficult topic.

This makes intuitive sense from a pattern-matching perspective. Recognizing that something "fits the pattern of correct answers" is a straightforward pattern-matching task. Generating a correct answer from scratch requires the model to navigate a vast space of possible outputs and land on one that happens to satisfy correctness criteria — a much harder search problem, especially when the relevant patterns from training data are sparse.

This asymmetry has a practical implication that practitioners have learned: use the model's recognition abilities to improve its generation. Generate multiple candidate answers, then ask the model to evaluate them and pick the best. Or generate a first draft, then ask the model to critique it. These "generate-then-evaluate" loops dramatically outperform single-shot generation on complex tasks. The model becomes a check on itself.

This also explains why chain-of-thought works so well. Each intermediate reasoning step is evaluated by the model before proceeding — it's continually checking whether the current trajectory "looks right" based on its pattern recognition. This is cognitively analogous to how expert humans solve problems: not by computing the answer in one step, but by moving through intermediate representations that can each be recognized as plausible or implausible.

graph TD
    A[Problem posed] --> B[Generate candidate answer]
    B --> C{Recognition check:\nDoes this look right?}
    C -->|No| D[Backtrack / revise]
    C -->|Yes| E[Accept and continue]
    D --> B
    E --> F[Final answer]
    style C fill:#f39c12,color:#000
    style F fill:#2ecc71,color:#000

An Honest Assessment

So: what does ChatGPT know, and how does it know it?

Here's the most intellectually honest summary available, given what we currently understand:

What it knows: A vast compression of patterns from human text — statistical regularities that encode facts, relationships, reasoning structures, stylistic conventions, and domain-specific knowledge across virtually every area of human intellectual activity. This is not stored as a database of facts. It's encoded in the weights of the neural network in a distributed, overlapping, and not-fully-understood way.

How it "knows" it: Through billions of numerical parameters that were adjusted, through gradient descent over trillions of training examples, to minimize the error in predicting the next token. The knowledge is implicit, not explicit. No one put the fact about Paris into the model; it emerged from seeing enough text where Paris and "capital of France" co-occurred in meaningful contexts.

What we know is happening inside: Some structure. Attention heads implement specific algorithms. Features correspond to recognizable concepts. There are circuits that perform identifiable linguistic operations. But we are nowhere near a complete mechanistic account, and probably won't be for years.

Whether it's "really" reasoning: Partially. The model has developed reasoning-like capabilities as emergent byproducts of learning to predict text well. These capabilities are genuine enough to be useful and are not reducible to simple lookup. But they're also patchy, fragile, and structurally different from human reasoning in ways that lead to characteristic failure modes.

Whether it "understands": Genuinely uncertain. The Chinese Room argument shows that behavioral similarity doesn't guarantee understanding. But the systems-level reply shows that ruling out understanding isn't straightforward either. This is an open question that honest people disagree about.

What we should resist is two opposite temptations: dismissing LLM capabilities as "just statistics" in a way that obscures how much genuine capability has emerged, and anthropomorphizing them in a way that obscures how fundamentally different their underlying processes are from human cognition.

The capabilities are real. The mechanisms are strange. Both things are true at once. And understanding both — rather than collapsing the tension into a simple story in either direction — is what it means to actually understand what ChatGPT knows and how it knows it.

That tension isn't a flaw in the explanation. It's the territory. These systems were built without a complete theory of what would emerge from them, and we're still building the concepts we need to properly describe what did emerge. That's not a reason for confusion — it's a reason for genuine intellectual excitement. We're watching the formation of a new kind of mind, or at least a new kind of information processing, and we don't yet have all the words for it.

The question isn't really whether to call it knowledge or reasoning or understanding. The question is: given what we know about how these systems work, how should we use them, trust them, and design with them? That's the question the rest of this course is built to help you answer.

Why ChatGPT Makes Up False Information Back to Course Overview

Only visible to you