How ChatGPT Works: The Definitive Plain-English Guide to Large Language Models
Section 10 of 13

GPT Models and How They Scale in Size and Ability

Scaling Up: The GPT Family and the Laws of AI Progress

We've now seen how the training pipeline—pretraining, supervised fine-tuning, and reinforcement learning from human feedback—shapes raw statistical models into genuinely helpful systems. But understanding how we train models is only half the story. The other half is understanding what happens when we train bigger models, with more data and more compute. And this is where things get truly surprising.

There's a thought experiment that illuminates everything about the last few years of AI progress. Imagine you're a researcher in 2017. You understand transformers. You understand pretraining. You have a reasonable language model that can finish sentences. Someone asks: "What happens if you make it ten times bigger? A hundred times bigger? A thousand times bigger?" Honestly? You'd probably shrug. Maybe it gets a little better. Maybe you hit diminishing returns. Maybe your compute cluster melts and your career goes with it.

What actually happened was one of the most surprising empirical discoveries in machine learning history: making models bigger, with more data and more compute, didn't just make them incrementally better — it made them qualitatively different. Capabilities that simply didn't exist at smaller scales appeared once you crossed certain thresholds. New behaviors emerged. New things became possible. This discovery, and what it tells us about where AI is headed, is what we're digging into here.

GPT-2: The Moment Things Got Weird

The story starts with GPT-2. OpenAI released it in February 2019, and here's the thing that made people nervous: it was too good at generating coherent text.

They'd give the model the opening of a news story, and it would continue with alarming coherence. The concern wasn't theoretical — it was practical. "Malicious use" was the official framing: specifically, that the model could generate fake news, synthetic disinformation, convincing impersonations. They called it "too dangerous to release" and instead staged a gradual rollout, releasing progressively larger versions over several months.

The AI research community had mixed reactions. Some agreed the concerns were legitimate. Others argued it was theatrical — that the model wasn't nearly as dangerous as claimed, and that withholding it mostly hurt legitimate researchers. When the full model finally shipped in November 2019, the feared disinformation apocalypse didn't materialize. By then, the conversation had moved on to what was coming next. GPT-3 was coming.

But GPT-2 left something important behind: a conceptual shift. Previous models could complete sentences plausibly. GPT-2 could complete paragraphs — not just grammatically correct, but topically coherent, maintaining context and narrative across hundreds of tokens. Something fundamental had changed, not just improved.

GPT-3: When Everything Got Strange

GPT-3, published in May 2020, was a genuine shock. Not because of what it was, but because of what it could do.

The scale jump was enormous. 175 billion parameters. 45 terabytes of training data (a filtered version of Common Crawl, plus books and Wikipedia). The compute required was staggering — millions of dollars in cloud costs, placing it entirely beyond the reach of individual researchers. But honestly, the scale wasn't what made people sit up and pay attention. It was what emerged from it.

Few-Shot Learning: The Trick Nobody Built In

Before GPT-3, if you wanted a language model to do something specific — translate sentences, answer questions, classify sentiment — you had to fine-tune it. You'd collect labeled examples, train the model further on them, and hope you'd created something useful.

GPT-3 introduced something qualitatively different: in-context learning, or few-shot prompting. You could give GPT-3 a handful of examples of a task directly in the prompt, and it would figure out the pattern and apply it — without any gradient updates, without any fine-tuning, without any additional training whatsoever.

Show it three examples of English-to-French translation, then give it an English sentence, and it translates. Show it two sentiment classification examples, and it classifies new inputs. This ability — to learn from context rather than from training — was never explicitly programmed. It just... emerged.

graph TD
    A[GPT-1: 117M params<br/>BookCorpus] --> B[GPT-2: 1.5B params<br/>WebText]
    B --> C[GPT-3: 175B params<br/>Common Crawl + Books]
    C --> D[InstructGPT: 175B<br/>+ RLHF alignment]
    D --> E[ChatGPT<br/>Deployed Product]
    E --> F[GPT-4: Multimodal<br/>Details undisclosed]
    
    style A fill:#d4e8ff
    style B fill:#b3d4ff
    style C fill:#80b3ff
    style D fill:#4d8fff
    style E fill:#1a6bff
    style F fill:#003dbf

The paper describing GPT-3 — "Language Models are Few-Shot Learners" — became one of the most cited AI papers ever written. The central demonstration was almost too clean: a single, general-purpose language model, given only a few examples in its prompt, could match or approach specialized fine-tuned models across dozens of different tasks.

This was theoretically unexpected. The model had been trained purely to predict the next token. Nothing in that training objective explicitly rewarded learning to do tasks from examples. Yet here it was, doing exactly that. The capability had emerged from the training process in a way nobody fully understands.

Scaling Laws: The Empirical Discovery That Changed Everything

While GPT-3 was being trained, researchers at OpenAI were wrestling with a more fundamental question: how does performance scale with size? Is there a predictable relationship, or is everything just chaos?

In early 2020, Jared Kaplan and colleagues published "Scaling Laws for Neural Language Models." What they found was remarkably clean. Model performance — measured as cross-entropy loss on held-out text — improved as a smooth power law function of three things:

  • Number of parameters (model size)
  • Dataset size (how much training data)
  • Compute budget (total computation used in training)

These relationships held across many orders of magnitude. A model ten times larger wasn't just slightly better — the improvement was predictable, consistent, and substantial. Loss curves followed power laws of the form L ∝ N^(-α), where N is the number of parameters and α is a consistent exponent.

What does this mean in practice? It means AI progress, at least in this regime, has something unusually rare in science and engineering: predictability. You could, in principle, look at how a small model trains and project forward to estimate how a model 100x larger would perform. You could decide in advance how to allocate a compute budget between bigger models and more data.

The Chinchilla paper (published by DeepMind in 2022) added a crucial refinement: the original scaling laws had been somewhat size-biased. Kaplan et al.'s analysis suggested you should scale model size faster than data. But DeepMind's analysis of optimal compute allocation suggested something different — for a given compute budget, you should scale model size and data in roughly equal proportion. Train a smaller model on more data.

The result was Chinchilla: a 70-billion parameter model trained on 1.4 trillion tokens that outperformed the much larger Gopher (280B parameters) on nearly every benchmark. GPT-3, it turned out, was substantially undertrained relative to its size. Not bad — just not optimally tuned.

This matters more than it might seem. It means GPT-3's 175 billion parameters weren't some magic number. It was an architecture-plus-training-data combination that could have been more efficient. The field had been racing to build the biggest possible models when it should have been more carefully calibrating the ratio of model size to training data.

Emergent Abilities: The Phenomenon That Surprised Everyone

Scaling laws describe a smooth, predictable relationship between scale and loss. But loss on next-token prediction doesn't directly measure what most people care about: can the model actually do useful things?

When researchers started charting the relationship between model scale and performance on specific tasks, they found something that scaling laws alone couldn't explain: emergence.

Some capabilities showed smooth, gradual improvement with scale — a larger model is somewhat better at grammar, somewhat better at factual recall, somewhat better at translation. But other capabilities showed a strikingly different pattern: essentially zero performance at smaller scales, then rapid improvement past a certain threshold, as if a switch had been flipped.

A 2022 paper by Wei et al. catalogued this across dozens of tasks. Multi-step arithmetic. Answering questions about physics. Following complex logical chains. Code generation. All showed the same characteristic profile: flat, flat, flat — then suddenly, a jump.

The term "emergent abilities" borrows from complex systems science, where emergence describes phenomena that arise from the interaction of simpler components and can't be predicted from examining those components alone. The analogy is apt, though it's worth being honest about how philosophically slippery the term can be.

graph LR
    A[Small Model<br/>1B params] -->|Task performance| B{Capability<br/>Threshold}
    B -->|Below threshold| C[Near-zero<br/>performance]
    B -->|Above threshold| D[Sudden capability<br/>appears]
    E[Large Model<br/>100B+ params] --> D
    
    style C fill:#ffcccc
    style D fill:#ccffcc

Here's a concrete example. Three-digit multiplication — multiply 347 × 829 — is something a human can do with effort. Smaller language models essentially fail at this. They might produce plausible-looking numbers, but they're almost always wrong. Once models pass roughly 50-100 billion parameters with sufficient training, accuracy begins rising sharply. The model hasn't been directly trained to multiply numbers. It's learned to manipulate symbolic representations in ways that happen to implement multiplication as a special case of pattern completion.

This is simultaneously impressive and philosophically bewildering. The capability wasn't designed in. It emerged. What other capabilities might emerge at the next order of magnitude?

InstructGPT: The Missing Piece

By mid-2020, GPT-3 was available via API. User feedback was instructive: the model was impressively capable but genuinely difficult to work with.

The core problem was alignment between what users wanted and what the model did. GPT-3 was trained to continue text, not to follow instructions. If you wanted it to answer a question, you often needed to phrase it as if it were the beginning of a document that would contain the answer. Asking "What is the capital of France?" might confuse it; framing it as "Question: What is the capital of France? Answer:" worked much better.

GPT-3 also produced outputs that were sometimes biased, harmful, or simply strange — because the pretraining data contained plenty of text that was biased, harmful, and strange, and the model had no particular reason to prefer the helpful output.

The solution was InstructGPT, published in early 2022. We've covered the mechanics of RLHF in an earlier section, but what matters here is what it accomplished at scale: by combining GPT-3's vast learned representations with RLHF fine-tuning on human preferences, OpenAI created a model that was substantially better at following instructions, less likely to produce harmful outputs, and genuinely more useful — despite being smaller than the base GPT-3 model.

Sit with that for a moment. The InstructGPT paper's central finding was that a 1.3 billion parameter InstructGPT model was preferred by human raters over a 175 billion parameter GPT-3 model on most tasks. Not because 1.3B is inherently better than 175B, but because the 175B model's capabilities were being wasted — its raw power was enormous, but it was pointed in the wrong direction.

This has become one of the most practically important insights in the field: raw model size is not the same thing as deployed usefulness. A smaller, well-aligned model can outperform a larger, poorly-aligned one on almost every task a user actually wants to solve.

GPT-4: Power, Multimodality, and Deliberate Mystery

GPT-4 was released in March 2023, and immediately distinguished itself in two ways: its capabilities, and OpenAI's unusual silence about its architecture.

On capabilities: GPT-4 scored in the 90th percentile on the bar exam, compared to GPT-3.5's 10th percentile. It scored 1410 on the SAT (compared to GPT-3.5's 1030). It passed the medical licensing exam. Across hundreds of academic and professional benchmarks, it substantially outperformed GPT-3.5. More importantly, it was markedly better at complex, multi-step reasoning — the kind where a small error early in a chain of logic cascades into a completely wrong answer.

GPT-4 also introduced multimodality: the ability to accept images as inputs, not just text. This wasn't merely a convenience. It represented a meaningful architectural expansion, allowing the model to reason about visual information using the same transformer machinery that handles text.

On architecture: OpenAI published a technical report for GPT-4, but what it omits is notable. Citing competitive concerns, they declined to disclose the model's parameter count, architecture details, training data, or hardware. This was a departure from the more open approach of GPT-1 through GPT-3, and it generated substantial criticism from the research community.

What we can say with reasonable confidence: GPT-4 is larger than GPT-3 (though by how much is unclear), was trained on more data, and benefited from substantially more sophisticated RLHF fine-tuning. There is credible speculation, including claims by some researchers, that GPT-4 uses a "mixture of experts" architecture — where different subnetworks specialize in different types of inputs — rather than a single monolithic transformer. OpenAI has neither confirmed nor denied this.

What Parameters Actually Count: The Fine-Tuning Revolution

The success of InstructGPT pointed toward something broader that has reshaped how the field thinks about model development. Parameters matter — but which parameters, trained how, matters more than the raw count.

Consider what happened after GPT-3 and LLaMA (Meta's open-source language model, released in early 2023 with weights ranging from 7B to 65B parameters). Within weeks of LLaMA's release — initially a somewhat unauthorized leak — researchers were demonstrating that fine-tuned versions of the smaller models could match GPT-3 on many tasks. Stanford's Alpaca project fine-tuned a 7B-parameter LLaMA model for under $600 in compute costs and produced something that behaved surprisingly like an early version of ChatGPT in side-by-side comparisons.

Why? Because GPT-3, despite its 175 billion parameters, was a base model — optimized for next-token prediction, not for following instructions helpfully. A 7B model with good instruction-following fine-tuning could outperform it on the tasks users actually wanted.

This created a new vocabulary in the field:

  • Base models: Large pretrained models, typically proprietary and expensive to produce
  • Fine-tuned models: Adapted versions of base models, trained on specific tasks or in specific styles
  • PEFT (Parameter-Efficient Fine-Tuning): Techniques like LoRA that update only a small fraction of parameters during fine-tuning, making adaptation cheap

The practical implication is significant: the "size of the model" that a user interacts with isn't necessarily the size that matters. A well-designed 7B model, instruction-tuned on high-quality data, might outperform a 70B base model on tasks involving following complex instructions, maintaining persona, or avoiding harmful outputs. This democratization of capable fine-tuned models has been one of the defining dynamics of the open-source AI ecosystem since 2023.

The Economics of Frontier Scale: Why Only a Few Players Can Play

Training GPT-3 cost an estimated $4-12 million in compute alone. GPT-4 likely cost tens to hundreds of millions. These numbers are probably conservative for whatever comes next. The economics of frontier model training have created what is, functionally, a barrier to entry that only a handful of organizations can clear.

Consider what's required:

Compute: Training a frontier model requires thousands of specialized AI accelerators (NVIDIA H100s run at roughly $30,000-40,000 per chip). Running thousands of these chips for months requires data centers built specifically to handle the power density and cooling.

Data: The highest-quality training data is increasingly scarce, contested, and expensive. Common Crawl is free but noisy; high-quality licensed content requires negotiations with publishers, news organizations, and platform companies.

Engineering talent: The specialized engineers who can optimize training runs at this scale are among the most highly compensated in technology.

Organizational infrastructure: Safely deploying a frontier model requires trust and safety teams, red-teaming infrastructure, API management at scale, and ongoing monitoring.

The result is that, as of 2024, the organizations capable of training and deploying frontier models can be counted on two hands: OpenAI, Google DeepMind, Anthropic, Meta AI, Mistral, xAI, Cohere, and a few others. Even among these, there's a significant split between those training at the very frontier (OpenAI, Google, Anthropic) and those working at somewhat smaller scales.

This concentration has significant implications — for competition, for AI safety oversight, and for the direction of research. When the cost of entry into frontier AI is effectively nine figures, the diversity of approaches, values, and organizational cultures involved in building the most capable systems is sharply limited.

Will Scaling Keep Working?

This is the question that divides the field more sharply than almost any other.

The optimist case, often associated with OpenAI and sometimes called the "scaling hypothesis," holds that the improvements from GPT-1 to GPT-4 reflect a fundamental truth about the relationship between compute and capability — and that continuing to scale will continue to produce qualitatively better systems. The fact that emergent capabilities appeared at each major jump is taken as evidence that we haven't reached a ceiling.

The skeptic case notes several potential limitations:

Data walls: Training on the entire internet only works until you've used the entire internet. As models train on more data, the "new" data available for additional training shrinks. Synthetic data generation may help, but there are reasons to doubt it can fully substitute for the diversity of human-generated content.

Diminishing returns: The power laws in scaling research show smooth improvement, but they've primarily been measured in regimes well below current frontier scale. It's possible — perhaps likely — that the exponents change at higher scales, leading to slower improvements than the historical trend suggests.

The reasoning gap: GPT-4 is dramatically better than GPT-3 at some tasks, but there remain classes of problems — multi-step deductive reasoning, novel mathematical proofs, tasks requiring genuine world models — where even the largest models fail in ways that seem fundamentally different from what more scale would fix.

Efficiency gains vs. raw scale: Models like Chinchilla demonstrated that better-trained smaller models can outperform carelessly-trained larger ones. Many researchers now believe the near-term gains will come more from training efficiency, architecture improvements, and fine-tuning quality than from raw parameter count.

The honest answer is that no one knows. The history of AI is littered with researchers predicting a ceiling, getting proved wrong, predicting another ceiling, getting proved wrong again. But "previous predictions of a ceiling were wrong" isn't proof that no ceiling exists. We're running a live experiment.

The Emergent Capabilities Debate: A Philosophical Aside

One important qualifier on "emergent abilities" deserves attention. A 2023 paper by Schaeffer et al. argued that emergence is partly an artifact of measurement. When capabilities are measured with non-linear metrics (accuracy, which jumps from "wrong" to "right"), they appear to emerge suddenly. When measured with continuous metrics (log-probability of the correct answer, which changes smoothly), the same capabilities show gradual improvement. The "sudden appearance" might be a story we're telling about a smooth underlying reality.

This doesn't fully deflate the emergence concept — there are still cases where the capability is genuinely discrete (either the model can do three-digit multiplication reliably or it can't), and where the jump is real regardless of the metric. But it's a good reminder that our conceptual categories for understanding AI development are themselves still being worked out.

From Parameters to Understanding: The Central Lesson

The story of GPT-1 through GPT-4 is, at one level, a story about numbers getting bigger. 117 million, 1.5 billion, 175 billion — and then the undisclosed but presumably enormous number behind GPT-4.

But the more important story is about what those numbers mean and what they don't. Large language models are extraordinarily sophisticated pattern-completion engines, and scaling has made them radically more sophisticated. But scale alone doesn't explain ChatGPT. The InstructGPT work demonstrated that raw capability and deployed usefulness are different things — and that the fine-tuning and alignment work that turns a base model into a product is at least as important as the pretraining scale that enables it.

The GPT family's trajectory also illustrates a broader point: none of these systems were designed to have the capabilities they demonstrated. Few-shot learning wasn't engineered into GPT-3 — it emerged. The ability to write code, to understand analogical reasoning, to maintain conversational context across long exchanges — these were consequences of training a massive pattern-completion engine on an enormous corpus of human language. The capabilities follow from the training objective, once you apply it at sufficient scale.

What's next? That's a question the entire field is trying to answer in real time, with hundreds of billions of dollars of compute investment and a genuine uncertainty about what the next threshold will reveal.