The Statistical Mind: How to Think Clearly About Evidence, Uncertainty, and Scientific Claims

Section 7 of 14

What p-values actually mean and why you're misinterpreting them

Everything You Think You Know About p-values Is Wrong

You now understand what a p-value actually measures and where the 0.05 threshold came from. You know what it can tell you (whether your data would be surprising if nothing were happening) and what it cannot (whether the null hypothesis is true or whether the effect is real). You know, in other words, more than most working scientists.

But here's the problem: knowing what something means and knowing what people think it means are two very different things. The p-value is probably the most widely used and most widely misunderstood concept in all of science. A landmark 2016 paper by Greenland and colleagues catalogued 25 distinct misconceptions about p-values, confidence intervals, and statistical power — misconceptions that appear not just in introductory textbooks but in published scientific literature and, yes, in the textbooks themselves. The researchers publishing in peer-reviewed journals don't understand p-values the way you now do.

This section walks through why that gap exists. We're going to work through the most damaging misconceptions one by one, explain how they arose from an unresolved historical argument buried inside the framework itself, and show you what to actually look at when you encounter a p-value in the wild. By the end, you won't just know what a p-value isn't — you'll understand why the confusion exists and why it matters enormously for how you evaluate any scientific claim.

Misconception 1: "p < 0.05 means there's a 95% chance the result is real"

This is the most common misinterpretation, and it feels so right that even after you learn it's wrong, you'll catch yourself drifting back to it.

The reasoning seems bulletproof: p = 0.05 means 5% probability, so 1 - 0.05 = 95% probability that the result is true. Clean. Satisfying. Wrong.

Here's the trap: the p-value tells you the probability of your data given that the null hypothesis is true. What researchers want to know — what journalists report, what policy-makers act on — is the probability that the hypothesis is true given the data. These are not the same thing. Not even close.

This is not a technicality. It's a fundamental reversal of the logical direction, and it matters more than you might think.

Consider an analogy. You learn that 99% of people who are struck by lightning were outdoors at the time. Okay, so you can compute: "Probability of being outdoors given struck by lightning = 0.99." But does that tell you anything useful about whether you should worry if you're standing outside right now? Not really. Most people who are outdoors never get struck by anything. The direction of the conditional probability is everything.

The p-value answers: "How likely is this data if there's no effect?" It does not answer: "How likely is it that there's a real effect?" To answer that second question — the one that actually matters — you need information the p-value doesn't contain. Specifically, you need some baseline probability that the effect is real in the first place. That's what Bayesian statistics (which we'll cover later) is built to handle.

Warning: When a study reports p = 0.03 and a headline screams "Scientists prove X causes Y with 97% certainty," someone has made exactly this error. The p-value says nothing about certainty in the direction the headline implies.

Misconception 2: "The p-value measures the probability the null hypothesis is true"

Closely related to Misconception 1, but subtly different — worth untangling carefully.

A p-value of 0.04 does not mean there's a 4% chance the null hypothesis is true. It means: if the null hypothesis were true, there would be a 4% chance you'd see data this extreme or more extreme. The null hypothesis is true or it isn't. It doesn't have a probability in the frequentist framework that p-values inhabit. In that framework, probability refers to the long-run frequency of outcomes, not the truth of fixed hypotheses.

As Greenland and colleagues explain, this confusion is so pervasive it appears in published papers across medicine, psychology, and ecology. The authors note that correct interpretation "requires an attention to detail which seems to tax the patience of working scientists," and this led to "an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so."

Here's the irony: the interpretation researchers actually want — "what's the probability this hypothesis is true?" — is a perfectly coherent question. It's just not what frequentist statistics answer. You'd need Bayesian inference for that. The problem is that researchers were handed a tool that doesn't answer their real question, decided to interpret it as if it did anyway, and the practice calcified.

Misconception 3: "A non-significant result means no effect exists"

This one trips up everyone, including experienced researchers.

Imagine a drug trial finds p = 0.21. A lot of people — including a lot of researchers — would conclude: "There's no effect." Logically? That doesn't hold up. A non-significant result means the data were insufficient to reject the null hypothesis at your chosen threshold. That's not remotely the same as proving the null hypothesis is correct.

Think about the mechanics. A study might fail to reach significance because the effect is genuinely small. Or because the sample size was too small to detect it. Or because the measurement was noisy. Or just bad luck with sampling variation. A non-significant p-value can't distinguish between any of these scenarios. It's an ambiguous signal, not a clear answer.

This distinction matters enormously in real life. A pharmaceutical company runs a small trial of a drug, gets p = 0.15, and concludes "no effect." Does that mean the drug doesn't work? Maybe. Or maybe the trial was underpowered — a larger study would have found significance. We cannot tell from the p-value alone.

The correct phrasing after a null result is: "We found no statistically significant evidence of an effect." Notice the phrase. It's a statement about what the evidence shows, not what the world contains. It's a famously difficult distinction to maintain, especially when "the study found no effect" sounds so much cleaner than "the study found ambiguous evidence."

Remember: Absence of evidence is not evidence of absence — unless the study was well-powered enough that it would have caught an effect if one existed. Always check the sample size and power before interpreting null results.

Misconception 4: "Statistical significance means the effect is large or important"

This one has probably caused more misallocated healthcare resources, more misguided policy decisions, and more confused patients than any other p-value misconception.

Here's the fundamental confusion: statistical significance is a statement about signal-to-noise ratio, not about the size of the signal. A study with a large enough sample size can produce p < 0.0001 for an effect that is utterly trivial in the real world.

Picture this concrete example: a study of 100,000 people finds that people who eat a particular brand of yogurt are, on average, 0.001 pounds lighter than non-yogurt-eaters. With 100,000 subjects, this finding might be extremely statistically significant — p < 0.00001. It is also clinically meaningless. A thousandth of a pound is not an effect anyone should care about.

The flip side is equally true: a study might find an enormous, clinically meaningful effect that just misses the significance threshold because the sample was small. The drug works — dramatically — but the trial had only 40 patients and p = 0.07, so it gets shelved as "no evidence of effectiveness."

This is precisely why statisticians and the American Statistical Association itself have pushed hard for reporting effect sizes alongside p-values. We'll dive into effect sizes properly in Section 10, but here's the core idea: significance tells you whether you can detect an effect; effect size tells you whether you should care about it. These are completely different questions.

graph TD
    A[Observed Data] --> B[Calculate p-value]
    B --> C{p < 0.05?}
    C -->|Yes: "Significant"| D[We can detect this effect]
    C -->|No: "Not significant"| E[We cannot reliably detect this effect]
    D --> F{Effect size?}
    E --> G{Effect size?}
    F -->|Large| H[Important finding]
    F -->|Tiny| I[Detectable but meaningless]
    G -->|Large| J[Real effect, underpowered study]
    G -->|Tiny| K[Probably nothing to see]

The Historical Accident That Broke Statistics

Here's where this gets genuinely fascinating — and genuinely infuriating. The confusions above didn't arise from random sloppiness. They arose from a historical accident: two incompatible statistical frameworks were stitched together into a single hybrid practice without anyone admitting they were fundamentally incompatible. Understand this history and everything about modern statistics will suddenly make more sense.

Fisher's Approach: Measuring Evidence

Ronald Fisher, the brilliant and famously cantankerous British statistician, developed the p-value in the 1920s as a tool for inductive reasoning. His idea was elegant: assume the null hypothesis is true, compute how probable your data would be under that assumption, and use that probability as a measure of the evidence against the null hypothesis.

For Fisher, the p-value was continuous evidence. A p-value of 0.001 represented stronger evidence against the null than a p-value of 0.049. The "significance" threshold of 0.05 was not a sacred line in the sand — it was a rough rule of thumb, a guideline for whether the evidence was worth taking seriously. Fisher was explicit about this: a single study achieving p < 0.05 shouldn't be treated as definitive. Results needed to be replicated, combined across studies, evaluated in context. He explicitly opposed the idea of making a binary decision based on a single threshold.

Fisher's framework had no fixed alternative hypothesis. It was about learning from data, updating your sense of whether the null hypothesis was reasonable, accumulating evidence across studies over time. It was fundamentally about epistemology — how you come to know things.

Neyman-Pearson: Decision Theory

Enter Jerzy Neyman and Egon Pearson (son of Karl Pearson), who developed an entirely different framework in the late 1920s and 1930s. They weren't interested in measuring evidence. They cared about making decisions — specifically, about developing procedures that would make errors at acceptable rates in the long run.

Their framework required you to specify not just a null hypothesis but an explicit alternative hypothesis. It introduced Type I error (falsely rejecting a true null, controlled by α) and Type II error (failing to reject a false null, controlled by power). The significance threshold — say, α = 0.05 — had a specific meaning in their system: if you follow this procedure repeatedly over many experiments, you'll falsely reject true null hypotheses at most 5% of the time.

Notice the crucial difference from Fisher. In Neyman-Pearson, you're not computing evidence for any particular study. You're adopting a decision procedure with known long-run error properties. The p-value isn't evidence about this study — it's a criterion for making a binary decision. And crucially, the individual p-value of a specific experiment tells you very little; what matters is the procedure itself, applied consistently over many repeated experiments.

Fisher and Neyman absolutely loathed each other's frameworks. Fisher called Neyman's approach "childish" and "horrifying [for] intellectual freedom in the west." Neyman called some of Fisher's work "worse than useless." The arguments were vicious, personal, and unresolved when both men died.

Remember: Fisher cared about the strength of evidence in a single study. Neyman-Pearson cared about the long-run error rates of a decision procedure. These goals are genuinely different, and they lead to different conclusions about what statistical results mean.

The Franken-Framework

As statistician Jim Frost explains, textbook publishers and instructors, facing two rival schools and needing a single coherent course to teach, did what any pragmatic institution would do: they merged them. The resulting hybrid — calculate a p-value, compare it to α = 0.05, reject or fail to reject — looks like a unified method. It isn't.

The merge created deep, unresolved inconsistencies. The p-value came from Fisher's continuous evidence framework. The significance threshold (α) came from Neyman-Pearson's decision procedure framework. These two numbers don't mean the same thing and were never designed to be compared directly. Comparing a p-value to α and concluding the result is "significant" is a category error — it's like judging a movie by comparing the ticket price to the Rotten Tomatoes score.

The consequence is that researchers operate in a confusing middle ground:

They use Fisherian p-values as if they were evidence
They use Neyman-Pearson α thresholds as if they were decision rules
They often end up doing neither properly

The result is that "statistical significance" has become a cultural ritual rather than a coherent epistemic practice. Researchers know they need p < 0.05 to publish. That's what they aim for. Whether the result is meaningful, replicable, or practically important becomes secondary — which leads directly to the next problem.

The Binary Trap: Significant vs. Not Significant

Once you understand that p-values are continuous measures of evidence forced into a binary framework, the damage becomes obvious.

Consider what happens to a result with p = 0.049 versus p = 0.051. In any sensible world, these are essentially identical outcomes — data that produced almost exactly the same statistical signal. In the world of significance testing, however, one is "significant" and publishable, the other is "not significant" and likely ends up in a file drawer. The evidence is nearly identical; the professional consequences are opposite.

This binary creates warped incentives throughout science. Researchers don't aim for the most informative result — they aim for the magical threshold. And because they're human beings operating under career pressure, they often find ways to get there, consciously or not. (Section 9 covers the specific mechanisms like p-hacking and optional stopping.) For now, just notice: any system that treats fundamentally similar evidence as categorically different will produce distorted results over time.

There's also the problem of what gets published in the first place. Studies with p < 0.05 get published. Studies with p > 0.05 often don't — either because journals reject them or because researchers don't bother submitting them. This is the "file drawer problem": the scientific literature contains a systematically biased sample of all research conducted, filtered specifically for positive findings.

If you run 20 studies of pure noise and publish the one that randomly lands at p = 0.04, you've created a published "finding" with zero relationship to reality. The other 19 sit in drawers, unseen. Anyone reading the literature sees only your one positive result and reasonably concludes there's an effect.

Tip: When evaluating a research finding, always ask: "How many studies on this topic exist, and is this one an outlier?" A single significant result in a field with many null results should make you more skeptical, not less.

The ASA Speaks — Twice

By 2016, the situation was serious enough that the American Statistical Association did something it had never done in its 177-year history: it issued a formal policy statement on p-values.

The statement was blunt. It listed six principles, and the most important was direct: "A p-value, or statistical significance, does not measure the size of an effect or the importance of a result." The ASA acknowledged that widespread misuse of p-values was causing real harm to science, and it urged researchers to "not conclude 'there is no difference' or 'there is no association' just because a p-value was larger than a threshold."

Then, in 2019, the ASA went further. The journal The American Statistician published a special issue with a striking editorial title: "Statistical Inference in the 21st Century: A World Beyond p < 0.05." The editorial, signed by ASA leadership, called for abandoning the term "statistical significance" entirely. Not reforming it. Abandoning it.

Forty-three statisticians contributed papers to that issue. The consensus was clear: the binary significant/not-significant framework has caused enough damage that it needs to go. Replace it with effect sizes, confidence intervals, and honest uncertainty. Think carefully about prior probabilities. Replicate before claiming discovery.

This is not a fringe position. This is the professional statistical community's official assessment of where things went wrong.

What to Look at Instead

If p-values are so problematic, what should you actually use? The honest answer is that you should use p-values correctly — as one piece of evidence among several — while also looking at:

Effect sizes tell you how large an effect is in practical terms. A drug might significantly reduce blood pressure (p = 0.001) but only by 1 mmHg — an effect so small it's clinically irrelevant. Effect size metrics like Cohen's d or odds ratios translate statistical results into interpretable magnitudes. (Section 10 covers these in depth.)

Confidence intervals show you the range of effect sizes consistent with the data, not just a single-number summary. A confidence interval that just barely excludes zero looks very different from one that excludes zero by a wide margin — but both might produce p = 0.04. Intervals give you the continuous information that binary significance destroys. (Section 8 covers confidence intervals in detail.)

Pre-registration means researchers publicly declare their hypotheses, methods, and analysis plans before collecting data. This prevents post-hoc adjustment of hypotheses to match results — one of the main mechanisms by which p-hacking occurs.

Replication is the most powerful reality check. One p < 0.05 result proves nothing. The same finding replicated in three independent samples with appropriate effect sizes is genuinely informative. The field of nutrition research has been plagued by single-study findings that evaporate under replication; the psychology replication crisis made this problem headline news.

graph TD
    A[Research Finding] --> B[Check p-value]
    A --> C[Check effect size]
    A --> D[Check confidence interval]
    A --> E[Check sample size and power]
    A --> F[Check for replication]
    B --> G{p < 0.05?}
    G -->|Yes| H[Some evidence against null]
    G -->|No| I[Insufficient evidence, not disproven]
    C --> J[Is the effect practically meaningful?]
    D --> K[What range of effects are consistent with data?]
    E --> L[Was the study powered to detect this size effect?]
    F --> M[Has this held up in independent samples?]
    H & I & J & K & L & M --> N[Integrated judgment about the claim]

The key insight is that no single statistic carries the full story. Greenland and colleagues are explicit: "Statistical tests should never constitute the sole input to inferences or decisions about associations or effects." Evidence is multi-dimensional. Any framework that compresses it into a single binary (significant/not significant) is discarding information you genuinely need.

The Reader's Takeaway

Let's be straightforward about something. The p-value isn't going away. It appears in virtually every piece of empirical research you'll encounter. Banning it outright would create as many problems as it solves — the field is genuinely still working out what should replace it.

What you need is not to reject p-values but to see through them clearly. When you read that a study found a "statistically significant" result:

Remember that significance says nothing about the size of the effect
Remember that p < 0.05 does not mean "95% chance this is real"
Remember that non-significant does not mean "no effect"
Remember that the threshold of 0.05 is a convention, not a law of nature
Remember that this framework is a hybrid of two incompatible approaches, born from an historical accident, and the p-value was designed for a different purpose than it's typically being used for

And always, always ask what the effect size was, whether the study has been replicated, and whether the sample size was large enough to detect effects that actually matter.

This is not statistical pedantry. This is how you avoid being misled by science journalism, by marketing that cites studies, by policy claims dressed up in statistical authority. The p-value, properly understood, is a useful but limited tool. It was never meant to be the whole story — and the moment you stop treating it as such, your ability to evaluate evidence improves dramatically.

That is, in the end, what statistical literacy is for.

How Hypothesis Testing Works: Null Hypotheses and P-values Understanding Confidence Intervals and Uncertainty

Only visible to you