The interpretation of a p value of 0.05 as a threshold for statistical significance represents one of the most enduring and contentious conventions in scientific research. This specific number, often misunderstood as a bright line between truth and falsehood, actually signals a quantified level of evidentiary uncertainty. A p value of 0.05 indicates that, assuming the null hypothesis is true, there is a 5% probability of observing a result as extreme as, or more extreme than, the one obtained in the study. This threshold dictates whether a finding is labeled "significant," influencing which hypotheses advance, which theories gain traction, and which interventions are funded or implemented.
The Historical Roots of 0.05
The adoption of 0.05 was not the result of a rigorous mathematical proof but rather a pragmatic convention established in the early 20th century. Sir Ronald Fisher, a prominent statistician, popularized this threshold in his 1925 work, suggesting that values below 0.05 warranted further investigation, while values above 0.10 could be safely disregarded. Fisher framed it as a tool for researchers to quickly decide whether to "pursue a phenomenon" rather than a definitive rule for publishing discoveries. This informal benchmark gradually solidified into a rigid standard, particularly in the biomedical and social sciences, where regulatory bodies and journals began to treat it as a non-negotiable requirement for credibility.
Understanding What "Significant" Really Means
The Misinterpretation of Probability
A widespread error is interpreting a "significant" result as a 95% probability that the alternative hypothesis is true. In reality, the p value addresses the data, not the hypothesis. A p value of 0.05 does not mean there is a 95% chance that the observed effect is real; it means there is a 5% chance of obtaining the data if there is no effect. This distinction is crucial. The probability that a finding is true depends on the study's power, the plausibility of the effect, and the prevalence of true effects in the field—factors the p value alone cannot capture.
Type I and Type II Errors
Rigorously, the p value of 0.05 directly controls the Type I error rate, which is the risk of a false positive—falsely rejecting a true null hypothesis. By setting alpha at 0.05, researchers accept a 5% risk of claiming an effect exists when it does not. However, this focus on false positives often overshadows Type II errors, or false negatives, where a real effect is missed due to insufficient sample size or high variability. A fixation on achieving significance can lead to underpowered studies that fail to detect meaningful effects, contributing to a distorted scientific record.
The Replication Crisis and Its Implications
The pervasive use of the 0.05 threshold has been implicated in the replication crisis affecting numerous scientific disciplines. When journals prioritize "significant" findings, researchers face pressure to engage in practices like p-hacking—manipulating data collection or analysis until the desired p value is achieved. A single statistically significant result is less reliable than a cluster of consistent findings. The expectation of a magic number creates an incentive structure that favors flashy, confirmatory outcomes over robust, incremental science, ultimately eroding trust in research.
Moving Beyond the Binary
Embracing Effect Sizes and Confidence Intervals
Modern statistical practice encourages a shift from binary "significant/non-significant" thinking to a more nuanced evaluation of evidence. The effect size, which quantifies the magnitude of a phenomenon, is often more informative than the significance label. A "significant" result with a trivial effect size (e.g., a 0.1% improvement) may be statistically detectable in a large sample but scientifically meaningless. Confidence intervals provide a richer alternative, presenting a range of plausible values for the effect rather than a single probability, offering a clearer picture of uncertainty and precision.