.2017 Dec 6;4(12):171085.

doi: 10.1098/rsos.171085. eCollection 2017 Dec.

The reproducibility of research and the misinterpretation ofp-values

David Colquhoun¹

Affiliations

PMID:29308247
PMCID: PMC5750014
DOI: 10.1098/rsos.171085

The reproducibility of research and the misinterpretation ofp-values

David Colquhoun. R Soc Open Sci.2017.

.2017 Dec 6;4(12):171085.

doi: 10.1098/rsos.171085. eCollection 2017 Dec.

Author

David Colquhoun¹

Affiliation

¹ Department of Neuroscience, Physiology and Pharmacology, University College London, London, UK.

PMID:29308247
PMCID: PMC5750014
DOI: 10.1098/rsos.171085

Erratum in

Correction to 'The reproducibility of research and the misinterpretation ofp-values'.
Colquhoun D.Colquhoun D.R Soc Open Sci. 2018 Mar 7;5(3):180100. doi: 10.1098/rsos.180100. eCollection 2018 Mar.R Soc Open Sci. 2018.PMID:29658963Free PMC article.

Abstract

We wish to answer this question: If you observe a 'significant'p-value after doing a single unbiased experiment, what is the probability that your result is a false positive? The weak evidence provided byp-values between 0.01 and 0.05 is explored by exact calculations of false positive risks. When you observep = 0.05, the odds in favour of there being a real effect (given by the likelihood ratio) are about 3 : 1. This is far weaker evidence than the odds of 19 to 1 that might, wrongly, be inferred from thep-value. And if you want to limit the false positive risk to 5%, you would have to assume that you were 87% sure that there was a real effect before the experiment was done. If you observep= 0.001 in a well-powered experiment, it gives a likelihood ratio of almost 100 : 1 odds on there being a real effect. That would usually be regarded as conclusive. But the false positive risk would still be 8% if the prior probability of a real effect were only 0.1. And, in this case, if you wanted to achieve a false positive risk of 5% you would need to observep = 0.00045. It is recommended that the terms 'significant' and 'non-significant' should never be used. Rather,p-values should be supplemented by specifying the prior probability that would be needed to produce a specified (e.g. 5%) false positive risk. It may also be helpful to specify the minimum false positive risk associated with the observedp-value. Despite decades of warnings, many areas of science still insist on labelling a result ofp < 0.05 as 'statistically significant'. This practice must contribute to the lack of reproducibility in some areas of science. This is before you get to the many other well-known problems, like multiple comparisons, lack of randomization andp-hacking. Precise inductive inference is impossible and replication is the only way to be sure. Science is endangered by statistical misunderstanding, and by senior people who impose perverse incentives on scientists.

Keywords: false positive risk; null hypothesis tests; reproducibility; significance tests; statistics.

PubMed Disclaimer

Conflict of interest statement

I declare I have no competing interests.

Figures

**Figure 1.**
Definitions for a NHST. A Student'st-test is used to analyse the difference between the means of two groups ofn = 16 observations. Thet value, therefore, has 2(n − 1) = 30 d.f. The blue line represents the distribution of Student'st under the null hypothesis (H₀): the true difference between means is zero. The green line shows the non-central distribution of Student'st under the alternative hypothesis (H₁): the true difference between means is 1 (1 s.d.). The critical value oft for 30 d.f. andp = 0.05 is 2.04, so, for a two-sided test, any value oft above 2.04, or below –2.04, would be deemed ‘significant’. These values are represented by the red areas. When the alternative hypothesis is true (green line), the probability that the value oft is below the critical level (2.04) is 22% (gold shaded): these represent false negative results. Consequently, the area under the green curve abovet = 2.04 (shaded yellow) is the probability that a ‘significant’ result will be found when there is in fact a real effect (H₁ is true): this is thepower of the test, in this case 78%. The ordinates markedy₀ (= 0.526) andy₁ (= 0.290) are used to calculate likelihood ratios, as in §5.

**Figure 2.**
Plots of false positive risk (FPR) againstp-value, for two different ways of calculating FPR. The continuous blue line shows thep-equals interpretation and the dashed blue line shows thep-less-than interpretation. These curves are calculated for a well-powered experiment with a sample size ofn = 16. This gives power = 0.78, forp = 0.05 in our example (true effect = 1 s.d.). (a,b) Prior probability of a real effect = 0.1. (c,d) Prior probability of a real effect = 0.5. The dashed red line shows a unit slope: this shows the relationship that would hold if the FPR were the same as thep-value. The graphs in the right-hand column are the same as those in the left-hand column, but in the form of a log–log plot. Graphs produced byPlot-FPR-versus-Pval.R (see the electronic supplementary material).

**Figure 3.**
The false positive risk plotted against the prior probability for a test that comes out with ap-value just below 0.05. The points for prior probabilities greater than 0.5 are red because it is essentially never legitimate to assume a prior bigger than 0.5. The calculations are done with a sample size of 16, giving power = 0.78 forp = 0.0475. The square symbols were found by simulation of 100 000 tests and looking only at tests that givep-values between 0.045 and 0.05. The fraction of these tests for which the null hypothesis is true is the false positive risk. The continuous line is the theoretical calculation of the same thing: the numbers were calculated withorigin-graph.R and transferred to origin to make the plot.

**Figure 4.**
The calculated false positive risk plotted against the observedp-value. The plots are for three different sample sizes:n = 4 (red),n = 8 (green) andn = 16 (blue). (a,b) Prior probability of a real effect = 0.1. (*c,d*) Prior probability of a real effect = 0.5. The dashed red line shows a unit slope: this shows the relationship that would hold if the FPR were the same as thep-value. The graphs in the right-hand column are the same as those in the left-hand column, but in the form of a log–log plot. Graphs produced byPlot-FPR-versus-Pval.R (see the electronic supplementary material).

**Figure 5.**
Web calculator [12] for the case where we observe ap-value of 0.001 and the prior probability of a real effect is 0.1 (http://fpr-calc.ucl.ac.uk/).

**Figure 6.**
Web calculator [12] calculation of the prior probability that would be needed to achieve a false positive risk of 5% when we observep = 0.05 (http://fpr-calc.ucl.ac.uk/).

See this image and copyright information in PMC

References

1. Bakan D. 1966. The test of significance in psychological research. Psychol. Bull. 66, 423–437. (doi:10.1037/h0020412) - DOI - PubMed
1. Colquhoun D. 2014. An investigation of the false discovery rate and the misinterpretation of p-values. R. Soc. open sci. 1, 140216 (doi:10.1098/rsos.140216) - DOI - PMC - PubMed
1. Berger JO, Sellke T. 1987. Testing a point null hypothesis—the irreconcilability of p-values and evidence. J. Am. Stat. Assoc. 82, 112–122. (doi:10.2307/2289131) - DOI
1. Berger JO, Delampady M. 1987. Testing precise hypotheses. Stat. Sci. 2, 317–352. (doi:10.1214/ss/1177013238) - DOI
1. Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, Munafo MR. 2013. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376. (doi: 10.1038/nrn3475) - DOI - PubMed

Associated data

figshare/10.6084/m9.figshare.c.3936958

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- Dryad Digital Repository
- scite Smart Citations

Movatterモバイル変換

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Full text links

Actions

The reproducibility of research and the misinterpretation ofp-values

Affiliation

The reproducibility of research and the misinterpretation ofp-values

Author

Affiliation

Erratum in

Abstract

Conflict of interest statement

Figures

References

Associated data

LinkOut - more resources

Full Text Sources

Other Literature Sources