Movatterモバイル変換


[0]ホーム

URL:


Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
Thehttps:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log inShow account info
Access keysNCBI HomepageMyNCBI HomepageMain ContentMain Navigation
pubmed logo
Advanced Clipboard
User Guide

Full text links

Free PMC article
Full text links

Actions

.2017 Dec 6;4(12):171085.
doi: 10.1098/rsos.171085. eCollection 2017 Dec.

The reproducibility of research and the misinterpretation ofp-values

Affiliations

The reproducibility of research and the misinterpretation ofp-values

David Colquhoun. R Soc Open Sci..

Erratum in

Abstract

We wish to answer this question: If you observe a 'significant'p-value after doing a single unbiased experiment, what is the probability that your result is a false positive? The weak evidence provided byp-values between 0.01 and 0.05 is explored by exact calculations of false positive risks. When you observep = 0.05, the odds in favour of there being a real effect (given by the likelihood ratio) are about 3 : 1. This is far weaker evidence than the odds of 19 to 1 that might, wrongly, be inferred from thep-value. And if you want to limit the false positive risk to 5%, you would have to assume that you were 87% sure that there was a real effect before the experiment was done. If you observep= 0.001 in a well-powered experiment, it gives a likelihood ratio of almost 100 : 1 odds on there being a real effect. That would usually be regarded as conclusive. But the false positive risk would still be 8% if the prior probability of a real effect were only 0.1. And, in this case, if you wanted to achieve a false positive risk of 5% you would need to observep = 0.00045. It is recommended that the terms 'significant' and 'non-significant' should never be used. Rather,p-values should be supplemented by specifying the prior probability that would be needed to produce a specified (e.g. 5%) false positive risk. It may also be helpful to specify the minimum false positive risk associated with the observedp-value. Despite decades of warnings, many areas of science still insist on labelling a result ofp < 0.05 as 'statistically significant'. This practice must contribute to the lack of reproducibility in some areas of science. This is before you get to the many other well-known problems, like multiple comparisons, lack of randomization andp-hacking. Precise inductive inference is impossible and replication is the only way to be sure. Science is endangered by statistical misunderstanding, and by senior people who impose perverse incentives on scientists.

Keywords: false positive risk; null hypothesis tests; reproducibility; significance tests; statistics.

PubMed Disclaimer

Conflict of interest statement

I declare I have no competing interests.

Figures

Figure 1.
Figure 1.
Definitions for a NHST. A Student'st-test is used to analyse the difference between the means of two groups ofn = 16 observations. Thet value, therefore, has 2(n − 1) = 30 d.f. The blue line represents the distribution of Student'st under the null hypothesis (H0): the true difference between means is zero. The green line shows the non-central distribution of Student'st under the alternative hypothesis (H1): the true difference between means is 1 (1 s.d.). The critical value oft for 30 d.f. andp = 0.05 is 2.04, so, for a two-sided test, any value oft above 2.04, or below –2.04, would be deemed ‘significant’. These values are represented by the red areas. When the alternative hypothesis is true (green line), the probability that the value oft is below the critical level (2.04) is 22% (gold shaded): these represent false negative results. Consequently, the area under the green curve abovet = 2.04 (shaded yellow) is the probability that a ‘significant’ result will be found when there is in fact a real effect (H1 is true): this is thepower of the test, in this case 78%. The ordinates markedy0 (= 0.526) andy1 (= 0.290) are used to calculate likelihood ratios, as in §5.
Figure 2.
Figure 2.
Plots of false positive risk (FPR) againstp-value, for two different ways of calculating FPR. The continuous blue line shows thep-equals interpretation and the dashed blue line shows thep-less-than interpretation. These curves are calculated for a well-powered experiment with a sample size ofn = 16. This gives power = 0.78, forp = 0.05 in our example (true effect = 1 s.d.). (a,b) Prior probability of a real effect = 0.1. (c,d) Prior probability of a real effect = 0.5. The dashed red line shows a unit slope: this shows the relationship that would hold if the FPR were the same as thep-value. The graphs in the right-hand column are the same as those in the left-hand column, but in the form of a log–log plot. Graphs produced byPlot-FPR-versus-Pval.R (see the electronic supplementary material).
Figure 3.
Figure 3.
The false positive risk plotted against the prior probability for a test that comes out with ap-value just below 0.05. The points for prior probabilities greater than 0.5 are red because it is essentially never legitimate to assume a prior bigger than 0.5. The calculations are done with a sample size of 16, giving power = 0.78 forp = 0.0475. The square symbols were found by simulation of 100 000 tests and looking only at tests that givep-values between 0.045 and 0.05. The fraction of these tests for which the null hypothesis is true is the false positive risk. The continuous line is the theoretical calculation of the same thing: the numbers were calculated withorigin-graph.R and transferred to origin to make the plot.
Figure 4.
Figure 4.
The calculated false positive risk plotted against the observedp-value. The plots are for three different sample sizes:n= 4 (red),n= 8 (green) andn= 16 (blue). (a,b) Prior probability of a real effect = 0.1. (c,d) Prior probability of a real effect = 0.5. The dashed red line shows a unit slope: this shows the relationship that would hold if the FPR were the same as thep-value. The graphs in the right-hand column are the same as those in the left-hand column, but in the form of a log–log plot. Graphs produced byPlot-FPR-versus-Pval.R (see the electronic supplementary material).
Figure 5.
Figure 5.
Web calculator [12] for the case where we observe ap-value of 0.001 and the prior probability of a real effect is 0.1 (http://fpr-calc.ucl.ac.uk/).
Figure 6.
Figure 6.
Web calculator [12] calculation of the prior probability that would be needed to achieve a false positive risk of 5% when we observep = 0.05 (http://fpr-calc.ucl.ac.uk/).
See this image and copyright information in PMC

References

    1. Bakan D. 1966. The test of significance in psychological research. Psychol. Bull. 66, 423–437. (doi:10.1037/h0020412) - DOI - PubMed
    1. Colquhoun D. 2014. An investigation of the false discovery rate and the misinterpretation of p-values. R. Soc. open sci. 1, 140216 (doi:10.1098/rsos.140216) - DOI - PMC - PubMed
    1. Berger JO, Sellke T. 1987. Testing a point null hypothesis—the irreconcilability of p-values and evidence. J. Am. Stat. Assoc. 82, 112–122. (doi:10.2307/2289131) - DOI
    1. Berger JO, Delampady M. 1987. Testing precise hypotheses. Stat. Sci. 2, 317–352. (doi:10.1214/ss/1177013238) - DOI
    1. Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, Munafo MR. 2013. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376. (doi: 10.1038/nrn3475) - DOI - PubMed

Associated data

LinkOut - more resources

Full text links
Free PMC article
Cite
Send To

NCBI Literature Resources

MeSHPMCBookshelfDisclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.


[8]ページ先頭

©2009-2025 Movatter.jp