The “no-effect” null hypothesis is not biologically plausible. | Josh L. Morgan Lab

Josh L. Morgan

Washington University in St. Louis, Department of Ophthalmology and Visual Sciences, Neuroscience, Biology and Biomedical Science.

jlmorgan@wustl.edu

ABSTRACT

The overwhelming majority of statistical analyses performed in cell biology are null hypothesis statistical tests (NHSTs) that test the hypothesis that there was no effect. The most common example is performing a t-test to test the null hypothesis that there is no difference between a control group and an experimental group. This no-effect hypothesis is not realistic in experimental biology because there are multiple sources of guaranteed effect (biological, experimental, and statistical) in any biology experiment. With a large enough sample size, all tests for no-effect will result in rejecting the null hypothesis. Our understanding of a given question, therefore, is not impacted by the result of a test for no-effect.

Biologically meaningful significance testing is possible if the null hypothesis is defined by an understanding of trivial vs. biologically important effect sizes. However, most interpretation of cell biological data would be better served by a careful analysis of confidence intervals. Confidence intervals are easy to calculate, depend on fewer assumptions than p-values, can be compared across studies, and can be used to build models and test predictions.

MAIN TEXT

Cell biologists collect measurements that that can be used to inspire, refine, or distinguish between models of the world. We have statistical models that can then be used to extract parameters of interest from these measurements and estimate how much our limited sample size can tell us about the larger population. Accurately reporting our results, therefore, means transparently reporting two classes of information: magnitude (mean, median, correlation coefficient, etc.) and uncertainty (standard error, confidence interval).

Null hypothesis significance testing (NHST) promised a quantitatively rigorous way to combine these classes of information within the framework of hypothesis driven science. However, statisticians and researchers have been pointing out for a hundred years that NHSTs, as commonly used, do not support the claims being made ^1–8. Despite these protests, a warped version of NHST became cell biology’s gold standard statistical analysis, exploding in popularity in the 1980s ⁸. This version of NHST is notable in that it makes no claims about the effect size (magnitude) being tested. How did statistics in cell biology become non-quantitative?

Historic error

In 1925 Ronald A. Fisher published “Statistical Methods for Research Workers” in which he described an approach to rejecting, or nullifying, a hypothesis by calculating how often the hypothesized effect size would produce an observed effect size given some number of samples ⁹. He provided tables for calculating p ≤ 0.05 and p ≤ 0.01 as practical cut-offs for claiming that the difference between observed data and the predicted results of the null hypothesis was statistically significant. Note that “null” in this case does not refer to the size of a hypothesized effect, but to the procedure of rejecting, or nullifying, a particular hypothesis.

The problem with Fisher’s approach is that you never get to accept a hypothesis, just avoid an unlikely null-hypothesis. In 1933, Neyman and Pearson explained how you could use the same statistical approach to make a rational choice between one hypothesis or another ¹⁰. If you could quantify the real-world costs of making the wrong decisions (picking A when B is true or vice versa) then you could choose an optimal statistical cutoff for acting on one hypothesis or the other ^2,10,11. Given an infinite number of tests, you could make sure that the type of mistake that cost 20 times more than the other, occurred twenty times less often (p < 0.05).

The standard biological application of NHST that evolved over the subsequent decades was a logically inconsistent mashup of Fisher, Neyman, and Pearson ². In the confusion of binarizing questions and quantifying error costs, the importance of effect size was subsumed by the importance of the p-value. Ultimately, effect size was entirely dropped from the process of defining the null hypothesis and the simpler test for effect or no-effect became the near universal convention ¹¹. In Jacob Cohen’s words, the null hypothesis became the nil hypothesis ¹². To reiterate, there is nothing in the mathematics of statistical analysis that says the difference between groups being tested is zero. A biologist above could decide that, given common errors in measurement and the types of effect they are interested in, they will use a t-test to see if the experimental group mean is at least 20% larger than the control group mean. But rather than factoring in potential sources of error and biologically relevant minimum effect sizes, the much easier question is asked: effect or no-effect.

The most familiar instantiation of no-effect testing is the biologist who wishes to compare a control group to an experimental group. They will typically use a t-test to test the null hypothesis that the difference between the means of the two groups is zero. The test considers sample size and variance and produces a threshold of difference that the model says would produce a false positive 5% of the time. If the observed difference is greater than the threshold, then the researcher rejects the null hypothesis (p < .05).

While it seems like the researcher has quantified their data, nothing has actually happened. The result is either p > .05 (we don’t know the answer) or p < .05 (we already knew the answer). The reason we already knew the answer is that the test has rejected a null hypothesis was never possible. That is not to say that there must be a biologically important difference between every experimental and every control group. The no-effect null hypothesis is not “there is no important difference between the control and experimental group”. The no-effect null hypothesis is “the control and experimental population are identical”. That is, our result is no different from having drawn all samples from the same population.

We can reject the no-effect null hypothesis without doing the experiment.

Why is positing the no-effect hypothesis a fundamental problem? First, in a highly interconnected network like a living organism, the proposition that one component is perfectly independent of another component is trivially false. Everything can be assumed to be directly or indirectly connected. Detecting the connection might require extremely sensitive equipment and many samples, but the question is never “Is there a connection?” The meaningful question is always “How strong is the connection?”.

The second problem with the no-effect hypothesis is that all experiments can be assumed to have some non-zero sampling bias ¹³. For instance, it is now recognized that circadian rhythms have detectable effects on most cellular processes. How much of the published biological literature has strong controls for time of day? The imperfections in experiments don’t have to be systematic to undermine the no-effect hypothesis. They only need to be imperfect.

Finally, no statistical model can be assumed to perfectly represent the biology. Using a statistical test (t-test) that assumes a normal distribution to compare groups with non-normal distributions is the most familiar example of a mismatch between a statistical model and the data. Using resampling-based statistics such as bootstrapping goes a long way towards minimizing assumptions about the biology, but even these methods are not guaranteed to converge on an unbiased prediction of potential results. No statistical model can perfectly predict the distribution of results that would be expected if there was no experimental effect. What, then, does it mean if the real data doesn’t fit a modeled no-effect distribution? Maybe there is an effect, maybe the model is imperfect. The problem is not with using statistical models to describe potential sample distributions. Asking how much bigger an observed effect size is than what would be predicted from a given model is a great way to quantify data. The problem is with treating all detectable deviations from these models as evidence for a given hypothesis.

The upshot of these three sources of guaranteed experimental effects is that p-values for ALL experiments will become infinitesimally small as sample sizes approach infinity. Imagine you have a dial that increases the sample size in all publications. As you turn the dial up, asterisks begin appearing over every plot and bar graph. You can keep turning the knob until each bar has a string of asterisks that run off the page and the conclusion to every finding is that the result was extremely statistically significant. The biology hasn’t changed. The questions haven’t changed. But increasing the sample size has guaranteed the same answer to every question. The conventional statistical approach depends on experiments not being particularly sensitive.

NHST in practice

NHST is promoted as being more rigorous than the “merely descriptive” practice of estimating effect sizes. This critique makes some sense when we consider what goes into truly rigorous significance testing. The first step is to formulate a quantitative hypothesis that distinguishes between a biologically relevant effect size and trivial effect size (not a test for no-effect). This hypothesis requires prior knowledge about the relevant biology, the measures being obtained, the meaning of potential effect sizes, and potential sources of sampling error. A statistical model must be chosen whose assumptions are appropriate for the system. A statistical significance threshold must be chosen that is grounded in the costs of false positives and false negatives. This threshold should also reflect the context of an experimental program that might be performing many such experiments. The experiment, particularly the sample size, must be designed with the minimum relevant effect size, criteria for significance, and population variance in mind. Ideally, the experiment, sample sizes, criteria for rejecting the null hypothesis, and statistical methods should be publicly registered prior to data collection ¹⁴. Finally, the results of any individual experiment should be analyzed as a modification, and not a replacement, of existing evidence.

Rigorous NHST is, therefore, best suited for guiding decision making about a system that has already been well characterized ⁸. At the other end of the spectrum, it makes no sense for biologists to predict effect sizes and preregister experiments every time they take measurements from half a dozen cells. The problem is that we still want to share the data. We are taught that publishing means p < 0.05 so we perform the simplest possible NHST. We do a t-test (or maybe a non-parametric test) to reject the hypothesis that the mean of the experimental group is identical to the mean of the control group.

How has biology made progress when the explicit goal of most of experiments is to reject a hypothesis that could never be true? First, for many biologists, p-values are performative. They have some understanding of the problems with the tests they are doing and base their conclusions on other types of analysis. Images, histograms, scatterplots, and error bars provide meaningful information about biology and are usually published in parallel with p-values. The second reason the system can kind of work is that testing the no-effect hypothesis can be a crude rule-of-thumb for effect size. If the sample sizes are small and measures are noisy, then a small p-value means there was probably a big effect. The question most biologists are really asking with p-values is: “Given that I only checked a few examples of group A and B, was the effect large enough (relative to variance) that I got a p-value less than 0.05?”

Stories supported by no-effect testing have lost the plot.

When a claim is made in cell biology, it is typically supported by a set of interdependent tests for no-effect. These arguments often take the form of series of binary decision points ¹³ such as the following formulation:

1) In a given disease, Measure X is higher than it is in controls (p < .05)

2) In a given diseases model, Measure X is also higher than the same measure in control animals (p < .05)

3) In disease model animals receiving a drug, Measure X was not different from non-disease model animals (p > .05).

4) The drug is, therefore, a promising treatment for the disease.

This argument demonstrates several common errors that are well understood to be statistically incorrect. A lack of statistical significance is incorrectly interpreted as evidence for similarity. P-values are compared between experiments even though effect size is ambiguously mixed with sample size and variance. Because the conclusion depends on multiple NHSTs, the overall p-value would be higher than any listed p-value. However, even if the tests for no-effect were reported with perfect statistical rigor, they would not constitute evidence for the argument being made.

The analysis that the above argument requires is relatively simple. The researcher needs to establish that a difference in disease state corresponds to a given effect size in Measure X. They then need to show that the drug treatment changes Measure X by a comparable degree. Testing for no-effect does not answer either of these basic questions. It is possible that effects detected in such test could be trivially small or very different from one another. It is also possible that the researcher based their beliefs on a scatter plot that provided overwhelming evidence for the drug’s efficacy. Inserting tests for no-effect between the analysis of effect sizes and claims about biology obscures the difference between experiments that reject an impossible null hypotheses and experiments that cure a disease.

Why no-effect testing is imploding.

Combining effect-size free statistics with Big Data is a disaster. The number of experiments that can be performed, the number of samples that can be analyzed, and the number of measures obtained from each sample has dramatically increased in the last decade. These potential improvements compound many of the problems with testing for no-effect:

Sample size: The utility of testing for no-effect as a rule-of-thumb for effect size evaporates with big data. Large sample sizes cause trivial effects to drive p-values below threshold. When the no-effect hypothesis is being tested, “statistically significant” means even less now than it did 20 years ago.

Low prior probability: Automated data collection makes it cost effective for a field to test a large number of possible factors (genes, cells, conditions) that have a low prior probability of being important for the phenomena of interest. Under these conditions, a false positive result can be more likely than a true positive ¹⁵. It is possible to adjust statistical criteria to compensate for multiple tests. However, the random walk down a path of trivial connections won’t be reversed if the goal is to test for no-effect. To replicate the previous positive result, you just need more n’s.

Meaningless measures: Data analysis software makes it easy to generate many different highly derived measures from the same data. Interpreting the result in terms of statistical significance makes it possible to claim an effect was observed even when there is no clear biological interpretation of the measure. By contrast, reporting that there was, at least, a 22% increase in measure X begs the questions: “Is a 22% increase in X important?”

Opportunity cost: To the extent that researchers pay attention to tests for no-effect, they are turning their attention away from the meat of what their data is telling them. Larger datasets should be making it possible to estimate effect sizes with high precision. Increasing computational power should make it possible to translate these effect sizes into more realistic models of biological systems. Testing the no-effect hypothesis takes massive datasets collected by million-dollar machines and compresses the results down to a from that is no longer biologically meaningful.

Confidence intervals should fill the niche of biology’s default statistic.

Part of the solution to the crisis in biological statistics is, when NHST is performed, it should be performed rigorously to test hypotheses about biologically meaningful effect sizes. However, rigorous NHST makes sense for only a subset of research publications. Even in those publications, it usually makes sense to perform only one or two NHSTs to test the main conclusions of the paper. If NHST is not an appropriate statistical workhorse, what should we be talking about in lab meetings, seminars, and the subpanels leading up to the main conclusion of a paper?

Biological quantification should transparently report two classes of information: magnitude and uncertainty. Reporting the magnitude means focusing on an effect size such as a mean, a difference in means, a regression coefficient, or some other biologically interpretable measure. Reporting uncertainty means considering the sample size relative to a model of the variance. Standard error of the mean is the most familiar measure of uncertainty. While significance testing incorporates both magnitude and uncertainty, it does so in a way where critical information is unreported or ambiguous.

In contrast, confidence intervals are a relatively lossless way to combine magnitude and uncertainty (example: CI95 = 20% to 40%). As such, confidence intervals are the most commonly proposed p-value alternative ^12,16,17. Calculating confidence intervals are, in fact, part of the process of significance testing with p-values. Given this dependency, the difference between reporting p-values and reporting the confidences intervals from which they are derived can seem mathematically trivial ¹⁸. However, the practical effect of treating a confidence interval versus a p-value as the endpoint of data analysis is critical.

Consider a biologist who is trying to understand why a particular strain of genetically blind mice have fewer axons (30% less) in their optic nerve than are found in normal mice. They want to know if lack of visual experience during development could account for the difference, so they raise some genetically normal mice in the dark (experimental) or light (control) and count the axons in their optic nerve. The results could be reported as:

1) Result: There were fewer axons in the optic nerves of dark reared mice then in the control group (p < 0.05). [implicit two-tailed test for no-effect]

Conclusion: Light deprivation probably effects axon number.

2) Result: There were 33% +-4% (standard error) fewer axons in the dark reared mouse optic nerve than in the control group (p < 0.001, n = 10, 10). [implicit test for no-effect]

Conclusion: Light deprivation probably decreases axon number to a similar degree as observed in the genetically blind mice.

3) Result: There were 26% to 40% (CI95, n = 10, 10) fewer axons in the dark reared mouse optic nerve than in the control group.

Conclusion: The estimated effect of light deprivation on optic nerve axon number is sufficient to explain most or all the difference between normal and genetically blind mice.

Version 1 does not answer the original question because the quantification provided is consistent with the following possible explanations: 1) One in a million mice will have one extra axon. 2) Access to food, water, social interaction, temperature ect. was slightly different between the experimental and control group. 3) The groups are the same, but axon number is not normally distributed. In contrast, version 3 tells us directly that most of the effect size we are interested in (30% fewer axons in genetically blind mice) could be explained by our manipulation (dark rearing). For most experiments, the number in the confidence interval that is closer to zero is of particular importance because it estimates the lower bound for how big the effect size could be. This lower bound is, therefore, THE critical number for making a conservative argument about the importance of a given manipulation. Tests for no-effect usually just ask if this lower bound excludes zero. We can ask much more useful questions by comparing this lower bound to predefined meaningful effect sizes. In the example above, the lower bound (26%) can account for most of the effect size (30%) the researcher was trying to explain.

Version 2 of the results is more traditional than version 3 and, mathematically, contains the same information as version 3. However, version 2 is prone to misinterpretation. It is common for the p-value to be incorrectly interpreted as evidence that the point value (mean = 33%) is accurate. More importantly, the number that is most critical to most arguments, the lower bound of the confidence interval, must be calculated by the reader and is unlikely to make an appearance in the conclusions.

Confidence intervals are limited by some of the same assumptions as p-values ¹⁹ and there have been arguments for throwing out both in favor of rigorous reporting of effect sizes and experimental design ²⁰. However, NHST for no-effect became ubiquitous because there is a genuine niche for a default statistical analysis that every biologist can calculate and interpret. Swapping out p-values for confidence intervals means that data interpretation will start with a clear statement of effect size and uncertainty about that effect size.

Calculating confidence intervals.

There are a variety of parametric and nonparametric methods for calculating confidence intervals and many are built into common statistical software packages. For replacing most t-tests, the simplest solution is to use standard error to calculate a range of likely differences between groups. The standard error of each group is first used to calculate a combined standard error (cSE = sqrt(SE1^2 + SE2^2)). Prior to the adoption of statical software packages, biologists would have been familiar with calculating this error as the first step in performing a t-test. The 95% confidence interval is simply the mean difference plus or minus 1.96 times the combined standard error. This estimate works best for large samples that are normally distributed, but it is robust enough across a range of conditions to be treated as a useful default statistic. Non-parametric confidence intervals can be calculated using a variety of resampling techniques ^19,21–26. To experiment with how different kinds of confidence intervals interact with different kinds of data, Matlab (Mathworks) code is available for download (https://github.com/MorganLabShare/betterThanChance).

How do you change the culture of biological quantification?

Testing for no-effect is the lowest possible statistical bar for interpreting results as publication worthy. Not using p-values can put individual researchers at a competitive disadvantage in the race to publish results ²⁷. How do you get researchers to adopt practices that will result in fewer publications?

Movements to reform significance testing and emphasize effect size have played out in other fields with vigorous debate and some success ²⁸. The debate in psychology has been particularly well documented ^12,29,30. The absurdity of testing for no-effect was perhaps most elegantly pointed out in “The Earth Is Round (p <.05)” ¹². Recommendations emerging from the debates over NHST include, treating p-values as a small part of a more comprehensive analysis ^31,32, integrating p-values with Baysian analysis ³³, and eliminating significance testing altogether ^34,35. The resulting ability to take statistical analysis of effect sizes seriously was key to the field of Psychology detecting and understanding its crisis in reproducibility ^36–39. It is not clear that cell biology is paying enough attention to effect sizes to register whether it has a similar crisis in reproducibility.

The numerous systematic problems with how p-values are applied in biology can make it tempting to ask journals to ban them. However, the root problem in cell biology is not NHST. The problem is the convenient but unambiguously incorrect belief that quantitative rigor means performing statistical test for effect or no-effect. If cell biologists are going to collect quantitative data (measurements), then we need to ask quantitative questions and expect quantitative answers. These questions and answers don’t need to be mathematically sophisticated; they just need to retain the quantitative meaning of the measurement. Even when the take-home message of an experiment is a binary claim that “gene X is important for disease Y”, the analysis that gets us there must be grounded in an understanding of what effect size constitutes “important”.

What should we be doing to improve quantification in cell biology? Researchers should recognize that testing the no-effect hypothesis does not increase their understanding of the system they are studying. Journal editors should recognize that rejecting the no-effect hypothesis is not a biologically interesting subject to read about. Reviewers should recognize that arguments based on rejecting the no-effect hypothesis are not convincing. When we encounter a claim that a biological conclusion is supported by rejecting the no-effect hypothesis we can respectfully disagree. We should be comfortable asking: What effect size was being tested for? Why is that effect size important? How do the results relate to that effect size? This shift in focus does not mean getting rid of p-values. It does mean that those using p-values will have to use them with enough statistical sophistication to explicitly define and test a non-zero null hypothesis. Those currently engaging in ritual null hypothesis testing are likely to find that reporting confidence intervals is easier and more informative.

What happens to biology if we stop testing for “no-effect”?

One implication to no linger testing for “no-effect” is that we will have to accept that much of what we currently publish in cell biology should be considered exploratory. That is, we are collecting measurements from cells before we have plausible competing hypotheses that can be distinguished by effect size. Often that’s fine. If we have never seen how a part of the brain is wired, it is reasonable (I hope) to map the organization of that part of the brain without having a quantitative model of what the results should be. In such cases, publishing hypothesis-free confidence intervals makes sense.

No longer testing for no-effect also means that we will have to face the uncertainty in our data. The implicit and, sometimes, explicit interpretation of p-values is that any experiment will either tell you that two groups are the same or different. When we take the confidence intervals seriously, we will often have to conclude that we don’t have enough data to understand what is going on. I have heard the argument that we can’t get rid of p-values because it would take too many mice to make the error bars small enough to reliably report effect size. Big error bars don’t mean you need a different test. Big error bars mean you don’t know the answer. Hopefully, we can adapt to publishing fewer papers with higher quality results.

CODE AVAILABILITY

Matlab (Mathworks) code that can be used to experiment with different types of distributions and confidence intervals is available at https://github.com/MorganLabShare/betterThanChance.

ACKNOWLEGEMENTS

Thanks to Hrvoje Šikić, Tim Holy, Phil Williams, Daniel Kerschensteiner, and Julie Hodges for reading the manuscript and providing feedback. This work was supported by an unrestricted grant to the Department of Ophthalmology and Visual Sciences from Research to Prevent Blindness, by a Research to Prevent Blindness Career Development Award, and by the NIH (EY029313).

REFERENCES

1. Rozeboom, W.W. (1960). The fallacy of the null-hypothesis significance test. Psychol. Bull. 57, 416–428.

2. Goodman, S.N. (1993). p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am. J. Epidemiol. 137, 485–496; discussion 497-501.

3. Meehl, P.E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philos. Sci. 34, 103–115.

4. Berkson, J. (1938). Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test. J. Am. Stat. Assoc. 33, 526–536.

5. Hogben (1957). Statistical Theory: The Relationship of Probability, Credibility and Error : an Examination of the Contemporary Crisis in Statistical Theory from a Behaviourist Viewpoint (Norton).

6. Bakan, D. (1966). The test of significance in psychological research. Psychol. Bull. 66, 423–437.

7. Sterling, T.D. (1959). Publication Decisions and their Possible Effects on Inferences Drawn from Tests of Significance—or Vice Versa. J. Am. Stat. Assoc. 54, 30–34.

8. Calin-Jageman, R.J. (2022). Better Inference in Neuroscience: Test Less, Estimate More. J. Neurosci. 42, 8427–8431.

9. Fisher, S.R.A. (1925). Statistical Methods for Research Workers (Oliver and Boyd).

10. Neyman, J., and Pearson, E.S. (1933). On the problems of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London 231A, 289–338.

11. Gigerenzer, G. (2004). Mindless statistics. J. Socio Econ. 33, 587–606.

12. Cohen, J. (1994). The earth is round (p < 05). Am. Psychol. 49, 997–1003.

13. McShane, B.B., Gal, D., Gelman, A., Robert, C., and Tackett, J.L. (2019). Abandon Statistical Significance. Am. Stat. 73, 235–245.

14. Szucs, D., and Ioannidis, J.P.A. (2017). When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment. Front. Hum. Neurosci. 11, 390.

15. Ioannidis, J.P.A. (2005). Why Most Published Research Findings Are False. 2. 10.1371/journal.pmed.0020124.

16. Thompson, B. (2014). The Use of Statistical Significance Tests in Research. J. Exp. Educ. 61, 361–377.

17. Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philos. Trans. R. Soc. Lond. 236, 333–380.

18. Krantz, D.H. (1999). The Null Hypothesis Testing Controversy in Psychology. J. Am. Stat. Assoc. 94, 1372–1381.

19. Fieberg, J.R., Vitense, K., and Johnson, D.H. (2020). Resampling-based methods for biologists. PeerJ 8, e9089.

20. Trafimow, D., and Marks, M. (2015). Editorial. Basic Appl. Soc. Psych. 37, 1–2.

21. Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. aos 7, 1–26.

22. Jung, K., Lee, J., Gupta, V., and Cho, G. (2019). Comparison of Bootstrap Confidence Interval Methods for GSCA Using a Monte Carlo Simulation. Front. Psychol. 10, 2215.

23. Shi, S.G. (1992). Accurate and efficient double-bootstrap confidence limit method. Comput. Stat. Data Anal. 13, 21–32.

24. Letson, D., and McCullough, B.D. (1998). Better Confidence Intervals: The Double Bootstrap with No Pivot. Am. J. Agric. Econ. 80, 552–559.

25. Puth, M.-T., Neuhäuser, M., and Ruxton, G.D. (2015). On the variety of methods for calculating confidence intervals by bootstrapping. J. Anim. Ecol. 84, 892–897.

26. DiCiccio, T.J., Martin, M.A., and Young, G.A. (1992). Fast and accurate approximate double bootstrap confidence intervals. Biometrika 79, 285–295.

27. Smaldino, P.E., and McElreath, R. (2016). The natural selection of bad science. R Soc Open Sci 3, 160384.

28. Morrison, D.E.H.R.E. ed. (1970). The significance test controversy (Aldine).

29. Harlow, L.L., Mulaik, S.A., and Steiger, J.H. (2013). What if there were no significance tests?

30. Fidler, F., Thomason, N., Cumming, G., Finch, S., and Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can’t make them think: statistical reform lessons from medicine. Psychol. Sci. 15, 119–126.

31. Bonovas, S., and Piovani, D. (2023). On p-Values and Statistical Significance. J. Clin. Med. Res. 12. 10.3390/jcm12030900.

32. Wasserstein, R.L., and Lazar, N.A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. Am. Stat. 70, 129–133.

33. Held, L., and Ott, M. (2018). On P-Values and Bayes Factors. 10.1146/annurev-statistics-031017-100307.

34. Schmidt, F.L., and Hunter, J.E. (1997). What if there were no significance tests? 37–64.

35. Amrhein, V., and Greenland, S. (2018). Remove, rather than redefine, statistical significance. Nat Hum Behav 2, 4.

36. Collaboration, O.S. (2015). Estimating the reproducibility of psychological science. Science 349, aac4716.

37. Van Noorden, R. (2023). Medicine is plagued by untrustworthy clinical trials. How many studies are faked or flawed? Nature Publishing Group UK. 10.1038/d41586-023-02299-w.

38. Errington, T.M., Mathur, M., Soderberg, C.K., Denis, A., Perfito, N., Iorns, E., and Nosek, B.A. (2021). Investigating the replicability of preclinical cancer biology. Elife 10. 10.7554/eLife.71601.

39. Ioannidis, J.P.A. (2005). Contradicted and initially stronger effects in highly cited clinical research. JAMA 294, 218–228.