Josh L. Morgan

Washington University in St. Louis, Department of Ophthalmology and Visual Sciences, Neuroscience, Biology and Biomedical Science.

**Abstract**

Significance testing with p-values is still widely treated as the gold standard for quantification in cell biology. However, the null hypothesis being tested, that measures are statistically independent, is not realistic in experimental biology. All parameters within a living system can be assumed to be, at least, indirectly connected and all measurements can be assumed to have some non-zero bias. Basing conclusions on accepting or rejecting the zero-effect hypothesis has, therefore, generated a literature full of meaningless tests of significance. This problem could be solved by basing conclusions on confidence intervals for effect size. Unfortunately, the wide acceptance, low bar, trivial reproducibility, and easy manipulation of testing for zero-effect means that researchers that quantify effect sizes are at a competitive disadvantage. Collective agreement that testing for zero-effect is not meaningful will be required for mainstream cell biology to become a quantitative science.

**What do we want from statistics?**

As cell biologists, we usually start out with the question “Is X important for Y?” We then obtain a dataset in which we can measure the relationship between X and Y. We then want to figure out if we have obtained a large enough sample size for us to make reliable claims about the relationship of X and Y outside of our dataset. The science of statistics provides mathematical frameworks for describing the reliability of measures given the size of a data set and a model of how values in the larger population are likely to behave. This process of estimating population parameters from sample parameters is straightforward statistical inference.

But researchers want more from statistics. We want to know, given the data, what is the probability that X is important for Y? What number do I use to say that the data supports the hypothesis driving my science? The convenient simplifications and misinterpretations used to force statistical inference into a one-size-fits-all hypothesis test has led to a culture of quantification in cell biology where statistics often do more harm than good.

**Historic Errors**

In 1925 Ronald A. Fisher published “Statistical Methods for Research Workers” in which he described an approach for rejecting, or nullifying, a hypothesis by calculating how often it would be likely to produce an observed effect size (Fisher 1925). He provided tables for calculating p ≤ 0.05 (5%) and p ≤ 0.01 (1%) as practical cut-offs for claiming that the difference between observed data and likely results of the null hypothesis was statistically significant. P ≤ 0.05 was adopted as a standard criterion by the research community, partly for the simplicity of the look-up table. Over time, the practice of using a fixed threshold to reject a null hypothesis was inappropriately mixed with the strategy of hypothesis testing Neyman and Pearson developed for optimizing the costs and benefits in binary decision-making (Goodman 1993; Neyman and Pearson 1933; Gigerenzer 2004). This mixing allowed researchers to claim support for one hypothesis by providing statistically significant evidence against a second hypothesis. Despite statisticians and researchers pointing out that p-values, even under the best circumstances, are of limited utility and that p-values, as commonly used, did not support the claims being made (Rozeboom 1960; Goodman 1993; Meehl 1967; Berkson 1938; Hogben 1957; Bakan 1966; Sterling 1959), rejecting the null hypothesis became the twentieth century’s statistical gold standard for hypothesis driven science.

Within this statistical confusion, an additional convention developed that removed most of the remaining meaning from standard cell biological quantification. The null hypothesis became interpreted as the ‘nil hypothesis’ (Cohen 1994). The nil hypothesis, or zero-effect hypothesis, is some version of “parameter X is independent of parameter Y”. That hypothesis is tested using a model that randomizes the relationship between X and Y and calculates the frequency distribution of results expected for a given sample size. A small p-value means a difference as extreme as the observed difference would rarely be produced if the relationship between X and Y is random. The logic is then inverted (without justification (Lytsy, Hartman, and Pingel 2022)) to support the conclusion that the relationship between X and Y is “probably not random”.

The litany of statistical mistakes involved in this procedure is rendered somewhat moot by the fact that every possible conclusion is true; either “we don’t know (p>.05)” or “probably not random (p≤.05)”. In a highly interconnected system like a living organism, the proposition that one parameter is perfectly independent of another parameter is trivially false. Everything can be assumed to be directly or indirectly connected. Furthermore, all experiments can be assumed to have some non-zero sampling bias (McShane et al. 2019). With unlimited sample sizes, we can then assume a test of any two parameters would pass any arbitrarily low threshold for p-value significance. The interesting biological question is not whether two parameters are statistically independent, but how much influence one parameter has on another parameter. Without information about the magnitude of influence between parameters, it is not possible to build a model of how a system works. Without reporting effect sizes, it is not possible to meaningfully replicate results. Without prediction and replication, biology is not science.

**Why p-value science kind of worked.**

There are special conditions where biologists use p-values correctly. First, there is nothing intrinsic to null hypothesis testing that the null hypothesis is that there is no interaction (zero-effect hypothesis). Testing the null hypothesis that effect size is zero is simply the near-universal convention (Gigerenzer 2004). It is possible to formulate a real hypothesis which distinguishes between a biologically relevant effect size and trivial effect size. Second, if the ultimate goal of a study is to make a binary decision, then the costs of making a mistake can be used to define an optimal rejection criteria (Neyman and Pearson 1933). Testing the hypothesis then requires committing to several decisions prior to data collection and these decisions require first understanding the variance of the population being sampled. The sample size needs to be large enough relative to the variance that effect sizes consistent with the hypothesis can be resolved from trivial effect sizes. Ideally, the experiment, sample sizes, criteria for rejecting the null hypothesis, and statistical methods should be publicly registered prior to data collection (Szucs and Ioannidis 2017). Finally, the results of any individual experiment should be interpreted as a modification, and not a replacement, of existing evidence.

The above conditions are what might be expected of a particularly well designed large clinical trial planned, run, and analyzed with the help of a statisticians. Most basic cell biology does not fit this model. For instance, I have never been to a talk or in my own field of cellular neuroscience where the above criteria have been met. Instead, p-values are presented to prove that group A is different from group B and are calculated using however many samples could be collected in the available time.

Biology makes progress despite p-values being touted as the final word in quantification for two reasons. First, images, histograms, scatterplots, and error bars provide meaningful information about biology and are usually published in parallel with p-values. The second reason is that p-values work as a crude rule of thumb for effect size as long as sample sizes are small and measures are noisy. The question most biologists are really asking with a p-value is: “Given that I only checked a few examples of group A and a few examples of group B, was there an effect large enough (relative to variance) that I got a p-value less than 0.05?” Much of the time, a small p-value will indicate a respectable effect size.

**Why p-value cell biology is imploding.**

The use of hypothesis testing with p-values has been a problem since its adoption, but the difference between what science could tell us and what p-values tell us is rapidly increasing. While some fields have adapted by incorporating more rigorous statistical standards, mainstream cell biology seems to be heading off a cliff.

Big data: The utility of testing for zero-effect as a rule-of-thumb for effect size evaporates with big data. Large sample sizes cause trivial connections between parameters to pass p-value thresholds. Even reducing the noise in data sets increases the sensitivity to trivial effect sizes thereby making p-values less informative. Automated data collection, more precise tools, and automated analysis should be great news for science, but the technical sophistication of data collection is counter productive if the results are boiled down to “statistically significant”. “Statistically significant” now means even less than it did 20 years ago.

One test to answer every question: The convention of zero-effect testing has become so embedded in the culture of cell biology that there seems to be a lack of awareness that statistics can do anything else. To the extent there is an awareness of other approaches, these are often viewed as “descriptive statistics” that are not critical to the business of hypothesis driven science.

P-hacking: If a student is taught that there are only two experimental results: “Not statistically significant” (you don’t know anything) or P < 0.05 (ready to be published), then they will keep doing experiments until P < 0.05. Under the right conditions, this process of ignoring experiments until they pass threshold can produce a literature with more false positives than true positives (Ioannidis 2005).

Story telling: Results relying on p-values are often presented as a series of binary decision points (McShane et al. 2019). The following formulation is common. 1) Measure X is bigger in disease than control (P < 0.05). 2) Manipulation B is a model for Disease A because it’s X is also bigger than control (P<0.05). 3) Treatment of Model B with substance C was successful because its X was not different from control (P > .05).” The p-values presented in this formulation do not support the claims being made. The most egregious (and common) error is using the lack of statistical significance as an argument for similarity between groups. However, it is also unclear if X is a useful measure of the disease or if the model is more like the disease than to control. Without considering sample sizes and effect sizes, storytelling with p-values might as well be telling a story about coin flips.

Data blindness: One of the strangest phenomena related to the use of p-values is when researchers show data that is clearly in conflict with the p-value story they are telling (McShane and Gal 2016). They will claim an important difference when error bars are overlapping or say there was no difference between two groups when the error bars allow for a ten-fold difference. In these cases, finding the correct answer does not require sophisticated analysis. It just requires looking at the data and not the p-value.

Meaningless measures: Data analysis software makes it easy to generate many highly derived variations of the same data. P-values make it possible to base a conclusion on these measures in the absence of a clear biological interpretation of what is being measured. If, instead, a conclusion is based observing that a manipulation increases measure X by 22% – 28%, then it is natural to ask if a 22% increase in X is important. How does 22% compare to natural variation? How does 22% compare to the difference between healthy and control? What exactly is X? Biologists consistently fail to ask these questions because they are satisfied that the small p-value tells them the difference was ‘statistically significant’.

The combined effects of cell biologists basing their conclusions on statistical significance is that quantification in the modern cell biology literature has been sapped of meaning when it should have been supercharged by technological advances.

**Quantification you can base a theory on**

Convincing biologists to change the way we think about our data is a daunting job. That job is made simpler by the fact that most hypothesis testing in experimental biology asks the same statistical question, “Is group A different from group B?” We can radically improve biological quantification by instead asking, “How different is group A from group B?”

One of the simplest fixes for cell biology’s statistical problems is to replace p-values with a confidence interval for a range of potential differences between groups (dCI) (Cohen 1994; Thompson 2014; Neyman 1937). Confidence intervals combine a quantification of magnitude with a quantification of uncertainty. They make useful quanta for scientific data reporting as all conclusions should be assembled from measures that consider both magnitude and uncertainty. To a statistician, the difference between reporting p-values and reporting the confidences intervals from which they are derived can seem mathematically trivial (Krantz 1999). However, the practical effect p-values being framed as the endpoint of data analysis is that researchers base their conclusions on “statistical significance” and not the range of potential effect sizes.

Many biologists assume both magnitude and confidence are addressed if we compare two groups using means, standard error, and p-values. It is true that standard error can be interpreted as confidence interval describing each group. Adding 1.96x error bars (~95% confidence interval) instead of 1x standard error bars would help. It is also true that with the means, sample sizes, and p-values comparing the groups, it is possible to back-calculate a confidence interval for the difference between the means of the groups. But that cryptic confidence interval is not what will be used to draw conclusions. The results will typically be referred to as statistically significant or insignificant without reference to effect size. If effect size is mentioned, it will likely be in the form of the wildly incorrect interpretation that p < 0.05 supports the claim that the observed difference in means is accurate. When someone reports “manipulation X increases measure Y by a factor of 3.5 (P<0.05)”, you can expect that they will discuss the implications of the result in terms of the point difference (3.5x) and not the range of potential differences (?-? x). Directly reporting confidence intervals prevents this misinterpretation: “We found that manipulation X increases measure Y by a factor of 1.2 to 10.6.”

In most analysis of biological data, and certainly in the public consumption of biological research, thinking in terms of confidence intervals is useful and thinking in terms of p-values is not. A researcher or reader can make predictions with a confidence interval. A reader can estimate whether the effect size observed in one study is sufficient to explain the effect size in another study. A reader can compare studies to see if they observed similar effect sizes. Fundamentally, researchers and readers can try to address whether the effect size observed in an experiment is sufficient to explain the phenomena being studied. Consideration of p-values adds little to this analysis besides a long and well documented history of misinterpretation.

One of the core benefits to researchers of thinking in terms of confidence intervals is that they provide useful information outside of the rigorous conditions required to justify the claims of statistical hypothesis testing. In most day-to-day experimental biology, researchers collect data without knowing the variance of the population and without knowing the prior probability of a hypothesis being true or false. A dCI provides a sense of both effect size and the precision with which they can estimate the effect size that is useful both in the initial encounter with new data types and in the final formulation of biological models.

Confidence intervals suffer from some of the same problems as p-values and there have been arguments for throwing out both in favor of rigorous reporting of effect sizes and experimental design (Trafimow and Marks 2015). A principle criticism of the confidence interval is that an arbitrary threshold value, whether 95% or 99.9% is misinterpreted as a probability that the range contains the true value. That number does not take into account other sources of evidence, prior probabilities, the reliability of the model, the extent to which the assumptions of the model fit the data, or the philosophical subtleties of making claims about probability from estimates of hypothetical frequency. There is also the problem that an reporting a 95% interval is an arbitrary cutoff and would require a case by case cost benefit justification of thresholds (Neyman and Pearson 1933) for rational interpretation. However, the practical requirement that we start working with quantitation that embodies both effect size and uncertainty about effect size is urgent. Even if confidence intervals become the new zombie statistic, cell biology will be in a radically better state of quantitative meaning compared to our current system of rejecting the zero-effect hypothesis.

**Getting used to confidence intervals.**

There are a variety of ways to calculate the difference range between two groups. The simplest dCI is the confidence interval derived from the combined standard error (cSE) of two groups (cSE = sqrt(SE1^2 + SE2^2)). Prior to the adoption of statical software packages, biologists would have been familiar with calculating this error as the first step in performing a t-test. The 95% confidence interval is simply the mean difference plus or minus 1.96 times the combined standard error. This estimate works best for large samples that are normally distributed, but it is robust enough across a range of conditions to be treated as a useful default statistic.

A more general solution for calculating confidence intervals is to take advantage of the set of techniques that use bootstrap resampling (Efron 1979; Fieberg, Vitense, and Johnson 2020). The simplest version of the bootstrap method is to recreate a sample group thousands of times by randomly selecting from the same pool of sampled values. Any test values (mean, median, standard deviation, etc.) obtained from the resampled data can be collected and sorted to obtain a confidence interval for that value. This approach can readily be extended to recreating two sample groups and measuring differences between them. More sophisticated versions of this approach, such as bias correction, double-bootstrap, and pivot functions (Jung et al. 2019; Shi 1992; Letson and McCullough 1998; Puth, Neuhäuser, and Ruxton 2015; DiCICCIO, Martin, and Young 1992), can be used to improve the performance of the test. Most statistics packages include functions for calculating a bootstrap confidence interval.

For an intuitive understanding of the relationship between confidence intervals and possible differences between groups, we can extend the familiar framework of p-value testing. The zero-effect hypothesis is tested by estimating the distribution of results produced by randomly sampling modeled populations with the same mean (**Fig 1a**). The same estimates can be generated for randomly sampling from many different model populations that have a range of differences between their means (**Fig1b**). For each of thousands of resamplings, we can keep track of how often each difference model produces a test statistic similar to the experimentally observed test statistic (**Fig. 1b,c**). We can then select the range of modeled population differences responsible for 95% of the matches with our observed difference (**Fig 1c**). For normally distributed populations, this simulation approach produces essentially the same 95% confidence interval as the using the difference in means plus or minus 1.96 times the combined standard error (**Fig 2a**).

**Figure 1:** Visualization of using p-values vs. confidence intervals to quantify the difference in means between two groups. **a.** Distribution of expected differences between means predicted from sampling normal distributions with the same mean. Blue line indicates the zero-effect null hypothesis being tested. Red tails highlight 5% of the most extreme results being defined as statistically significant. Black line indicates observed difference between groups occurring within the alpha range of the distribution. **b.** Distributions of expected differences between means for a range of modeled population differences. Circles correspond to the number of times the color matched curve is expected to produce a result that matches the observed difference in means. **c.** Colored bars indicate the number of times each modeled difference produces a result matching the observed result. Smooth curve indicates results from a finer sampling of potential differences. Red arrows indicate boundaries of potential differences that include 95% of the expected results (shaded curve) that matched the observed result (central black line).

Applying models whose assumptions don’t match the data or failing to report how a confidence interval was generated can lead to systematic overestimations of the reliability of confidence intervals (Fieberg, Vitense, and Johnson 2020). The code for experimenting with parametric and resampling versions of the above dCI simulation and for comparing the results to standard error and bootstrap methods of confidence interval generation is available for download (https://github.com/MorganLabShare/betterThanChance, **Fig 2b,c**). Non-experts, especially, are encouraged to experiment with the reliability of different confidence interval estimations for different data types.

**Figure 2: **Application for experimenting with confidence intervals for differences between two groups.** a. **Top plot shows parametric (red) and resampling (blue) based confidence interval estimation relative to real population differences (black line) for experiment in which sample size for Group A and Group B is 10. Bottom plot repeats the analysis for sample size of 100 for Group A and B. **b**. Window for testing confidence interval generation against a variety of simulated populations. Ranges of sample sizes, means, and standard deviations can be tested. Resulting plots compare population values to four methods of estimating confidence intervals. **c. **Main window provides fields for entering or randomly generating two sample groups. Samples are displayed as scatter points with the error bars indicating the requested confidence interval (from standard error). Standard confidence intervals and a bootstrap confidence interval are calculated for sample groups. Zero-effect p-values can be generated for comparison. Simulations that generate confidence intervals from a range of modeled population differences can be run on the right side. The distribution of models that produce differences consistent with the observed difference is plotted in black. ‘Test accuracy’ button opens window shown in panel **b.**

**Should we get rid of p-values entirely?**

Movements to replace p-values and significance testing with effect size and confidence intervals have already played out in other fields (Morrison 1970) with the debate in psychology being particularly well documented (Harlow, Mulaik, and Steiger 2013; Cohen 1994; Fidler et al. 2004). There is now a rich literature critiquing the problems with conventional statistical hypothesis testing and suggesting reforms. The recommendations include using smaller p-values as thresholds (Benjamin et al. 2018), using continuous p-values (Amrhein and Greenland 2018), treating p-values as a small part of a more comprehensive analysis (Bonovas and Piovani 2023; Wasserstein and Lazar 2016), integrating p-values with Baysian analysis (Held and Ott 2018), and eliminating significance testing altogether (Schmidt and Hunter 1997; Amrhein and Greenland 2018).

In cell biology, I believe the most destructive aspect of our current statistical practices is the implicit and explicit belief that the point of doing experiments is to reject the zero-effect hypothesis. The simplest reform is, therefore, for us to learn to identify and disregard statistical arguments based on rejecting the zero-effect hypothesis. This change does not require getting rid of p-values. It does mean that those using p-values will have to use them with enough statistical sophistication to explicitly define and test a non-zero null hypothesis. Those currently engaging in zombie null hypothesis testing are likely to find that reporting confidence intervals is easier and more informative.

Experimental biology data is hard-won and messy. P-values were embraced because they were one-size-fits-all, easy to calculate, and easy to interpret (incorrectly). Proper use of p-values requires careful consideration of the implications of sample size, effect size, the costs of type I error and type II error, unreported tests, and *a priori* probability. Alternatively, we could replace reporting one number (p-value) with reporting two numbers (dCI) and get back to work.

**Admitting we don’t know**

One of the more profound consequences of no longer performing zero-effect testing is that we will have to face the uncertainty in our data. The big-error-bar results reported as “statistically significant” will now have to be reported as “maybe different, maybe pretty much the same”. The big-error-bar ‘N.S.’ (not significant) results for which it was previously implied that there was no difference between groups will have to be reported as “we don’t know”. I have heard the argument that we can’t get rid of p-values because it would take too many mice to make the error bars small enough to reliably report effect size. Biologists need to admit that big error bars don’t mean you need a different test. Big error bars mean you don’t know the answer.

Much of the storytelling biologists currently do using p-values will collapse when differences between groups are properly quantified. If we take four samples using a noisy measurement of a messy parameter, we are going to be uncertain how different it is from our other N = 4 measurement. Again, the problem is not with using confidence intervals. We never had enough information to support the stories being told.

**Reform**

One hundred years after Fischer published his tables for P = 0.05, it is time for cell biologists to adopt a new default statistic. Computers have given us the power to readily perform far more useful analysis than testing for ‘statistical significance’. Computers have also made it easier to collect massive datasets with huge Ns and huge numbers of parameters. If we apply the current thinking behind of zero-effect testing to big data, we will be doomed to a meaningless random walk from one P < 0.05 to the next.

Testing the zero-effect hypothesis is too easy to go away by itself. It is easy to calculate a p-value and it is easy to achieve a low p-value if we can just collect more samples. There is, therefore, a strong disincentive for individual researchers to stop using a widely accepted p-value. Reporting effect sizes might show that a statistically significant relationship is biologically negligible, is inconsistent with a model, can’t be replicated, or covers a range of possible values so wide that no useful conclusions can be drawn. Even when a confidence interval rigorously demonstrates that some relationship is important, the analysis might be disparaged as merely descriptive if it lacks an accompanying p-value to reject a potentially nonsensical null hypothesis. Not using p-values in the current culture of cell biology puts individual researchers at a competitive disadvantage in the race to publish results (Smaldino and McElreath 2016).

Change will require that the biological community acts as a group to understand and discourage testing the zero-effect hypothesis. Journals, reviewers, and audience members need to learn to disregard arguments based on these tests. Some journals have already rejected p-values (Trafimow and Marks 2015; Gill 2018), others warn against them (Loftus 1993) or their misuse (for instance Nature directs authors to (Gelman and Stern 2006)). I recommend (https://sites.wustl.edu/morganlab/reject-p-values-in-2025/) that the biology community organize to ask life science journals to implement a policy that:

1) By default, authors should base conclusions on effect size confidence intervals.

2) Basing arguments on p-values requires stating the effect size being tested for and must meet basic criteria for statistical hypothesis testing.

3) Statistical rejection of the zero-effect hypothesis will be treated as biologically trivial.

4) When possible, data should be reported with visual depictions of data distributions such as scatterplots and histograms.

It is not the role of journals to tell researchers how to analyze our data. However, journals routinely reject papers because the hypothesis being tested is not interesting. Journal editors and their readers should be aware that there is no weaker or less interesting claim than statistically rejecting the possibility that two biological parameters are perfectly independent. Given the time, money, and technology devoted to biological research, the public deserves a more interesting and useful answer to its questions than “probably not random”. We can all agree that biology is probably not random.

**References**

Amrhein, Valentin, and Sander Greenland. 2018. “Remove, Rather than Redefine, Statistical Significance.” *Nature Human Behaviour*.

Bakan, D. 1966. “The Test of Significance in Psychological Research.” *Psychological Bulletin* 66 (6): 423–37.

Benjamin, Daniel J., James O. Berger, Magnus Johannesson, Brian A. Nosek, E-J Wagenmakers, Richard Berk, Kenneth A. Bollen, et al. 2018. “Redefine Statistical Significance.” *Nature Human Behaviour* 2 (1): 6–10.

Berkson, Joseph. 1938. “Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test.” *Journal of the American Statistical Association* 33 (203): 526–36.

Bonovas, Stefanos, and Daniele Piovani. 2023. “On P-Values and Statistical Significance.” *Journal of Clinical Medicine Research* 12 (3). https://doi.org/10.3390/jcm12030900.

Cohen, Jacob. 1994. “The Earth Is Round (p < 05).” *The American Psychologist* 49 (12): 997–1003.

DiCICCIO, Thomas J., Michael A. Martin, and G. Alastair Young. 1992. “Fast and Accurate Approximate Double Bootstrap Confidence Intervals.” *Biometrika* 79 (2): 285–95.

Efron, B. 1979. “Bootstrap Methods: Another Look at the Jackknife.” *The Annals of Statistics* 7 (1): 1–26.

Fidler, Fiona, Neil Thomason, Geoff Cumming, Sue Finch, and Joanna Leeman. 2004. “Editors Can Lead Researchers to Confidence Intervals, but Can’t Make Them Think: Statistical Reform Lessons from Medicine.” *Psychological Science* 15 (2): 119–26.

Fieberg, John R., Kelsey Vitense, and Douglas H. Johnson. 2020. “Resampling-Based Methods for Biologists.” *PeerJ* 8 (May): e9089.

Fisher, Sir Ronald Aylmer. 1925. *Statistical Methods for Research Workers*. Oliver and Boyd.

Gelman, Andrew, and Hal Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” *The American Statistician* 60 (4): 328–31.

Gigerenzer, Gerd. 2004. “Mindless Statistics.” *The Journal of Socio-Economics* 33 (5): 587–606.

Gill, Jeff. 2018. “Comments from the New Editor.” *Political Analysis: An Annual Publication of the Methodology Section of the American Political Science Association* 26 (1): 1–2.

Goodman, S. N. 1993. “P Values, Hypothesis Tests, and Likelihood: Implications for Epidemiology of a Neglected Historical Debate.” *American Journal of Epidemiology* 137 (5): 485–96; discussion 497-501.

Harlow, L. L., S. A. Mulaik, and J. H. Steiger. 2013. “What If There Were No Significance Tests?” https://www.taylorfrancis.com/books/mono/10.4324/9781315827353/significance-tests-lisa-harlow-stanley-mulaik-james-steiger.

Held, Leonhard, and Manuela Ott. 2018. “On P-Values and Bayes Factors.” https://doi.org/10.1146/annurev-statistics-031017-100307.

Hogben. 1957. *Statistical Theory: The Relationship of Probability, Credibility and Error : An Examination of the Contemporary Crisis in Statistical Theory from a Behaviourist Viewpoint*. Norton.

Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False” 2 (8). https://doi.org/10.1371/journal.pmed.0020124.

Jung, Kwanghee, Jaehoon Lee, Vibhuti Gupta, and Gyeongcheol Cho. 2019. “Comparison of Bootstrap Confidence Interval Methods for GSCA Using a Monte Carlo Simulation.” *Frontiers in Psychology* 10 (October): 2215.

Krantz, David H. 1999. “The Null Hypothesis Testing Controversy in Psychology.” *Journal of the American Statistical Association* 94 (448): 1372–81.

Letson, David, and B. D. McCullough. 1998. “Better Confidence Intervals: The Double Bootstrap with No Pivot.” *American Journal of Agricultural Economics* 80 (3): 552–59.

Loftus, Geoffrey R. 1993. “Editorial Comment.” *Memory & Cognition* 21 (1): 1–3.

Lytsy, Per, Mikael Hartman, and Ronnie Pingel. 2022. “Misinterpretations of P-Values and Statistical Tests Persists among Researchers and Professionals Working with Statistics and Epidemiology.” *Upsala Journal of Medical Sciences* 127 (August). https://doi.org/10.48101/ujms.v127.8760.

McShane, Blakeley B., and David Gal. 2016. “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence.” *Management Science* 62 (6): 1707–18.

McShane, Blakeley B., David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett. 2019. “Abandon Statistical Significance.” *The American Statistician* 73 (sup1): 235–45.

Meehl, Paul E. 1967. “Theory-Testing in Psychology and Physics: A Methodological Paradox.” *Philosophy of Science* 34 (2): 103–15.

Morrison, D. E. Henkel R. E., ed. 1970. *The Significance Test Controversy*. Chicago: Aldine.

Neyman, J. 1937. “Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability.” *Philosophical Transactions of the Royal Society of London* 236 (767): 333–80.

Neyman, J., and E. S. Pearson. 1933. “On the Problems of the Most Efficient Tests of Statistical Hypotheses.” *Philosophical Transactions of the Royal Society of London* 231A: 289–338.

Norton, Edward C., Bryan E. Dowd, and Matthew L. Maciejewski. 2018. “Odds Ratios-Current Best Practice and Use.” *JAMA: The Journal of the American Medical Association* 320 (1): 84–85.

Puth, Marie-Therese, Markus Neuhäuser, and Graeme D. Ruxton. 2015. “On the Variety of Methods for Calculating Confidence Intervals by Bootstrapping.” *The Journal of Animal Ecology* 84 (4): 892–97.

Rozeboom, W. W. 1960. “The Fallacy of the Null-Hypothesis Significance Test.” *Psychological Bulletin* 57 (September): 416–28.

Schmidt, F. L., and J. E. Hunter. 1997. “What If There Were No Significance Tests?,” January, 37–64.

Shi, Sheng G. 1992. “Accurate and Efficient Double-Bootstrap Confidence Limit Method.” *Computational Statistics & Data Analysis* 13 (1): 21–32.

Smaldino, Paul E., and Richard McElreath. 2016. “The Natural Selection of Bad Science.” *Royal Society Open Science* 3 (9): 160384.

Sterling, Theodore D. 1959. “Publication Decisions and Their Possible Effects on Inferences Drawn from Tests of Significance—or Vice Versa.” *Journal of the American Statistical Association* 54 (285): 30–34.

Szucs, Denes, and John P. A. Ioannidis. 2017. “When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment.” *Frontiers in Human Neuroscience* 11 (August): 390.

Thompson, Bruce. 2014. “The Use of Statistical Significance Tests in Research.” *Journal of Experimental Education* 61 (4): 361–77.

Trafimow, David, and Michael Marks. 2015. “Editorial.” *Basic and Applied Social Psychology* 37 (1): 1–2.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on P-Values: Context, Process, and Purpose.” *The American Statistician* 70 (2): 129–33.