Whatever adjustment is used, extensive multiple testing inevitably leads to some loss of power or the need for a compensatory increase in the detectable effect size and/or sample size.
12, 13, 14 The universe of tests subject to adjustment, the level of stringency, underlying assumptions and statistical methods are all subjects of debate. There are reasonable arguments as to when it is appropriate to statistically adjust for multiple tests and how best to do so. 10, 11 Cross-disciplinary studies, such as genomic analyses of neuroimaging or other high-dimensional phenotypes, will only further expand the problem.
Other fields (for example, neuroimaging or data mining of medical records) face a similar burden of multiplicity of tests. 9 Investigation of a variety of phenotype definitions, genetic models and subsets of individuals also increases the total number of hypothesis tests actually underlying any reported finding. 8 As whole genome sequencing becomes widely available, the number of tests will continue to increase. As a consequence, some experts now advocate larger and larger sample sizes for genomic studies 6, 7 that are impractical for many diseases and, even when practical, may require broader, more heterogeneous phenotype definitions and less costly, more imprecise phenotypic measurements to accomplish.Ĭurrent genotype microarray technology is now approaching a capacity of five million single-nucleotide polymorphisms (SNPs) per study. 3, 4, 5 The more stringent level of evidence required necessarily reduces the power to identify a true-positive finding. With almost inevitable large numbers of hypothesis tests in a single experiment comes the well-recognized need to use some type of statistical correction for multiple testing to avoid generating ever-increasing numbers of false-positive results. 1, 2 Simultaneously, as technological advances have provided the means to easily measure, store and manipulate huge quantities of data, the need for stronger a priori testing of one hypothesis has become more complex to justify. Total studies and hypothesis tests per study have both increased exponentially since the 1920s when the conventional 0.05 significance level was first adopted. The numbers of hypothesis tests in science and genomics, in particular, are increasing at an ever-expanding rate. The results are reassuring in an era of extreme multiple testing. We provide an interactive Excel calculator to compute power, effect size or sample size when comparing study designs or genome platforms involving different numbers of hypothesis tests. Relative costs are less when measured by increases in the detectable effect size. For example at the 0.05 significance level, a 13% increase in sample size is needed to maintain 80% power for ten million tests compared with one million tests, whereas a 70% increase in sample size is needed for 10 tests compared with a single test. We show that once the number of tests is large, power can be maintained at a constant level, with comparatively small increases in the effect size or sample size. This study examines the relationship between the number of tests on the one hand and power, detectable effect size or required sample size on the other. To safeguard against the resulting loss of power, some have suggested sample sizes on the order of tens of thousands that can be impractical for many diseases or may lower the quality of phenotypic measurements. In addition to the monetary cost, this increase imposes a statistical cost owing to the multiple testing corrections needed to avoid large numbers of false-positive results. Studies using current genotyping platforms frequently include a million or more tests. Advances in high-throughput biology and computer science are driving an exponential increase in the number of hypothesis tests in genomics and other scientific disciplines.