"Recognize
that any frequentist statistical test has a random chance of indicating
significance when it is not really present. Running multiple tests on
the same data set at the same stage of an analysis increases the chance
of obtaining at least one invalid result. Selecting the one
"significant" result from a multiplicity of parallel tests poses a
grave risk of an incorrect conclusion. Failure to disclose the full
extent of tests and their results in such a case would be highly
misleading."

Professionalism Guideline 8, Ethical
Guidelines for Statistical Practice, American Statistical
Association, 1997

Performing multiple inference without adjusting the Type I error rate accordingly is a common error in research using statistics

The Problem

Recall that if you perform a hypothesis test using a certain significance level (we will use 0.05 for illustration), and if you obtain a p-value less than 0.05, then there are three possibilities:

- The model assumptions for the hypothesis test are not satisfied in the context of your data.
- The null hypothesis is false.
- Your sample happens to be one of the 5% of samples satisfying the appropriate model conditions for which the hypothesis test gives you a Type I error.

Joint Type I error rate:
The probability that a randomly chosen
sample (of the given size,
satisfying the appropriate model assumptions) will give a Type I
error for at least one of the
hypothesis tests performed.

The joint Type I error rate is also known as the overall Type I error
rate, or
joint
significance level, or the simultaneous Type I
error rate, or the family-wise error rate
(FWER), or
the experiment-wise
error rate, etc. The acronym FWER is becoming more and more
common, so will be used in the sequel, often along with another name
for the concept as well.An especially serious form of neglect of the problem of multiple inference is the one alluded to in the quote above: Trying several tests and reporting just one significant test, without disclosing how many tests were performed or correcting the significance level to take into account the multiple inference.

The problem of multiple inference also occurs for confidence intervals. In this case, we need to focus on the confidence level. Recall that a 95% confidence interval is an interval obtained by using a procedure which, for 95% of all suitably random samples, of the given size from the random variable and population of interest, produces an interval containing the parameter we are estimating (assuming the model assumptions are satisfied). In other words, the procedure does what we want (i.e. gives an interval containing the true value of the parameter) for 95% of suitable samples. If we are using confidence intervals to estimate two parameters, there is no reason to believe that the 95% of samples for which the procedure "works" for one parameter (i.e. gives an interval containing the true value of the parameter) will be the same as the 95% of samples for which the procedure "works" for the other parameter. If we are calculating confidence intervals for more than one parameter, we can talk about the joint (or overall or simultaneous or family-wise or experiment-wise) confidence level. For example, a group of confidence intervals (for different parameters) has an overall 95% confidence level (or 95% family-wise confidence level, etc.) if the intervals are calculated using a procedure which, for 95% of all suitably random samples, of the given size from the population of interest, produces for each parameter an interval containing that parameter (assuming the model assumptions are satisfied).

What to do about it

Unfortunately, there is no simple formula to cover all cases: Depending on the context, the samples giving Type I errors for two tests might be the same, they might have no overlap, or they could be somewhere in between. Various techniques for bounding the FWER (joint Type I error rate) or otherwise dealing with the problem of multile inference have been devised for various special circumstances

Bonferroni method:

Fairly basic probability calculations can show that if the sum of the Type I error rates for different tests is less than α, then the overall Type I error rate (FWER) for the combined tests will be at most α.

- So, for example, if you are performing five hypothesis tests and would like an FWER (overall significance level) of at most 0.05, then using significance level 0.01 for each test will give an FWER (overall significance level) of at most 0.05.
- Similarly, if you are finding confidence intervals for five parameters and want an overall confidence level of 95%, using the 99% confidence level for each confidence interval will give you overall confidence level at least 95%. (Think of confidence level as 1 - α.)

The Bonferroni method is also useful for apportioning the overall Type I error between different types of inference -- in particular, between pre-planned inference (the inference planned as part of the design of the study) and "data-snooping" inferences (inferences based on looking at the data and noticing other things of interest). For example, to achieve an overall Type I error rate of .05, one might apportion an overall significance level of .04 to the pre-planned inference and .01 to data-snooping. However, this apportioning should be done before analyzing the data.

Whichever method is used, it is important to make the calculations based on the number of tests that have been done, not just the number that are reported. (See Data Snooping for more discussion.)

False discovery rate:

An alternative to bounding Type I error was introduced by Benjamini and Hochberg in 1995

The False Discovery Rate
(FDR) of a group of tests is the expected value^{5} of the
ratio of
falsely rejected hypotheses to all rejected hypotheses.

Note that the family-wise error rate (FWER) focuses on the
possibility of making any
error among all the inferences performed, whereas the false discovery
rate (FDR) tells you what proportion of the rejected null hypotheses
are, on average, really true. Bounding the FDR rather than the
FWER may be a more reasonable choice when many inferences are
performed, especially if there is little expectation of harm from
falsely rejecting a null hypothesis. Thus it is increasingly being
adopted in areas such as micro-array gene expression experiments or
neuro-imaging. As with the FWER, there are various methods of actually bounding the false discovery rate

Efron has used the phrase "false discovery rate" in a slightly different way in his development of empirical Bayes methods for dealing with multiple inference

Multiple inference in regression:

Not accounting for multiple inference in regression is a common mistake. There are at least three types of situations in which this often occurs:

1. Many stepwise variable selection methods involve multiple inference. (The methods described in item (3) below offer an alternative in some cases.)

2. An analysis may involve inference for more than one regression coefficient. Often a good way to handle this is by using confidence regions. These are generalizations of confidence intervals to more than one dimension. For example, in simple linear regression, the confidence region for the intercept β

3. Considering confidence intervals for conditional means at more than one value of the predictors. Many standard software packages will allow the user to plot confidence bands easily. These typically show confidence intervals for conditional means that are calculated individually. However, when considering more than one confidence interval, one needs instead simultaneous confidence bands, which account for multiple inference. These are less well-known. A discussion of several methods may be found in W. Liu (2011) Simultaneous Inference in Regression, CRC Press. Liu also has Matlab® programs for calculating the confidence bands available from his website.(Click on the link to the book.)

Subtleties and controversies

Bounding the overall Type I error rate (FWER) will reduce the power of the tests, compared to using individual Type I error rates. Some researchers use this as an argument against multiple inference procedures. The counterargument is the argument for multiple inference procedures to begin with: Neglecting them will produce excessive numbers of false findings, so that the "power" as calculated from single tests is misleading.

Bounding the False Discovery Rate (FDR) will usually give higher power than bounding the overall Type I error rate (FWER).

Consequently, it is important to consider the particular circumstances, as in considering both Type I and Type II errors in deciding significance levels. In particular, it is important to consider the consequences of each type of error in the context of the particular research. Examples:

- A research lab is using hypothesis tests to screen genes for possible candidates that may contribute to certain diseases. Each gene identified as a possible candidate will undergo further testing. If the results of the initial screening are not to be published except in conjunction with the results of the secondary testing, and if the secondary screening is inexpensive enough that many second level tests can be run, then the researchers could reasonably decide to ignore overall Type I error in the initial screening tests, since there would be no harm or excessive expense in having a high Type I error rate. However, if the secondary tests are expensive, the researchers would reasonably decide to bound either family-wise Type I error rate or False Discovery Rate.
- Consider a variation of the situation in Example 1: The researchers are using hypothesis tests to screen genes as in Example 1, but plan to publish the results of the screening without doing secondary testing of the candidates identified. In this situation, ethical considerations would warrant bounding either the FWER or the FDR -- and taking pains to emphasize in the published report that these results are just of a preliminary screening for possible candidates, and that these preliminary findings need to be confirmed by further testing.

Notes:

1. A. M. Strasak et al (The Use of Statistics in Medical Research, The American Statistician, February 1, 2007, 61(1): 47-55) report that, in an examination of 31 papers from the New England Journal of Medicine and 22 from Nature Medicine (all papers from 2004), 10 (32.3%) of those from NEJM and 6 (27.3%) from Nature Medicine were "Missing discussion of the problem of multiple signiﬁicance testing if occurred."

These two journals are considered the top journals (according to impact figure) in clinical science and in research and experimental medicine, respectively.

2. For a simulation illustrating this, see Jerry Dallal's demo . This simulates the results of 100 independent hypothesis tests, each at 0.05 significance level. Click the "test/clear" button to see the results of one set of 100 tests (that is, for one sample of data). Click the button two more times (first to clear and then to do another simulation) to see the results of another set of 100 tests (i.e., for another sample of data). Notice as you continue to do this that i) which tests give type I errors (i.e., are statistically significant at the 0.05 level) varies from sample to sample, and ii) which samples give type I errors for a given test varies from test to test. (To see the latter point, it may help to focus just on the first column.)

3. Chapters 3 and 4 of B. Efron (2010), Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Cambridge (or see his Stats 329 notes) contains a summary of various attempts to deal with multiple inference.

Nichols, T. and S. Hayasaka
(2003), Controlling the familywise error rate in functional
neuroimaging: a comparative review, Statistical Methods in Medical Research
12; 419 -446 (accessible at http://www.fil.ion.ucl.ac.uk/spm/doc/papers/NicholsHayasaka.pdf)
gives a survey of Bonferroni-type methods and two other approaches
(random field and permuation tests) to bounding FWER, focusing on
application in neruoimaging. They discuss model assumptions for each
approach and present results of simulations to help users decide which
method to use. The Mindhive webpage P threshold FAQ (accessible at http://mindhive.mit.edu/node/90
or http://mindhive.mit.edu/book/export/html/90
Note: Links from both pages seem to be broken.) gives a less technical
summary of the multple-comparison problem, with summaries of some of
Nichols and Hayasaka's discussion.

See also
Hochberg, Y. and Tamhane,
A. (1987) Multiple Comparison
Procedures, Wiley

Miller, R.G. (1981) Simultaneous Statistical Inference 2nd Ed., Springer

P. H. Westfall and S. S. Young (1993), Resampling-based Multiple Testing: Examples and Methods for p-Value Adjustment, Wiley

B. Phipson and G. K. Smyth (2010), Permutation P-values Should Never Be Zero: Calculating Exact P-values when Permutations are Randomly Drawn, Statistical Applications in Genetics and Molecular Biology Vol.. 9 Iss. 1, Article 39, DOI: 10.2202/1544-6155.1585

F. Betz, T. Hothorn, P. Westfall (2010), Multiple Comparisons Using R, CRC Press

S. Dudoit and M. J. van der Laan (2008), Multiple Testing Procedures with Application to Genomics, Springer

Miller, R.G. (1981) Simultaneous Statistical Inference 2nd Ed., Springer

P. H. Westfall and S. S. Young (1993), Resampling-based Multiple Testing: Examples and Methods for p-Value Adjustment, Wiley

B. Phipson and G. K. Smyth (2010), Permutation P-values Should Never Be Zero: Calculating Exact P-values when Permutations are Randomly Drawn, Statistical Applications in Genetics and Molecular Biology Vol.. 9 Iss. 1, Article 39, DOI: 10.2202/1544-6155.1585

F. Betz, T. Hothorn, P. Westfall (2010), Multiple Comparisons Using R, CRC Press

S. Dudoit and M. J. van der Laan (2008), Multiple Testing Procedures with Application to Genomics, Springer

4. Y. Benjamini and Y. Hochberg (1995), Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B (Methodological), Vol. 57 No. 1, 289 - 300.

5. "Expected value" is another term for mean of a distribution. Here, the distribution is the sampling distribution of the ratio of falsely rejected hypotheses to all rejected hypotheses tested.

6. See, for example:

Y. Benjamini and D. Yekutieli
(2005), False Discovery
Rate–Adjusted Multiple Confidence Intervals for Selected
Parameters, Journal of the American
Statistical Association, March 1, 2005, 100(469): 71-81

Y. Benjamini and D. Yekutieli
(2001), The Control
of the False Discovery Rate in Multiple Testing under Dependency, The Annals of Statistics, vol. 29
N. 4, 1165 - 1186.

Y. Benjamini and Y. Hochberg (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B (Methodological), Vol. 57 No. 1, 289 - 300

Y. Benjamini and Y. Hochberg (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B (Methodological), Vol. 57 No. 1, 289 - 300

7. B. Efron (2010), Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Cambridge, or see his Stats 329 notes.

This page last revised 1/21/2013