COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

Introduction        Types of Mistakes        Suggestions        Resources        Table of Contents         About



Problems with Stepwise Model Selection Procedures

"... perhaps the most serious source of error lies in letting statistical procedures make decisions for you."

"Don't be too quick to turn on the computer. Bypassing the brain to compute by reflex is a sure recipe for disaster."

Good and Hardin, Common Errors in Statistics (and How to Avoid Them), p. 3, p. 152

Various algorithms have been developed for aiding in model selection. Many of them are "automatic", in the sense that they have a "stopping rule" (which it might be possible for the researcher to set or change from a default value) based on criteria such as value of a t-statistic or an F-statistic. Others might be better termed "semi-automatic," in the sense that they automatically list various options and values of measures that might be used to help evaluate them.

Caution: Different regression softwares may use the same name (e.g.,"Forward Selection" or "Backward Elimination") to designate different algorithms. Be sure to read the documentation to know find out just what the algorithm does in the software you are using -- in particular, whether it has a stopping rule or is of the "semi-automatic" variety.

Cook and Weisberg1(p. 280) comment,

"We do not recommend such stopping rules for routine use since they can reject perfectly reasonable submodels from further consideration. Stepwise procedures are easy to explain, inexpensive to compute, and widely used. The comparative simplicity of the results from stepwise regression with model selection rules appeals to many analysts. But, such algorithmic model selection methods must be used with caution."

They give an example (pp. 280 - 281) illustrating how stepwise regression algorithms will generally result in models suggesting that the remaining terms are more important than they really are, and that the R2 values of the submodels obtained may be misleadingly large.

Ryan2 (pp.269- 273 and 284 - 286) elaborates on these points. One underlying problem with methods based on t or F statistics is that they effectively ignore problems of multiple inference

Alternatives to Stepwise Selection Methods


Notes:
1. 
R.D. Cook and S. Weisberg (1999), Applied Regression Including Computing and Graphics, Wiley
2. T. Ryan (2009), Modern Regression Methods, Wiley
3. Mallow's statistic is discussed in, e.g., Cook and Weisberg (pp. 272 - 280), Ryan (pp. 273 - 277 and 279 - 283), R. Berk (2004) Regression Analysis: A Constructive Critique, Sage (pp.130 - 135); see also Lecture Notes on Selecting Terms 
4. 
K. P. Burnham and D. R. Anderson (2002), Model selection and Multimodel Inference: A Practical Information-Theoretic Approach, 2nd ed., SpringerK. P. Burnham and D. R. Anderson (2002), Model selection and Multimodel Inference: A Practical Information-Theoretic Approach, 2nd ed., SpringerK. P. Burnham and D. R. Anderson (2002), Model selection and Multimodel Inference: A Practical Information-Theoretic Approach, 2nd ed., Springer has an extensive discussion of Akaike's Information Criterion and related methods. For some common mistakes in using AIC, see pp. 63, 66, 108, 119
5. W. Liu
(2011) Simultaneous Inference in Regression, CRC Press. Liu also has Matlab® programs for implementing procedures available from his website. (Click on the link to the book.)

Last updated January 20, 2012