# Assuming linearity is preserved when variables are dropped

One common mistake in using "variable selection" methods is to assume that if one or more variables are dropped, then the appropriate model using the remaining variables can be obtained simply by deleting the dropped variables from the "full model" (i.e., the model with all the explanatory variables).  This assumption is in general false.

Example:  If the true model is E(Y|X1, X2) = 1 + 2X1 +3X2 . Thus a linear model fits when Y is regressed on both X1and X2. But if in addition, E(X1X2) = log(X1), then it can be calculated that E(Y|X1) = 1 +2X1 + 3log(X1), which shows that a linear model does not fit when Y is regressed on X1 alone (and, in particular, that the model E(Y|X1) = 1 +2X1 is incorrect.)

One method that sometimes works to get around this problem is to transform the variables to have a multivariate normal distribution, then work with the transformed variables. This will ensure that the conditional means are a linear function of the transformed explanatory variables, no matter which subset of explanatory variables is chosen. Such a transformation is sometimes possible with some variant of a Box-Cox transformation procedure. See, e.g., pp. 236 and 324 - 329 Cook and Weisberg's text1 for more details.

1. Cook and Weisberg (1999) Applied Regression Including Computing and Graphics, Wiley.