## This
site is under construction. Please check back every few weeks for
updates

### COMMON MISTEAKS
MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them

# Assuming linearity is
preserved when variables are dropped

One common mistake in using "variable selection"
methods is to assume that if one or more variables are dropped, then
the appropriate model using the remaining variables can be obtained
simply by deleting the dropped variables from the "full model" (i.e.,
the model with all the explanatory variables). This assumption is in general false.

Example:
If the true model is E(Y|X_{1}, X_{2}) = 1 + 2X_{1}
+3X_{2} . Thus a
linear model fits when Y is regressed on
both X_{1}and X_{2}.
But if in addition, E(X_{1}| X_{2})
= log(X_{1}),
then it can be calculated that E(Y|X_{1})
= 1 +2X_{1} + 3log(X_{1}),
which shows that a linear model does not
fit when Y is regressed on X_{1}
alone (and, in particular, that the model E(Y|X_{1})
= 1 +2X_{1} is
incorrect.)

One method that sometimes works to get around this problem is to
transform the variables to have a multivariate normal distribution,
then work with the transformed variables. This will ensure that the
conditional means are a linear function of the transformed explanatory
variables, no matter which subset of explanatory variables is chosen.
Such a transformation is sometimes possible with some variant of a
Box-Cox transformation procedure. See, e.g., pp. 236 and 324 - 329 Cook
and Weisberg's text^{1} for more details.

1. Cook and Weisberg (1999) Applied Regression Including Computing and
Graphics, Wiley.