COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

Introduction        Types of Mistakes        Suggestions        Resources        Table of Contents         About


Using Plots to Check Model Assumptions

Overall Cautions:

1. Unfortunately, these methods are typically better at telling you when the model assumption does not fit than when it does.

2. Different techniques have different model assumptions, so additional model checking plots may be needed; be sure to consult a good reference for the particular technique you are considering using.

General Rule of Thumb: First check any independence assumptions, then any equal variance assumption, then any assumption on distribution (e.g., normal) of variables.

Rationale: Techniques are usually least robust to departures from independence1 and most robust to departures from normality2, 3.

Suggestions and Guidelines for Checking Specific Model Assumptions

Checking for Independence

Independence assumptions are usually formulated in terms of error terms rather than in terms of the outcome variables. For example, in simple linear regression, the model equation is
Y = α + βx + ε,
where Y is the outcome (response) variable and ε denotes the error term (also a random variable). It is the error terms that are assumed to be independent4, not the values of the response variable.

We do not know the values of the error terms
ε, so we can only plot the residuals ei (defined as the observed value yi minus the fitted value, according to the model), which approximate the error terms.

Rule of Thumb: To check independence, plot residuals against any time variables present (e.g., order of observation), any spatial variables present, and any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.

Rationale: Dependence on time or spatial variables are common sources of lack of independence, but the other plots might also detect lack of independence.

Comments:

1. Because time or spatial correlations are so frequent, it is important when making observations to record any time or spatial variables that could conceivably influence results. This not only allows you to make the residual plots to detect possible lack of independence, but also allows you to change to a technique incorporating additional time or spatial variables if lack of independence is detected in these plots.

2. Since it is known that the residuals sum to zero, they are not independent, so the plot is really a very rough approximation.


Checking for Equal Variance 

Plot residuals against fitted values (in most cases, these are the estimated conditional means, according to the model), since it is not uncommon for conditional variances to depend on conditional means, especially to increase as conditional means increase. (This would show up as a funnel or megaphone shape to the residual plot.)

Caution: Hypothesis tests for equality of variance are often not reliable, since they also have model assumptions and are typically not robust to departures from these assumptions.

Checking for Normality or Other Distribution

Caution: A histogram (whether of outcome values or of residuals) is not a good way to check for normality, since histograms of the same data but using different bin sizes (class-widths) and/or different cut-points between the bins may look quite different. Example.

Instead, use a probability plot (also know as a quantile plot or Q-Q plot). Click here for a pdf file explaining what these are. Most statistical software has a function for producing these.

Caution: Probability plots for small data sets are often misleading; it is very hard to tell whether or not a small data set comes from a particular distribution.

Checking for Linearity

When considering a simple linear regression model, it is important to check the linearity assumption -- i.e., that the conditional means of the response variable are a linear function of the predictor variable. Graphing the response variable vs the predictor can often give a good idea of whether or not this is true. However, one or both of the following refinements may be needed:

1. Plot residuals (instead of response) vs. predictor. A non-random pattern suggests that a simple linear model is not appropriate; you may need to transform the response or predictor, or add a quadratic or higher term to the mode.

2. Use a scatterplot smoother such as lowess (also known as loess) to give a visual estimation of the conditional mean. Such smoothers are available in many regression software packages. Caution:  You may need to choose a value of a smoothness parameter. Making it too large will oversmooth; making it too small will not smooth enough.
 
When considering a linear regression with just two terms, plotting response (or residuals) against the two terms (making a three-dimensional graph) can help gauge suitability of a linear model, especially if your software allows you to rotate the graph.

Caution: It is not possible to gauge from scatterplots whether a linear model in more than two predictors is suitable. One way to address this problem is to try to transform the predictors to approximate multivariate normality.5 This will ensure not only that a linear model is appropriate for all (transformed) predictors together, but that a linear model is appropriate even when some transformed predictors are dropped from the model.6



1. Some techniques may merely require uncorrelated errors rather than independent errors, but the model-checking plots needed are the same.

2. Robustness to departures from normality is related to the Central Limit Theorem, since most estimators are linear combinations of the observations, and hence approximately normal if the number of observations is large.

3. In this context, "robustness" can be formulated in terms of the effect of the departure from a model assumption on the Type I error rate. See Van Belle (2008) Statistical Rules of Thumb, pp. 173 - 177 and the references given there for more detail.

4. In some formulations of regression, the error terms are only assumed to be uncorrelated, not necessarily independent.

5. See Cook and Weisberg (1999) Applied Regression Including Computing and Graphics, p. 324- 329 for one way to do this.

6. If a linear model fits with all predictors included, it is not true that a linear model will still fit when some predictors are dropped. For example, if E(Y|X1, X2) = 1 + 2X1 +3X2 
(so that a linear model fits when Y is regressed on both X1and X2), but  E( X1| X2) = log(X1), then it can be calculated that E(Y|X1) = 1 +2X1 + 3log(X1), which says that a linear model does not fit when y is regressed on X1 alone.