USING STATISTICS: Spotting and Avoiding Them
Using Plots to Check Model Assumptions
1. Unfortunately, these
methods are typically better at telling you when the model assumption
does not fit than when it
General Rule of Thumb:
First check any independence assumptions, then any equal variance
assumption, then any assumption on distribution (e.g., normal) of
2. Different techniques have different model assumptions,
additional model checking plots may be needed; be sure to consult a
good reference for the particular technique you are considering using.
Techniques are usually least robust to departures from independence1
and most robust to departures from normality2, 3.
Guidelines for Checking Specific Model Assumptions
Checking for Independence
Independence assumptions are
usually formulated in terms of error terms rather than in terms of the
outcome variables. For example, in simple linear regression, the model
Y = α
+ βx + ε,
where Y is the outcome
(response) variable and ε
the error term (also a random variable). It is the error terms that are
assumed to be independent4, not the values of the response
We do not know the values of the error terms ε,
so we can only
plot the residuals ei (defined as the observed value yi
minus the fitted value, according to the model),
which approximate the error terms.
Rule of Thumb:
To check independence, plot residuals against any time variables
present (e.g., order of observation), any spatial variables
present, and any variables used in the technique (e.g., factors,
regressors). A pattern that is not random suggests lack of independence.
Dependence on time or
spatial variables are common sources of lack of independence, but the
other plots might also detect lack of independence.
1. Because time or
spatial correlations are so frequent, it is
important when making observations to record
any time or spatial variables that could conceivably influence results.
This not only allows you to make the residual plots to detect
possible lack of independence, but also allows you to change to a
technique incorporating additional time or spatial variables if lack of
independence is detected in these plots.
2. Since it is known that the
residuals sum to zero, they are not
independent, so the plot is really a very rough approximation.
for Equal Variance
Plot residuals against fitted
values (in most cases, these are the estimated conditional means,
according to the model), since it is not uncommon for
conditional variances to depend on conditional means, especially to
increase as conditional means increase. (This would show up as a funnel
or megaphone shape to the residual plot.)
Caution: Hypothesis tests for
equality of variance are often not reliable, since they also have model
assumptions and are typically not robust to departures from these
for Normality or Other
A histogram (whether of
outcome values or of residuals) is not
a good way to check for normality, since histograms of
the same data but using different bin sizes
(class-widths) and/or different cut-points between the bins may look
quite different. Example.
Instead, use a probability plot
(also know as a quantile plot
or Q-Q plot). Click here for a pdf file explaining what these are.
Most statistical software has a function for producing these.
Caution: Probability plots for
small data sets are often misleading; it is very hard to tell whether
or not a small data set comes from a particular distribution.
Checking for Linearity
When considering a simple
linear regression model, it is important to check the linearity
assumption -- i.e., that the conditional means of the response variable
are a linear function of the predictor variable. Graphing the response
variable vs the predictor can often give a good idea of whether or not
this is true. However, one or both of the following refinements may be
1. Plot residuals (instead of
response) vs. predictor. A non-random pattern suggests that a simple
linear model is not appropriate; you may need to transform the response
or predictor, or add a quadratic or higher term to the mode.
2. Use a scatterplot smoother such as lowess (also known as loess) to
give a visual estimation of the conditional mean. Such smoothers are
available in many regression software packages. Caution: You may need to
choose a value of a smoothness parameter. Making it too large will
oversmooth; making it too small will not smooth enough.
a linear regression with just two
terms, plotting response (or residuals) against the two terms
three-dimensional graph) can help gauge suitability of a linear model,
especially if your software allows you to rotate the graph.
It is not possible to gauge
scatterplots whether a linear model in more than two predictors is
suitable. One way to address this problem is to try to transform the
predictors to approximate multivariate normality.5 This will
ensure not only that a linear model is appropriate for all
(transformed) predictors together, but that a linear model is
appropriate even when some transformed predictors are dropped from the
1. Some techniques may merely require uncorrelated errors rather than
independent errors, but the model-checking plots needed are the same.
2. Robustness to departures from normality is related to the Central Limit Theorem,
since most estimators are linear combinations of the observations, and
hence approximately normal if the number of observations is large.
3. In this context, "robustness" can be formulated in terms of the
of the departure from a model assumption on the Type I error rate. See
Van Belle (2008) Statistical Rules of Thumb, pp. 173 - 177 and the
references given there for more detail.
4. In some formulations of regression, the error terms are only assumed
to be uncorrelated, not necessarily independent.
5. See Cook and Weisberg (1999) Applied
Regression Including Computing and Graphics, p. 324- 329 for one
way to do this.
6. If a linear model fits with all predictors included, it is not true that a linear model will
still fit when some predictors are dropped. For example, if E(Y|X1,
X2) = 1 + 2X1 +3X2
(so that a
linear model fits when Y is regressed on both X1and
X2), but E(
X2) = log(X1),
then it can be calculated that
E(Y|X1) = 1 +2X1
which says that a linear model does not fit when y is regressed
on X1 alone.