With four parameters I can fit an elephant
and with five I can make him wiggle his trunk.

If we are given n distinct x values and corresponding
y values for
each, it is possible to find a curve going exactly through all n
resulting points (x,y); this can be done by setting up a system of
equations
and solving simultaneously. But this is not what regression methods
typically are designed to do. Most
regression methods (e.g., least squares) estimate conditional
means of the response variable given the explanatory
variables. They are not expected to go through all the data
points.John von Neumann

For example, with one explanatory variable X (e.g., height) and response variable Y (e.g., weight), if we fix a value x of X, we have a conditional distribution of Y given X = x (e.g., the conditional distribution of weight for people with height x). This conditional distribution has an expected value (population mean), which we will denote E(Y|X = x) (e.g., the mean weight of people with height x). This is the conditional mean of Y given X = x. It depends on x -- in other words, E(Y|X = x) is a mathematical function of x.

In least squares regression

Example: To illustrate, I have used simulated data: five points sampled from a joint distribution where the conditional mean E(Y|X = x) is known to be x

E(Y|X=x) = α +βx + γx

as one of the model assumptions

The graph below shows:

- the five data points in red (one at the left is mostly hidden by the green curve)
- the curve y
= x
^{2}of conditional means (black) - the graph of the calculated regression equation (in green).

Note: In a real world example, we would not know the conditional mean function (black curve) -- and in most problems, would not even know in advance whether it is linear, quadratic, or something else. Thus, part of the problem of finding an appropriate regression curve is figuring out what kind of function it should be.

Continuing with this example, if we (naively) try to get a "good fit" by trying a quartic (fourth degree) regression curve (that is, using a model assumption of the form E(Y|X=x) = α +β

If we had instead tried to fit a cubic (third degree) regression curve (that is, using a model assumption of the form E(Y|X=x) = α +β

How can overfitting be avoided?

As with most things in statistics, there are no hard and fast rules that guarantee success. However, here some guidelines. They apply to many other types of statistical models (e.g., multilinear, mixed models, general linear models, hierarchical models) as well as least squares regression.

1. Validate your model (for the mean function, or whatever else you are modeling) if at all possible. Good and Hardin

i. Independent validation
(e.g., wait till the future and see if predictions are accurate)

This of course is not always
possible.

ii. Split the sample. Use one
part for model-building, the other for validation. (See item II(c) of Data Snooping for more discussion.)

iii. Resampling methods.

Details on these methods are beyond the scope of this website; see Chapter 13 of Good and Hardin^{3}, and the further references
provided there, for more information.

iii. Resampling methods.

Details on these methods are beyond the scope of this website; see Chapter 13 of Good and Hardin

2. Gather plenty of (ideally, well-sampled) data.

If you are gathering data
(especially through an experiment), be
sure to consult the literature on optimal design to plan the data
collection to get the tightest possible
estimates from the least amount of data.^{4}

Unfortunately, there is not much known about sample sizes needed for good modeling. Ryan^{5} quotes Draper and Smith^{6} as
suggesting that the number of observations should be at least ten times
the number of terms. Good and Hardin^{3} (p. 183) offer the
following
conjecture:

Unfortunately, there is not much known about sample sizes needed for good modeling. Ryan

"If m points are
required to determine a univariate regression line with sufficient
precision, then it will take at least m^{n} observations
and perhaps n!m^{n} observations to
appropriately characterize and evaluate a regression model with n variables."

3. Pay particular attention to transparency (FUTURE LINK) and avoiding overinterpretation in reporting your results. For example, be sure to state carefully what assumptions you made, what decisions you made, your basis for making these decisions, and what validation procedures you used. Provide (in supplementary online material if necessary) enough detail so that another researcher could replicate your methods.

Notes:

1. Least squares is the most common form of regression. Other types of regression may or may not have this as one of their model assumptions

2. There are other ways of expressing this model assumption, for example,

y =
α
+βx + γx^{2 }+
ε,

ory_{i} =
α
+βx_{i} + γx_{i}^{2
}+ ε_{i}

3. P. I. Good and J. W. Hardin (2006). Common Errors in Statistics (And How to
Avoid Them), Wiley.4. For regression, the values of the explanatory variable (x values, in the above example) do not usually need to be randomly sampled; choosing them carefully can minimize variances and thus give tighter estimates.

5. T. Ryan (2009), Modern Regression Methods, Wiley, p. 20

6. N. Draper and H. Smith (1998), Applied Regression Analysis, Wiley

Last updated June 13, 2014