COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

Introduction        Types of Mistakes        Suggestions        Resources        Table of Contents         About


Examples of Checking Model Assumptions Using Well-established Fact or Theorems

Note: Here, "well established" means well established by empirical evidence and/or sound mathematical reasoning. This is not the same as "well-accepted," since sometimes things may be well-accepted without sound evidence.

1. Using laws of physics

Hooke's Law says that when a weight that is not too large (below what is called the "elastic limit") is placed on the end of a spring, the length of the (stretched) spring is approximately a linear function of the weight. This tells us that if we do an experiment with a spring by putting various weights (below the elastic limit) on it and measuring the length of the spring, we are justified in using a linear model,

    Length = A×Weight + B

2. Using the Central Limit Theorem

The Central Limit Theorem1 says that for most distributions, linear combinations (e.g., the  sum or the mean) of a large enough number of independent random variables is approximately normal. Thus, if a random variable in question is the sum of independent random variables, then it is usually2 safe to assume that it is approximately normal.

For example, adult human heights (at least if we restrict to one sex3) are the sum of many heights: the heights of the ankles, lower legs, upper legs, pelvis, many vertebrae, and head. Empirical evidence suggests that these heights vary roughly independently (e.g., the ratio of height of lower leg to that of upper leg varies considerably). Thus it is plausible by the Central Limit Theorem that human heights are approximately normal. This in fact is supported by empirical evidence.

The Central Limit Theorem can also be used to reason that some distributions are approximately lognormal -- that is, that the logarithm of the random variable is  normal. For example, the distribution of a pollutant might be determined by successive independent dilutions of an original emission. This translates into mathematical terminology by saying that the amount of pollution (call this random variable Y) in a given small region is the product of independent random variables. Thus logY is the sum of independent random variables. If the number of successive dilutions is large enough, the reasoning above shows that logY is approximately normal, and hence that Y is approximately lognormal.4, 5



1. Actually, there are several versions of the Central Limit Theorem, essentially concerning different types of distributions. The paraphrase given here is good enough for most practical purposes. See also
the Rice Virtual Lab in Statistics' Sampling Distribution Simulation, which can be used to show how the version of the Central Limit Theorem for means works for various distributions.
2. Notable exceptions are if the random variables being summed have "heavy tails" (also called leptokurtic), or are strongly bimodal, or very strongly skewed (especially if the sums involved are not large.)
3. If we consider both sexes, then we loose independence, since the average height for males is higher than the average height for females. However, since the average height for males is not that much higher than the average height for females, it turns out that the overall distribution of heights for all adult humans is not far from normal -- the mode is a little off to one side, and the top is slightly wider than for a normal distribution.
4. In practice, one would usually work with logY, using a technique that requires approximate normality.
5. For more about lognormal distributions, see Ott (1995) Environmental Statistics and Data Analysis; van Belle (2008) Statistical Rules of Thumb, pp 88 - 90, and the Life is Lognormal website and further references given there.

Updated Sept. 25, 2011