Summary Statistics for Skewed Distributions

COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

Introduction Types of Mistakes Suggestions Resources Table of Contents About Blog

Measure of Center

When we focus on the mean of a variable, we are presumably trying to focus on what happens "on average," or perhaps "typically". The mean is very appropriate for this purpose when the distribution is symmetrical, and especially when it is "mound-shaped," such as a normal distribution. For a symmetrical distribution, the mean is in the middle; if the distribution is also mound-shaped, then values near the mean are typical.

But if a distribution is skewed, then the mean is usually not in the middle.

Example: The mean of the ten numbers 1, 1, 1, 2, 2, 3, 5, 8, 12, 17 is 52/10 = 5.2. Seven of the ten numbers are less than the mean, with only three of the ten numbers greater than the mean.

A better measure of the center for this distribution would be the median, which in this case is (2+3)/2 = 2.5. Five of the numbers are less than 2.5, and five are greater.

Notice that in this example, the mean is greater than the median. This is common for a distribution that is skewed to the right (that is, bunched up toward the left and with a "tail" stretching toward the right).

Similarly, a distribution that is skewed to the left (bunched up toward the right with a "tail" stretching toward the left) typically has a mean smaller than its median. (See http://www.amstat.org/publications/jse/v13n2/vonhippel.html for discussion of exceptions.)

(Note that for a symmetrical distribution, such as a normal distribution, the mean and median are the same.)

For a practical example (one I have often given my students):

Suppose a friend is considering moving to Austin and asks you what houses here typically cost. Would you tell her the mean or the median house price? Housing prices (in Austin, at least -- think of all those Dellionaires) are skewed to the right. Unless your friend is rich, the median housing price would be more useful than the mean housing price (which would be larger than the median, thanks to the Dellioniares' expensive houses).

In fact, many distributions that occur in practical situations are skewed, not symmetric. (For some examples, see the Life is Lognormal! website.)

Implications for Applying Statistical Techniques

How do we work with skewed distributions when so many statistical techniques give information about the mean? First, note that most of these techniques assume that the random variable in question has a distribution that is normal. Many of these techniques are somewhat "robust" to departures from normality -- that is, they still give pretty accurate results if the random variable has a distribution that is not too far from normal. But many common statistical techniques are not valid for strongly skewed distributions. Two possible alternatives are:

I. Taking logarithms of the original variable.

Fortunately, many of the skewed random variables that arise in applications are lognormal. That means that the logarithm of the random variable is normal, and hence most common statistical techniques can be applied to the logarithm of the original variable. (With robust techniques, approximately lognormal distributions can also be handled by taking logarithms.) However, doing this may require some care in interpretation. There are three common routes to interpretation when dealing with logs of variables.

1. In many fields, it is common to work with the log of the original outcome variable, rather than the original variable. Thus one might do a hypothesis test for equality of the means of the logs of the variables. A difference in the means of the logs will tell you that the original distributions are different, which in some applications may answer the question of interest.

2. For situations that require interpretation in terms of the original variable, we can often exploit the fact that the logarithm transformation and its inverse, the exponential transformation, preserve order. This implies that they take the median of one variable to the median of another. So if a variable X is lognormal and we take its logarithm, Y = logX ¹, we get a normal distribution, whose mean is the same as its median. If we back-tranform (by exponentiating -- so X = exp(Y)¹), the median of Y goes to the median of X. Thus statements about means for the log-transformed variable Y give us statements about medians for the original variable X. (Note that in this situation, the original variable X is skewed, so we probably should be talking about its median rather than its mean anyhow.) We can also back-transform a confidence interval for the mean of Y to get a confidence interval for the median of X. (Typically, a confidence interval for the mean of Y will be symmetric about the estimated mean of Y, but the confidence interval for the median of the original variable X that is obtained by back-transforming will not be symmetric about the estimated median of the original variable. )

3. In some situations, we can use properties of logs to say useful things when we back-transform. For example, if we regress Y = log₁₀X on U and get Y = a + bU + error, then we can say that increasing U by one unit increases the median of X by a factor of 10^b. ²

Note: Not all skewed distributions are close enough to lognormal to be handled using a log transformation. Sometimes other transformations (e.g., square roots) can yield a distribution that is close enough to normal to apply standard techniques. However, interpretation will depend on the transformation used.

II. Quantile Regression Techniques

Standard regression estimates the mean of the conditional distribution (conditioned on the values of the predictors) of the response variable. For example, in simple linear regression, with one predictor X and response variable Y, we calculate an equation y = a + bx that tells us that when X takes on the value x, the mean of Y is approximately a + bx.³ Quantile regression is a method for estimating conditional quantiles ⁴, including the median. For more on quantile regression, see http://www.econ.uiuc.edu/~roger/research/rq/rq.html.

Measures of Spread

For a normal distribution, the standard deviation is a very appropriate measure of variability (or spread) of the distribution. (Indeed, if you know a distribution is normal, then knowing its mean and standard deviation tells you exactly which normal distribution you have.) But for skewed distributions, the standard deviation gives no information on the asymmetry. It is better to use the first and third quartiles⁴, since these will give some sense of the asymmetry of the distribution.

Notes:
1. We could use logs base e, base 10, or even base 2. If we use log base b, then "exp" will be the function "raise b to that power."

2. The mean of Y when U = u is E(Y|U = u) = a + bu, and the mean of Y when U = u+1 is E(Y|U = u + 1) = a + b(u + 1). Exponentiating (which here means raising ten to the power, since we are working with log base 10) gives

median(Y | U = u) = 10^{a
+ bu}
median(Y| U = u + 1) = 10^{a + b(u+1)} = 10^b(10^a+bu), which by the previous line is 10^bmedian(Y | U = u),

median(Y| U = u + 1)/median(Y | U = u) = 10^b.

3. If we are trying to predict what Y is when X = x, our best estimate is also a + bx, but the estimate isn't as good for Y as it is for the mean of Y. This leads to a common mistake: using the confidence interval (which is appropriate for the conditional mean of Y when X = x) to express our degree of uncertainty (or margin of error) when we are predicting Y (not the conditional mean of Y) when X = x. If we use a + bx to predict Y, then we need to use the prediction interval, which is typically much wider than the confidence interval. In other words, we have more uncertainty when predicting Y than when predicting its mean -- which makes sense if you stop to think about it.

4. A quantile (also known as a percentile) of a distribution is the number that separates the values of the distribution into a specified lower fraction and the corresponding upper fraction. The median is the quantile corresponding to the fraction 1/2. As in the example above, half of the values are above the median, and half below. We could similarly talk about the first quartile (one quarter of the values below and three quarters above), the third quartile (three quarters of the distribution below and one quarter above), the second quartile (just another name for the median), the first quintile (one fifth below and four fifths above), etc.

Last updated October 12, 2016