Summary Statistics for Skewed Distributions

But if a distribution is skewed, then the mean is usually not in the middle.

Example: The
mean of the ten numbers 1,
1, 1, 2, 2, 3, 5, 8, 12, 17
is 52/10 = 5.2. Seven of the ten numbers are less than the mean,
with
only
three of the ten numbers greater than the mean.

A better measure of the center for this distribution would be the median, which in this case is (2+3)/2 = 2.5. Five of the numbers are less than 2.5, and five are greater.

Notice that in this example, the mean is greater than the median. This is common for a distribution that is skewed to the right (that is, bunched up toward the left and with a "tail" stretching toward the right).

Similarly, a distribution that is skewed to the left (bunched up toward the right with a "tail" stretching toward the left) typically has a mean smaller than its median. (See http://www.amstat.org/publications/jse/v13n2/vonhippel.html for discussion of exceptions.)

(Note that for a symmetrical distribution, such as a normal distribution, the mean and median are the same.)

For a practical example (one I have often given my students):

Suppose a friend is considering
moving to Austin and asks you what
houses here typically cost. Would you tell her the mean or the median
house price? Housing prices (in Austin, at least -- think of all those
Dellionaires) are skewed to the right. Unless your friend is rich, the
median housing price would be
more useful than the mean
housing price (which would be larger than the median, thanks to the
Dellioniares' expensive houses).

In fact, many distributions that occur in practical situations are skewed, not symmetric. (For some examples, see the Life is Lognormal! website.)

1. In many fields, it is common to
work with
the log of
the original outcome variable, rather than the original variable. Thus
one might do a hypothesis test for equality
of the means of the logs of
the variables. A difference in the means of the logs will tell
you that
the original distributions are different, which in some applications
may answer the question of interest.

2. For situations that require
interpretation
in terms of
the
original variable, we can often exploit the fact that the logarithm
transformation and its inverse, the exponential transformation,
preserve order. This implies that they take the median of one variable
to the median of another. So if a variable X is lognormal
and we take its logarithm, Y = logX ^{1}, we
get a normal distribution, whose mean is the
same as its median. If we back-tranform (by exponentiating -- so X =
exp(Y)^{1}), the median of Y
goes to the median of X. Thus statements about means for the
log-transformed
variable Y give us statements about medians for the original variable
X.
(Note that in this situation, the original variable X is skewed, so we
probably should be talking about its median rather than its mean
anyhow.) We can also back-transform a confidence interval for the mean
of Y to get a confidence interval for the median of X.
(Typically,
a confidence interval for the
mean of Y will be symmetric about the estimated mean of Y, but the
confidence interval for the median
of the original variable X that is obtained by back-transforming will not be symmetric about the
estimated median
of the original variable. )

3. In some situations, we can use properties of logs to say useful things when we back-transform. For example, if we regress Y = log_{10}X on U and get Y = a + bU + error, then we can say that
increasing U by one unit increases the median of X by a factor of 10^{b}. ^{2}

3. In some situations, we can use properties of logs to say useful things when we back-transform. For example, if we regress Y = log

Notes:

1. We could use logs base e, base 10, or even base 2. If we use log base b, then "exp" will be the function "raise b to that power."

2. The mean of Y when U = u is E(Y|U = u) = a + bu, and the mean of Y when U = u+1 is E(Y|U = u + 1) = a + b(u + 1). Exponentiating (which here means raising ten to the power, since we are working with log base 10) gives

median(Y | U = u) = 10^{a
+ bu}

median(Y| U = u + 1) = 10^{a + b(u+1)} = 10^{b}(10^{a+bu}),
which by the previous line is 10^{b}median(Y
| U = u),

somedian(Y| U = u + 1) = 10

median(Y| U = u + 1)/median(Y
| U = u) = 10^{b}.

3. If we are trying to predict what Y is when X =
x, our best estimate is also a + bx, but the estimate isn't as good for
Y as it is for the mean of Y. This leads to a common mistake: using the
confidence interval (which is appropriate for the conditional mean of Y
when X = x) to express our degree of uncertainty (or margin of error)
when we are predicting Y (not the conditional mean of Y) when X = x. If
we use a + bx to predict Y, then we need to use the prediction interval, which is
typically much wider than the confidence interval. In other words, we
have more uncertainty when predicting Y than when predicting its mean
-- which makes sense if you stop to think about it.4. A quantile (also known as a percentile) of a distribution is the number that separates the values of the distribution into a specified lower fraction and the corresponding upper fraction. The median is the quantile corresponding to the fraction 1/2. As in the example above, half of the values are above the median, and half below. We could similarly talk about the first quartile (one quarter of the values below and three quarters above), the third quartile (three quarters of the distribution below and one quarter above), the second quartile (just another name for the median), the first quintile (one fifth below and four fifths above), etc.

Last updated October 12, 2016