This site is under construction. Please check back every few weeks for updates

COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

Introduction        Types of Mistakes        Suggestions        Resources        Table of Contents         About

Misunderstandings Involving Conditional Probabilities

The basic idea of conditional probabilities.

A conditional probability is a probability with some condition imposed. In practice, many probabilities we encounter  are conditional probabilities, although that is not always made explicit.

For example, the phrase "the probability of dying of a heart attack in the next five years" is, in practice, typically ambiguous. Is the writer talking about the world-wide probability, over both sexes and all ages? Or is there an implicit assumption that the probability in question just refers to adults living now in one particular country?

Being clear on the condition is important; for example, we would expect that the probability of dying of a heart attack in the next five years is much less 
for men under 25 years of age than for men over 65 years of age.

Misunderstandings arising from ignoring the condition

Many research studies involving people study a fairly restricted group. Thus they result in conditional probabilities with a fairly restricted condition. Unfortunately, all too often this restriction is not emphasized enough. For example, a study of a cholesterol-lowering medication might be restricted to men between the ages of 45 and 65 who have previously had a heart attack. If a physician decides, on the basis of this study, to prescribe the medication to a woman who is 70 years old and has no previous record of heart attacks, the physician is extrapolating; the applicability of the study to this quite different group of people is questionable.

Terminology and notation for conditional probabilities

A conditional probability is often expressed using the phrase "given" to describe the conditon. For example, the phrase "
the probability of dying of a heart attack in the next five years  for men under 25 years of age" would be expressed as, "The probabilty of dying of a heart attack in the next five years given that the person is a man under 25."

The notation P( ) is often used to express a probability of something. To express a conditional probability, we use a vertical bar to stand for "given". For example,

    P(dying of a heart attack in the next five years | male under 25 years of age)

stands for "the probability of dying of a
heart attack in the next five years for men under 25 years of age," and is read "The probability of dying of a heart attack in the next five years given male under 25 years of age" (which, admittedly, is not very good English).

Confusion of reverse conditional probabilities

One common misunderstanding is confusing a conditional probability with the reverse conditional probability -- that is, with conditonal probability that reverses the roles of the event (e.g., "dying of a heart attack in the next five years") and condition (e.g.,  "male under 25 years of age"). For
example,"the probability of dying of a heart attack in the next five years  for men under 25 years of age," is talking about something quite different from "the probability of being a male under 25 years of age if one dies of a heart attack." Sometimes this is called confusion of the inverse.

One situation where this type of confusion is very common is in connection with diagnostic tests for medical conditions.

Diagnostic tests typically have two outcomes, labeled "positive" and "negative." For an ideal test, the outcome is "positive" exactly when the patient has the disease being tested for, and "negative" exactly when the patient does not have the disease.  

Unfortunately, diagnostic tests are almost never perfect. Thus we talk about their sensitivity  and their specificity:

Sensitivity  = the probability that a person tests positive if the disease is present
= P(tests positive | has the disease)

Specificity =the probability that a person tests negative if the disease is absent
= P(tests negative | disease absent)

Many people (physicians as well as patients) confuse the  sensitivity
P(tests positive | has the disease) with the reverse conditional probability P(has the disease | tests positive). This reverse conditional probability is called the positive predictive value, also denoted PPV:

PPV  = Positive predictive value = the probability that someone has the disease if they test positive
= P(has the disease | tests positive).

In fact, the sensitivity and PPV can be very different. In particular, the sensitivity might be very high while the PPV is low. The PPV depends on the sensitivity, but also on the specificity and the prevalence rate of the disease:

Prevalence rate = the proportion of the population having the disease.

Note that the prevalence rate refers to a specific reference category -- in this case, a certain population. The prevalence rate will vary according to the population. For example, in most countries, the prevalence rate of having the HIV virus is greater for the population of intravenous drug users than for the population  at large.

The way in which the PPV depends on the sensitivity, specificity, and prevalence rate is sufficiently involved to be counterintuitive for most people. In particular, a test can have what seem like high sensitivity and high specificity, yet have low PPV. For more about this relationship, see the Notes.

Notes:
The following references give more information on the relationship between PPV,
sensitivity, specificity, and prevalence rate:

"Positive predictive value", Wikipedia, http://en.wikipedia.org/wiki/Positive_predictive_value , accessed November 8, 2009.
Gives the formula relating PPV, sensitivity, specificity, and prevalence rate, plus some examples and links to related discussions.

"Accuracy of Diagnostic Tests," RDTinfo, http://www.rapid-diagnostics.org/accuracy.htm, accessed November 10, 2009.
Gives the basic defnitions and formulas, plus further references.

Gigerenzer, Gerd et al (2007)."Helping doctors and patients make sense of health statistics," Psychological Science in the Public Interest, vo. 8, No. 2, pp. 53 - 96. Download from  http://www.psychologicalscience.org/journals/index.cfm?journal=pspi&content=pspi/8_2
Discusses misunderstandings involving the positive predicitve value as well as other confusions that affect medical care. Also discusses ways to explain the topics that can help improve understanding.  A somewhat shortened version has appeared as "Knowing your chances: What health stats really mean," Scientific American Mind, April/May/June 2009, pp. 44 - 51

Nugent, William (2004). "The role of prevalence rates, sensitivity, and specificity in assessment accuracy: Rolling the dice in social work process," Journal of Social Service Research, 31 (2), 51 - 75.
Discusses the questions of accuracy of diagnostic testing in the contest of mental disorders, focusing examples on major depressive disorder. Includes discussion of when it is better to try to improve sensitivity and when it is better to improve specificity. Although the explanations of the math are sometimes not the best, the article is very worthwhile in many ways. I have used it as the basis of a couple of assignments in a course I have taught for students in a master's program for secondary math teachers. (First assignment, second assignment)

Pepe, M.S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford.
If you're interested in even more than the Nugent and Swets et al articles provide.

Swets, John A., Robyn Dawes, and John Monahan (2000). Psychological Science Can Improve Diagnostic Decisions, Psychological Science in the Public Interest 1(1). Download from http://www.psychologicalscience.org/journals/index.cfm?journal=pspi&content=pspi/1_1
 A fairly comprehensive discussion of the question of improving diagnostic accuracy. Discusses several other applications (e.g.,  predicting violence, weather forecasting, aircraft cockpit warnings) as well as medical diagnostics.