Type I Error

Rejecting the null hypothesis when it is in fact true is called a Type I error.

Many people decide, before doing a hypothesis test, on a maximum p-value for which they will reject the null hypothesis. This value is often denoted α (alpha) and is also called the significance level.

When a hypothesis
test results in a p-value that is less than the significance level, the
result of the hypothesis test is called
statistically significant.

Common
mistake: Confusing statistical significance
and
practical significance. Example:
A
large clinical trial is carried out to compare a new medical treatment
with a
standard one. The statistical analysis shows a statistically
significant difference in lifespan when using the new treatment
compared to the old one. But the increase in lifespan is at most three
days, with average increase less than 24 hours, and with poor quality
of
life during the period of extended life. Most people would not consider
the improvement practically significant.

Caution: The larger the sample size, the more likely a hypothesis test will detect a small difference. Thus it is especially important to consider practical significance when sample size is large.

Caution: The larger the sample size, the more likely a hypothesis test will detect a small difference. Thus it is especially important to consider practical significance when sample size is large.

Connection between Type I error and significance level:

A significance
level
α
corresponds to a certain value of the test statistic, say t_{α},
represented by the orange line in the picture
of a sampling distribution below (the picture illustrates a hypothesis
test with alternate hypothesis
"µ
> 0")

Since the shaded
area
indicated by the arrow is the p-value corresponding to t_{α},
that p-value (shaded area) is α.

To have p-value less than α , a t-value for this test must be to the right of t_{α}.

So the probability of rejecting the null hypothesis when it is true is the probability that t > t_{α},
which we saw above is α.

In other words, the probability of Type I error is α.^{1}

Rephrasing using the definition of Type I error:

To have p-value less than α , a t-value for this test must be to the right of t

So the probability of rejecting the null hypothesis when it is true is the probability that t > t

In other words, the probability of Type I error is α.

Rephrasing using the definition of Type I error:

The
significance level α is
the probability of making the wrong decision when the null hypothesis is true.

Pros and Cons of Setting a Significance Level:

- Setting a significance level (before doing inference) has the advantage that the analyst is not tempted to chose a cut-off on the basis of what he or she hopes is true.
- It has the disadvantage
that it neglects that some p-values might
best be considered borderline. This
is one reason
^{2}why it is important to report p-values when reporting results of hypothesis tests. It is also good practice to include confidence intervals corresponding to the hypothesis test. (For example, if a hypothesis test for the difference of two means is performed, also give a confidence interval for the difference of those means. If the significance level for the hypothesis test is .05, then use confidence level 95% for the confidence interval.)

Type II Error

Not rejecting the null hypothesis when in fact the alternate hypothesis is true is called a Type II error. (The second example below provides a situation where the concept of Type II error is important.)

Note: "The alternate hypothesis" in the definition of Type II error may refer to the alternate hypothesis in a hypothesis test, or it may refer to a "specific" alternate hypothesis.

Example:
In a t-test for a sample
mean µ, with null hypothesis ""µ
= 0" and alternate
hypothesis "µ
> 0", we may talk about the Type II error relative to the
general alternate hypothesis "µ
> 0", or may talk about the Type II error relative to the specific
alternate hypothesis "µ
> 1". Note that the specific alternate hypothesis is a
special case
of the general alternate hypothesis.

In practice, people often work with Type II error
relative to a specific
alternate hypothesis. In this situation, the
probability of Type II error relative to the specific alternate
hypothesis is often called β. In other words, β
is the probability of
making the wrong decision when the specific
alternate hypothesis is true. (See the discussion of Power for related detail.)

Considering both types of error together:

The following table summarizes Type I and Type II errors:

Truth (for population studied) |
|||

Null Hypothesis True | Null Hypothesis False | ||

Decision (based on sample) |
Reject Null Hypothesis | Type I Error | Correct Decision |

Fail to reject Null Hypothesis | Correct Decision | Type II Error |

Truth | |||

Not Guilty | Guilty | ||

Verdict | Guilty | Type I Error -- Innocent person goes to jail (and maybe guilty person goes free) | Correct Decision |

Not Guilty | Correct Decision | Type II Error -- Guilty person goes free |

The following diagram illustrates the Type I error and the Type II error against the specific alternate hypothesis "µ =1" in a hypothesis test for a population mean µ, with null hypothesis ""µ = 0," alternate hypothesis "µ > 0", and significance level α= 0.05.

- The blue (leftmost) curve is the sampling distribution assuming the null hypothesis ""µ = 0."
- The green (rightmost) curve is the sampling distribution assuming the specific alternate hypothesis "µ =1".
- The vertical red line shows the cut-off for rejection of the null hypothesis: the null hypothesis is rejected for values of the test statistic to the right of the red line (and not rejected for values to the left of the red line)>
- The area of the diagonally hatched region to the right of the red line and under the blue curve is the probability of type I error (α)
- The area of the horizontally hatched
region to
the left of the red line and
under the green curve is the
probability
of Type II error (β)

Deciding what significance level to use:

This should be done before
analyzing the data -- preferably before gathering the data.^{5}

The choice of significance level should be based on the consequences of Type I and Type II errors.

The choice of significance level should be based on the consequences of Type I and Type II errors.

- If the consequences of a type I error are serious or expensive, then a very small significance level is appropriate.

Example
1:
Two drugs are being compared for effectiveness in treating
the same condition. Drug 1 is very affordable,
but Drug 2 is extremely expensive. The
null hypothesis is "both drugs are equally effective," and the
alternate is "Drug 2 is more effective than Drug 1." In this situation,
a Type I error
would be deciding that Drug 2 is more effective, when in fact it is no
better than Drug 1, but would cost the patient much more money. That
would be undesirable from the patient's perspective, so a
small significance level is warranted.

- If the consequences of a Type I error are not very serious (and especially if a Type II error has serious consequences), then a larger significance level is appropriate.

Example
2:
Two drugs are known to be equally effective for a certain
condition. They are also each equally affordable. However, there is
some suspicion that Drug 2 causes a serious side-effect in some
patients, whereas Drug 1 has been used for decades with no reports of
the side effect. The null hypothesis is "the incidence of the side
effect in both drugs is the same", and the alternate is "the incidence
of the side effect in Drug 2 is greater than that in Drug 1." Falsely
rejecting the null hypothesis when it is in fact true (Type I error)
would have no great
consequences for the consumer, but a Type II error (i.e., failing to
reject the null
hypothesis
when in fact the alternate is true, which would result in deciding that
Drug 2 is no
more harmful than Drug 1 when it is in fact more harmful) could have
serious consequences from
a public health standpoint. So setting a large significance level is
appropriate.

See Sample size calculations to plan an experiment, GraphPad.com, for more examples.

See Sample size calculations to plan an experiment, GraphPad.com, for more examples.

- Sometimes there may be serious consequences
of each
alternative, so some compromises or weighing priorities may be
necessary. The trial analogy illustrates this well: Which is better or
worse, imprisoning an innocent person or letting a guilty person go
free?
^{6}This is a value judgment; value judgments are often involved in deciding on significance levels. Trying to avoid the issue by always choosing the same significance level is itself a value judgment. - Sometimes different stakeholders have different interests that compete (e.g., in the second example above, the developers of Drug 2 might prefer to have a smaller significance level.)
- See http://core.ecu.edu/psyc/wuenschk/StatHelp/Type-I-II-Errors.htm
for more discussion of the considerations involved in deciding what are
reasonable levels for Type I and Type II errors.

- See the discussion of Power for more on deciding on a significance level.
- Similar considerations hold for setting confidence levels for confidence intervals.

- This is an instance of the common mistake of expecting too much certainty.
- There is always a possibility of a Type I error; the sample in the study might have been one of the small percentage of samples giving an unusually extreme test statistic.
- This is why replicating experiments (i.e., repeating the experiment with another sample) is important. The more experiments that give the same result, the stronger the evidence.
- There is also the possibility that the
sample is biased or the method of analysis was inappropriate;
either of these could lead to a misleading result.

1. α is also called the bound on Type I error. Choosing a value α is sometimes called setting a bound on Type I error.

2. Another good reason for reporting p-values is that different people may have different standards of evidence; see the section "Deciding what significance level to use" on this page.

3. This could be more than just an analogy: Consider a situation where the verdict hinges on statistical evidence (e.g., a DNA test), and where rejecting the null hypothesis would result in a verdict of guilty, and not rejecting the null hypothesis would result in a verdict of not guilty.

4. This is consistent with the system of justice in the USA, in which a defendant is assumed innocent until proven guilty beyond a reasonable doubt; proving the defendant guilty beyond a reasonable doubt is analogous to providing evidence that would be very unusual if the null hypothesis is true.

5. There are (at least) two reasons why this is important. First, the significance level desired is one criterion in deciding on an appropriate sample size. (See Power for more information.) Second, if more than one hypothesis test is planned, additional considerations need to be taken into account. (See Multiple Inference for more information.)

6. The answer to this may well depend on the seriousness of the punishment and the seriousness of the crime. For example, if the punishment is death, a Type I error is extremely serious. Also, if a Type I error results in a criminal going free as well as an innocent person being punished, then it is more serious than a Type II error.

Last updated May 12, 2011