CS345, Machine Learning
Prof. Alvarez
Decision Tree Pruning based on Confidence Intervals (as in C4.5)
The basic entropy-based decision tree learning algorithm ID3
continues to grow a tree until it makes no errors over the set
of training data. This fact makes ID3 prone to overfitting.
In order to reduce overfitting, pruning is used.
Reduced-Error Pruning
One approach to pruning is to withhold a portion of the available
labeled data for validation. The validation set is not used during
training. Once training has been completed, testing is carried out
over the validation set. If the error rate of the original decision
tree over the validation set exceeds the error rate of a pruned
version of the tree (obtained by replacing a near-leaf node with
its child leaves by a single leaf node), then the pruning operation
is carried out. Reduced error pruning can reduce overfitting, but
it also reduces the amount of data available for training.
Statistical Pruning
C4.5 uses a pruning technique based on statistical confidence estimates.
This technique has the advantage that it allows all of the available
labeled data to be used for training.
Confidence intervals
The heart of the statistical pruning technique is the calculation of
a confidence interval for the error rate. In brief, one starts from
an observed error rate f measured over the set of N training instances.
The observed error rate is analaogous to the observed fraction of heads
in N tosses of a (usually loaded) coin. One wishes to estimate the
true error rate p that would be observed if one could test
over all possible data instances that will ever become available.
The true error rate p is analogous to the actual probability of
heads of the given coin. Note the difference between f and p:
f is the fraction of errors (or heads) that were observed in
one particular sample of size N, while p is the fraction of errors
(heads) that will be observed over the infinitely large set of all
instances, past, present, and future.
The confidence interval is calculated as follows:
p = f +- z*sqrt( f*(1-f) / N )
Here, f, p, and N are as described above. z is a factor that
depends on the desired level of confidence. Values of z for
selected confidence levels appear below.
| z | confidence
|
| 0.67 | 50%
|
| 1 | 68%
|
| 1.64 | 90%
|
| 1.96 | 95%
|
With the level of confidence given in this table, one can state
that the true value p will be found inside the corresponding
confidence interval calculated as described above. For example,
suppose that 3 errors are observed among a set of 10 instances.
We thus have
f = 0.3, N = 10, z = 1.64
The confidence interval for p is:
p = 0.3 +- 1.64*sqrt( 0.3*(1-0.3) / 10 )
To two decimal digits, the confidence interval is:
p = 0.3 +- 0.24
Thus, we can be 90% certain that the true error rate of the
particular classifier that led to the observed error rate
of 3/10 will be between 0.16 and 0.54.
Pruning based on confidence intervals
In order to decide whether to replace a near-leaf node and its
child leaves by a single leaf node, C4.5 compares the upper limits
of the error confidence intervals for the two trees. For the unpruned
tree, the upper error estimate is calculated as a weighted average
over its child leaves. Whichever tree has a lower estimated upper
limit on the error rate "wins" and is selected.
Example from the labor negotiations dataset
The unpruned subtree is:
wage increase?
<=2.5 / \ > 2.5
/ \
work hours
<=36 / \ >36
/ \
health plan
none / half | full \
/ | \
4+ 2- 1+ 1- 4+ 2-
We target the health plan node near the bottom of the tree for pruning.
First we calculate the average estimated upper error rate for the
unpruned tree:
for health plan = none leaf node:
f = 2/6, N=6 => upper p estimate = .46
(using z=0.69 for 75-th percentile estimate)
for health plan = half leaf node:
f = 1/2, N=2 => upper p estimate = .74
for health plan = full leaf node:
f = 2/6, N=6 => upper p estimate = .46
average upper error rate over the three leaves: .55
On the other hand, if the tree were pruned by replacing the health plan
node by a leaf (9+, 5-), the confidence interval calculation would be
as follows:
f = 5/14, N=14 => upper p estimate = .49
Since the pruned tree results in a lower upper estimate for the
error rate, the leaves are indeed pruned.