CSCI 3346, Data Mining
Prof. Alvarez
Decision Tree Pruning based on Confidence Intervals (as in C4.5)
The basic class-entropy-based decision tree induction algorithm ID3
continues to grow a tree until it makes no errors over the set
of training data. This fact makes ID3 prone to overfitting.
In order to reduce overfitting, pruning is used. A commonly used pruning
operation involves replacing a non-leaf node of the tree by a leaf node
that corresponds to the most common class among instances that reach the node.
A criterion is needed in order to determine whether pruning a given node
in this way is likely to yield an improvement in generalization.
Reduced-Error Pruning
One approach to estimating generalization performance is to withhold a
portion of the available labeled data for validation.
The validation set is not used during training.
Once training has been completed, testing is carried out
over the validation set. If the error rate of the original decision
tree over the validation set exceeds the error rate of a pruned
version of the tree (obtained by replacing a near-leaf node with
its child leaves by a single leaf node), then the pruning operation
is carried out. Reduced error pruning can reduce overfitting, but
it also reduces the amount of data available for training.
Statistical Pruning
C4.5 makes pruning decisions based on statistical confidence estimates
of the generalization error.
This technique has the advantage that it allows all of the available
labeled data to be used for training.
Confidence intervals
The heart of the statistical pruning technique is the calculation of
a confidence interval for the generalization error rate. In brief,
one starts from an observed error rate f measured over the set of
N training instances.
The observed error rate is analaogous to the observed fraction of heads
in N tosses of a (usually loaded) coin. One wishes to estimate the
true error rate p that would be observed if one could test
over all possible data instances that will ever become available.
The true error rate p is analogous to the actual probability of
heads of the given coin. Note the difference between f and p:
f is the fraction of errors (or heads) that were observed in
one particular sample of size N, while p is the fraction of errors
(heads) that will be observed over the infinitely large set of all
instances, past, present, and future.
An approximate confidence interval can be calculated as follows:
p = f +- z*sqrt( f*(1-f) / N )
Here, f, p, and N are as described above. z is a factor that
depends on the desired level of confidence. Values of z for
selected confidence levels appear below.
| z | confidence
|
| 0.67 | 50%
|
| 1 | 68%
|
| 1.64 | 90%
|
| 1.96 | 95%
|
With the level of confidence given in this table, one estimates
that the true value p will be found inside the corresponding
confidence interval calculated as described above. For example,
suppose that 3 errors are observed among a set of 10 instances.
We thus have
f = 0.3, N = 10, z = 1.64
The confidence interval for p is:
p = 0.3 +- 1.64*sqrt( 0.3*(1-0.3) / 10 )
To two decimal digits, the confidence interval is:
p = 0.3 +- 0.24
Thus, we can be 90% certain that the true error rate of the
particular classifier that led to the observed error rate
of 3/10 will be between 0.16 and 0.54.
A more precise upper bound
The above approximation of the generalization error rate can be refined.
Considering the upper limit only, the following provides a better upper bound
on the generalization error:
p = (f + z^2/(2N) + z*sqrt( f*(1-f) / N + z^2/(4N^2) )) / (1 + z^2/N)
Unless the value of N is very small, this bound will be close to the
approximate value described above. For example, the more precise computation
yields an upper bound of 0.56 in the preceding example, as compared with 0.54.
Pruning based on confidence intervals
In order to decide whether to replace a near-leaf node and its
child leaves by a single leaf node, C4.5 compares the upper limits
of the error confidence intervals for the two trees. For the unpruned
tree, the upper error estimate is calculated as a weighted average
over its child leaves. Whichever tree has a lower estimated upper
limit on the error rate "wins" and is selected.
Example from the labor negotiations dataset
The unpruned subtree is:
wage increase?
<=2.5 / \ > 2.5
/ \
work hours
<=36 / \ >36
/ \
health plan
none / half | full \
/ | \
4+ 2- 1+ 1- 4+ 2-
We target the health plan node near the bottom of the tree for pruning.
First we calculate the average estimated upper error rate for the
unpruned tree:
for health plan = none leaf node:
f = 2/6, N=6 => upper p estimate = .46
(using z=0.69 for 75-th percentile estimate)
for health plan = half leaf node:
f = 1/2, N=2 => upper p estimate = .74
for health plan = full leaf node:
f = 2/6, N=6 => upper p estimate = .46
average upper error rate over the three leaves: .55
On the other hand, if the tree were pruned by replacing the health plan
node by a leaf (9+, 5-), the confidence interval calculation would be
as follows:
f = 5/14, N=14 => upper p estimate = .49
Since the pruned tree results in a lower upper estimate for the
error rate, the leaves are indeed pruned.