CSCI 3346, Data Mining, Statistical Pruning in Decicion Tree Induction

CSCI 3346, Data Mining
Prof. Alvarez
Decision Tree Pruning based on Confidence Intervals (as in C4.5)

The basic class-entropy-based decision tree induction algorithm ID3 continues to grow a tree until it makes no errors over the set of training data. This fact makes ID3 prone to overfitting. In order to reduce overfitting, pruning is used. A commonly used pruning operation involves replacing a non-leaf node of the tree by a leaf node that corresponds to the most common class among instances that reach the node. A criterion is needed in order to determine whether pruning a given node in this way is likely to yield an improvement in generalization.

Reduced-Error Pruning

One approach to estimating generalization performance is to withhold a portion of the available labeled data for validation. The validation set is not used during training. Once training has been completed, testing is carried out over the validation set. If the error rate of the original decision tree over the validation set exceeds the error rate of a pruned version of the tree (obtained by replacing a near-leaf node with its child leaves by a single leaf node), then the pruning operation is carried out. Reduced error pruning can reduce overfitting, but it also reduces the amount of data available for training.

Statistical Pruning

C4.5 makes pruning decisions based on statistical confidence estimates of the generalization error. This technique has the advantage that it allows all of the available labeled data to be used for training.

Confidence intervals

The heart of the statistical pruning technique is the calculation of a confidence interval for the generalization error rate. In brief, one starts from an observed error rate f measured over the set of N training instances. The observed error rate is analaogous to the observed fraction of heads in N tosses of a (usually loaded) coin. One wishes to estimate the true error rate p that would be observed if one could test over all possible data instances that will ever become available. The true error rate p is analogous to the actual probability of heads of the given coin. Note the difference between f and p: f is the fraction of errors (or heads) that were observed in one particular sample of size N, while p is the fraction of errors (heads) that will be observed over the infinitely large set of all instances, past, present, and future.

An approximate confidence interval can be calculated as follows:

	p = f +- z*sqrt( f*(1-f) / N )

Here, f, p, and N are as described above. z is a factor that depends on the desired level of confidence. Values of z for selected confidence levels appear below.

	z	confidence
	0.67	50%
	1	68%
	1.64	90%
	1.96	95%

With the level of confidence given in this table, one estimates that the true value p will be found inside the corresponding confidence interval calculated as described above. For example, suppose that 3 errors are observed among a set of 10 instances. We thus have

	f = 0.3, 	N = 10, 	z = 1.64

The confidence interval for p is:

	p = 0.3 +- 1.64*sqrt( 0.3*(1-0.3) / 10 )

To two decimal digits, the confidence interval is:

	p = 0.3 +- 0.24

Thus, we can be 90% certain that the true error rate of the particular classifier that led to the observed error rate of 3/10 will be between 0.16 and 0.54.

A more precise upper bound

The above approximation of the generalization error rate can be refined. Considering the upper limit only, the following provides a better upper bound on the generalization error:

	p = (f + z^2/(2N) + z*sqrt( f*(1-f) / N + z^2/(4N^2) )) / (1 + z^2/N)

Unless the value of N is very small, this bound will be close to the approximate value described above. For example, the more precise computation yields an upper bound of 0.56 in the preceding example, as compared with 0.54.

Pruning based on confidence intervals

In order to decide whether to replace a near-leaf node and its child leaves by a single leaf node, C4.5 compares the upper limits of the error confidence intervals for the two trees. For the unpruned tree, the upper error estimate is calculated as a weighted average over its child leaves. Whichever tree has a lower estimated upper limit on the error rate "wins" and is selected.

Example from the labor negotiations dataset

The unpruned subtree is:

                        wage increase?
                <=2.5 /         \ > 2.5
                     /           \
                work hours
           <=36 /       \ >36
               /         \
                        health plan
                none /    half |   full \
                    /          |         \
                4+ 2-        1+ 1-      4+ 2-

We target the health plan node near the bottom of the tree for pruning. First we calculate the average estimated upper error rate for the unpruned tree:

        for health plan = none leaf node:
                f = 2/6, N=6 => upper p estimate = .46
                (using z=0.69 for 75-th percentile estimate)

        for health plan = half leaf node:
                f = 1/2, N=2 => upper p estimate = .74

        for health plan = full leaf node:
                f = 2/6, N=6 => upper p estimate = .46

        average upper error rate over the three leaves: .55

On the other hand, if the tree were pruned by replacing the health plan node by a leaf (9+, 5-), the confidence interval calculation would be as follows:

        f = 5/14, N=14 => upper p estimate = .49

Since the pruned tree results in a lower upper estimate for the error rate, the leaves are indeed pruned.