CS 345, Machine Learning
Prof. Alvarez

Bagging (Bootstrap Aggregating)

Some machine learning techniques (e.g. decision trees, ANN) are sensitive to variations in the training data. Bagging is a way of reducing the variance in the learned representation of a dataset for such techniques. Bagging relies on multiple bootstrap samples of a dataset. The same classification technique is trained separately on each of the bootstrap samples. To make predictions for unseen instances, voting is used.

Bootstrap Sampling

Here's the "0.632 bootstrap":

Given a dataset D containing N training instances, generate a dataset D' by randomly selecting N instances from D with replacement, that is,
		initialize D' as an empty set
		repeat N times:
			randomly select an instance x from D
			add x to D'
	
Note that the selection is always made from the original D. This is what "sampling with replacement" means.

Discussion

Will the bootstrap sample D' of D be the same as D? No, because some instances of D will be repeated, and other instances will be left out of D' altogether. In fact, on each pass, each instance is passed over with probability P = (N-1)/N.
	=> P(given instance not chosen in any of the N passes) = (1-1/N)N = 1/e (approx.) = 0.368

	=> P(given instance is chosen) = 1 - 0.368 = 0.632
The above may be used to evaluate a classifier: a 0.632 bootstrap sample is used for training, and the remaining 0.368N or so instances are used for testing.

Bagging pseudocode

The following pseudocode describes the bagging technique introduced by Breiman (1996).

Bagging (prediction algorithm A, dataset D, iterations T)

1) model generation

		for i = 1 to T:
			generate a bootstrap sample D(i) from D
			let M(i) be result of training A on D(i)

2) prediction for a given test instance x

		for i = 1 to T:
			let C(i) = output of M(i) on x
		
		return class that appears most often among C(1)..C(T)

Why Bagging Works: Bias-Variance Decomposition

The bias-variance decomposition was suggested by Geman et al (1992). According to this idea, a classifier's error may be thought of as having two sources:
the classifier algorithm's intrinsic "bias", i.e. its error rate on an infinitely large training dataset

the "variance" term associated with variations in training datasets of a given finite size

The classifier's error rate is the sum of these two components.

Breiman (1996) suggested that bagging works by reducing the variance component of the error. Experiments carried out by Bauer and Kohavi (1999) confirm this statement for "unstable" learning techniques such as C4.5-like decision tree learners. However, for "stable" learners such as naive Bayes, bagging can actually increase the variance, as the size of the training dataset used for each learner is effectively reduced by bootstrap sampling.