CS 345, Machine Learning
Prof. Alvarez

Bagging (Bootstrap Aggregating)

Some machine learning techniques (e.g. decision trees, ANN) are sensitive to variations in the training data. Bagging is a way of reducing the variance in the learned representation of a dataset for such techniques. Bagging relies on multiple bootstrap samples of a dataset. The same classification technique is trained separately on each of the bootstrap samples. To make predictions for unseen instances, voting is used.

Bootstrap Sampling

Here's the "0.632 bootstrap":

Given a dataset D containing N training instances, generate a dataset D' by randomly selecting N instances from D with replacement, that is,
		initialize D' as an empty set
		repeat N times:
			randomly select an instance x from D
			add x to D'
	

Note that the selection is always made from the original D. This is what "sampling with replacement" means.

Discussion

Will the bootstrap sample D' of D be the same as D? No, because some instances of D will be repeated, and other instances will be left out of D' altogether. In fact, on each pass, each instance is passed over with probability P = (N-1)/N.

	=> P(given instance not chosen in any of the N passes) = (1-1/N)^N = 1/e (approx.) = 0.368

	=> P(given instance is chosen) = 1 - 0.368 = 0.632

The above may be used to evaluate a classifier: a 0.632 bootstrap sample is used for training, and the remaining 0.368N or so instances are used for testing.

Bagging pseudocode

The following pseudocode describes the bagging technique introduced by Breiman (1996).

Bagging (prediction algorithm A, dataset D, iterations T)
1) model generation
		for i = 1 to T:
			generate a bootstrap sample D(i) from D
			let M(i) be result of training A on D(i)
2) prediction for a given test instance x
		for i = 1 to T:
			let C(i) = output of M(i) on x
		
		return class that appears most often among C(1)..C(T)

Why Bagging Works: Bias-Variance Decomposition

The bias-variance decomposition was suggested by Geman et al (1992). According to this idea, a classifier's error may be thought of as having two sources:

the classifier algorithm's intrinsic "bias", i.e. its error rate on an infinitely large training dataset
the "variance" term associated with variations in training datasets of a given finite size

The classifier's error rate is the sum of these two components.