This is the third in a series of posts about decision trees in the context of machine learning. In this post, we’ll explore bagging, a popular strategy to reduce model variance, and random forests, an algorithm that uses decision trees and bagging. Random forests are very easy to use and usually deliver good performance in real-world problems. If you haven’t already, consider reading the previous parts of this series.
Climbing trees series
- Climbing trees 1: what are decision trees?
- Climbing trees 2: implementing decision trees
- Climbing trees 3: from trees to forests (you are here)
If things become too complicated, try to read the provided references. I’ve drawn upon various sources instrumental to my understanding of decision trees, including books, documentation, articles, blog posts and lectures. Even if you understand everything, check the references: there is great content there.
The code snippets below sometimes present partial implementation for brevity or to avoid repetition. Complete code is available at the climbing trees repository.
The variance problem
In the first part of this series, we’ve seen that decision trees often have poor variance. Their limitations, namely axis-aligned splits and the greedy approximation, limit their ability to separate signal from the noise. Moreover, we’ve seen that it may not be trivial to tune decision trees and achieve a good bias-variance balance for a few reasons. Most regularization strategies (such as setting a maximum tree depth) rely on integer values and are thus less tunable than fractional counterparts of other machine learning algorithms. Decision trees are also unstable due to the greedy strategy: small changes in the training data can completely change the final model.
Models with high variance and low bias (such as fully-grown decision trees) achieve low (or even 0) training error but fail to generalize to new examples.
Ensembles
Ensemble methods combine multiple machine learning algorithms or multiple instances of the same algorithm to obtain better predictions. The general idea is that learning the correct hypothesis about the data is very hard, so combining multiple hypotheses may lead to a better one. There are different types of ensembles, the most famous among them being bagging and boosting. We’ll explore bagging. Boosting is a topic for a future post.
Aggregating hypotheses
What if we could build low-bias and high-variance models, then later reduce their variance without increasing their bias? There is a theoretical way to do this: we obtain many independent datasets, train many different models and average their results. As the number of models approaches infinity we converge towards the expected classifier for the true population distribution. Since variance is a characteristic of each model, we “average out” model variance without increasing bias.
More formally, consider we have access to the true population distribution . Each training set contains many cases independently drawn from the distribution , where and are random variables. The average error for a single model can be expressed as:
The model trained on the training set is notated as . Since we’re expressing the average error, we take the expectation over the distribution and over the training sets . That is, measures the expected squared error over different test points () and over different training sets .
Now, let’s consider an aggregated predictor averaged over all possible training sets. Its error can be expressed as follows:
There is no need to consider the expectation over all training sets since the model is already averaged. Expanding the squared term in :
Using linearity of expectation:
The expected value of individual predictors equals the aggregated predictor:
Using Jensen’s inequality applied to , we can conclude:
Thus, the aggregated predictor never has higher mean-squared error than any individual predictor. The improvement we get from aggregating all possible models depends on how unequal are the two terms:
Rearranging, we get:
This value is high when the training procedure is unstable, indicating that the model exhibits high variance. As a result, decision trees are particularly well-suited for model aggregation, whereas k-nearest neighbor and ordinary least squared (OLS) regression do not benefit as much from this approach.
Bagging
Unfortunately, we don’t have infinite datasets — it’s usually hard enough to obtain one. Dividing one set into many doesn’t help either because this increases bias due to lower sample size. The solution is to create many datasets out of the one we have by bootstrap aggregating or bagging.
We start with a single training set and generate new sets by bootstrapping. That is, we randomly sample from the original set with replacement and generate many new sets with elements each. Some points will be represented more than once in a sampled set, while others will be absent. That is, we treat our training set as if it were the population distribution and sample from it. This may sound like a gimmick1, but empirically it works. In fact, bootstrapping is widely used in statistics.
Because we’re not considering all points in every sampled set, we do increase bias — the bootstrap is, after all, an approximation. However, the decrease in variance can be higher than the increase in bias. The instability conclusion still holds, so unstable procedures benefit the most from bagging. Stable procedures, on the other hand, tend to produce worse errors when bagged.
Bagging implementation
It’s quite straightforward to implement bagging.
We can apply bagging to any estimator, so we’ll use the Classifier
and Regressor
interfaces we’ve defined previously.
The bagging estimator has very few parameters: only the number of estimators and a random state seed (optional).
The estimator is created through a callable, and this estimator may have other parameters.
T = TypeVar("T", bound=Classifier | Regressor)
class BaseBagging(Generic[T], ABC):
"""Base class for Bagging implementations."""
def __init__(
self,
estimator_constructor: Callable[[], T],
n_estimators: int = 100,
random_state: int | None = None,
) -> None:
"""Initialize Bagging ensemble.
Args:
estimator_constructor: Function that returns a new estimator instance
n_estimators: Number of estimators in the ensemble
random_state: Random state for reproducibility
"""
self.estimator_constructor = estimator_constructor
self.n_estimators = n_estimators
self.random_state = random_state
self._estimators: list[T] = []
if random_state is not None:
np.random.seed(random_state)
def _build_estimators(
self,
X: pd.DataFrame,
y: np.ndarray,
sample_weights: np.ndarray | None,
) -> list[T]:
"""Build estimators sequentially."""
estimators = []
for _ in range(self.n_estimators):
model = self.estimator_constructor()
indices = np.random.choice(X.shape[0], size=X.shape[0], replace=True)
sample_weight = None if sample_weights is None else sample_weights[indices]
model.fit(X.iloc[indices], y[indices], sample_weight)
estimators.append(model)
return estimators
def fit(
self,
X: pd.DataFrame,
y: np.ndarray,
sample_weights: np.ndarray | None = None,
) -> None:
"""Fit the bagging ensemble."""
self._estimators = self._build_estimators(X, y, sample_weights)
Then, we only need to implement the predict
methods for BaggingClassifier
and BaggingRegressor
.
Here only the BaggingClassifier
implementation is shown for brevity:
class BaggingClassifier(BaseBagging[Classifier], Classifier):
"""Bagging Classifier implementation."""
def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
"""Predict class probabilities for X."""
all_proba = np.array([model.predict_proba(X) for model in self._estimators])
return np.mean(all_proba, axis=0)
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""Predict class labels for X."""
proba = self.predict_proba(X)
return (
(proba >= 0.5).astype(int)
if proba.shape[1] == 1
else np.argmax(proba, axis=1)
)
Random Forest
The core advantage of bagging is to reduce variance by averaging many models with relatively low bias. As we’ve seen before, decision trees are inherently unstable and can achieve low bias, making them ideal algorithms for bagging.
Random forests is a particular type of bagged trees with one very important modification: at each split search, only a subset of all features are considered at random. The rest of the procedure is identical to bagging, that is:
- Sample datasets with replacement.
- For each sampled set, train a decision tree, but with a random subset of features chosen before each split.
- Average predictions between all decision trees.
To understand why feature sampling is effective, let’s consider the statistical foundations.
Consider what happens when we average identically distributed random variables. If each variable has variance and they are completely independent, their average has variance . This means that as we increase the number of trees in our ensemble, we steadily reduce the variance of our predictions.
However, the independence assumption is critical. When the variables are positively correlated with some pairwise correlation , the variance becomes:
The first term remains constant regardless of how many trees we add. Even with an infinite number of trees, we cannot eliminate this component of variance. This is where random feature sampling becomes crucial.
De-correlating trees
Each decision tree in a random forest can be viewed as a random variable (with some distribution) generating predictions. Without feature sampling, trees grown on bootstrapped samples of the same dataset tend to have highly correlated predictions because:
- Strong predictors tend to be selected repeatedly across trees
- Similar tree structures emerge when the same dominant features are available to every tree
- The predictions follow similar patterns, especially in regions of the feature space dominated by these strong predictors
By restricting each split to consider only a random subset of features (typically for classification or for regression, where is the total number of features), we force diversity in the tree structures. Even if some feature is strongly predictive, some splits will have to consider weaker but still informative features. This diversity directly reduces the correlation between tree predictions, which makes the ensemble’s variance reduction more effective as we add trees.
Bias-variance trade-off
Individual trees might become slightly less accurate (higher bias) when restricted to fewer features, similar to how bagging may slightly increase bias due to reduced sample sizes. Still, correlation means the trees “fail together”, so averaging them provides less stabilization than if their errors were uncorrelated. Therefore, the substantial decrease in correlation between trees usually leads to a larger reduction of variance and the sampling pays off. Bagging benefits from instability and random forests add even more of it to the already unstable procedure of decision trees.
Visualizing forests
Let’s visualize the decision boundary of a random forest. Note that the visualization has two input features, therefore each split considers rounded up, which is 2 (all features). Hence, in this case bagging with decision trees and random forest yield the same results.
Only the first three individual trees are shown, but the forest contains 100 decision trees. Notice how each decision tree can only make axis-aligned splits, which limits their ability to capture some patterns, while the combined forest produces a smoother boundary that better fits the data. Each individual tree is fit until no training error remains, yet the forest is able to reduce variance by averaging over the sampled datasets (bootstrap).
Pros and cons of random forests
Since random forests are still, at their core, decision trees, many of the pros and cons we’ve discussed previously remain valid. However, there are some important differences.
Random forests have very few hyperparameters, mainly the number of features to consider in each split and the number of trees. The latter can be set to as large as one can afford, while the former has some sensible defaults ( for classification or for regression, where is the total number of features). Thus, it’s possible to train a random forest without any hyperparameter tuning and still get good results. This combined with the ability to handle numerical and categorical data with no need for scaling makes random forests a great bang-for-the-buck algorithm.
Training random forests (and bagging algorithms, more generally) can be slow if the number of trees is high. Note that since each estimator is independent when bagging, training can be done in parallel. Also, boosting usually beats bagging, but we’ll talk about boosting in the future.
One of the main advantages of decision trees is their simple interpretability — all decision paths can be visualized. This is lost when using ensemble methods such as bagging or boosting.
Out-of-bag error
By sampling with replacement when bootstrapping, it’s very likely that not all samples from the original training set are sampled. In the limit of large (original dataset size), the probability of not picking a random sample is:
Thus, approximately 63% of the original dataset is sampled to each dataset, on average. This means that, for each sampled dataset, we have effectively two sets: one is the sampled data (in-the-bag), the other set contains all data not chosen in the sampling process (out-of-bag).
We can keep track of which samples are in-the-bag or out-of-bag for each tree in the forest. Then, we can make a prediction using the same original training dataset, but for each observation use only the trees where this observation is out-of-bag. The out-of-bag (OOB) error is estimated using this prediction and closely matches the error estimate obtained by N-fold cross-validation. Random forests and bagging more generally can therefore forgo of a validation set and train on the entire dataset, validating while training.
Each sample is out-of-bag for ~37% of the estimators and the OOB error stabilizes after some number of bagging iterations (decision trees trained). The OOB error can be monitored to determine when adding more trees would likely yield no further improvement.
Overfitting
Random forests are often referred to as impossible to overfit. This is true in the sense that increasing the number of trees will not overfit a random forest as the average estimate will converge with many trees. Nonetheless, this doesn’t mean that the bias-variance tradeoff is always optimal in random forests. We’ve seen that the correlation between trees limit variance reduction, but decreasing the number of sampled features too much to lower this correlation renders many splits useless by considering only uninformative features, hence increasing bias. Thus, there is always some correlation between trees and bagging can only go so far to reduce variance.
Fully-grown decision trees may have a variance that is too high for bagging to compensate for. Therefore, limiting the size of each tree can and often does improve test error, especially in regression. For example, the diabetes toy dataset benefits from minimum node size restrictions over tree size.
Random Forest Implementation
Since bagging and random forests are so closely related, their implementations are also very similar.
First, we modify our previous CART implementation so that it takes an extra argument max_features
that determines the maximum number of features considered in each split. If max_features
is lower than the number of features, we sample features at each split.
def _get_feature_indices(
n_features: int, max_features: int | float | None = None
) -> np.ndarray:
"""Get indices of features to consider for splitting.
Args:
n_features: Total number of features
max_features: If int, consider max_features features.
If float, consider max_features * n_features features.
If None, consider all features.
"""
if max_features is None:
return np.arange(n_features)
if isinstance(max_features, float):
if not 0.0 < max_features <= 1.0:
raise ValueError("max_features must be in (0, 1]")
max_features = int(max_features * n_features)
max_features = min(max_features, n_features)
return np.random.choice(n_features, size=max_features, replace=False)
def get_splitter(
criterion: Criterion,
max_depth: int = 0,
min_samples_leaf: int = 0,
min_criterion_reduction: float = 0,
max_features: int | float | None = None,
):
...
Then, we implement the random forest:
T = TypeVar("T", DecisionTreeClassifier, DecisionTreeRegressor)
class BaseRandomForest(Generic[T], ABC):
"""Base class for Random Forest implementations."""
def __init__(
self,
n_estimators: int = 100,
max_depth: int = 0,
min_samples_leaf: int = 1,
min_criterion_reduction: float = 0,
max_features: int | float | None = None,
bootstrap: bool = True,
random_state: int | None = None,
) -> None:
"""Initialize Random Forest.
Args:
n_estimators: Number of trees in the forest
max_depth: Maximum depth of each tree
min_samples_leaf: Minimum samples required at leaf node
min_criterion_reduction: Minimum criterion reduction required for splitting
max_features: Number of features to consider for best split:
If int, consider max_features features
If float, consider max_features * n_features features
If None, consider sqrt(n_features) for classification
and n_features/3 for regression
bootstrap: Whether to use bootstrap samples
random_state: Random state for reproducibility
"""
self.n_estimators = n_estimators
self.max_depth = max_depth
self.min_samples_leaf = min_samples_leaf
self.min_criterion_reduction = min_criterion_reduction
self.max_features = max_features
self.bootstrap = bootstrap
self.random_state = random_state
self._estimators: list[T] = []
self.feature_importances_: np.ndarray | None = None
if random_state is not None:
np.random.seed(random_state)
@abstractmethod
def _make_estimator(self) -> T:
"""Create a new decision tree instance."""
pass
def _build_trees(
self,
X: pd.DataFrame,
y: np.ndarray,
sample_weights: np.ndarray | None,
) -> list[T]:
"""Build trees sequentially."""
trees = []
for _ in range(self.n_estimators):
tree = self._make_estimator()
if self.bootstrap:
indices = np.random.choice(X.shape[0], size=X.shape[0], replace=True)
sample_weight = (
None if sample_weights is None else sample_weights[indices]
)
tree.fit(X.iloc[indices], y[indices], sample_weight)
else:
tree.fit(X, y, sample_weights)
trees.append(tree)
return trees
def fit(
self,
X: pd.DataFrame,
y: np.ndarray,
sample_weights: np.ndarray | None = None,
) -> None:
"""Fit the random forest."""
self._n_features = X.shape[1]
self._estimators = self._build_trees(X, y, sample_weights)
Here only the RandomForestRegressor
implementation is shown for brevity (classification is very similar):
class RandomForestRegressor(BaseRandomForest[DecisionTreeRegressor], Regressor):
"""Random Forest Regressor implementation."""
def _make_estimator(self) -> DecisionTreeRegressor:
max_features = self.max_features
if max_features is None:
max_features = max(1, int(self._n_features / 3))
return DecisionTreeRegressor(
max_depth=self.max_depth,
min_samples_leaf=self.min_samples_leaf,
min_criterion_reduction=self.min_criterion_reduction,
max_features=max_features,
)
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""Predict regression target for X."""
all_predictions = np.array([tree.predict(X) for tree in self._estimators])
return np.mean(all_predictions, axis=0)
Conclusion
We have explored the concepts of bagging and random forests, two algorithms that can improve on decision trees by addressing their high variance problem. Bagging is a fundamental method in ensemble learning, decreasing model variance via the aggregation of multiple low-bias and high-variance models trained on bootstrapped datasets while barely increasing bias. Decision trees are quite unstable, making them very suitable for bagging.
Random forests further build upon bagging by introducing randomness at the feature selection levels to decorrelate trees. This boosts the variance improvement achieved with bagging, making them a great off-the-shelf algorithm with few hyperparameters and very decent predictive power.
Random forests share many advantages with decision trees, such as requiring little data preparation, and the ability to handle both numerical and categorical data directly. In practice, their major drawback is falling behind boosting in most scenarios. If you’re going to train hundreds of decision trees, boosting can usually outperform random forests. We’ll explore boosting in the next part of this series.
References
- Ensemble learning - Wikipedia
- Breiman, L. Bagging Predictors. Machine Learning 24, 123–140 (1996).
- Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001)
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer New York.
- Machine Learning for Intelligent Systems - Cornell CS4780 (Fall 2018) by Kilian Weinberger
- Out-of-bag error - Wikipedia
Footnotes
-
It is believed that the name bootstrapping comes from the saying “pull oneself up by one’s bootstraps”, implying an attempt to do something impossible. ↩