Skip to content

Climbing trees 3: from trees to forests

Published:

This is the third in a se­ries of posts about de­ci­sion trees in the con­text of ma­chine learn­ing. In this post, we’ll ex­plore bag­ging, a pop­u­lar strat­egy to re­duce model vari­ance, and ran­dom forests, an al­go­rithm that uses de­ci­sion trees and bag­ging. Ran­dom forests are very easy to use and usu­ally de­liver good per­for­mance in real-​world prob­lems. If you haven’t al­ready, con­sider read­ing the pre­vi­ous parts of this se­ries.

Climb­ing trees se­ries


If things be­come too com­pli­cated, try to read the pro­vided ref­er­ences. I’ve drawn upon var­i­ous sources in­stru­men­tal to my un­der­stand­ing of de­ci­sion trees, in­clud­ing books, doc­u­men­ta­tion, ar­ti­cles, blog posts and lec­tures. Even if you un­der­stand every­thing, check the ref­er­ences: there is great con­tent there.

The code snip­pets below some­times present par­tial im­ple­men­ta­tion for brevity or to avoid rep­e­ti­tion. Com­plete code is avail­able at the climb­ing trees repos­i­tory.

The vari­ance prob­lem

In the first part of this se­ries, we’ve seen that de­ci­sion trees often have poor vari­ance. Their lim­i­ta­tions, namely axis-​aligned splits and the greedy ap­prox­i­ma­tion, limit their abil­ity to sep­a­rate sig­nal from the noise. More­over, we’ve seen that it may not be triv­ial to tune de­ci­sion trees and achieve a good bias-​variance bal­ance for a few rea­sons. Most reg­u­lar­iza­tion strate­gies (such as set­ting a max­i­mum tree depth) rely on in­te­ger val­ues and are thus less tun­able than frac­tional coun­ter­parts of other ma­chine learn­ing al­go­rithms. De­ci­sion trees are also un­sta­ble due to the greedy strat­egy: small changes in the train­ing data can com­pletely change the final model.

Mod­els with high vari­ance and low bias (such as fully-​grown de­ci­sion trees) achieve low (or even 0) train­ing error but fail to gen­er­al­ize to new ex­am­ples.

En­sem­bles

En­sem­ble meth­ods com­bine mul­ti­ple ma­chine learn­ing al­go­rithms or mul­ti­ple in­stances of the same al­go­rithm to ob­tain bet­ter pre­dic­tions. The gen­eral idea is that learn­ing the cor­rect hy­poth­e­sis about the data is very hard, so com­bin­ing mul­ti­ple hy­pothe­ses may lead to a bet­ter one. There are dif­fer­ent types of en­sem­bles, the most fa­mous among them being bag­ging and boost­ing. We’ll ex­plore bag­ging. Boost­ing is a topic for a fu­ture post.

Ag­gre­gat­ing hy­pothe­ses

What if we could build low-​bias and high-​variance mod­els, then later re­duce their vari­ance with­out in­creas­ing their bias? There is a the­o­ret­i­cal way to do this: we ob­tain many in­de­pen­dent datasets, train many dif­fer­ent mod­els and av­er­age their re­sults. As the num­ber of mod­els ap­proaches in­fin­ity we con­verge to­wards the ex­pected clas­si­fier for the true pop­u­la­tion dis­tri­b­u­tion. Since vari­ance is a char­ac­ter­is­tic of each model, we “av­er­age out” model vari­ance with­out in­creas­ing bias.

More for­mally, con­sider we have ac­cess to the true pop­u­la­tion dis­tri­b­u­tion P\mathcal{P}. Each train­ing set L\mathcal{L} con­tains many (x;y)(x;y) cases in­de­pen­dently drawn from the dis­tri­b­u­tion P\mathcal{P}, where XX and YY are ran­dom vari­ables. The av­er­age error for a sin­gle model can be ex­pressed as:

e=ELEX,Y(Yφ(x;L))2e = \mathbb{E}_{\mathcal{L}} \mathbb{E}_{X,Y} (Y - \varphi(x;\mathcal{L}))^2

The model trained on the train­ing set L\mathcal{L} is no­tated as φ(x;L)\varphi(x;\mathcal{L}). Since we’re ex­press­ing the av­er­age error, we take the ex­pec­ta­tion over the dis­tri­b­u­tion EX,Y\mathbb{E}_{X,Y} and over the train­ing sets EL\mathbb{E}_{\mathcal{L}}. That is, ee mea­sures the ex­pected squared error over dif­fer­ent test points (EX,Y\mathbb{E}_{X,Y}) and over dif­fer­ent train­ing sets EL\mathbb{E}_{\mathcal{L}}.

Now, let’s con­sider an ag­gre­gated pre­dic­tor av­er­aged over all pos­si­ble train­ing sets. Its error can be ex­pressed as fol­lows:

ea=EX,Y(Yφa(x;P))2e_a = \mathbb{E}_{X,Y} (Y - \varphi_a(x;P))^2

There is no need to con­sider the ex­pec­ta­tion over all train­ing sets since the model is al­ready av­er­aged. Ex­pand­ing the squared term in ee:

e=ELEX,Y[Y22Yφ(x;L)+φ2(x;L)]e = \mathbb{E}_{\mathcal{L}} \mathbb{E}_{X,Y} [Y^2 - 2 Y \varphi(x;\mathcal{L}) + \varphi^2(x;\mathcal{L})]

Using lin­ear­ity of ex­pec­ta­tion:

e=EX,YY22EX,YYELφ(x;L)+EX,YELφ2(x;L)e = \mathbb{E}_{X,Y} Y^2 - 2 \mathbb{E}_{X,Y} Y \mathbb{E}_\mathcal{L} \varphi(x;\mathcal{L}) + \mathbb{E}_{X,Y} \mathbb{E}_{\mathcal{L}} \varphi^2 (x;\mathcal{L})

The ex­pected value of in­di­vid­ual pre­dic­tors equals the ag­gre­gated pre­dic­tor:

ELφ(x;L)=φa(x;P)\mathbb{E}_{\mathcal{L}} \varphi(x;\mathcal{L}) = \varphi_a(x;P)

Using Jensen’s in­equal­ity (EZ)2E(Z)2(\mathbb{E} Z)^2 \le \mathbb{E} (Z)^2 ap­plied to ELφ(x;L)\mathbb{E}_{\mathcal{L}} \varphi(x;L), we can con­clude:

e=EX,YY22EX,YYφa(x;P)+EX,YELφ2(x;L)EX,Y(Yφa(x;P))2=eae = \mathbb{E}_{X,Y} Y^2 - 2 \mathbb{E}_{X,Y} Y \varphi_a(x;P) + \mathbb{E}_{X,Y} \mathbb{E}_{\mathcal{L}} \varphi^2 (x;\mathcal{L})\\ \ge \mathbb{E}_{X,Y} (Y - \varphi_a(x;P))^2 = e_a

Thus, the ag­gre­gated pre­dic­tor never has higher mean-​squared error than any in­di­vid­ual pre­dic­tor. The im­prove­ment we get from ag­gre­gat­ing all pos­si­ble mod­els de­pends on how un­equal are the two terms:

[Eφ(x,L)]2ELφ2(x,L)[\mathbb{E} \varphi(x, \mathcal{L})]^2 \le \mathbb{E}_{\mathcal{L}} \varphi^2 (x, \mathcal{L})

Re­ar­rang­ing, we get:

ELφ2(x,L)[Eφ(x,L)]20\mathbb{E}_{\mathcal{L}} \varphi^2 (x, \mathcal{L}) - [\mathbb{E} \varphi(x, \mathcal{L})]^2 \ge 0

This value is high when the train­ing pro­ce­dure is un­sta­ble, in­di­cat­ing that the model ex­hibits high vari­ance. As a re­sult, de­ci­sion trees are par­tic­u­larly well-​suited for model ag­gre­ga­tion, whereas k-​nearest neigh­bor and or­di­nary least squared (OLS) re­gres­sion do not ben­e­fit as much from this ap­proach.

Bag­ging

Un­for­tu­nately, we don’t have in­fi­nite datasets — it’s usu­ally hard enough to ob­tain one. Di­vid­ing one set into many doesn’t help ei­ther be­cause this in­creases bias due to lower sam­ple size. The so­lu­tion is to cre­ate many datasets out of the one we have by boot­strap aggregating or bag­ging.

We start with a sin­gle train­ing set L={(x1,y1),...,(xn,yn)}\mathcal{L} = \{(x_1, y_1), ..., (x_n, y_n)\} and gen­er­ate new sets by boot­strap­ping. That is, we ran­domly sam­ple from the orig­i­nal set with re­place­ment and gen­er­ate many new sets with NN el­e­ments each. Some points will be rep­re­sented more than once in a sam­pled set, while oth­ers will be ab­sent. That is, we treat our train­ing set as if it were the pop­u­la­tion dis­tri­b­u­tion and sam­ple from it. This may sound like a gim­mick1, but em­pir­i­cally it works. In fact, boot­strap­ping is widely used in sta­tis­tics.

Be­cause we’re not con­sid­er­ing all points in every sam­pled set, we do in­crease bias — the boot­strap is, after all, an ap­prox­i­ma­tion. How­ever, the de­crease in vari­ance can be higher than the in­crease in bias. The in­sta­bil­ity con­clu­sion still holds, so un­sta­ble pro­ce­dures ben­e­fit the most from bag­ging. Sta­ble pro­ce­dures, on the other hand, tend to pro­duce worse er­rors when bagged.

Bag­ging im­ple­men­ta­tion

It’s quite straight­for­ward to im­ple­ment bag­ging. We can apply bag­ging to any es­ti­ma­tor, so we’ll use the Classifier and Regressor in­ter­faces we’ve de­fined pre­vi­ously. The bag­ging es­ti­ma­tor has very few pa­ra­me­ters: only the num­ber of es­ti­ma­tors and a ran­dom state seed (op­tional). The es­ti­ma­tor is cre­ated through a callable, and this es­ti­ma­tor may have other pa­ra­me­ters.

T = TypeVar("T", bound=Classifier | Regressor)


class BaseBagging(Generic[T], ABC):
    """Base class for Bagging implementations."""

    def __init__(
        self,
        estimator_constructor: Callable[[], T],
        n_estimators: int = 100,
        random_state: int | None = None,
    ) -> None:
        """Initialize Bagging ensemble.

        Args:
            estimator_constructor: Function that returns a new estimator instance
            n_estimators: Number of estimators in the ensemble
            random_state: Random state for reproducibility
        """
        self.estimator_constructor = estimator_constructor
        self.n_estimators = n_estimators
        self.random_state = random_state

        self._estimators: list[T] = []

        if random_state is not None:
            np.random.seed(random_state)

    def _build_estimators(
        self,
        X: pd.DataFrame,
        y: np.ndarray,
        sample_weights: np.ndarray | None,
    ) -> list[T]:
        """Build estimators sequentially."""
        estimators = []
        for _ in range(self.n_estimators):
            model = self.estimator_constructor()

            indices = np.random.choice(X.shape[0], size=X.shape[0], replace=True)
            sample_weight = None if sample_weights is None else sample_weights[indices]
            model.fit(X.iloc[indices], y[indices], sample_weight)

            estimators.append(model)

        return estimators

    def fit(
        self,
        X: pd.DataFrame,
        y: np.ndarray,
        sample_weights: np.ndarray | None = None,
    ) -> None:
        """Fit the bagging ensemble."""
        self._estimators = self._build_estimators(X, y, sample_weights)

Then, we only need to im­ple­ment the predict meth­ods for BaggingClassifier and BaggingRegressor. Here only the BaggingClassifier im­ple­men­ta­tion is shown for brevity:

class BaggingClassifier(BaseBagging[Classifier], Classifier):
    """Bagging Classifier implementation."""

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Predict class probabilities for X."""
        all_proba = np.array([model.predict_proba(X) for model in self._estimators])
        return np.mean(all_proba, axis=0)

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict class labels for X."""
        proba = self.predict_proba(X)
        return (
            (proba >= 0.5).astype(int)
            if proba.shape[1] == 1
            else np.argmax(proba, axis=1)
        )

Ran­dom For­est

The core ad­van­tage of bag­ging is to re­duce vari­ance by av­er­ag­ing many mod­els with rel­a­tively low bias. As we’ve seen be­fore, de­ci­sion trees are in­her­ently un­sta­ble and can achieve low bias, mak­ing them ideal al­go­rithms for bag­ging.

Ran­dom forests is a par­tic­u­lar type of bagged trees with one very im­por­tant mod­i­fi­ca­tion: at each split search, only a sub­set of all fea­tures are con­sid­ered at ran­dom. The rest of the pro­ce­dure is iden­ti­cal to bag­ging, that is:

To un­der­stand why fea­ture sam­pling is ef­fec­tive, let’s con­sider the sta­tis­ti­cal foun­da­tions.

Con­sider what hap­pens when we av­er­age NN iden­ti­cally dis­trib­uted ran­dom vari­ables. If each vari­able has vari­ance σ2\sigma^2 and they are com­pletely in­de­pen­dent, their av­er­age has vari­ance 1Nσ2\frac{1}{N}\sigma^2. This means that as we in­crease the num­ber of trees NN in our en­sem­ble, we steadily re­duce the vari­ance of our pre­dic­tions.

How­ever, the in­de­pen­dence as­sump­tion is crit­i­cal. When the vari­ables are pos­i­tively cor­re­lated with some pair­wise cor­re­la­tion ρ\rho, the vari­ance be­comes:

ρσ2+1ρNσ2\rho \sigma^2 + \frac{1 - \rho}{N} \sigma^2

The first term re­mains con­stant re­gard­less of how many trees we add. Even with an in­fi­nite num­ber of trees, we can­not elim­i­nate this com­po­nent of vari­ance. This is where ran­dom fea­ture sam­pling be­comes cru­cial.

De-​correlating trees

Each de­ci­sion tree in a ran­dom for­est can be viewed as a ran­dom vari­able (with some dis­tri­b­u­tion) gen­er­at­ing pre­dic­tions. With­out fea­ture sam­pling, trees grown on boot­strapped sam­ples of the same dataset tend to have highly cor­re­lated pre­dic­tions be­cause:

  1. Strong pre­dic­tors tend to be se­lected re­peat­edly across trees
  2. Sim­i­lar tree struc­tures emerge when the same dom­i­nant fea­tures are avail­able to every tree
  3. The pre­dic­tions fol­low sim­i­lar pat­terns, es­pe­cially in re­gions of the fea­ture space dom­i­nated by these strong pre­dic­tors

By re­strict­ing each split to con­sider only a ran­dom sub­set of fea­tures (typ­i­cally p\sqrt{p} for clas­si­fi­ca­tion or p/3p/3 for re­gres­sion, where pp is the total num­ber of fea­tures), we force di­ver­sity in the tree struc­tures. Even if some fea­ture is strongly pre­dic­tive, some splits will have to con­sider weaker but still in­for­ma­tive fea­tures. This di­ver­sity di­rectly re­duces the cor­re­la­tion ρ\rho be­tween tree pre­dic­tions, which makes the en­sem­ble’s vari­ance re­duc­tion more ef­fec­tive as we add trees.

Bias-​variance trade-​off

In­di­vid­ual trees might be­come slightly less ac­cu­rate (higher bias) when re­stricted to fewer fea­tures, sim­i­lar to how bag­ging may slightly in­crease bias due to re­duced sam­ple sizes. Still, cor­re­la­tion means the trees “fail to­gether”, so av­er­ag­ing them pro­vides less sta­bi­liza­tion than if their er­rors were un­cor­re­lated. There­fore, the sub­stan­tial de­crease in cor­re­la­tion be­tween trees usu­ally leads to a larger re­duc­tion of vari­ance and the sam­pling pays off. Bag­ging ben­e­fits from in­sta­bil­ity and ran­dom forests add even more of it to the al­ready un­sta­ble pro­ce­dure of de­ci­sion trees.

Vi­su­al­iz­ing forests

Let’s vi­su­al­ize the de­ci­sion bound­ary of a ran­dom for­est. Note that the vi­su­al­iza­tion has two input fea­tures, there­fore each split con­sid­ers 2\sqrt{2} rounded up, which is 2 (all fea­tures). Hence, in this case bag­ging with de­ci­sion trees and ran­dom for­est yield the same re­sults.

Only the first three in­di­vid­ual trees are shown, but the for­est con­tains 100 de­ci­sion trees. No­tice how each de­ci­sion tree can only make axis-​aligned splits, which lim­its their abil­ity to cap­ture some pat­terns, while the com­bined for­est pro­duces a smoother bound­ary that bet­ter fits the data. Each in­di­vid­ual tree is fit until no train­ing error re­mains, yet the for­est is able to re­duce vari­ance by av­er­ag­ing over the sam­pled datasets (boot­strap).

Pros and cons of ran­dom forests

Since ran­dom forests are still, at their core, de­ci­sion trees, many of the pros and cons we’ve dis­cussed pre­vi­ously re­main valid. How­ever, there are some im­por­tant dif­fer­ences.

Ran­dom forests have very few hy­per­pa­ra­me­ters, mainly the num­ber of fea­tures to con­sider in each split and the num­ber of trees. The lat­ter can be set to as large as one can af­ford, while the for­mer has some sen­si­ble de­faults (p\sqrt{p} for clas­si­fi­ca­tion or p/3p/3 for re­gres­sion, where pp is the total num­ber of fea­tures). Thus, it’s pos­si­ble to train a ran­dom for­est with­out any hy­per­pa­ra­me­ter tun­ing and still get good re­sults. This com­bined with the abil­ity to han­dle nu­mer­i­cal and cat­e­gor­i­cal data with no need for scal­ing makes ran­dom forests a great bang-​for-​the-​buck al­go­rithm.

Train­ing ran­dom forests (and bag­ging al­go­rithms, more gen­er­ally) can be slow if the num­ber of trees is high. Note that since each es­ti­ma­tor is in­de­pen­dent when bag­ging, train­ing can be done in par­al­lel. Also, boost­ing usu­ally beats bag­ging, but we’ll talk about boost­ing in the fu­ture.

One of the main ad­van­tages of de­ci­sion trees is their sim­ple in­ter­pretabil­ity — all de­ci­sion paths can be vi­su­al­ized. This is lost when using en­sem­ble meth­ods such as bag­ging or boost­ing.

Out-​of-​bag error

By sam­pling with re­place­ment when boot­strap­ping, it’s very likely that not all sam­ples from the orig­i­nal train­ing set are sam­pled. In the limit of large NN (orig­i­nal dataset size), the prob­a­bil­ity of not pick­ing a ran­dom sam­ple is:

limN(11N)N=e10.368\lim_{N \to \infty} \left(1 - \frac{1}{N} \right)^N = e^{-1} \approx 0.368

Thus, ap­prox­i­mately 63% of the orig­i­nal dataset is sam­pled to each dataset, on av­er­age. This means that, for each sam­pled dataset, we have ef­fec­tively two sets: one is the sam­pled data (in-​the-​bag), the other set con­tains all data not cho­sen in the sam­pling process (out-​of-​bag).

We can keep track of which sam­ples are in-​the-​bag or out-​of-​bag for each tree in the for­est. Then, we can make a pre­dic­tion using the same orig­i­nal train­ing dataset, but for each ob­ser­va­tion use only the trees where this ob­ser­va­tion is out-​of-​bag. The out-​of-​bag (OOB) error is es­ti­mated using this pre­dic­tion and closely matches the error es­ti­mate ob­tained by N-fold cross-​validation. Ran­dom forests and bag­ging more gen­er­ally can there­fore forgo of a val­i­da­tion set and train on the en­tire dataset, val­i­dat­ing while train­ing.

Each sam­ple is out-​of-​bag for ~37% of the es­ti­ma­tors and the OOB error sta­bi­lizes after some num­ber of bag­ging it­er­a­tions (de­ci­sion trees trained). The OOB error can be mon­i­tored to de­ter­mine when adding more trees would likely yield no fur­ther im­prove­ment.

Over­fit­ting

Ran­dom forests are often re­ferred to as im­pos­si­ble to over­fit. This is true in the sense that in­creas­ing the num­ber of trees will not over­fit a ran­dom for­est as the av­er­age es­ti­mate will con­verge with many trees. Nonethe­less, this doesn’t mean that the bias-​variance trade­off is al­ways op­ti­mal in ran­dom forests. We’ve seen that the cor­re­la­tion be­tween trees limit vari­ance re­duc­tion, but de­creas­ing the num­ber of sam­pled fea­tures too much to lower this cor­re­la­tion ren­ders many splits use­less by con­sid­er­ing only un­in­for­ma­tive fea­tures, hence in­creas­ing bias. Thus, there is al­ways some cor­re­la­tion be­tween trees and bag­ging can only go so far to re­duce vari­ance.

Fully-​grown de­ci­sion trees may have a vari­ance that is too high for bag­ging to com­pen­sate for. There­fore, lim­it­ing the size of each tree can and often does im­prove test error, es­pe­cially in re­gres­sion. For ex­am­ple, the di­a­betes toy dataset ben­e­fits from min­i­mum node size re­stric­tions over tree size.

Plot of test error as a function of minimum node size showing lower error with higher minimum node size

Ran­dom For­est Im­ple­men­ta­tion

Since bag­ging and ran­dom forests are so closely re­lated, their im­ple­men­ta­tions are also very sim­i­lar. First, we mod­ify our pre­vi­ous CART im­ple­men­ta­tion so that it takes an extra ar­gu­ment max_features that de­ter­mines the max­i­mum num­ber of fea­tures con­sid­ered in each split. If max_features is lower than the num­ber of fea­tures, we sam­ple fea­tures at each split.

def _get_feature_indices(
    n_features: int, max_features: int | float | None = None
) -> np.ndarray:
    """Get indices of features to consider for splitting.

    Args:
        n_features: Total number of features
        max_features: If int, consider max_features features.
                     If float, consider max_features * n_features features.
                     If None, consider all features.
    """
    if max_features is None:
        return np.arange(n_features)

    if isinstance(max_features, float):
        if not 0.0 < max_features <= 1.0:
            raise ValueError("max_features must be in (0, 1]")
        max_features = int(max_features * n_features)

    max_features = min(max_features, n_features)
    return np.random.choice(n_features, size=max_features, replace=False)


def get_splitter(
    criterion: Criterion,
    max_depth: int = 0,
    min_samples_leaf: int = 0,
    min_criterion_reduction: float = 0,
    max_features: int | float | None = None,
):
    ...

Then, we im­ple­ment the ran­dom for­est:

T = TypeVar("T", DecisionTreeClassifier, DecisionTreeRegressor)


class BaseRandomForest(Generic[T], ABC):
    """Base class for Random Forest implementations."""

    def __init__(
        self,
        n_estimators: int = 100,
        max_depth: int = 0,
        min_samples_leaf: int = 1,
        min_criterion_reduction: float = 0,
        max_features: int | float | None = None,
        bootstrap: bool = True,
        random_state: int | None = None,
    ) -> None:
        """Initialize Random Forest.

        Args:
            n_estimators: Number of trees in the forest
            max_depth: Maximum depth of each tree
            min_samples_leaf: Minimum samples required at leaf node
            min_criterion_reduction: Minimum criterion reduction required for splitting
            max_features: Number of features to consider for best split:
                         If int, consider max_features features
                         If float, consider max_features * n_features features
                         If None, consider sqrt(n_features) for classification
                         and n_features/3 for regression
            bootstrap: Whether to use bootstrap samples

            random_state: Random state for reproducibility
        """
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.min_criterion_reduction = min_criterion_reduction
        self.max_features = max_features
        self.bootstrap = bootstrap

        self.random_state = random_state

        self._estimators: list[T] = []
        self.feature_importances_: np.ndarray | None = None

        if random_state is not None:
            np.random.seed(random_state)

    @abstractmethod
    def _make_estimator(self) -> T:
        """Create a new decision tree instance."""
        pass

    def _build_trees(
        self,
        X: pd.DataFrame,
        y: np.ndarray,
        sample_weights: np.ndarray | None,
    ) -> list[T]:
        """Build trees sequentially."""
        trees = []
        for _ in range(self.n_estimators):
            tree = self._make_estimator()

            if self.bootstrap:
                indices = np.random.choice(X.shape[0], size=X.shape[0], replace=True)
                sample_weight = (
                    None if sample_weights is None else sample_weights[indices]
                )
                tree.fit(X.iloc[indices], y[indices], sample_weight)
            else:
                tree.fit(X, y, sample_weights)

            trees.append(tree)

        return trees

    def fit(
        self,
        X: pd.DataFrame,
        y: np.ndarray,
        sample_weights: np.ndarray | None = None,
    ) -> None:
        """Fit the random forest."""
        self._n_features = X.shape[1]
        self._estimators = self._build_trees(X, y, sample_weights)

Here only the RandomForestRegressor im­ple­men­ta­tion is shown for brevity (clas­si­fi­ca­tion is very sim­i­lar):

class RandomForestRegressor(BaseRandomForest[DecisionTreeRegressor], Regressor):
    """Random Forest Regressor implementation."""

    def _make_estimator(self) -> DecisionTreeRegressor:
        max_features = self.max_features
        if max_features is None:
            max_features = max(1, int(self._n_features / 3))

        return DecisionTreeRegressor(
            max_depth=self.max_depth,
            min_samples_leaf=self.min_samples_leaf,
            min_criterion_reduction=self.min_criterion_reduction,
            max_features=max_features,
        )

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict regression target for X."""
        all_predictions = np.array([tree.predict(X) for tree in self._estimators])
        return np.mean(all_predictions, axis=0)

Con­clu­sion

We have ex­plored the con­cepts of bag­ging and ran­dom forests, two al­go­rithms that can im­prove on de­ci­sion trees by ad­dress­ing their high vari­ance prob­lem. Bag­ging is a fun­da­men­tal method in en­sem­ble learn­ing, de­creas­ing model vari­ance via the ag­gre­ga­tion of mul­ti­ple low-​bias and high-​variance mod­els trained on boot­strapped datasets while barely in­creas­ing bias. De­ci­sion trees are quite un­sta­ble, mak­ing them very suit­able for bag­ging.

Ran­dom forests fur­ther build upon bag­ging by in­tro­duc­ing ran­dom­ness at the fea­ture se­lec­tion lev­els to decor­re­late trees. This boosts the vari­ance im­prove­ment achieved with bag­ging, mak­ing them a great off-​the-​shelf al­go­rithm with few hy­per­pa­ra­me­ters and very de­cent pre­dic­tive power.

Ran­dom forests share many ad­van­tages with de­ci­sion trees, such as re­quir­ing lit­tle data prepa­ra­tion, and the abil­ity to han­dle both nu­mer­i­cal and cat­e­gor­i­cal data di­rectly. In prac­tice, their major draw­back is falling be­hind boost­ing in most sce­nar­ios. If you’re going to train hun­dreds of de­ci­sion trees, boost­ing can usu­ally out­per­form ran­dom forests. We’ll ex­plore boost­ing in the next part of this se­ries.

Ref­er­ences

Footnotes

  1. It is be­lieved that the name boot­strap­ping comes from the say­ing “pull one­self up by one’s boot­straps”, im­ply­ing an at­tempt to do some­thing im­pos­si­ble.



Previous Post
llm-docsmith
Next Post
Climbing trees 2: implementing decision trees