Skip to content

Climbing trees 1: what are decision trees?

Published:

This is the first in a se­ries of posts about de­ci­sion trees in the con­text of ma­chine learn­ing. The goal here is to pro­vide a foun­da­tional un­der­stand­ing of de­ci­sion trees and to im­ple­ment them.

De­ci­sion trees are not amaz­ing al­go­rithms by them­selves. They have lim­i­ta­tions that can re­sult in sub­op­ti­mal and even weird pre­dic­tions. And yet, they have be­come ex­tremely pop­u­lar. Some would even say they are the de facto go-to al­go­rithm for many ma­chine learn­ing do­mains. This is due to bag­ging and boost­ing, tech­niques that turned sub­par de­ci­sion trees into state-​of-​the-​art al­go­rithms. We’ll ex­plore them in the fu­ture.

First, we’ll build an in­tu­ition for what are de­ci­sion trees and de­fine them math­e­mat­i­cally. Then, we’ll ex­plore how de­ci­sion trees are built. This will allow us to grasp their main char­ac­ter­is­tics, ad­van­tages and dis­ad­van­tages. I will try to in­tro­duce com­plex­ity grad­u­ally, but I will as­sume you have some knowl­edge on math­e­mat­i­cal no­ta­tion, sta­tis­tics and basic ma­chine learn­ing con­cepts.

If things be­come too com­pli­cated, try to read the pro­vided ref­er­ences. I’ve drawn upon var­i­ous sources in­stru­men­tal to my un­der­stand­ing of de­ci­sion trees, in­clud­ing books, doc­u­men­ta­tion, ar­ti­cles, blog posts and lec­tures. Even if you un­der­stand every­thing, check the ref­er­ences: there is great con­tent there.

What is a de­ci­sion tree?

Imag­ine you’re try­ing to de­cide whether to take an um­brella when leav­ing home. You might ask ques­tions like: “Are there clouds?”. If yes, you might then ask “What’s the hu­mid­ity level?”. Each ques­tion helps you nar­row down the de­ci­sion. This is how a de­ci­sion tree works.

Let’s sim­u­late this weather ex­am­ple:

Scatter plot of humidity and cloud coverage showing that it's more likely to rain when its cloudy and humid

A de­ci­sion tree can be thought of as mak­ing con­sec­u­tive de­ci­sions by ask­ing a se­ries of ques­tions about our data. Each in­ter­nal tree node uses a cer­tain fea­ture (in our ex­am­ple, cloud cover or hu­mid­ity) to di­vide its re­gion into two using a split value. Each new re­gion can be fur­ther di­vided into two. A node that di­vides its re­gion into two is called an in­ter­nal node.

Leaf (or ter­mi­nal) nodes don’t ask any more ques­tions, but rather pro­vide a pre­dic­tion for its re­gion. In our ex­am­ple, it might say “Rain” or “No rain”. More pre­cisely, it as­signs a prob­a­bil­ity for each out­come.

Let’s take a look at a de­ci­sion tree fit­ted to our weather ex­am­ple:

Graph of decision tree splits

This de­ci­sion tree was kept in­ten­tion­ally small. From top to bot­tom, this graph rep­re­sents all de­ci­sion bound­aries of our sim­ple tree. Each in­ter­nal node pro­duces two branches, that is, two paths that can be fol­lowed. These branches re­cur­sively par­ti­tions the fea­ture space such that the sam­ples with the same la­bels or sim­i­lar tar­get val­ues are grouped to­gether. For in­stance, we pre­dict that it’ll rain if hu­mid­ity is above 59% and cloud cover is above 45% (right­most path in the graph) be­cause most points (in­stances) in this re­gion are of the “Rain” class.

The class shown is only rel­e­vant for leaf nodes, that is, those at the bot­tom row. The value prop­erty shows how many sam­ples there are for each class in each re­gion. A pure node has only in­stances of one class in its re­gion. Since all nodes con­tain at least one in­stance of both classes (that is, “Rain” and “No Rain”), all nodes are im­pure. The ob­jec­tive dur­ing train­ing is to re­duce im­pu­rity as much as pos­si­ble. If all nodes are pure, then the tree has zero error on the train­ing data set. This is how de­ci­sion trees learn.

This vi­su­al­iza­tion of the tree is very easy to in­ter­pret. We can fol­low each path and clearly see why a pre­dic­tion was made, that is, the model is eas­ily ex­plain­able (the op­po­site of a black-​box model). This is one of the rea­sons why de­ci­sion trees are pop­u­lar. Sim­ple de­ci­sion trees can even be ap­plied in some prac­ti­cal set­tings (e.g. med­i­cine) with­out ma­chine as­sis­tance.

We can also vi­su­al­ize the de­ci­sion bound­aries of the tree by over­lay­ing them onto the scat­ter plot of our data:

Scatter plot of humidity and cloud coverage showing prediction regions from a decision tree

We can see the straight bound­aries be­tween re­gions. More for­mally, a de­ci­sion tree is a hi­er­ar­chi­cal struc­ture that re­cur­sively di­vide our fea­tures into cuboid re­gions. Since we have 2 fea­tures (2 di­men­sions) in our ex­am­ple, the cuboid re­gions are squares.

Types of trees

Broadly, there are two types of de­ci­sion trees: clas­si­fi­ca­tion and re­gres­sion trees. While they share the same fun­da­men­tal struc­ture and split­ting method­ol­ogy, they dif­fer in their out­put and how they make pre­dic­tions.

Clas­si­fi­ca­tion trees are de­signed to pre­dict cat­e­gor­i­cal out­comes — they as­sign input data to pre­de­fined classes or cat­e­gories. At each leaf node, the tree pre­dicts the most com­mon class among the train­ing sam­ples that reached that node. Our weather ex­am­ple in­volves a clas­si­fi­ca­tion tree and the leaves pre­dict whether it will rain or not. The pre­dic­tion is made by count­ing the pro­por­tion of train­ing sam­ples of each class at the leaf node and se­lect­ing the ma­jor­ity class. Think of it as the tree ask­ing a se­ries of yes/no ques­tions about the input fea­tures until it can make an ed­u­cated guess about which cat­e­gory the input be­longs to.

Re­gres­sion trees, on the other hand, pre­dict con­tin­u­ous nu­mer­i­cal val­ues rather than cat­e­gories. In­stead of pre­dict­ing a class at each leaf node, re­gres­sion trees typ­i­cally out­put the av­er­age value of all train­ing sam­ples that reached that node. For in­stance, a re­gres­sion tree might pre­dict a house’s price based on fea­tures like square footage, num­ber of bed­rooms, and lo­ca­tion. Each split in the tree tries to group to­gether sim­i­lar nu­mer­i­cal val­ues. When a new ex­am­ple comes in, the tree can guide it to a leaf node con­tain­ing train­ing ex­am­ples with sim­i­lar tar­get val­ues and use their av­er­age as the pre­dic­tion.

De­ci­sion tree al­go­rithms

De­ci­sion trees have been widely ex­plored and there is a wide range of dif­fer­ent de­ci­sion tree al­go­rithms. The most in­flu­en­tial ones are ID3 (It­er­a­tive Di­chotomiser 3), C4.5 and CART (Clas­si­fi­ca­tion and Re­gres­sion Trees).

ID3 is one of the ear­li­est de­ci­sion tree al­go­rithms and sup­port only clas­si­fi­ca­tion with cat­e­gor­i­cal fea­tures. Nodes can be split into more than two de­pend­ing on the num­ber of cat­e­gor­i­cal lev­els of each fea­ture.

C4.5 ex­tends the ID3 al­go­rithm to sup­port nu­mer­i­cal fea­tures and miss­ing val­ues. CART is sim­i­lar to C4.5 but also adds sup­port for re­gres­sion.

The CART ap­proach only per­forms bi­nary splits, re­sult­ing in taller trees when com­pared to ID3 and C4.5. Al­though the prin­ci­ples dis­cussed here apply to most de­ci­sion tree al­go­rithms, I am fo­cused on CART — which is also the al­go­rithm we’ll im­ple­ment.

Math­e­mat­i­cal de­f­i­n­i­tion

Math­e­mat­i­cally, a de­ci­sion tree can be de­scribed as:

f(x)=m=1McmI(xRm)f(x) = \sum_{m=1}^{M} c_m \mathbb{I}(x \in R_m)

Where xx are the input fea­tures, RmR_m is the mm‘th re­gion and cmc_m is the value of this re­gion. I(xRm)\mathbb{I}(x \in R_m) is 1 if xx is con­tained in the mm‘th re­gion, 0 oth­er­wise. The val­ues (cc) are typ­i­cally a con­stant (re­gres­sion) or a vec­tor of prob­a­bil­i­ties (clas­si­fi­ca­tion). The com­bi­na­tion of re­gions and val­ues de­fines the de­ci­sion tree.

The re­gions can­not as­sume ar­bi­trary bound­aries, though. They are al­ways par­al­lel to some axis and can only di­vide a pre­vi­ous re­gion. This can be a lim­i­ta­tion, but it greatly re­duces the com­pu­ta­tional com­plex­ity of con­struct­ing a de­ci­sion tree. In our weather ex­am­ple, it’s triv­ial to find the fol­low­ing de­ci­sion bound­ary using other meth­ods (I have used lo­gis­tic re­gres­sion):

Weather scatter plot with diagonal decision boundary

How­ever, axis-​parallel splits (sin­gle fea­ture) are much eas­ier to com­pute than oblique splits (mul­ti­ple fea­tures). Find­ing the best split of a sin­gle fea­ture in­volves sort­ing the data and eval­u­at­ing splits. Since the lat­ter is neg­li­gi­ble com­pared to sort­ing, this op­er­a­tion has a time com­plex­ity of O(nlogn)O(n \log n), where nn is the num­ber of data points. To find the best oblique split com­bin­ing two fea­tures, how­ever, we must first con­sider all pos­si­ble O(n2)O(n^2) lines formed by pairs of points. For each line, you need to eval­u­ate which side each point falls on: O(n)O(n). This amounts to a total time com­plex­ity of O(n3)O(n^3). More gen­er­ally, an oblique split has a time com­plex­ity of O(nd+1)O(n^{d+1}), in which dd is the num­ber of fea­tures. There­fore, we com­pro­mise on using only axis-​parallel splits, which de­fine cuboid re­gions.

Each re­gion RmR_m is de­fined by a path from the root to the mm‘th leaf of the tree. For in­stance, con­sider the path that de­fines the re­gion R={(Humidity, Cloud)  Humidity>59 and Cloud>45}R = \{(Humidity,\ Cloud)\ |\ Humidity > 59\ \text{and}\ Cloud > 45\}.

Path between the root node and a leaf node

This re­gion can be vi­su­al­ized on the scat­ter plot:

Defined region in scatter plot

Un­like lin­ear mod­els, de­ci­sion trees do not model the en­tire data dis­tri­b­u­tion. Rather, each re­gion has in­de­pen­dent pre­dicted val­ues. More for­mally, adding all in­de­pen­dent re­gions de­fines a piece­wise func­tion that can ap­prox­i­mate any pat­tern in the data, but it may strug­gle to rep­re­sent smooth or con­tin­u­ous func­tions prop­erly.

Bias-​variance trade­off

As all other ma­chine learn­ing al­go­rithms, trees are also haunted by the bias-​variance trade­off. If this con­cept is new to you, I highly rec­om­mend read­ing about it first (check the ref­er­ences), but come back later.

Bias-​variance trade­off re­fresher In sum­mary, bias refers to the error that a model makes due to over­sim­pli­fi­ca­tion of re­la­tion­ships be­tween fea­tures (un­der­fit­ting). Vari­ance mea­sures how sen­si­tive model pre­dic­tions are to small fluc­tu­a­tions in the train­ing set. High vari­ance means that the model is cap­tur­ing noise rather than true re­la­tion­ships (over­fit­ting). Re­duc­ing bias tends to in­crease vari­ance and vice-​versa. Find­ing an op­ti­mal bias-​variance bal­ance is cru­cial to achieve good pre­dic­tion ac­cu­racy.

De­ci­sion tree vari­ance

Due to the non-​linear and non-​smooth na­ture of trees, they eas­ily cap­ture noise. Thus, deeper trees present more vari­ance. If you fully grow a tree, it will par­ti­tion the fea­ture space until the error is zero1. The model will have ef­fec­tively mem­o­rized the train­ing set — in­clud­ing noise — which re­sults in over­fit­ting.

If we re­build our ex­am­ple tree until the error is 0 we get the fol­low­ing re­gions:

Scatter plot of humidity and cloud coverage showing prediction regions from a deep decision tree

It has per­fect ac­cu­racy, but it has very un­usual bound­aries due to noise (high vari­ance). This model doesn’t gen­er­al­ize well, that is, it would score poorly with new data. There are dif­fer­ent ways to limit tree vari­ance, namely:

The num­ber of sam­ples re­quired to fill a tree dou­bles for each ad­di­tional level (depth) of the tree. These lim­its en­sure leaf nodes are not overly sparse. An­other pos­si­bil­ity is to fully grow the tree and later prune it to bal­ance com­plex­ity and ac­cu­racy.

Let’s build de­ci­sion trees with in­creas­ing depths on a toy dataset:

Depth: 1

On the left we can see the train­ing set2 with two fea­tures and two classes. We’re build­ing clas­si­fi­ca­tion trees, whose de­ci­sion bound­ary is over­laid onto the left plot. The plot on the right shows train­ing and test error, as well as the test error of a lo­gis­tic re­gres­sion model as a base­line. The test set was sam­pled from the same dis­tri­b­u­tion, but is dif­fer­ent from the train­ing set. Use the slider to com­pare depths. We can see that the best test error hap­pens with a depth of 3. In­creas­ing the depth fur­ther re­sults in over­fit­ting.

De­ci­sion tree bias

Even with a depth of three in the ex­am­ple above the error is larger than the lo­gis­tic re­gres­sion base­line. This is an ex­am­ple where lin­ear mod­els out­per­form de­ci­sion trees due to the ad­di­tive struc­ture of the data. The lo­gis­tic re­gres­sion equa­tion in this case is:

logit(p)=β0+β1x1+β2x2+εlogit(p) = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \varepsilon

Where pp is the prob­a­bil­ity of the out­come being of class 1. Ig­nor­ing con­stants and the noise (ε\varepsilon), the prob­a­bil­ity is de­ter­mined by a lin­ear com­bi­na­tion of both fea­tures. The re­la­tion­ship be­tween fea­tures is not hi­er­ar­chi­cal, there­fore de­ci­sion trees re­quire mul­ti­ple, some­times re­dun­dant, splits to ap­prox­i­mate this con­cept, lead­ing to deep or overly com­plex trees.

There are other con­cepts that fall under the same cat­e­gory, such as:

These may seem overly spe­cific, but they’re used as bench­marks to test the ca­pa­bil­i­ties ma­chine learn­ing al­go­rithms. Ad­di­tive struc­ture, XOR and par­ity in­volve global de­pen­den­cies that re­quire mul­ti­ple in­puts to be con­sid­ered to­gether. De­ci­sion trees split data based on one fea­ture at a time, there­fore are in­ef­fi­cient at cap­tur­ing these re­la­tion­ships. Mul­ti­plexer prob­lems re­quire con­di­tional rules which are not hi­er­ar­chi­cal, mak­ing the tree very large to ac­count for all com­bi­na­tions of selector-​input pairs.

In more com­plex real-​world datasets, it’s quite likely that mul­ti­ple non-​hierarchical con­cepts are present, lead­ing to sub­op­ti­mal bias. If they have sub­par vari­ance and bias, de­ci­sion trees may look like a poor choice of al­go­rithm. In­deed, they rarely shine on their own, but rather as en­sem­bles (with bag­ging or boost­ing), which we’ll cover in the fu­ture.

The stair­case ef­fect

When de­ci­sion trees at­tempt to model ad­di­tive struc­ture, they face a struc­tural con­straint that is com­mon enough for me to name it the “stair­case” ef­fect3. This lim­i­ta­tion man­i­fests dif­fer­ently in clas­si­fi­ca­tion and re­gres­sion tasks, but stems from axis-​aligned splits.

In clas­si­fi­ca­tion prob­lems, the stair­case ef­fect be­comes ap­par­ent when the op­ti­mal de­ci­sion bound­ary be­tween classes in­volves a lin­ear com­bi­na­tion of fea­tures (an oblique bound­ary). Con­sider a sim­ple case where the op­ti­mal bound­ary is a di­ag­o­nal line in a two-​dimensional fea­ture space. Be­cause de­ci­sion trees can only make splits par­al­lel to the fea­ture axes, they must ap­prox­i­mate this di­ag­o­nal bound­ary through a se­ries of rec­tan­gu­lar re­gions, cre­at­ing a stair-​like pat­tern.

Demonstration of the staircase effect in classification

In re­gres­sion set­tings, the stair­case ef­fect is even more promi­nent be­cause re­gres­sion trees must ap­prox­i­mate con­tin­u­ous func­tions using dis­crete, constant-​valued re­gions. When the un­der­ly­ing re­la­tion­ship is smooth — whether lin­ear, poly­no­mial, or any other con­tin­u­ous func­tion — the tree’s pre­dic­tion sur­face has a stair-​like struc­ture with abrupt changes be­tween ad­ja­cent re­gions. In­creas­ing tree depth ap­prox­i­mates the pre­dic­tion sur­face to the train­ing set, yet the sur­face does not nec­es­sar­ily be­come smooth as it cap­tures fea­ture noise.

Demonstration of the staircase effect in regression

Re­gres­sion ex­am­ple with the stair­case ef­fect. There is only one fea­ture (xx) and the ver­ti­cal axis is the out­come.

Ob­jec­tive func­tions

We need to de­fine ob­jec­tive func­tions to op­ti­mize for dur­ing train­ing. Clas­si­fi­ca­tion trees use ei­ther Gini im­pu­rity or en­tropy as ob­jec­tive func­tions. Re­gres­sion trees usu­ally use the squared loss.

In all ex­am­ples, con­sider the data D={(x1,y1),...,(xn,yn)},yi{1,...,c}\mathcal{D} = \{(x_1, y_1), ..., (x_n, y_n)\}, y_i \in \{1, ..., c\} where cc is the num­ber of classes.

Mis­clas­si­fi­ca­tion rate

MR(D)=i=1nI(y^iyi)nMR(\mathcal{D}) = \frac{\sum_{i=1}^n \mathbb{I}(\hat{y}_i \neq y_i)}{n}

This is a very in­tu­itive ob­jec­tive for clas­si­fi­ca­tion. It mea­sures the pro­por­tion of mis­clas­si­fied ex­am­ples. The pre­dic­tion y^\hat{y} is the ma­jor­ity vote of the node. Our goal is to de­crease leaf node im­pu­rity, which is aligned with the mis­clas­si­fi­ca­tion rate func­tion.

Gini im­pu­rity

The Gini im­pu­rity4 over a set of class prob­a­bil­i­ties pp is de­fined as:

G(D)=k=1cpk(1pk)=1k=1cpk2G(\mathcal{D}) = \sum_{k=1}^{c} p_{k}(1 - p_{k}) = 1 - \sum_{k=1}^c p_{k}^2

It mea­sures how often a ran­domly cho­sen el­e­ment of a set would be in­cor­rectly la­beled if it were la­beled ran­domly and in­de­pen­dently ac­cord­ing to the dis­tri­b­u­tion of la­bels in the set. When the node is pure, one class has a prob­a­bil­ity of 1 and the rest 0, so the Gini im­pu­rity is also 0. The worst Gini im­pu­rity arises when the prob­a­bil­i­ties are uni­form among classes, that is, a node with max­i­mum Gini im­pu­rity is no bet­ter than a coin toss at clas­si­fy­ing ex­am­ples. It can be shown that the upper bound of Gini im­pu­rity is 1 as the num­ber of classes grow.

To eval­u­ate the qual­ity of a split (S\mathcal{S}), we cal­cu­late the Gini im­pu­rity of each child node and take the weighted av­er­age.

G(DS)=DLDG(DL)+DRDG(DR)G(\mathcal{D}|\mathcal{S}) = \frac{|\mathcal{D}_L|}{|\mathcal{D}|} G(\mathcal{D}_L) + \frac{|\mathcal{D}_R|}{|\mathcal{D}|} G(\mathcal{D}_R)

Where DLD\frac{|\mathcal{D}_L|}{|\mathcal{D}|} is the frac­tion of in­puts in the left child node and DRD\frac{|\mathcal{D}_R|}{|\mathcal{D}|}, in the right.

En­tropy

H(D)=k=1cpk log2pkH(\mathcal{D}) = - \sum_{k=1}^{c} p_{k}\ log_{2} p_{k}

The en­tropy cri­te­rion is a con­cept from in­for­ma­tion the­ory (Shan­non en­tropy). It mea­sures the av­er­age level of un­cer­tainty of a set of prob­a­bil­i­ties. In other words, it’s the ex­pected amount of in­for­ma­tion re­quired to de­scribe the po­ten­tial out­comes. When prob­a­bil­i­ties are uni­formly dis­trib­uted en­tropy is max­i­mum, just like Gini im­pu­rity. To un­der­stand the con­nec­tion be­tween in­for­ma­tion and prob­a­bil­i­ties, con­sider you have a set with two classes that are equally likely (let’s say red and blue). If you draw a sam­ple, you have the least pos­si­ble amount of cer­tainty about which color you’ve drawn. On the other hand, if red has a 95% prob­a­bil­ity, you can be pretty con­fi­dent about which color you’ll draw — this has much lower un­cer­tainty. When prob­a­bil­i­ties are uni­form you need extra in­for­ma­tion to ac­cu­rately con­vey the re­sult of a se­ries of draws.

The range of the en­tropy cri­te­rion is from 0 to log2(c)log_{2}(c). We also take the weighted av­er­age of child nodes to eval­u­ate the qual­ity of a split:

H(DS)=DLDH(DL)+DRDH(DR)H(\mathcal{D}|\mathcal{S}) = \frac{|\mathcal{D}_L|}{|\mathcal{D}|} H(\mathcal{D}_L) + \frac{|\mathcal{D}_R|}{|\mathcal{D}|} H(\mathcal{D}_R)

It’s also com­mon to see the ob­jec­tive func­tion ex­pressed in terms of in­for­ma­tion gain, which is the de­crease in en­tropy yielded by a node split.

IG(D)=H(D)H(DS)IG(\mathcal{D}) = H(\mathcal{D}) - H(\mathcal{D}|\mathcal{S})

It rep­re­sents the amount of in­for­ma­tion gained about our tar­get vari­able given our split. In these terms, the ob­jec­tive is to max­i­mize in­for­ma­tion gain.

Com­par­ing clas­si­fi­ca­tion ob­jec­tives

Al­though mis­clas­si­fi­ca­tion rate is in­tu­itive and com­monly used as a model met­ric, it’s not used as an op­ti­miza­tion ob­jec­tive. We can come up with splits (S\mathcal{S}) that have the same mis­clas­si­fi­ca­tion rate, but one has a pure node while the other doesn’t. For in­stance:

S={A:10,B:0}, {A:10,B:5}\mathcal{S} = \{A: 10, B: 0\},\ \{A: 10, B: 5\} S={A:10,B:2}, {A:10,B:3}\mathcal{S}' = \{A: 10, B: 2\},\ \{A: 10, B: 3\}

Both splits mis­clas­sify 5 sam­ples of class B (same mis­clas­si­fi­ca­tion rate), how­ever the first has a pure node. From an op­ti­miza­tion per­spec­tive, we should favor pure nodes as they re­duce the num­ber of sub­se­quent splits. You can check that in­deed both the en­tropy and the Gini im­pu­rity of the sec­ond split are higher.

Check­ing that the af­fir­ma­tion holds

Mis­clas­si­fi­ca­tion rate:

MR(DS)=10250+1525515=0.2MR(\mathcal{D}|\mathcal{S}) = \frac{10}{25} 0 + \frac{15}{25} \frac{5}{15} = 0.2MR(DS)=1225212+1325313=0.2MR(\mathcal{D}|\mathcal{S}') = \frac{12}{25} \frac{2}{12} + \frac{13}{25} \frac{3}{13} = 0.2MR(DS)=MR(DS)MR(\mathcal{D}|\mathcal{S}) = MR(\mathcal{D}|\mathcal{S}')

Gini im­pu­rity:

G(DS)=10250+15250.444=0.266G(\mathcal{D}|\mathcal{S}) = \frac{10}{25} 0 + \frac{15}{25} 0.444 = 0.266G(DS)=12250.277+13250.355=0.318G(\mathcal{D}|\mathcal{S}') = \frac{12}{25} 0.277 + \frac{13}{25} 0.355 = 0.318G(DS)<G(DS)G(\mathcal{D}|\mathcal{S}) < G(\mathcal{D}|\mathcal{S}')

For the same rea­son, a tree may get “stuck” when there is no split that min­i­mizes the mis­clas­si­fi­ca­tion rate. Gini im­pu­rity and en­tropy con­sider prob­a­bil­i­ties, while mis­clas­si­fi­ca­tion rate con­sid­ers only the ma­jor­ity vote and there­fore is not sen­si­tive enough.

More for­mally, both Gini im­pu­rity and en­tropy are strictly con­cave func­tions. If you take two points on the curve of a strictly con­cave func­tion and draw a line be­tween them, the line will be al­ways below the curve.

Plot of the entropy function with a line segment below the curve

Plot of en­tropy H(p)H(p) over prob­a­bil­i­ties pp with two classes. The prob­a­bil­ity of the sec­ond class is (1p)(1 - p).

We know from the data pro­cess­ing in­equal­ity that the av­er­age en­tropy can­not in­crease after a split. Ei­ther a split re­duces en­tropy in both child nodes — which re­sults in lower av­er­age en­tropy com­pared to the par­ent node — or one child node has higher en­tropy than the par­ent node and the other lower. Both child nodes can­not have higher en­tropy than the par­ent node be­cause this would mean that the split cre­ated new in­for­ma­tion (un­cer­tainty), vi­o­lat­ing the data pro­cess­ing in­equal­ity. Thus, we only need to show that the av­er­age en­tropy de­creases when one child node has higher en­tropy and the other lower com­pared to the par­ent node. The strict con­cav­ity of the en­tropy func­tion is enough to en­sure this.

Plot of entropy over probabilities showing labeled node points

The av­er­age child node en­tropy lies some­where along the line be­tween the points of both child nodes in the curve. The av­er­age prob­a­bil­ity must be the same of the par­ent node, so the av­er­age child node en­tropy al­ways lies di­rectly below the par­ent node en­tropy. Hence, there is al­most5 al­ways a split which re­sults in a lower child node av­er­age en­tropy.

Mis­clas­si­fi­ca­tion rate is not strictly con­cave. The av­er­age mis­clas­si­fi­ca­tion rate of child nodes is equal to that of the par­ent node when both child nodes lie on the same side of the func­tion, halt­ing the op­ti­miza­tion process.

Plot showing line segment over misclassification rate function

Squared loss

L(D)=1Ni=1N(yiyˉ)2L(\mathcal{D}) = \frac{1}{N} \sum_{i=1}^N (y_{i} - \bar{y})^2

The squared loss func­tion is the pri­mary op­ti­miza­tion cri­te­rion for re­gres­sion trees. In this equa­tion, yˉ\bar{y} rep­re­sents the mean value of the tar­get vari­able yy within a spe­cific node. This value be­comes the pre­dic­tion (cc) for all ob­ser­va­tions that fall into that node, fol­low­ing the math­e­mat­i­cal de­f­i­n­i­tion of de­ci­sion trees.

While this for­mula re­sem­bles the tra­di­tional mean squared error (MSE) used in many sta­tis­ti­cal ap­pli­ca­tions, there is an im­por­tant dis­tinc­tion. In stan­dard re­gres­sion mod­els, MSE typ­i­cally mea­sures the dif­fer­ence be­tween pre­dic­tions (y^\hat{y}) and ac­tual val­ues, where y^\hat{y} can be di­rectly op­ti­mized through model pa­ra­me­ters. How­ever, in the con­text of de­ci­sion trees, this squared loss func­tion ac­tu­ally quan­ti­fies the within-​node vari­ance of the tar­get vari­able.

This vari­ance in­ter­pre­ta­tion leads to an in­tu­itive un­der­stand­ing of the tree-​building process: the al­go­rithm seeks splits that max­i­mize the re­duc­tion in vari­ance be­tween the par­ent node and its chil­dren. This ap­proach, often termed vari­ance re­duc­tion, ef­fec­tively par­ti­tions the data into in­creas­ingly ho­mo­ge­neous sub­groups with re­spect to the tar­get vari­able. The op­ti­mal split is one that cre­ates child nodes with min­i­mal av­er­age vari­ance, thereby im­prov­ing the tree’s pre­dic­tive ac­cu­racy.

Build­ing a de­ci­sion tree

A suf­fi­ciently large tree will have one train­ing sam­ple per node, ef­fec­tively mem­o­riz­ing the train­ing set. Such tree achieves 0 train­ing error, but it is a patho­log­i­cal op­ti­miza­tion case. Our goal is to build the small­est pos­si­ble tree that achieves 0 train­ing error. If lim­its such as max­i­mum depth are en­forced, we should find the tree that min­i­mizes the ob­jec­tive while re­spect­ing such lim­its.

Un­for­tu­nately, the com­pu­ta­tional com­plex­ity grows com­bi­na­to­ri­ally and find­ing the op­ti­mal tree is NP-​complete. It would be nice to build trees be­fore the heat death of the uni­verse, so we pro­ceed with a greedy al­go­rithm. We make a se­ries of lo­cally op­ti­mal choices, hop­ing that it leads to a so­lu­tion some­what close to the global op­ti­mum.

Let’s build a clas­si­fi­ca­tion tree using the Gini im­pu­rity cri­te­rion. Start­ing with all of the data and the first vari­able, we test all pos­si­ble split points (N1N - 1 splits) and choose the one with the low­est cri­te­rion. Then, we re­peat this pro­ce­dure with all other fea­tures. The split with the low­est cri­te­rion over all fea­tures is cho­sen. Fi­nally, we re­peat the pro­ce­dure for the left and right child nodes re­cur­sively until the cri­te­rion is 0 in all nodes or until a tree size limit is reached.

Rep­re­sen­ta­tion of the op­ti­miza­tion process con­sid­er­ing a sin­gle fea­ture. We eval­u­ate all split points to find the one that min­i­mizes the ob­jec­tive func­tion. The same process must be done with the other fea­tures to choose the op­ti­mal split.

This ap­proach, while com­pu­ta­tion­ally ef­fi­cient, has sig­nif­i­cant draw­backs. The pri­mary lim­i­ta­tion lies in the al­go­rithm’s in­abil­ity to con­sider long-​term con­se­quences of split de­ci­sions. When eval­u­at­ing po­ten­tial splits, the al­go­rithm op­ti­mizes only for im­me­di­ate gain, po­ten­tially over­look­ing splits that might en­able bet­ter down­stream de­ci­sions.

This my­opic op­ti­miza­tion strat­egy in­tro­duces sub­stan­tial model in­sta­bil­ity. The cu­mu­la­tive ef­fect of these lo­cally op­ti­mal but glob­ally sub­op­ti­mal de­ci­sions makes the re­sult­ing tree struc­ture highly sen­si­tive to vari­a­tions in the train­ing data (i.e. it’s an­other source of vari­ance). In ex­treme cases, the ad­di­tion or re­moval of a sin­gle train­ing ex­am­ple can prop­a­gate through the tree struc­ture, lead­ing to dra­mat­i­cally dif­fer­ent de­ci­sion paths.

More­over, this is the main rea­son why re­quir­ing a min­i­mum de­crease in loss to split the node is not a good idea. Splits that ap­pear mar­gin­ally ben­e­fi­cial in iso­la­tion may be es­sen­tial for ac­cess­ing valu­able par­ti­tion pat­terns deeper in the tree struc­ture. There­fore, the im­me­di­ate mag­ni­tude of im­prove­ment from a split may not re­li­ably in­di­cate its ul­ti­mate con­tri­bu­tion to model per­for­mance.

Char­ac­ter­is­tics of de­ci­sion trees

As­sump­tions

Every ma­chine learn­ing al­go­rithm makes as­sump­tions about the struc­ture of the data. Ex­plor­ing the en­tire space of pos­si­ble re­la­tion­ships be­tween fea­tures and the out­come is not fea­si­ble, there­fore each al­go­rithm cuts cor­ners in its way.

De­ci­sion trees as­sume that we can sep­a­rate the data into mean­ing­ful re­gions by re­peat­edly split­ting on in­di­vid­ual fea­tures. In other words, they as­sume that the op­ti­mal de­ci­sion bound­ary in the fea­ture space is axis-​aligned, hence fea­ture in­ter­ac­tions can only be cap­tured through hi­er­ar­chi­cal struc­ture. As with most ma­chine learn­ing al­go­rithms, these as­sump­tions are sel­dom strictly true. Nonethe­less, they may be close enough to re­al­ity to pro­duce good re­sults.

De­ci­sion trees, on the other hand, don’t make other re­stric­tive as­sump­tions com­mon to many ma­chine learn­ing al­go­rithms: they do not as­sume any prob­a­bil­ity dis­tri­b­u­tion nor any lin­ear struc­ture. They also do not rely on any dis­tance met­ric, par­tially evad­ing the curse of di­men­sion­al­ity6. Fea­ture se­lec­tion is baked into the de­ci­sion tree al­go­rithm, so they can ig­nore ir­rel­e­vant di­men­sions. Still, as di­men­sion­al­ity grows and the data be­comes sparse, it be­comes harder to find mean­ing­ful splits with­out cap­tur­ing noise.

Pros and cons

We can sum­ma­rize the main ad­van­tages of de­ci­sion trees as:

They also same major draw­backs, namely:

Ex­trap­o­la­tion

De­ci­sion trees do not ex­trap­o­late. I didn’t place this ei­ther as a pro or a con be­cause it might be both.

xkcd comic about extrapolation

The de­ci­sion tree train­ing pro­ce­dure par­ti­tions the fea­ture space based on ex­am­ples. When a de­ci­sion tree en­coun­ters a data point where one or more fea­ture fall out­side the bounds of what it was trained on, it sim­ply as­signs the value of the near­est leaf node. There­fore, pre­dic­tions re­main con­stant be­yond the bound­aries of the train­ing data.

This be­hav­ior makes the model ro­bust to ex­treme fea­ture val­ues, hence it might be de­sir­able in some set­tings. How­ever, it’s also a major lim­i­ta­tion in do­mains where we ex­pect trends to ex­tend be­yond our ob­served data range. Con­sider the task of pre­dict­ing house based on con­structed area. If the train­ing data only in­cludes houses up to 250 square me­ters, that is the upper bound of our pre­dic­tions, even though we would rea­son­ably ex­pect larger houses to be more ex­pen­sive. This is an in­her­ent in­duc­tive bias of de­ci­sion trees.


We’ve de­fined de­ci­sion trees, their ob­jec­tive func­tions and a gen­eral al­go­rithm. In the fol­lowup of this se­ries we’ll im­ple­ment clas­si­fi­ca­tion and re­gres­sion trees (CART) in Python.

Ref­er­ences

Footnotes

  1. The error may not reach zero if and only if there are two or more points with ex­actly the same fea­ture val­ues but dif­fer­ent tar­get val­ues.

  2. Each class was sam­pled from a Bi­vari­ate (2D) Gauss­ian dis­tri­b­u­tion with SD=1. One class is shifted by 1.5 units both up and to the right.

  3. I’ve seen the term “stair­case” being used in­for­mally (e.g. here) be­fore.

  4. Not to be con­fused with the Gini co­ef­fi­cient

  5. Here, again, this split doesn’t exist only if all points have the exact same fea­ture val­ues.

  6. As the num­ber of di­men­sions in­crease, the data quickly be­comes sparse. Stan­dard dis­tance met­rics (e.g. Eu­clid­ean dis­tance) be­come less mean­ing­ful and loose pre­dic­tive power.

  7. De­ci­sion trees are not in­flu­enced by the rel­a­tive scale of fea­tures.



Next Post
Regression to the mean