Skip to content

The basics of outlier detection

Published:

Out­liers

The word out­lier is fre­quently used in my field of re­search (basic bi­ol­ogy and bio­med­i­cine). It refers to data ob­ser­va­tions that dif­fer a lot from the oth­ers. Some very in­for­ma­tive posts re­gard­ing out­liers are al­ready avail­able (see here and here for ex­am­ples). How­ever, my ex­pe­ri­ence is that bi­ol­ogy re­searchers often don’t re­ceive ad­e­quate sta­tis­ti­cal ed­u­ca­tion and rely on pos­si­bly in­ad­e­quate heuris­tics to de­ter­mine what is an out­lier. This post is in­tended to ex­plain the ba­sics of out­lier de­tec­tion and re­moval and, more specif­i­cally, to high­light some com­mon mis­takes. Out­liers may arise from ex­per­i­men­tal er­rors, human mis­takes, flawed tech­niques (e.g., a batch of ex­per­i­ments done with low-​quality reagent), cor­rupt data, or sim­ply sam­pling prob­a­bil­ity.

Why do peo­ple re­move out­liers?

The ques­tion is valid: if we ob­tain data ob­tained through stan­dard­ized and re­pro­ducible pro­ce­dures, why should we dis­card valu­able data points? The an­swer is that stan­dard sta­tis­ti­cal tests that rely on para­met­ric as­sump­tions are quite sen­si­tive to out­liers. This oc­curs mainly (but not ex­clu­sively) be­cause the mean is very sen­si­tive to ex­treme val­ues (while the me­dian is not, for ex­am­ple) — and the stan­dard error is also sen­si­tive in some way. So, as para­met­ric tests usu­ally rely solely on mean and vari­ances to cal­cu­late the fa­mous p-​value, out­liers often lead to weird re­sults that do not seem plau­si­ble. For in­stance, let’s con­sider the fol­low­ing data:

2.05, 3.27, 1.53, 3.82, 2.33

A one-​sample t-​test tests if the mean is sig­nif­i­cantly dif­fer­ent from zero. The re­sult is that, as one would ex­pect, it is in­deed:

##  One Sample t-test
##
## data:  c(2.051501, 3.27815, 1.532082, 3.826658, 2.335235)
## t = 6.2481, df = 4, p-value = 0.003345
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1.447266 3.762184
## sample estimates:
## mean of x
##  2.604725

How­ever, let’s now add a point that makes the data even far­ther from zero:

2.05, 3.27, 1.53, 3.82, 2.33, 20

##  One Sample t-test
##
## data:  c(2.051501, 3.27815, 1.532082, 3.826658, 2.335235, 20)
## t = 1.8855, df = 5, p-value = 0.118
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -1.999914 13.007789
## sample estimates:
## mean of x
##  5.503938

Well, now there’s no sig­nif­i­cant dif­fer­ence. The stan­dard de­vi­a­tion of the sam­ple in­creases a lot with the ar­ti­fi­cial out­lier and so does the stan­dard error of the mean, which is used to cal­cu­late con­fi­dence in­ter­vals and p-​values. Thus, back in time when para­met­ric sta­tis­tics were all that there was and sam­ple sizes were very lim­ited, the so­lu­tion was to re­move (or change the value of) the out­ly­ing val­ues.

Now, at this point it’s im­por­tant to no­tice a few char­ac­ter­is­tics about out­liers: by de­f­i­n­i­tion, out­liers must be rare. If out­liers are com­mon (that is, more than a few per­cent of the ob­ser­va­tions at most), prob­a­bly some data col­lec­tion error has hap­pened, or the dis­tri­b­u­tion is very far from a nor­mal one. An­other op­tion is that the out­lier de­tec­tion method im­ple­mented de­tects too much of them.

De­tect­ing out­liers

Z-​scores

Z-​scores are sim­ply a way to stan­dard­ize your data through scal­ing. It shows how many stan­dard de­vi­a­tions a given point, well, de­vi­ates from the mean. One can set an ar­bi­trary thresh­old and ex­clude points that are above the pos­i­tive thresh­old value or below the neg­a­tive one. Let’s apply the z-​score to our ar­ti­fi­cial data with a thresh­old value of 2:

print(scale(c(2.051501, 3.27815, 1.532082, 3.826658, 2.335235, 20))[, 1])
## [1] -0.4828334 -0.3112829 -0.5554757 -0.2345725 -0.4431524  2.0273168

This shows that our ar­ti­fi­cial out­lier is in­deed above the thresh­old and could be re­moved. How­ever, many peo­ple con­sider 2 a far too per­mis­sive thresh­old and use 3 as a rule-​of-​thumb value. The out­lier thresh­old will al­ways be ar­bi­trary and there’s no right an­swer, but using a low value can lead to fre­quent out­lier de­tec­tion — and out­liers must be rare.

So, to sum it up, the z-​score method is quite ef­fec­tive if the dis­tri­b­u­tion of the data is roughly nor­mal. The smaller the sam­ple size, the more in­flu­ence ex­treme val­ues will have over the mean and the stan­dard de­vi­a­tion. Thus, the z-​score method may fail to de­tect ex­treme val­ues in small sam­ple sizes.

IQR method

The in­terquar­tile range (or IQR) method was cre­ated by the great sta­tis­ti­cian John Tukey and is em­bed­ded in the fa­mous box plot. The idea is to de­ter­mine the 25th and 75th quan­tiles, that is, the val­ues that leave 25% and 75% of the data below it, re­spec­tively. Then, the dis­tance be­tween them is the IQR. Below you can see a his­togram of the fa­mous height data from Sir Fran­cis Gal­ton in which those quan­tiles are marked with red lines. 50% of the data lies be­tween the lines.

Histogram of a random variable with normal distribution and quantiles marked in red vertical lines

The box-​and-​whisker plot (or box plot) sim­ply draws a box whose lim­its are the 25th and 75th per­centiles, with the me­dian (or 50th per­centile) as a line in the mid­dle. Then, whiskers of length 1.5 times IQR are drawn on both di­rec­tions, that is, 1.5 times IQR below the 25th per­centile and 1.5 times IQR above the 75th per­centile. Val­ues that are out­side this range were con­sid­ered out­liers by Tukey. Here is the box plot for the height data:

Box plot of height data in inches with two outliers

Some out­liers do ap­pear, but they are very few. The main ad­van­tage of the IQR method is that it’s more ro­bust to slightly skewed dis­tri­b­u­tions. It can also de­tect out­liers with smaller sam­ple sizes as the me­dian and IQR are much less in­flu­enced by ex­treme val­ues than the mean and stan­dard de­vi­a­tion, re­spec­tively. With z-​scores, the pres­ence of re­ally ex­treme val­ues can in­flu­ence the mean and the stan­dard de­vi­a­tion so much that it fails to de­tect other less ex­treme out­liers, a phe­nom­e­non known as mask­ing. In some cases, a fac­tor of 2 or even 3 is used to mul­ti­ply the IQR, de­tect­ing fewer out­liers.

Using our pre­vi­ous ar­ti­fi­cial data, let’s re­place the out­lier with a less ex­treme value of 8 and apply both de­tec­tion meth­ods:

print(scale(c(2.051501, 3.27815, 1.532082, 3.826658, 2.335235, 8))[, 1])
## [1] -0.61670998 -0.09587028 -0.83725721  0.13702825 -0.49623548  1.90904470

Box plot with one outlier of value 8

The z-​score method does not de­tect the ex­treme value as an out­lier, while the IQR method does so. Let’s in­crease the sam­ple size and re­peat the analy­sis. The new data will be:

2.05, 3.28, 1.53, 3.83, 2.34, 3.47, 0.32, 3.40, 2.75, 4.55, 1.32, 1.82, 2.19, 1.86, 2.65, 3.54, 2.98, 2.60, 4.57, 8

print(scale(c(2.0515010, 3.2781500, 1.5320820, 3.8266580, 2.3352350, 3.4745626, 0.3231792, 3.3983499, 2.7515991, 4.5479615, 1.3167715, 1.8196742, 2.1908817, 1.8590404, 2.6546580, 3.5424431, 2.9777304, 2.6038048, 4.5722174, 8))[, 1])
##  [1] -0.56445751  0.20373600 -0.88974560  0.54724118 -0.38676803
##  [6]  0.32674012 -1.64682548  0.27901164 -0.12601846  0.99896019
## [11] -1.02458460 -0.70963991 -0.47716983 -0.68498668 -0.18672819
## [16]  0.36925054  0.01559711 -0.21857520  1.01415054  3.16081217

Now the ar­ti­fi­cial value is iden­ti­fied as an out­lier, al­though the av­er­age of the other points is roughly the same. That is be­cause the stan­dard de­vi­a­tion and the mean get less sen­si­tive to ex­treme val­ues as the sam­ple size in­creases. The IQR method also iden­ti­fies the ar­ti­fi­cial point as an out­lier in this case (graph not shown). The z-​score, by de­f­i­n­i­tion, will never be greater than (N1)/N(N-1)/\sqrt{N}, and that should be ac­counted for be­fore analy­sis.

MAD method

The last method that we’ll cover is based on the me­dian ab­solute de­vi­a­tion (MAD) and this method is often re­ferred to as the ro­bust z-​score. It’s es­sen­tially a z-​score, ex­cept the me­dian will be used in­stead of the mean and the MAD in­stead of the stan­dard de­vi­a­tion. The MAD is the me­dian of the ab­solute val­ues of the dis­tance be­tween each point and the sam­ple me­dian. In other words, it’s a stan­dard de­vi­a­tion cal­cu­lated with me­di­ans in­stead of av­er­ages. The new score, let’s call it the M-​score, is given by:

Mi=0.6745(xix~)MADM_i = \frac{0.6745(x_i - \tilde{x})}{MAD}

Where x~\tilde{x} is the sam­ple me­dian and xi is each ob­ser­va­tion. Var­i­ous thresh­olds have been sug­gested, rang­ing be­tween 2 and 3. Let’s apply this method to our ar­ti­fi­cial out­lier of 8 with a thresh­old of 2.24 as sug­gested be­fore 1.

m_data <- c(2.051501, 3.27815, 1.532082, 3.826658, 2.335235, 8)
print(0.6745*(m_data - median(m_data)) / mad(m_data))
## [1] -0.3870867  0.2416539 -0.6533241  0.5228013 -0.2416539  2.6619214

The ar­ti­fi­cial out­lier is in­deed above the thresh­old. The M-​score suf­fers even less mask­ing than the IQR and much less than the z-​score. It’s ro­bust to ex­treme out­liers even with small sam­ple sizes.

The dan­gers of out­liers re­moval

False pos­i­tives (type I error)

When per­form­ing ex­ploratory data analy­sis, all out­lier de­tec­tion meth­ods listed above are valid and each one has its pros and cons. They are use­ful to de­tect pos­si­ble flaws in data col­lec­tion but are also very use­ful to de­tect nov­elty and new trends. How­ever, when per­form­ing in­fer­en­tial analy­sis, type I error rates (false pos­i­tives) should be ac­counted for. Here, we’ll ac­cept a 5% type I error rate, as usual. The graphic below shows the type I error rate for a two-​sample Welch t-​test draw­ing sam­ples from a pop­u­la­tion of 10000 nor­mally dis­trib­uted points (mean = 0, SD = 1). For each sam­ple sizes, 10000 t-​tests are per­formed on in­de­pen­dent sam­ples.

A line graph showing type I error rate over sample size for four methods: None, Z(2), IQR (1.5), and M (2.24).

It’s quite alarm­ing that, when using a thresh­old of Z>2|Z| > 2, the error rate goes up as the sam­ple size in­creases. This shows that, al­though this is a very com­mon thresh­old in pub­lished stud­ies, it greatly in­flates error rates and can be con­sid­ered a form of p-​hacking. All meth­ods re­sulted in error in­fla­tion, es­pe­cially with smaller sam­ple sizes. Let’s re­peat this analy­sis using a pop­u­la­tion with skewed val­ues:

population <- rgamma(10000, shape = 6)
hist(population)

A histogram of population data.

A line graph showing a trend of type I error rates over sample size.

The error in­fla­tion gets even worse when deal­ing with skewed dis­tri­b­u­tion. Thus, cau­tion should be taken be­fore re­mov­ing out­liers with these meth­ods if the dis­tri­b­u­tion is heav­ily skewed.

Con­clu­sions

Out­lier de­tec­tion and re­moval should be done with care and with a well-​defined method. Data that present dis­tri­b­u­tions far from a nor­mal one should not be sub­jected to the meth­ods pre­sented here. Re­mov­ing out­liers with small sam­ple sizes (e.g., less than 20 ob­ser­va­tions per group or con­di­tion) can in­flate type I error rates sub­stan­tially and should be avoided. Out­lier re­moval must be de­cided with­out tak­ing into ac­count sta­tis­ti­cal sig­nif­i­cance and the same method must be ap­plied through­out the whole study (at least to sim­i­lar data).

If we re­move out­liers, they must be rare (as a rule-​of-​thumb, they must ac­count for less than 5% of the data — ide­ally less than 1%). The method used to re­move them as well as the num­ber of ob­ser­va­tions re­moved must be clearly stated. Pub­lish­ing the orig­i­nal data with out­liers is also strongly ad­vis­able. Do­main knowl­edge is key to de­ter­mine when out­liers are most likely due to error and not nat­ural vari­abil­ity. Today, mod­ern sta­tis­ti­cal tech­niques that are ro­bust to ex­treme val­ues exist and should be pre­ferred when­ever pos­si­ble (for ex­am­ple, see the WRS2 pack­age). More­over, data that present non-​normal dis­tri­b­u­tions should not be forced into a normal-​like dis­tri­b­u­tion through out­lier re­moval. The most im­por­tant thing about out­liers is to try to un­der­stand how they arise and to make ef­forts so that out­liers don’t even ap­pear — thus, ren­der­ing its re­moval un­nec­es­sary in most cases. In my ex­pe­ri­ence, most out­liers arise due to pre-​analytical mis­takes and small sam­ple sizes. Thus, well-​standardized tech­niques com­bined with suit­able sam­ple sizes can mit­i­gate the issue in many cases.

Footnotes

  1. Iglewicz, Boris, and 1944- Hoaglin David C. (David Caster). 1993. How to De­tect and Han­dle Out­liers. Book; Book/Il­lus­trated. Mil­wau­kee, Wis. : ASQC Qual­ity Press.


Previous Post
Why new hydroxichloroquines will come
Next Post
Twitter sentiment classification - Part 2