Skip to content

What are NOT p-values?

Published:

Over the last 100 years, p-​values have be­come in­creas­ingly com­mon in many sci­en­tific fields, after its de­vel­op­ment by Ronald Fisher around 1920. Now, on the ap­prox­i­mate 100th an­niver­sary of the all-​famous p-​value, its use is being ques­tioned due to ir­re­pro­ducible re­search, p-​hacking prac­tices and var­i­ous mis­un­der­stand­ings about its true mean­ing.

Re­port­ing only p-​values is be­com­ing less ac­cept­able, and for good rea­son. Sci­en­tists should also re­port full data dis­tri­b­u­tion, con­fi­dence in­ter­vals and de­tails about the sta­tis­ti­cal test used and its as­sump­tions. Those who con­tinue to re­port only that P < 0.05 will face more and more ques­tions from jour­nals and re­view­ers. All this is good for sci­ence and open sci­ence. Still, p-​values are very use­ful and will not dis­ap­pear — at least not for now.

But what ex­actly is a p-​value? Many sci­en­tists have never thought about this ques­tion ex­plic­itly. So, let’s de­fine the p-​value and then look at what it is not. Defin­ing what some­thing is not is a great way to re­move mis­con­cep­tions that may al­ready be lurk­ing in our heads.

The Amer­i­can Sta­tis­ti­cal As­so­ci­a­tion de­fines the p-​value as:

[…] The prob­a­bil­ity under a spec­i­fied sta­tis­ti­cal model that a sta­tis­ti­cal sum­mary of the data (e.g., the sam­ple mean dif­fer­ence be­tween two com­pared groups) would be equal to or more ex­treme than its ob­served value.
Amer­i­can Sta­tis­ti­cal As­so­ci­a­tion1

Most often, the spec­i­fied sta­tis­ti­cal model is the null hy­poth­e­sis: H0H_0 or the hy­poth­e­sis that group means (or an­other sta­tis­ti­cal sum­mary, such as me­dian) are not dif­fer­ent. Or that the cor­re­la­tion be­tween two con­tin­u­ous vari­ables is not dif­fer­ent from 0. I guess 99% of the times we en­counter p-​values it’s under these cir­cum­stances.

So, the p-​value is the prob­a­bil­ity that, given that H0H_0 is true (this is ex­tremely im­por­tant), the re­sults would come up with at least the dif­fer­ence ob­served.

Let’s look at what p-​values are not.

The prob­a­bil­ity that the re­sults are due to chance

Prob­a­bly the most com­mon mis­con­cep­tion re­gard­ing p-​values, it’s easy to see why we fall into this sta­tis­ti­cal trap. As­sum­ing that there is no dif­fer­ence be­tween the two groups, all the ob­served dif­fer­ence is due to chance. The prob­lem here is that p-​value cal­cu­la­tions as­sume that every de­vi­a­tion from H0H_0 is due to chance. Thus, it can­not com­pute a prob­a­bil­ity of some­thing it as­sumes to be true.

P-​values tell us the prob­a­bil­ity that the ob­served re­sults would come up due to chance alone as­sum­ing H0H_0 to be true, and not the chance that the ob­served re­sults are due to chance, pre­cisely be­cause we don’t know if H0H_0 is true or not. Pause and think about the dif­fer­ence be­tween these two state­ments. If the dif­fer­ence is not clear, it may be­come clearer with the next topic.

The prob­a­bil­ity that H1H_1 is true

This is a bit tricky to un­der­stand. H1H_1 is the al­ter­na­tive hy­poth­e­sis, in con­trast to H0H_0. Al­most al­ways, H1H_1 states that the groups are dif­fer­ent or that an es­ti­ma­tor is dif­fer­ent from zero. How­ever, p-​values tell us noth­ing about H1H_1. They tell us:

P(observed differenceH0)P(observed\ difference \vert H_0)

That is, the prob­a­bil­ity of the ob­served dif­fer­ence given that H0H_0 is true. Since H0H_0 and H1H_1 are com­ple­men­tary hy­pothe­ses, P(H0)+P(H1)=1P(H_0) + P(H_1) = 1. Thus:

P(H1)=1P(H0)P(H_1) = 1 - P(H_0)

How­ever, we do not know P(H0)P(H_0) since we as­sume H0H_0 to be true. Let’s use an ex­am­ple.

Let’s as­sume we are a pa­tient doing blood tests for a dis­ease called sta­tis­ti­co­sis. We know that, if you’re ill, the test re­turns a pos­i­tive re­sult 99% of the time. Also, if you’re not ill, it re­turns a neg­a­tive re­sult 98% of the time. 1% of the pop­u­la­tion is es­ti­mated to have sta­tis­ti­co­sis. Let’s as­sume a pop­u­la­tion of 1 mil­lion peo­ple and build a table. First, there are 990,000 peo­ple (99% of the pop­u­la­tion) with­out the dis­ease and 10,000 peo­ple (1%) with the dis­ease. Of those that are ill, 9,900 (99% of the ill) will get a pos­i­tive re­sult at the test, while 9,900 peo­ple that are not ill will also get a pos­i­tive re­sult (2% of those who are not ill).

Test
Ill
YesNo
Yes990019800
No100970200

Here, H0H_0 is that we’re not sick, while H1H_1 is that we are sick. The prob­a­bil­ity of get­ting a pos­i­tive re­sult given that we are not sick is P(+H0)=19800(970200+19800)=0.02P(+ \vert H_0) = \frac{19800}{(970200 + 19800)} = 0.02 or 2%. This is how we usu­ally think about blood tests, and it’s com­pa­ra­ble to what p-​values es­ti­mate. We as­sume H0H_0 (not sick) to be true and cal­cu­late the prob­a­bil­ity of an ob­ser­va­tion at least as ex­treme as the one ob­served (in this case, the prob­a­bil­ity of a pos­i­tive re­sult). This num­ber tells us that, given that we’re not sick, a pos­i­tive re­sult is un­likely (2%). How­ever, the prob­a­bil­ity that one is not sick given a pos­i­tive re­sult is P(H0+)=19800(19800+9900)=23P(H_0 \vert +) = \frac{19800}{(19800 + 9900)} = \frac{2}{3} or 66%! In other words, if you were to re­ceive a pos­i­tive re­sult, you would have a 66% prob­a­bil­ity of not being ill (false pos­i­tive) and a 33% prob­a­bil­ity of ac­tu­ally being ill (true pos­i­tive). It might seem like the test is use­less in this case, but with­out the test, we can only know that our prob­a­bil­ity of being ill is 1% (pop­u­la­tion preva­lence). This ex­am­ple, of course, ig­nores symp­toms and other di­ag­nos­tic tools for the sake of sim­plic­ity.

How can these two prob­a­bil­i­ties be so dif­fer­ent? The thing here is the low preva­lence of the dis­ease. Even with a good test, there are many, many more peo­ple with­out the dis­ease (com­pared to peo­ple with the dis­ease), so a lot of false pos­i­tives will occur.

The im­por­tant thing here is to un­der­stand that P(+H0)P(H0+)P(+ \vert H_0) \neq P(H_0 \vert +) and that these prob­a­bil­i­ties can be wildly dif­fer­ent. This con­fu­sion is known as the pros­e­cu­tor fal­lacy.

P-​values are com­pa­ra­ble to P(+H0)P(+ \vert H_0). We as­sume the null hy­poth­e­sis, there­fore we can­not cal­cu­late its prob­a­bil­ity nor the prob­a­bil­ity of H1H_1. There­fore, p-​values tell us noth­ing about the prob­a­bil­ity of H0H_0 nor of H1H_1. There is no bet­ter es­ti­mate be­cause we do not know the prob­a­bil­ity that H1H_1 is true be­fore (a pri­ori) our ob­ser­va­tions are col­lected. If this no­tion of a pri­ori prob­a­bil­i­ties seems fuzzy, let’s look at the next mis­con­cep­tion.

Chance of error

This mis­take arises be­cause we often re­port that a thresh­old of P < 0.05 was used (or that α\alpha = 0.05). By as­sum­ing a 5% thresh­old, we as­sume a 5% type-​I error rate. That is, we will wrongly re­ject the null hy­poth­e­sis 5% of the times that the null hy­poth­e­sis was true. This does not mean an over­all error rate of 5%, be­cause it’s im­pos­si­ble to know how many hy­pothe­ses are truly null in the first place. The pro­por­tion of true and false hy­pothe­ses being tested in a study or even in a sci­en­tific field would be the prior (a pri­ori) prob­a­bil­i­ties we talked about. That would be like know­ing how many peo­ple are ill, but it’s im­pos­si­ble in the case of hy­poth­e­sis. We only use p-​values be­cause it’s im­pos­si­ble to know the pro­por­tion of true hy­pothe­ses being tested.

Thus, if we re­ject the null hy­poth­e­sis when P < 0.05, we will wrongly do so in 5% of true null hy­pothe­ses. But we can’t know how many true null hy­pothe­ses there are in a study. There­fore, we can’t as­sume that 5% of re­sults will be wrong ac­cord­ing to the p-​value. It might be much more.

This is why a hy­poth­e­sis must be very well-​founded in the pre­vi­ous sci­en­tific ev­i­dence of rea­son­able qual­ity. The more hy­pothe­ses are in­cre­men­tally built upon pre­vi­ous re­search, the big­ger the chance that a new hy­poth­e­sis will be true. Rais­ing the pro­por­tion of true hy­poth­e­sis among all those being tested is fun­da­men­tal. A low true-​hypothesis pro­por­tion is sim­i­lar to a low preva­lence dis­ease, and we’ve seen that while test­ing for rare events (be it hy­pothe­ses or dis­eases) we make much more mis­takes, es­pe­cially false pos­i­tives! If the pro­por­tion of true hy­poth­e­sis among all being tested is too low, the ma­jor­ity of sta­tis­ti­cally sig­nif­i­cant re­sults may be false pos­i­tives. There is even a study which ex­plored this phe­nom­e­non and gained a lot of media at­ten­tion.

There­fore, a smaller p-​value does not mean a smaller chance of type-​I error. The tol­er­ated type-​I error rate comes from the se­lected thresh­old, not from in­di­vid­ual p-​values. And the over­all error rate comes from the ac­cepted type-​I error rate com­bined with sam­ple size and the pro­por­tion of true hy­poth­e­sis being tested in a study.

Ef­fect sizes

Smaller p-​values do not mean that the dif­fer­ence is more sig­nif­i­cant or larger. It just means that as­sum­ing H0H_0 to be true that re­sult is less likely to arise by chance.

Many mea­sures of ef­fect size exist to mea­sure pre­cisely what the name sug­gests: the size of the ob­served ef­fect. This kind of sta­tis­ti­cal sum­mary is re­ally valu­able be­cause it tells us the mag­ni­tude of the ob­served dif­fer­ence, ac­count­ing for the ob­served vari­abil­ity.

An ex­per­i­ment with P = 0.00001 may have a Cohen’s d of 0.05, while an­other with P = 0.002 may have d = 0.2. Using the com­mon 5% thresh­old, both are sta­tis­ti­cally sig­nif­i­cant. How­ever, as we’ve seen, smaller p-​values do not in­di­cate the chance of error and, as we’re see­ing now, nor the ef­fect size. The lat­ter has a higher p-​value, which could make us think the ef­fect was smaller, but the ef­fect size is greater com­pared to the for­mer (d = 0.2 vs d = 0.05).

Ef­fect sizes should be re­ported be­cause when the sam­ple size is big or vari­abil­ity is low very small changes may be­come sta­tis­ti­cally sig­nif­i­cant, but the ef­fect size is so small that it might as well be bi­o­log­i­cally ir­rel­e­vant. Con­fi­dence in­ter­vals can also be cal­cu­lated for ef­fect sizes, which is an­other great way of vi­su­al­iz­ing mag­ni­tude and its as­so­ci­ated un­cer­tainty.

Con­clu­sions

After a few ex­am­ples of what a p-​value is not, let’s re­mem­ber what it is:

[…] The prob­a­bil­ity under a spec­i­fied sta­tis­ti­cal model that a sta­tis­ti­cal sum­mary of the data (e.g., the sam­ple mean dif­fer­ence be­tween two com­pared groups) would be equal to or more ex­treme than its ob­served value.
Amer­i­can Sta­tis­ti­cal As­so­ci­a­tion1

Maybe this de­f­i­n­i­tion makes more in­tu­itive sense now. The point here is that p-​values are very use­ful and will not go away soon. They should be used and are a valu­able re­source to make good sta­tis­ti­cal rea­son­ing. How­ever, they have a very strict de­f­i­n­i­tion and pur­pose, which is often mis­un­der­stood by those who apply them to their daily jobs.

Un­der­stand­ing what p-​values in­di­cate re­minds us of the im­por­tance of well-​founded hy­poth­e­sis gen­er­a­tion, of mul­ti­ple lines of ev­i­dence to con­firm a re­sult, of ad­e­quate sam­ple sizes and, most of all, of good rea­son­ing and trans­parency when judg­ing new hy­pothe­ses.

Footnotes

  1. ASA State­ment on Sta­tis­ti­cal Sig­nif­i­cance and P-​Values 2


Previous Post
O que NÃO É a eficácia de uma vacina
Next Post
Por que novas cloroquinas virão