Skip to content

Survival analysis with Cox reggression - heart failure data

Published:

Last time, we used de­ci­sion trees, bi­na­riza­tion and lo­gis­tic re­gres­sion to pre­dict heart fail­ure mor­tal­ity in a pub­lic dataset. This ap­proach yielded a very sim­ple model, which, when deal­ing with noisy bi­o­log­i­cal data and not many ob­ser­va­tions, is cru­cial to im­prove gen­er­al­iza­tion.

The orig­i­nal data is de­scribed in this paper. As we’ve dis­cussed pre­vi­ously, the time vari­able refers to follow-​up time, that is, for how long each pa­tient was ob­served. When­ever a pa­tient that has not died reaches its follow-​up time, it’s no longer ob­served and thus is a cen­sored ob­ser­va­tion.

Ma­chine learn­ing ap­proaches ap­plied to this dataset gen­er­ally ig­nore the cen­sor­ing prob­lem, which can se­ri­ously dis­tort the re­sults. In our pre­vi­ous post, we used a gim­mick of fil­ter­ing ob­ser­va­tions to mit­i­gate this issue.

Nonethe­less, sur­vival analy­sis is the most ap­pro­pri­ate method to an­a­lyze this dataset, as it was used in the orig­i­nal study. So, let’s build some sur­vival curves and then a pre­dic­tive model based on sur­vival prob­a­bil­ity.

Let’s load the data just like last time.

library(tidyverse)
library(magrittr)
library(survival)
library(ggfortify)
library(gridExtra)
library(survminer)
library(knitr)
library(ranger)
library(riskRegression)

data <- read.csv("./heart_failure_clinical_records_dataset.csv")
factor_cols <- c("anaemia", "diabetes", "high_blood_pressure", "smoking")

data %<>% mutate_at(factor_cols, factor)
data$sex <- factor(data$sex, labels = c("female", "male"))

Using the survival pack­age, we’ll build a sur­vival ob­ject and plot a sur­vival curve:

surv_obj <- with(data, Surv(time, DEATH_EVENT))
surv_fit <- survfit(surv_obj ~ 1, data = data)

autoplot(surv_fit) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw()

Survival curve of patients. The y-axis shows the percentage of individuals who have survived at each point in time, while the x-axis represents time measured in days.

By the end of the study, only 55-60% of pa­tients were still alive. Each + sign rep­re­sents a cen­sor­ing, that is, the end of one pa­tient’s follow-​up pe­riod. Let’s plot sur­vival curves split­ting by dif­fer­ent vari­ables:

p_anaemia <- autoplot(survfit(surv_obj ~ anaemia, data = data)) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw() + theme(legend.position = "none") + ggtitle("Anaemia") + theme(plot.title = element_text(hjust = 0.5))

p_diabetes <- autoplot(survfit(surv_obj ~ diabetes, data = data)) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw() + theme(legend.position = "none") + ggtitle("Diabetes") + theme(plot.title = element_text(hjust = 0.5))

p_pressure <- autoplot(survfit(surv_obj ~ high_blood_pressure, data = data)) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw() + theme(legend.position = "none") + ggtitle("High blood pressure") + theme(plot.title = element_text(hjust = 0.5))

p_sex <- autoplot(survfit(surv_obj ~ sex, data = data)) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw() + theme(legend.position = "none") + ggtitle("Sex") + theme(plot.title = element_text(hjust = 0.5))

p_smoking <- autoplot(survfit(surv_obj ~ smoking, data = data)) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw() + theme(legend.position = "none") + ggtitle("Smoking") + theme(plot.title = element_text(hjust = 0.5))


grid.arrange(p_anaemia, p_diabetes, p_pressure, p_sex, p_smoking, nrow = 3)

Survival curve of patients divided into groups for each categorical variable.

Sur­vival may dif­fer only by ane­mia and high blood pres­sure sta­tus. We’ll use the log-​rank test to com­pare sur­vival dis­tri­b­u­tions:

survdiff(surv_obj ~ data$anaemia)
## Call:
## survdiff(formula = surv_obj ~ data$anaemia)
##
##                  N Observed Expected (O-E)^2/E (O-E)^2/V
## data$anaemia=0 170       50     57.9      1.07      2.73
## data$anaemia=1 129       46     38.1      1.63      2.73
##
##  Chisq= 2.7  on 1 degrees of freedom, p= 0.1
survdiff(surv_obj ~ data$high_blood_pressure)
## Call:
## survdiff(formula = surv_obj ~ data$high_blood_pressure)
##
##                              N Observed Expected (O-E)^2/E (O-E)^2/V
## data$high_blood_pressure=0 194       57     66.4      1.34      4.41
## data$high_blood_pressure=1 105       39     29.6      3.00      4.41
##
##  Chisq= 4.4  on 1 degrees of freedom, p= 0.04

Sur­vival dis­tri­b­u­tion only dif­fers sig­nif­i­cantly ac­cord­ing to high blood pres­sure sta­tus. How­ever, we’ve only looked at cat­e­gor­i­cal vari­ables. In fact, sur­vival plots are not prac­ti­cal to vi­su­al­ize the ef­fect of con­tin­u­ous vari­ables on sur­vival. In order to ex­plore these vari­ables, we’ll trans­form them into cat­e­gor­i­cal ones split­ting them by their me­dian value:

# Binning using the ntile function from dplyr
# ntile(data$serum_creatinine, 2) generates a new factor variable with
# two levels, that is, two 50th quantiles

p_age <- autoplot(survfit(surv_obj ~ ntile(log(data$age), 2), data = data)) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw() + theme(legend.position = "none") + ggtitle("Age") + theme(plot.title = element_text(hjust = 0.5))

p_cpk <- autoplot(survfit(surv_obj ~ ntile(log(data$creatinine_phosphokinase), 2), data = data)) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw() + theme(legend.position = "none") + ggtitle("CPK") + theme(plot.title = element_text(hjust = 0.5))

p_ef <- autoplot(survfit(surv_obj ~ ntile(log(data$ejection_fraction), 2), data = data)) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw() + theme(legend.position = "none") + ggtitle("Ejection fraction") + theme(plot.title = element_text(hjust = 0.5))

p_platelets <- autoplot(survfit(surv_obj ~ ntile(log(data$platelets), 2), data = data)) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw() + theme(legend.position = "none") + ggtitle("Platelets") + theme(plot.title = element_text(hjust = 0.5))

p_creatinine <- autoplot(survfit(surv_obj ~ ntile(log(data$serum_creatinine), 2), data = data)) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw() + theme(legend.position = "none") + ggtitle("Serum creatinine") + theme(plot.title = element_text(hjust = 0.5))

p_sodium <- autoplot(survfit(surv_obj ~ ntile(log(data$serum_sodium), 2), data = data)) + ylab("Survival Probability") + xlab("Time (days)") + theme_bw() + theme(legend.position = "none") + ggtitle("Serum sodium") + theme(plot.title = element_text(hjust = 0.5))

grid.arrange(p_age, p_cpk, p_ef, p_platelets, p_creatinine, p_sodium, nrow = 3)

Survival curves for continuous variables split by their median value.

Here, ejec­tion frac­tion, serum cre­a­ti­nine, serum sodium and (per­haps) age quan­tiles ap­pear to be as­so­ci­ated with a dif­fer­ence in sur­vival. We could (and in fact I did) apply log-​rank tests here, but we’ll use these vari­ables with­out bin­ning in a Cox re­gres­sion. All these plots are ex­ploratory, that is, they allow us to grasp the gen­eral ten­den­cies and pos­si­ble as­so­ci­a­tions be­tween vari­ables. Still, they should be taken with a grain of salt be­cause vari­ables may have in­ter­ac­tions and sig­nif­i­cant ef­fects may be hid­den when we split the sur­vival curve by one vari­able at a time.

Cox re­gres­sion

This kind of model ac­cepts cat­e­gor­i­cal and con­tin­u­ous vari­ables as pre­dic­tors and es­ti­mates haz­ard ra­tios, that is, rel­a­tive risks com­pared to a ref­er­ence fac­tor level or nu­mer­i­cal value. It is widely used in sur­vival analy­sis as it can take as input a sur­vival curve — a very pe­cu­liar type of data. Fit­ting sur­vival data with nor­mal lin­ear mod­els is trou­ble­some (among other rea­sons) due to high cen­sor­ing fre­quency (in­her­ent to sur­vival stud­ies). Let’s fit a sim­ple model:

cox_m <- coxph(Surv(time, DEATH_EVENT) ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking, data = data, x = TRUE)

broom::tidy(cox_m) %>% kable()
termes­ti­matestd.errorsta­tis­ticp.value
age0.050.014.980.000001
anaemia10.460.222.120.033835
cre­a­ti­nine_phos­pho­k­i­nase0.000.002.230.026048
di­a­betes10.140.220.630.530745
ejec­tion_frac­tion-0.050.01-4.670.000003
high_blood_pres­sure10.480.222.200.027769
platelets-0.000.00-0.410.680638
serum_cre­a­ti­nine0.320.074.580.000005
serum_sodium-0.040.02-1.900.057531
sex­male-0.240.25-0.940.345165
smok­ing10.130.250.510.607828

There are 6 sig­nif­i­cant pre­dic­tors. Adding more vari­ables will al­ways re­duce the error of a lin­ear model. How­ever, vari­ables with lit­tle pre­dic­tive power are most likely just adding noise to the model, and the re­duced error comes from mod­el­ing ran­dom error, that is, over­fit­ting. Just like last time, we shall keep this model sim­ple. This is even more im­por­tant con­sid­er­ing the na­ture of the dataset, that is, bi­o­log­i­cal data from hu­mans, which is very noisy. There are only 96 death events, so in­clud­ing too many vari­ables might quickly re­sult in over­fit­ting.

Thus, let’s re­duce the model re­mov­ing the vari­ables which con­tribute very lit­tle to the final model above.

cox_m_reduced <- coxph(Surv(time, DEATH_EVENT) ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + serum_creatinine, data = data, x = TRUE)

broom::tidy(cox_m_reduced) %>% kable
termes­ti­matestd.errorsta­tis­ticp.value
age0.040.014.930.000001
anaemia10.390.211.850.06477
cre­a­ti­nine_phos­pho­k­i­nase0.000.002.000.04623
ejec­tion_frac­tion-0.050.01-5.150.000003
high_blood_pres­sure10.470.212.190.02839
serum_cre­a­ti­nine0.350.075.320.000001

So far so good, or is it? The cox re­gres­sion model, like all mod­els, makes a few as­sump­tions. We’ll check for pro­por­tional haz­ards, in­flu­en­tial ob­ser­va­tions and lin­ear­ity in the co­vari­ates.

Cox re­gres­sion mod­els are also called pro­por­tional haz­ards mod­els. This means that these mod­els as­sume that the ef­fect of a co­vari­ate is con­stant through­out the time, that is, resid­u­als are in­de­pen­dent of time. We can check this using the cox.zph func­tion from the Sur­vival pack­age and by graph­i­cally in­spect­ing if Schoen­feld resid­u­als are in­de­pen­dent of time.

print(cox.zph(cox_m))
##                             chisq df    p
## age                      1.03e-01  1 0.75
## anaemia                  1.69e-02  1 0.90
## creatinine_phosphokinase 1.02e+00  1 0.31
## diabetes                 1.92e-01  1 0.66
## ejection_fraction        4.69e+00  1 0.03
## high_blood_pressure      8.23e-03  1 0.93
## platelets                5.69e-06  1 1.00
## serum_creatinine         1.52e+00  1 0.22
## serum_sodium             1.10e-01  1 0.74
## sex                      7.63e-02  1 0.78
## smoking                  4.79e-01  1 0.49
## GLOBAL                   1.17e+01 11 0.39

It seems that ejec­tion frac­tion does not fol­low the pro­por­tional haz­ards as­sump­tion. Still, let’s check the other as­sump­tions, and we’ll later come back to this issue. Using func­tions from the survminer pack­age, let’s check for in­flu­en­tial ob­ser­va­tions and non-​linearity in the co­vari­ates.

First, let’s plot the es­ti­mated change in co­ef­fi­cients when each ob­ser­va­tion is re­moved from the model.

ggcoxdiagnostics(cox_m, type = 'dfbeta')

Scatter plots of changes to coefficients when each observation is removed

It seems that cre­a­ti­nine phos­pho­k­i­nase, serum cre­a­ti­nine and maybe serum sodium have a few very in­flu­en­tial ob­ser­va­tions. Fi­nally, con­tin­u­ous co­vari­ate ef­fects are as­sumed to be lin­ear. We can check this as­sump­tion plot­ting Mar­tin­gale resid­u­als using the ggcoxfunctional func­tion from the survminer pack­age:

ggcoxfunctional(with(data, Surv(time, DEATH_EVENT)) ~ age + creatinine_phosphokinase + ejection_fraction + serum_creatinine + serum_sodium, data = data, ylim = c(-1, 1))

Martingale residuals of null Cox Model for each continuous variable

Now it’s clear why there are very in­flu­en­tial ob­ser­va­tions. As the model as­sumes lin­ear co­vari­ate ef­fects, ex­tremely high cre­a­ti­nine phos­pho­k­i­nase and serum cre­a­ti­nine val­ues be­come largely in­flu­en­tial. In the pre­vi­ous post we used bin­ning, thus this issue be­came ir­rel­e­vant. Here, we can ad­dress right-​skewed dis­tri­b­u­tions by using a sim­ple log. The slight left-​skewed dis­tri­b­u­tion of serum sodium might not be a prob­lem. We’ll see.

Let’s fit a new model using some log-​transformed vari­ables:

cox_m_log <- coxph(Surv(time, DEATH_EVENT) ~ age + anaemia + log(creatinine_phosphokinase) + diabetes + ejection_fraction + high_blood_pressure + platelets + log(serum_creatinine) + serum_sodium + sex + smoking, data = data, x = TRUE)

broom::tidy(cox_m_log) %>% kable()
termes­ti­matestd.errorsta­tis­ticp.value
age0.040.014.520.0000063
anaemia10.470.212.200.0280328
log(cre­a­ti­nine_phos­pho­k­i­nase)0.090.100.920.3595379
di­a­betes10.140.220.640.5194190
ejec­tion_frac­tion-0.040.01-4.130.0000361
high_blood_pres­sure10.500.212.350.0185782
platelets-0.000.00-0.390.6991078
log(serum_cre­a­ti­nine)0.960.214.660.0000032
serum_sodium-0.030.02-1.410.1587203
sex­male-0.260.25-1.050.2936010
smok­ing10.210.250.830.4086007

In­ter­est­ingly, the log-​transformed cre­a­ti­nine phos­pho­k­i­nase vari­able was no longer a sig­nif­i­cant pre­dic­tor. It looks like the fit­ted co­ef­fi­cient was largely due to ex­treme ob­ser­va­tions. Al­though omit­ted here, vari­ables which con­tribute the least to the final model were re­moved in a step­wise man­ner, re­sult­ing in the fol­low­ing model:

cox_m_log <- coxph(Surv(time, DEATH_EVENT) ~ age + anaemia + ejection_fraction + high_blood_pressure + log(serum_creatinine), data = data, x = TRUE)

broom::tidy(cox_m_log) %>% kable()
termes­ti­matestd.errorsta­tis­ticp.value
age0.040.014.390.0000112
anaemia10.400.211.920.0553363
ejec­tion_frac­tion-0.040.01-4.460.0000084
high_blood_pres­sure10.510.212.410.0161065
log(serum_cre­a­ti­nine)1.000.195.250.0000002

The non-​proportional haz­ard issue, how­ever, per­sists:

cox.zph(cox_m_log)
##                         chisq df    p
## age                   0.17720  1 0.67
## anaemia               0.00755  1 0.93
## ejection_fraction     4.73124  1 0.03
## high_blood_pressure   0.02384  1 0.88
## log(serum_creatinine) 2.11537  1 0.15
## GLOBAL                6.65288  5 0.25
plot(cox.zph(cox_m_log)[3])
abline(c(0,0), col = 2)
abline(h = cox_m_log$coefficients[3], col = 3, lty = 2)

Beta coefficient for ejection fraction over time

In the plot above, the green dot­ted line is at the fit­ted co­ef­fi­cient for ejec­tion frac­tion. Al­though ejec­tion frac­tion is a very im­por­tant pre­dic­tor, its ef­fect is not con­stant over time. Early on it has a small neg­a­tive ef­fect on sur­vival (close to 0 on the graph), but after 60 days its ef­fect is much more pro­nounced. This is why this pre­dic­tor vi­o­lates the pro­por­tional haz­ards as­sump­tion.

A pos­si­ble so­lu­tion is to fit dif­fer­ent co­ef­fi­cients over dif­fer­ent time in­ter­vals (as de­scribed in this ex­cel­lent vi­gnette). Vi­su­ally, it seems that split­ting the data at day 50 pro­vide a good sep­a­ra­tion of dis­tinct ejec­tion frac­tion ef­fects.

data_split <- survSplit(Surv(time, DEATH_EVENT) ~ ., data = data, cut = 50, episode = "tgroup", id = "id")

cox_m_split <- coxph(Surv(tstart, time, DEATH_EVENT) ~ age + anaemia + ejection_fraction:strata(tgroup) + high_blood_pressure + log(serum_creatinine), data = data_split, x = TRUE)

broom::tidy(cox_m_split) %>% kable()
termes­ti­matestd.errorsta­tis­ticp.value
age0.040.014.530.0000058
anaemia10.440.212.120.0341384
high_blood_pres­sure10.540.212.540.0110057
log(serum_cre­a­ti­nine)0.970.195.130.0000003
ejec­tion_frac­tion:strata(tgroup)tgroup=1-0.020.01-1.600.1105482
ejec­tion_frac­tion:strata(tgroup)tgroup=2-0.090.02-4.910.0000009

The ejec­tion frac­tion now only has a sig­nif­i­cant ef­fect on tgroup = 2, that is, after day 50. This should re­move the issue with the non-​proportional haz­ard. Let’s check:

print(cox.zph(cox_m_split))
##                                    chisq df    p
## age                              0.02004  1 0.89
## anaemia                          0.15881  1 0.69
## high_blood_pressure              0.00366  1 0.95
## log(serum_creatinine)            1.35157  1 0.25
## ejection_fraction:strata(tgroup) 1.13453  2 0.57
## GLOBAL                           3.14512  6 0.79

Great! The as­sump­tion is not vi­o­lated any­more. Now we’ll com­pare the mod­els we have built so far. For this step, we’ll split the data into train­ing (70% of orig­i­nal data) and test (30%) datasets and refit the pre­vi­ous mod­els using the train­ing data.

set.seed(12)
train_size = floor(0.7 * nrow(data))
train_ind <- sample(seq_len(nrow(data)), size = train_size)

train_data <- data[train_ind, ]
test_data <- data[-train_ind, ]

cox_m_log <- coxph(Surv(time, DEATH_EVENT) ~ age + anaemia + log(creatinine_phosphokinase) + diabetes + ejection_fraction + high_blood_pressure + platelets + log(serum_creatinine) + serum_sodium + sex + smoking, data = train_data, x = TRUE)
cox_m_log_reduced <- coxph(Surv(time, DEATH_EVENT) ~ age + anaemia + ejection_fraction + high_blood_pressure + log(serum_creatinine), data = train_data, x = TRUE)

train_data_split <- survSplit(Surv(time, DEATH_EVENT) ~ ., data = train_data, cut = 50, episode = "tgroup", id = "id")

cox_m_split <- coxph(Surv(tstart, time, DEATH_EVENT) ~ age + anaemia + ejection_fraction:strata(tgroup) + high_blood_pressure + log(serum_creatinine), data = train_data_split, x = TRUE)


models = list(cox_m_log, cox_m_log_reduced, cox_m_split)
concordances = sapply(models, function(x) concordance(x)$concordance)
aics = sapply(models, AIC)

models_metrics = data.frame(Model = c('All variables', 'Selected variables', 'Split time, selected var.'), Concordance = concordances, AIC = aics)
models_metrics %>% kable()
ModelCon­cor­danceAIC
All vari­ables0.746606.5
Se­lected vari­ables0.740599.1
Split time, se­lected var.0.746593.9

Our final model, which doesn’t vi­o­late the core model as­sump­tions, presents bet­ter AIC com­pared to the pre­vi­ous ones and a very sim­i­lar con­cor­dance com­pared to the model with all vari­ables. But in prac­tice this model will pre­dict two sep­a­rate sur­vival curves, one for each tgroup (be­fore or after day 50). There­fore, this model may be great to in­ter­pret the con­tri­bu­tion of each vari­able, but it’s not very prac­ti­cal for pre­dic­tion.

Fi­nally, we’ll build a ran­dom sur­vival for­est model using the ranger func­tion from the ranger pack­age. How­ever, we’ll not go deep into this model. We’ll limit tree depth (max.depth) and set a higher min­i­mal node size (min.node.size) to avoid over­fit­ting.

ranger_m <- ranger(Surv(time, DEATH_EVENT) ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking, data = train_data, x = TRUE, importance = 'permutation', min.node.size = 5, max.depth = 5)

The riskRe­gres­sion pack­age has a nice Score func­tion that cal­cu­lates Brier scores and Index of Pre­dic­tion Ac­cu­racy, which is 1 minus the ratio be­tween the Brier score of the model against the Brier score of a null model. A pos­i­tive IPA means a bet­ter prob­a­bil­ity pre­dic­tion com­pared to the null model.

ipa_models <- Score(list("Cox - all variables" = cox_m_log, "Cox - few variables" = cox_m_log_reduced, "Forest" = ranger_m), data = test_data, formula = Surv(time, DEATH_EVENT) ~ 1, summary = 'ipa', metrics = 'brier', contrasts = FALSE)

print(ipa_models)
## Metric Brier:
##
## Results by model:
##
##                  model times Brier lower upper  IPA
## 1:          Null model   113  20.9  15.8  26.0  0.0
## 2: Cox - all variables   113  16.3  10.4  22.2 21.9
## 3: Cox - few variables   113  15.8  10.0  21.6 24.3
## 4:              Forest   113  16.4  11.7  21.2 21.3

##
## NOTE: Values are multiplied by 100 and given in %.

## NOTE: The lower Brier the better, the higher IPA the better.

It seems our Cox re­gres­sion with only a few vari­ables out­per­forms the other mod­els slightly (higher IPA score). Ran­dom forests can be re­ally use­ful to fit high di­men­sional data. In this case, how­ever, a Cox re­gres­sion care­fully fit seems to be an ex­cel­lent op­tion.


Previous Post
Como configurar e usar um cluster de Apache Spark em sua rede local
Next Post
The power of simple models: predicting heart failure mortality