Skip to content

The power of simple models: predicting heart failure mortality

Published:

Ac­cu­rately pre­dict­ing the risk of mor­tal­ity in hos­pi­tals and health care ser­vices is re­ally im­por­tant. How­ever, bi­ol­ogy is noisy and pre­dic­tive mod­els can eas­ily in­cor­po­rate noise in­stead of rel­e­vant in­for­ma­tion and work poorly out­side its orig­i­nal data. Here, I’ll an­a­lyze a dataset with a few com­mon hospital-​related vari­ables and death out­come in pa­tients with heart fail­ure.

First, let’s load the data and have a look at it. All this project was done using R.

library(ggplot2)
library(reshape2)
library(dplyr)
library(gridExtra)
library(tree)
library(smbinning)
library(kableExtra)

data <- read.csv("./heart_failure_clinical_records_dataset.csv")
summary(data)
##       age           anaemia       creatinine_phosphokinase    diabetes
##  Min.   :40.00   Min.   :0.0000   Min.   :  23.0           Min.   :0.0000
##  1st Qu.:51.00   1st Qu.:0.0000   1st Qu.: 116.5           1st Qu.:0.0000
##  Median :60.00   Median :0.0000   Median : 250.0           Median :0.0000
##  Mean   :60.83   Mean   :0.4314   Mean   : 581.8           Mean   :0.4181
##  3rd Qu.:70.00   3rd Qu.:1.0000   3rd Qu.: 582.0           3rd Qu.:1.0000
##  Max.   :95.00   Max.   :1.0000   Max.   :7861.0           Max.   :1.0000
##  ejection_fraction high_blood_pressure   platelets      serum_creatinine
##  Min.   :14.00     Min.   :0.0000      Min.   : 25100   Min.   :0.500
##  1st Qu.:30.00     1st Qu.:0.0000      1st Qu.:212500   1st Qu.:0.900
##  Median :38.00     Median :0.0000      Median :262000   Median :1.100
##  Mean   :38.08     Mean   :0.3512      Mean   :263358   Mean   :1.394
##  3rd Qu.:45.00     3rd Qu.:1.0000      3rd Qu.:303500   3rd Qu.:1.400
##  Max.   :80.00     Max.   :1.0000      Max.   :850000   Max.   :9.400
##   serum_sodium        sex            smoking            time
##  Min.   :113.0   Min.   :0.0000   Min.   :0.0000   Min.   :  4.0
##  1st Qu.:134.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 73.0
##  Median :137.0   Median :1.0000   Median :0.0000   Median :115.0
##  Mean   :136.6   Mean   :0.6488   Mean   :0.3211   Mean   :130.3
##  3rd Qu.:140.0   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:203.0
##  Max.   :148.0   Max.   :1.0000   Max.   :1.0000   Max.   :285.0
##   DEATH_EVENT
##  Min.   :0.0000
##  1st Qu.:0.0000
##  Median :0.0000
##  Mean   :0.3211
##  3rd Qu.:1.0000
##  Max.   :1.0000

Ok, so there are 12 vari­ables, the out­come (DEATH_EVENT) and 299 ob­ser­va­tions. Let’s build a cor­re­la­tion ma­trix be­tween all vari­ables to ex­plore very roughly any pos­si­ble as­so­ci­a­tions.

data$DEATH_EVENT <- as.factor(data$DEATH_EVENT)
data$anaemia <- as.factor(data$anaemia)
data$diabetes <- as.factor(data$diabetes)
data$high_blood_pressure <- as.factor(data$high_blood_pressure)
data$smoking <- as.factor(data$smoking)
data$sex <- factor(data$sex, labels = c("female", "male"))

corr_matrix <- round(cor(data.matrix(data), method = "spearman"),2)

# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

upper_tri <- get_upper_tri(corr_matrix)
melted_matrix <- melt(upper_tri)

ggheatmap <- ggplot(melted_matrix, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1,1), space = "Lab",
                       name="Spearman\nCorrelation", na.value = "white") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1,
                                   size = 12, hjust = 1)) +
  coord_fixed() +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank()) +
  ggtitle("Correlation matrix")

show(ggheatmap)

Correlation matrix between all variables

Time is the vari­able with the high­est cor­re­la­tion to the out­come. This is a good time to ac­tu­ally un­der­stand how the data was gen­er­ated. In the orig­i­nal study, it is de­scribed that time is the follow-​up pe­riod avail­able (in days) for each pa­tient. Thus, if the follow-​up pe­riod is too small, the pa­tient may have al­ready died after the follow-​up pe­riod. Let’s have a look at the dis­tri­b­u­tion of the follow-​up time by out­come:

ggplot(data = data, aes(x = time, fill = DEATH_EVENT, color = DEATH_EVENT)) + geom_density(alpha = 0.5) + theme_minimal() + theme(axis.title = element_text(size=14), axis.text = element_text(size = 12)) + ylab("Density") + xlab("Follow-up time (days)")

Distribution of follow-up time split by death event (0 or 1)

Now it is easy to see why the cor­re­la­tion is strong. The group with the death out­come (rep­re­sented as 1 in the dataset) presents much shorter follow-​up time, pre­sum­ably be­cause most of them un­for­tu­nately die soon after their first hos­pi­tal visit. Still, there are pa­tients with­out the death out­come to whom lit­tle follow-​up time is avail­able. This is a prob­lem be­cause we can’t know if these peo­ple have re­ally sur­vived or if they ac­tu­ally died after the avail­able (short) follow-​up pe­riod. Thus, the data should be fil­tered to re­move small follow-​up times in the non-​death group. The bound­ary here is of course ar­bi­trary, as more strin­gent cri­te­ria pro­vide more ac­cu­rate data to begin with, but smaller sam­ple sizes when the start­ing sam­ple size is al­ready quite lim­ited. I’ve de­cided to use the 80th quan­tile of the follow-​up time in the death group as a cut­off.

quantile(filter(data, DEATH_EVENT == 1)$time, 0.8)
## 80%
## 126

Let’s build a new den­sity his­togram of the non-​death group and shade the ex­cluded re­gion.

histogram <- ggplot(data = filter(data, DEATH_EVENT != TRUE), aes(x = time)) + geom_density(alpha = 0.5, fill = "red", color = "red") + theme_minimal() + theme(axis.title = element_text(size=14), axis.text = element_text(size = 12)) + geom_vline(xintercept = 126, alpha = 0.5, size = 1)
d <- ggplot_build(histogram)$data[[1]]

histogram <- histogram + geom_area(data = subset(d, x < 126), aes(x=x, y=y), fill="#8b0000", alpha = 0.5) + ylab("density")
show(histogram)

Distribution of follow-up time of the non-death group

filtered_data <- filter(data, DEATH_EVENT == 1 | time > 126)

It might seem a bit dras­tic, but in­clud­ing these ob­ser­va­tions can dis­tort pre­dic­tive model quite a lot be­cause of the way by which the data was gen­er­ated. Also, the follow-​up time can­not be used as a pre­dic­tor be­cause this vari­able in only avail­able a pos­te­ri­ori, that is, after the follow-​up pe­riod and there­fore after most deaths oc­curred. From now on, let’s work with this new fil­tered dataset and with­out using the time vari­able as a pre­dic­tor. The orig­i­nal study an­a­lyzes the data though sur­vival curves, re­in­forc­ing this in­ter­pre­ta­tion.

Now, with this new dataset, let’s look at the con­tin­u­ous vari­ables com­par­ing the two out­comes.

p_sodium <- ggplot(data = filtered_data, aes(DEATH_EVENT, serum_sodium, fill = DEATH_EVENT)) + geom_boxplot() + theme_bw() + theme(legend.position = "none") + xlab("Death") + ylab("Serum Sodium") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black"))

p_creatinine <- ggplot(data = filtered_data, aes(DEATH_EVENT, serum_creatinine, fill = DEATH_EVENT)) + geom_boxplot() + theme_bw() + theme(legend.position = "none") + xlab("Death") + ylab("Serum Creatinine") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black"))

p_platelets <- ggplot(data = filtered_data, aes(DEATH_EVENT, platelets, fill = DEATH_EVENT)) + geom_boxplot() + theme_bw() + theme(legend.position = "none") + xlab("Death") + ylab("Platelets") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black"))

p_ejection <- ggplot(data = filtered_data, aes(DEATH_EVENT, ejection_fraction, fill = DEATH_EVENT)) + geom_boxplot() + theme_bw() + theme(legend.position = "none") + xlab("Death") + ylab("Ejection Fraction") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black"))

p_cpk <- ggplot(data = filtered_data, aes(DEATH_EVENT, creatinine_phosphokinase, fill = DEATH_EVENT)) + geom_boxplot() + theme_bw() + theme(legend.position = "none") + xlab("Death") + ylab("Creatinine Phosphokinase") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black"))

p_age <- ggplot(data = filtered_data, aes(DEATH_EVENT, age, fill = DEATH_EVENT)) + geom_boxplot() + theme_bw() + theme(legend.position = "none") + xlab("Death") + ylab("Age") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "black"))

grid.arrange(p_age, p_creatinine, p_cpk, p_ejection, p_platelets, p_sodium, nrow = 2)

Box plots of continuous variables by death outcome

We can see that age, serum cre­a­ti­nine, ejec­tion frac­tion and serum sodium show no­tice­able dif­fer­ences in dis­tri­b­u­tion be­tween the two out­comes.

Build­ing a pre­dic­tive model: lo­gis­tic re­gres­sion

When build­ing a pre­dic­tive model, we al­ways have to be cau­tious about its com­plex­ity and the signal-​to-​noise ratio in the orig­i­nal data. In this case — bio­med­ical data from pa­tients — there’s usu­ally a lot of noise. This noise comes from bi­o­log­i­cal vari­abil­ity (which is al­ready pretty high) com­bined with mea­sur­ing and human er­rors. So, bio­med­ical data can be quite a needle-​in-​the-​hasting sit­u­a­tion: the sig­nal is there, but sur­rounded by a lot of noise. There­fore, the pre­dic­tive model should be as sim­ple as pos­si­ble. Try­ing to add too much com­plex­ity will eas­ily cause over­fit­ting and de­crease the gen­er­al­iza­tion ca­pac­ity of the model. That is, it will work re­ally well with this dataset, but poorly with oth­ers. More­over, there are less than 300 ob­ser­va­tions, which also lim­its the com­plex­ity of the model.

Lo­gis­tic re­gres­sion is a type of gen­er­al­ized lin­ear model that is quite use­ful to pre­dict cat­e­gor­i­cal out­comes like ours. Since it’s a lin­ear model, it can be kept sim­ple it’s a quite in­ter­pretable. Let’s build one with all pre­dic­tors:

log_reg <- glm(data = filtered_data, DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + serum_creatinine + serum_sodium + sex + smoking, family = "binomial")

summary(log_reg)
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
##     diabetes + ejection_fraction + high_blood_pressure + serum_creatinine +
##     serum_sodium + sex + smoking, family = "binomial", data = filtered_data)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.6772  -0.8397  -0.4011   0.8884   2.4091
##
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)               2.9584921  5.1159428   0.578 0.563069
## age                       0.0541167  0.0147177   3.677 0.000236 ***
## anaemia1                  0.9280712  0.3518475   2.638 0.008347 **
## creatinine_phosphokinase  0.0003715  0.0001809   2.053 0.040046 *
## diabetes1                 0.0310352  0.3359719   0.092 0.926401
## ejection_fraction        -0.0602175  0.0166305  -3.621 0.000294 ***
## high_blood_pressure1      0.7145341  0.3591042   1.990 0.046616 *
## serum_creatinine          0.9299890  0.2865249   3.246 0.001171 **
## serum_sodium             -0.0468141  0.0373333  -1.254 0.209860
## sexmale                  -0.1612479  0.3962186  -0.407 0.684032
## smoking1                 -0.0055385  0.3925579  -0.014 0.988743
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 292.0  on 211  degrees of freedom
## Residual deviance: 223.4  on 201  degrees of freedom
## AIC: 245.4
##
## Number of Fisher Scoring iterations: 5

Build­ing mod­els tai­lored to its ap­pli­ca­tion

Six vari­ables were sig­nif­i­cant pre­dic­tors of mor­tal­ity in the lo­gis­tic re­gres­sion. Al­though we could re­fine this model and achieve a good ac­cu­racy, it’s not con­ve­nient to be used in a point of care fash­ion. Re­al­is­ti­cally, de­pend­ing on a computer-​powered al­go­rithm to as­sess risk in a hos­pi­tal set­ting is still not con­ve­nient in most places. A more suited model would give the pre­dic­tion with­out the need of com­put­ers.

De­ci­sion trees fit this re­quire­ment. They split the data in branches as­sign­ing cut­off val­ues to each vari­able try­ing to max­i­mize the out­come pre­dic­tion. In other words, it gen­er­ates a flowchart-​like struc­ture that can be eas­ily fol­lowed by hu­mans.

Let’s split the data be­tween train­ing and test sub­sets, as trees can eas­ily over­fit, build a big one and then use cross-​validation to prune the tree as short as pos­si­ble while re­tain­ing good ac­cu­racy.

set.seed(42)

train <- sample(1:nrow(filtered_data), 148)
tree <- tree(DEATH_EVENT ~ . -time, filtered_data, subset = train)
cv_tree <- cv.tree(tree, FUN = prune.misclass)

plot(cv_tree)

Line plot of misclassification by depth (size) of decision tree

With a depth of just 4 branches, we achieve peak ac­cu­racy. This is how the tree looks like:

prune_tree <- prune.misclass(tree, best = 4)
plot(prune_tree)
text(prune_tree)

Visual representation of decision tree using serum creatinine, ejection fraction and age as predictors

Let’s use this model to pre­dict the out­come in the test sub­set of data and draw a table with the re­sults.

tree_pred = predict(prune_tree, filtered_data[-train, ], type = "class")
with(filtered_data[-train, ], table(tree_pred, DEATH_EVENT))
##          DEATH_EVENT
## tree_pred  0  1
##         0 24  6
##         1  8 26

Here, the model achieved 71.8% ac­cu­racy in the test sub­set. It’s quite good, but de­ci­sion trees can vary dras­ti­cally just by chang­ing the data a tiny bit. This can be demon­strated by chang­ing the ran­dom seed and se­lec­tion a new sub­set of train­ing data. This is an­other pos­si­ble tree:

set.seed(71)

train <- sample(1:nrow(filtered_data), 148)
tree <- tree(DEATH_EVENT ~ . -time, filtered_data, subset = train)
cv_tree <- cv.tree(tree, FUN = prune.misclass)
prune_tree <- prune.misclass(tree, best = 4)
plot(prune_tree)
text(prune_tree)

Visual representation of another valid decision tree with different split points

Mak­ing it sim­pler: build­ing a score

The de­ci­sion tree would al­ready be rea­son­ably ac­cu­rate and easy to im­ple­ment in­side hos­pi­tals. Still, as the last ex­am­ple shows, the hi­er­ar­chy be­tween the pre­dic­tors can change just with tiny vari­a­tions in the data. With under 300 ob­ser­va­tions and noisy data, we risk build­ing noise into the model by set­ting a hi­er­ar­chy be­tween pre­dic­tors.

An al­ter­na­tive is using a scor­ing sys­tem. The health sys­tem al­ready makes ex­ten­sive use of them with no­table suc­cess (e.g., the Apgar score or the NIH stroke scale). To build a score, we’ll use the smbinning pack­age to di­vide con­tin­u­ous vari­ables into cat­e­gor­i­cal ones. Then, these vari­ables will be used as pre­dic­tors to as­sign score points re­lated to the out­come. This pack­age uses re­cur­sive par­ti­tion­ing, that is, it builds a decision-​tree-​like struc­ture for each vari­able to find which split­ting points max­i­mize their out­come pre­dic­tive power. Let’s bin the con­tin­u­ous vari­ables (not all of them will have sig­nif­i­cant bin­ning points) and build a new lo­gis­tic re­gres­sion with them.

num_filtered_data <- filtered_data
num_filtered_data$DEATH_EVENT <- as.numeric(num_filtered_data$DEATH_EVENT) - 1

is_num_variables <- sapply(filtered_data, is.numeric)
num_variables <- colnames(filtered_data)[is_num_variables]

new_variables <- vector()

for (i in num_variables){
  binning = smbinning(num_filtered_data, "DEATH_EVENT", i, p = 0.1)
  if (!is.character(binning)){
    name <- paste(i, "bin", sep = "_")
    num_filtered_data <- smbinning.gen(num_filtered_data, binning, name)
    new_variables <- append(new_variables, name)
  }
}

scoring_vars <- c(colnames(filtered_data)[!is_num_variables][-5], new_variables[-5])

bin_reg <- glm(data = num_filtered_data, as.factor(DEATH_EVENT) ~ anaemia + diabetes + high_blood_pressure + sex + smoking + age_bin + ejection_fraction_bin + serum_creatinine_bin + serum_sodium_bin, family = "binomial")

summary(bin_reg)
## Call:
## glm(formula = as.factor(DEATH_EVENT) ~ anaemia + diabetes + high_blood_pressure +
##     sex + smoking + age_bin + ejection_fraction_bin + serum_creatinine_bin +
##     serum_sodium_bin, family = "binomial", data = num_filtered_data)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.3225  -0.7347  -0.4139   0.7564   2.2443
##
## Coefficients:
##                               Estimate Std. Error z value   Pr(>|z|)
## (Intercept)                     0.2264     0.8607   0.263   0.792560
## anaemia1                        0.7040     0.3579   1.967   0.049156 *
## diabetes1                      -0.0442     0.3671  -0.120   0.904164
## high_blood_pressure1            0.6549     0.3775   1.735   0.082779 .
## sexmale                        -0.1711     0.4098  -0.418   0.676282
## smoking1                       -0.1955     0.4199  -0.466   0.641475
## age_bin02 > 70                  1.2665     0.4629   2.736   0.006226 **
## ejection_fraction_bin02 <= 30  -1.3169     0.7688  -1.713   0.086709 .
## ejection_fraction_bin03 > 30   -2.6608     0.7370  -3.611   0.000306 ***
## serum_creatinine_bin02 <= 1.8   1.0863     0.4649   2.337   0.019460 *
## serum_creatinine_bin03 > 1.8    3.0255     0.6606   4.580 0.00000466 ***
## serum_sodium_bin02 > 135       -0.1611     0.3861  -0.417   0.676574
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 292.00  on 211  degrees of freedom
## Residual deviance: 204.99  on 200  degrees of freedom
## AIC: 228.99
##
## Number of Fisher Scoring iterations: 5

We now re­build the model using only sig­nif­i­cant pre­dic­tors.

num_filtered_data$ejection_fraction_bin <- factor(num_filtered_data$ejection_fraction_bin, levels = c("03 > 30", "02 <= 30", "01 <= 20"))

bin_reg <- glm(data = num_filtered_data, as.factor(DEATH_EVENT) ~ anaemia + age_bin + ejection_fraction_bin + serum_creatinine_bin, family = "binomial")

summary(bin_reg)
## Call:
## glm(formula = as.factor(DEATH_EVENT) ~ anaemia + age_bin + ejection_fraction_bin +
##     serum_creatinine_bin, family = "binomial", data = num_filtered_data)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.0220  -0.7343  -0.3965   0.9449   2.2726
##
## Coefficients:
##                               Estimate Std. Error z value    Pr(>|z|)
## (Intercept)                    -2.5037     0.4741  -5.281 0.000000129 ***
## anaemia1                        0.6873     0.3479   1.976    0.048178 *
## age_bin02 > 70                  1.3331     0.4429   3.010    0.002616 **
## ejection_fraction_bin02 <= 30   1.3307     0.3942   3.376    0.000735 ***
## ejection_fraction_bin01 <= 20   2.6506     0.7040   3.765    0.000167 ***
## serum_creatinine_bin02 <= 1.8   1.0581     0.4545   2.328    0.019895 *
## serum_creatinine_bin03 > 1.8    3.0097     0.6298   4.779 0.000001763 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 292.00  on 211  degrees of freedom
## Residual deviance: 209.24  on 205  degrees of freedom
## AIC: 223.24
##
## Number of Fisher Scoring iterations: 5

The scal­ing func­tion of the smbinning pack­age as­signs score points to each vari­able. I have toyed a bit with the pa­ra­me­ters to come up with a con­ve­nient scale of points.

scaled_reg <- smbinning.scaling(bin_reg, pdo = 10, score = 100, odds = 95)
scaled_reg$logitscaled[[1]] %>% kbl() %>% kable_styling()
Char­ac­ter­is­ticAt­tributeCo­ef­fi­cientWeightWeightScaledPoints
(In­ter­cept)-2.50-36.120.000
anaemia000.000.00-0.450
anaemia110.699.929.469
age_bin01 <= 700.000.00-0.450
age_bin02 > 701.3319.2318.7819
ejec­tion_frac­tion_bin01 <= 202.6538.2437.7838
ejec­tion_frac­tion_bin02 <= 301.3319.2018.7419
ejec­tion_frac­tion_bin03 > 300.000.00-0.450
serum_cre­a­ti­nine_bin01 <= 0.90.000.00-0.450
serum_cre­a­ti­nine_bin02 <= 1.81.0615.2714.8115
serum_cre­a­ti­nine_bin03 > 1.83.0143.4242.9743

Let’s di­vide the points and round them to a sin­gle digit to make the model even sim­pler.

points <- round(scaled_reg$logitscaled[[1]][ , "Points"]/10, 0)
names(points) <- scaled_reg$logitscaled[[1]][ , "Characteristic"]
points
##           (Intercept)              anaemia0              anaemia1
##                     0                     0                     1
##               age_bin               age_bin ejection_fraction_bin
##                     0                     2                     4
## ejection_fraction_bin ejection_fraction_bin  serum_creatinine_bin
##                     2                     0                     0
##  serum_creatinine_bin  serum_creatinine_bin
##                     2                     4

Lastly, I’ve cho­sen to make it sim­pler one last time by elim­i­nat­ing the anaemia vari­able. Then, all other points can be di­vided by 2. This cre­ates a very sim­ple scor­ing sys­tem rang­ing from 0 to 5 points. A higher score means a higher mor­tal­ity risk. Fi­nally, we cre­ate a func­tion to ob­tain such score from pa­tient data and cal­cu­late some ac­cu­racy met­rics using this scor­ing method.

get_score <- function(data){
  scores = vector()
  for (i in 1:nrow(data)){
    score = 0
    if (data[i, "age"] > 70){
      score = score + 1
    }
    if (data[i, "ejection_fraction"] <= 20){
      score = score + 1
    }
    if (data[i, "ejection_fraction"] <= 30){
      score = score + 1
    }
    if (data[i, "serum_creatinine"] > 0.9){
      score = score + 1
    }
    if (data[i, "serum_creatinine"] > 1.8){
      score = score + 1
    }
    scores <- append(scores, score)
  }
  return(scores)
}

scores <- get_score(filtered_data)
num_filtered_data$score <- scores
smbinning.metrics(prediction = "score", actualclass = "DEATH_EVENT", dataset = num_filtered_data)
##   Overall Performance Metrics
##   --------------------------------------------------
##                     KS : 0.6038 (Awesome)
##                    AUC : 0.8222 (Good)
##
##   Classification Matrix
##   --------------------------------------------------
##            Cutoff (>=) : 2 (Optimal)
##    True Positives (TP) : 77
##   False Positives (FP) : 23
##   False Negatives (FN) : 19
##    True Negatives (TN) : 93
##    Total Positives (P) : 96
##    Total Negatives (N) : 116
##
##   Business/Performance Metrics
##   --------------------------------------------------
##       %Records>=Cutoff : 0.4717
##              Good Rate : 0.7700 (Vs 0.4528 Overall)
##               Bad Rate : 0.2300 (Vs 0.5472 Overall)
##         Accuracy (ACC) : 0.8019
##      Sensitivity (TPR) : 0.8021
##  False Neg. Rate (FNR) : 0.1979
##  False Pos. Rate (FPR) : 0.1983
##      Specificity (TNR) : 0.8017
##        Precision (PPV) : 0.7700
##   False Discovery Rate : 0.2300
##     False Omision Rate : 0.1696
##   Inv. Precision (NPV) : 0.8304
##
##   Note: 0 rows deleted due to missing data.
VariablesPoints
Age
Age > 70+1
Ejection fraction
Ejection fraction <= 30%+1
Ejection fraction <= 20%+2
Serum creatinine
Serum creatinine > 0.9 mg/dL+1
Serum creatinine > 1.8 mg/dL+2

The model has achieved 80.19% ac­cu­racy, which is quite im­pres­sive con­sid­er­ing its nu­meric sim­plic­ity. A score of 2 or more points is con­sid­ered high mor­tal­ity risk. This goes to show the im­por­tance of keep­ing it sim­ple when build­ing pre­dic­tive mod­els with noisy data and small datasets. In this case, we sim­pli­fied it to the ex­treme to im­prove gen­er­al­iza­tion and avoid over­fit­ting.

With just 3 easily-​available vari­ables this scor­ing model achieves 80.21% sen­si­tiv­ity and 80.17% speci­ficity while also being so sim­ple that it can be cal­cu­lated men­tally be health­care work­ers. Still, it’s not im­mune from over­fit­ting: the bin­ning bound­aries may have in­cor­po­rated noise and a larger dataset would be help­ful to fine tune them.


Previous Post
Survival analysis with Cox reggression - heart failure data
Next Post
O que NÃO É a eficácia de uma vacina