Skip to content

Exploratory data analysis: the WHO suicide dataset

Published:

Warn­ing: this post talks about sui­cide

The Data

Ex­ploratory data analy­sis is es­sen­tial to con­struct hy­poth­e­sis. Today we’ll ex­plore the pub­licly avail­able WHO Sui­cide Sta­tis­tics data­base (ver­sion from Kag­gle). It con­sists of a sin­gle CSV table, with 43776 in­stances of merely 6 vari­ables. We do not in­tend to spec­u­late about sui­cide causes nor to make any judg­ments. This analy­sis was done using R and R mark­down.

summary(who_suicide_statistics)
##    country               year          sex                age
##  Length:43776       Min.   :1979   Length:43776       Length:43776
##  Class :character   1st Qu.:1990   Class :character   Class :character
##  Mode  :character   Median :1999   Mode  :character   Mode  :character
##                     Mean   :1999
##                     3rd Qu.:2007
##                     Max.   :2016
##
##   suicides_no        population
##  Min.   :    0.0   Min.   :     259
##  1st Qu.:    1.0   1st Qu.:   85113
##  Median :   14.0   Median :  380655
##  Mean   :  193.3   Mean   : 1664091
##  3rd Qu.:   91.0   3rd Qu.: 1305698
##  Max.   :22338.0   Max.   :43805214
##  NA's   :2256      NA's   :5460

Clearly, we have a con­sid­er­able amount of miss­ing val­ues, with data since 1979 to 2016, which is still quite re­cent. The sex and coun­try vari­ables must be con­verted to cat­e­gor­i­cal ones:

who_suicide_statistics$sex <- as.factor(who_suicide_statistics$sex)
who_suicide_statistics$country <- as.factor(who_suicide_statistics$country)

Next, the age vari­able should be an or­dered fac­tor:

who_suicide_statistics$age <- factor(who_suicide_statistics$age, levels = c("5-14 years", "15-24 years", "25-34 years", "55-74 years", "75+ years"))

Let’s take a look at our most im­por­tant vari­able –- sui­cide num­ber:

Histogram of suicide number with highly skewed distribution

Clearly, the dis­tri­b­u­tion is ex­tremely skewed and zero-​inflated, rang­ing from 0 to very high val­ues. Let’s cre­ate a pro­por­tional sui­cide num­ber vari­able (sui­cide_rate), de­fined by prop_sui­cide = sui­cides_no/pop­u­la­tion * 1000000 (per mil­lion peo­ple) and see its dis­tri­b­u­tion:

total_suicide_rate <- who_suicide_statistics %>% group_by(country, year) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit

hist(total_suicide_rate$rate_suicide, xlab = "Suicide rate", main = NA)

Histogram of suicide rate

Much less vari­ance, but still a very broad range. Let’s sum­ma­rize and plot some graphs to see the re­la­tion­ships be­tween vari­ables.

library(ggplot2, dplyr)
## Warning: package 'ggplot2' was built under R version 4.0.5
total_suicide <- who_suicide_statistics %>% group_by(year, country) %>% summarise(total_suicide = sum(suicides_no, na.rm = T), .groups = "drop_last")

Line plot of suicide numbers per country over time

Line plot of suicide rate per country over time

Violin plot of suicide rate per gender

Men have higher sui­cide rates over­all Let’s see which coun­tries have the most and least sui­cides:

Top 10 coun­tries and cor­re­spon­dent years with the high­est sui­cide rates

coun­tryyearrate_sui­cide
Lithua­nia1996510.1976
Lithua­nia1995500.1256
Lithua­nia1994499.8927
Hun­gary1983492.1207
Lithua­nia2000491.9875
Hun­gary1981491.5882
Hun­gary1984490.6624
Hun­gary1980486.1906
Lithua­nia1997485.0974
Hun­gary1979485.0378

Top 10 coun­tries and cor­re­spon­dent years with the low­est pos­i­tive sui­cide rates

coun­tryyearrate_sui­cide
Egypt19800.4035020
Ja­maica20040.4057770
Ja­maica19910.4640727
Ja­maica19860.4872034
Egypt20070.4927728
Egypt19870.4942756
Ja­maica19820.5138543
Egypt20020.5620709
Egypt20150.6084794
Egypt20080.6107135

Top 20 coun­tries with the high­est sui­cide rates (2012-2016 av­er­age)

Now let’s take an av­er­age over the last five years of data and see again the highs and lows.

coun­tryrate_sui­cide
Lithua­nia335.3883
Guyana305.1528
Re­pub­lic of Korea289.2143
Suri­name265.4565
Slove­nia217.4291
Hun­gary212.4062
Latvia209.5409
Kaza­khstan208.0467
Japan207.0399
Be­larus204.4635
Russ­ian Fed­er­a­tion203.9287
Ukraine198.5692
Uruguay186.5003
Bel­gium182.3194
Croa­tia179.5992
Es­to­nia178.9021
Ser­bia169.9654
Re­pub­lic of Moldova168.2837
Mon­go­lia166.7801
Poland166.0466

Top 20 coun­tries with the low­est pos­i­tive sui­cide rates (2012-2016 av­er­age)

coun­tryrate_sui­cide
Egypt1.596867
Oman1.927792
An­tigua and Bar­buda2.720674
Grenada4.191730
Bahrain9.113524
May­otte10.501900
South Africa11.001666
Ba­hamas14.440957
Kuwait15.263111
Brunei Darus­salam15.960329
Turkey22.758229
Qatar23.989111
Ar­me­nia24.627670
Venezuela (Bo­li­var­ian Re­pub­lic of)24.873804
Turk­menistan26.545199
Iran (Is­lamic Rep of)34.028634
Guatemala34.051098
Saint Vin­cent and Grenadines37.354314
Panama37.454562
Fiji40.871639

Democ­racy Index

Let’s see if there’s any re­la­tion­ship be­tween sui­cide rates (2012-2016) and Democ­racy Index (2015) cal­cu­lated by The Econ­o­mist group. The democ­racy index data was man­u­ally cu­rated to cor­re­spond to coun­try names present in the WHO dataset.

democracy <- read.csv(file = "democracy_index_2015.csv")

democracy_compare_data <- total_suicide_rate %>% filter(year >= 2012) %>% filter(country %in% as.character(unique(democracy$Country))) %>% group_by(country) %>% summarise(rate_suicide = mean(rate_suicide, na.rm = T)) %>% arrange(country)

democracy <- democracy %>% filter(Country %in% as.character(unique(democracy_compare_data$country))) %>% arrange(Country)

democracy_compare_data$overall_score <- democracy$Overall_score

ggplot(data = democracy_compare_data, aes(overall_score, rate_suicide)) + geom_point(size = 2, alpha = 0.75, colour = "dark blue") + theme_bw() + geom_smooth(formula = y ~ x, method = "loess", se = F) + xlab("Democracy score (overall)") + ylab("Suicide rate (per million people)")

Scatter plot between democracy score and suicide rate showing positive correlation

tidy(cor.test(democracy$Overall_score, democracy_compare_data$rate_suicide, method = "pearson")) %>% kable()
es­ti­matesta­tis­ticp.valuepa­ra­me­terconf.lowconf.highmethodal­ter­na­tive
0.30723752.8330230.0058833770.09240440.4947386Pear­son’s product-​moment cor­re­la­tiontwo.sided
tidy(cor.test(democracy$Overall_score, democracy_compare_data$rate_suicide, method = "spearman")) %>% kable()
es­ti­matesta­tis­ticp.valuemethodal­ter­na­tive
0.354716853016.470.0013388Spear­man’s rank cor­re­la­tion rhotwo.sided

There’s a weak (R = 0.307) but sig­nif­i­cant pos­i­tive Pear­son cor­re­la­tion be­tween the Democ­racy Index and sui­cide rates. How­ever, there are many con­found­ing fac­tors here, as more de­mo­c­ra­tic coun­tries are in gen­eral richer and may re­port sui­cide sta­tis­tics with bet­ter ac­cu­racy. Also, there are huge cul­tural dif­fer­ences be­tween coun­tries. Among highly de­mo­c­ra­tic na­tions the cor­re­la­tion is near zero:

democracy_compare_data %>% filter(overall_score > 6) %>% ggplot(aes(overall_score, rate_suicide)) + geom_point(size = 2, alpha = 0.75, colour = "dark blue") + theme_bw() + geom_smooth(formula = y ~ x, method = "loess", se = F) + xlab("Democracy score (overall)") + ylab("Suicide rate (per million people)")

Scatter plot between democracy score and suicide rate in highly democratic nations showing no clear correlation

Gross do­mes­tic prod­uct based on purchasing-​power-​parity (PPP) per capita GDP val­ues (2015) in in­ter­na­tional dol­lars were ob­tained from the In­ter­na­tional Mon­e­tary Fund (IMF).

gdppc <- read.csv("WEO_Data.xls", sep = "\t")
gdppc$X2015 <- as.numeric(as.character(gdppc$X2015))

gdp_compare_data <- total_suicide_rate %>% filter(year >= 2012) %>% filter(country %in% as.character(unique(gdppc$Country))) %>% group_by(country) %>% summarise(rate_suicide = mean(rate_suicide, na.rm = T)) %>% arrange(country)

gdppc <- gdppc %>% filter(Country %in% as.character(unique(gdp_compare_data$country))) %>% arrange(Country)

Histogram of PPP GDP per capita (2015) Histogram of log of PPP GDP per capita (2015)

As the GDP vari­able is heav­ily skewed, it’s bet­ter to vi­su­al­ize it using its log trans­form:

Scatter plot between log of PPP GDP per capita and suicide rate with no clear correlation

tidy(cor.test(gdppc$X2015, gdp_compare_data$rate_suicide, method = "spearman")) %>% kable()
es­ti­matesta­tis­ticp.valuemethodal­ter­na­tive
0.1861228694400.0983024Spear­man’s rank cor­re­la­tion rhotwo.sided

There does not seem to exist an ap­par­ent as­so­ci­a­tion be­tween sui­cide rates and per capita GDP in­come.

Gen­der Ra­tios

female_rates <- who_suicide_statistics %>% filter(year >= 2012) %>% group_by(country, sex) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit %>% arrange(country) %>% filter(sex == "female")

male_rates <- who_suicide_statistics %>% filter(year >= 2012) %>% group_by(country, sex) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit %>% arrange(country) %>% filter(sex == "male")

gender_ratio <- data.frame(country = female_rates$country, ratio = male_rates$rate_suicide / female_rates$rate_suicide) %>% na.omit() %>% filter(is.finite(ratio))

hist(gender_ratio$ratio, main = NA, xlab = "Gender Ratio")

Histogram of suicide gender ratio

gender_ratio_gdp <- gender_ratio %>% filter(country %in% as.character(unique(gdppc$Country)))
gdppc_gender <- gdppc %>% filter(Country %in% as.character(unique(gender_ratio_gdp$country)))

tidy(cor.test(gender_ratio_gdp$ratio, gdppc_gender$X2015)) %>% kable()
es­ti­matesta­tis­ticp.valuepa­ra­me­terconf.lowconf.highmethodal­ter­na­tive
-0.2276648-1.9561490.054438770-0.43632060.0042267Pear­son’s product-​moment cor­re­la­tiontwo.sided
gender_ratio_dem <- gender_ratio %>% filter(country %in% as.character(unique(democracy$Country)))
democracy_gender <- democracy %>% filter(Country %in% as.character(unique(gender_ratio_dem$country)))

tidy(cor.test(gender_ratio_dem$ratio, democracy_gender$Overall_score)) %>% kable()
es­ti­matesta­tis­ticp.valuepa­ra­me­terconf.lowconf.highmethodal­ter­na­tive
-0.0795524-0.67245170.503478871-0.30405470.153321Pear­son’s product-​moment cor­re­la­tiontwo.sided

There does not seem to be any as­so­ci­a­tion be­tween gen­der ra­tios and Democ­racy Index nor per capita GDP.

Top 10 coun­tries with the high­est gen­der ra­tios (male-​to-​female) 2012-2016

head(gender_ratio %>% arrange(desc(ratio)), n = 20) %>% kable()
coun­tryratio
Bahrain9.262603
Poland6.992961
Saint Lucia6.889143
Sey­chelles6.841246
Slo­va­kia6.684513
Panama6.439133
Mon­go­lia6.425235
Puerto Rico6.290557
Costa Rica6.000486
Ro­ma­nia5.962328
Re­pub­lic of Moldova5.751732
Be­lize5.586587
Latvia5.480794
Lithua­nia5.463001
Russ­ian Fed­er­a­tion5.419557
Cyprus5.224378
Re­union5.115328
Ukraine5.011528
Malta4.980026
Geor­gia4.875833

Top 10 coun­tries with low­est pos­i­tive gen­der ra­tios (male-​to-​female) 2012-2016

head(gender_ratio %>% filter(ratio > 0) %>% arrange(ratio), n = 20) %>% kable()
coun­tryratio
Kuwait1.472849
Aruba1.903816
Hong Kong SAR1.915214
Uzbek­istan2.076370
Sin­ga­pore2.079089
Iran (Is­lamic Rep of)2.116816
Fiji2.120162
Nether­lands2.211227
Re­pub­lic of Korea2.304564
Nor­way2.332093
Swe­den2.360817
Japan2.424445
Vir­gin Is­lands (USA)2.484681
Paraguay2.492538
Turk­menistan2.578286
Lux­em­bourg2.584304
Bel­gium2.614295
Guatemala2.636253
Saint Vin­cent and Grenadines2.699769
New Zealand2.759721

Age

El­derly sui­cide is an in­creas­ingly trou­ble­some con­cern as the pop­u­la­tion grows older.

elderly_data <- who_suicide_statistics %>% filter(year >= 2012) %>% filter(age == "55-74 years" | age == "75+ years") %>% group_by(country) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population)) %>% na.omit %>% arrange(desc(rate_suicide))

Top 10 coun­tries with the high­est el­derly sui­cide rates (2012-2016)

head(elderly_data, n = 10) %>% kable()
coun­tryrate_sui­cide
Re­pub­lic of Korea494.6145
Lithua­nia391.9060
Slove­nia346.1632
Hun­gary342.0108
Guyana333.2012
Suri­name307.5963
Ser­bia296.0792
Croa­tia289.7782
Cuba277.7552
Uruguay261.7103

This, how­ever, can be bi­ased due to a higher over­all higher in­ci­dence of sui­cides in some coun­tries. Thus, let’s cal­cu­late the per­cent­age of total sui­cides that are el­derly ones (55+ years).

total_elderly <- who_suicide_statistics %>% filter(year >= 2012) %>% filter(age == "55-74 years" | age == "75+ years") %>% group_by(country) %>% summarise(total_suicide = sum(suicides_no)) %>% na.omit

total_2012_16 <- total_suicide %>% filter(year >= 2012) %>% group_by(country) %>% summarise(total_suicide = sum(total_suicide, na.rm = T)) %>% filter(country %in% as.character(unique(total_elderly$country)))

elderly_proportion <- data.frame(country = total_elderly$country, proportion = total_elderly$total_suicide / total_2012_16$total_suicide)

elderly_proportion <- elderly_proportion[is.finite(elderly_proportion$proportion), ]

Top 10 coun­tries with high­est el­derly sui­cide pro­por­tion (2012-2016)

head(elderly_proportion %>% arrange(desc(proportion)), n = 10) %>% kable()
coun­trypro­por­tion
An­tigua and Bar­buda1.0000000
Ser­bia0.6119500
Por­tu­gal0.5895522
Bul­garia0.5839448
Croa­tia0.5570321
Hun­gary0.5400160
Ger­many0.5384444
Aus­tria0.5329861
Slove­nia0.5318396
Cuba0.5177912

USA and Brazil: a case-​study

I’ve se­lected two coun­tries for fur­ther analy­sis: Brazil and USA, both very big coun­tries with re­li­able data.

BR_data <- subset(who_suicide_statistics, country == "Brazil")

US_data <- subset(who_suicide_statistics, country == "United States of America")

Suicide rate over time in Brazil with a positive trend Suicide rate over time in the USA, the number drops until the year 2000 and rises again

Plot of suicide rate per gender showing higher rates in males (Brazil) Plot of suicide rate per gender showing higher rates in males (USA)

Gen­der dif­fer­ences can be cal­cu­lated over time:

Suicide rate per gender over time in Brazil Suicide rate per gender over time in the USA

sex_US_data <- US_data %>% group_by(year, sex) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit

sex_BR_data <- BR_data %>% group_by(year, sex) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit

US_data_sexratio <- data.frame(year = subset(sex_US_data, sex == "male")$year, ratio = subset(sex_US_data, sex == "male")$rate_suicide / subset(sex_US_data, sex == "female")$rate_suicide, country = "US")

BR_data_sexratio <- data.frame(year = subset(sex_BR_data, sex == "male")$year, ratio = subset(sex_BR_data, sex == "male")$rate_suicide / subset(sex_BR_data, sex == "female")$rate_suicide, country = "BR")

data_sexratio <- rbind(US_data_sexratio, BR_data_sexratio)

Gender ratio of suicides in Brazil and USA

In Brazil, sui­cide rates for men have been steadily in­creas­ing since the 1980s, while rates for women have stayed roughly the same. In the US, how­ever, sui­cide rates for men in­creased dur­ing the 80s (not fol­lowed by an in­crease in women’s rates), de­cline in the 2000s and has been in­creas­ing since 2005-6. This in­crease is now fol­lowed by a sim­i­lar (but smaller) one in women’s rates. Thus, the men-​to-​women ratio in­creased with time in Brazil and de­creased only after 2000 in the US. In 2015, for each woman, 4-4.5 men have ended their lives in Brazil or in the US.

age_data_usbr <- who_suicide_statistics %>% group_by(year, country, age) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit

Suicide rates per age group in Brazil. Elderly people show higher rates Suicide rates per age group in the USA. Elderly people show higher rates

Both coun­tries present the high­est sui­cide rates for the el­derly. How­ever, in both cases, the gap be­tween adults (25–34 years) and el­derly (55+ years) is get­ting nar­rower since the 2000s, which shows that adult sui­cide is more likely now than com­pared to the past (1990s).

age_gender_usbr <- who_suicide_statistics %>% group_by(sex, year, country, age) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit

Suicide rate per gender and age group in Brazil Suicide rate per gender and age group in the USA

In­ter­est­ingly, the high el­derly sui­cide rate is ap­par­ently ac­counted for by only male peo­ple. There’s prac­ti­cally no age gap among women. This sug­gests that el­derly sui­cide is al­most ex­clu­sively a male issue in these coun­tries.

Con­clu­sion

This ex­ploratory analy­sis is de­scrip­tive and serves the pur­pose to in­form about over­all char­ac­ter­is­tics and trends in global sui­cide re­ports pro­vided by the WHO. Sui­cide is a com­plex so­cial phe­nom­e­non and should not be in­ter­preted sim­plis­ti­cally. Still, the huge dif­fer­ence be­tween gen­ders in the age gap is of in­ter­est.


Previous Post
Twitter sentiment classification - Part 1
Next Post
Exploring Fractals With Pytorch