Exploratory data analysis: the WHO suicide dataset

Warning: this post talks about suicide

The Data

Exploratory data analysis is essential to construct hypothesis. Today we’ll explore the publicly available WHO Suicide Statistics database (version from Kaggle). It consists of a single CSV table, with 43776 instances of merely 6 variables. We do not intend to speculate about suicide causes nor to make any judgments. This analysis was done using R and R markdown.

summary(who_suicide_statistics)

##    country               year          sex                age
##  Length:43776       Min.   :1979   Length:43776       Length:43776
##  Class :character   1st Qu.:1990   Class :character   Class :character
##  Mode  :character   Median :1999   Mode  :character   Mode  :character
##                     Mean   :1999
##                     3rd Qu.:2007
##                     Max.   :2016
##
##   suicides_no        population
##  Min.   :    0.0   Min.   :     259
##  1st Qu.:    1.0   1st Qu.:   85113
##  Median :   14.0   Median :  380655
##  Mean   :  193.3   Mean   : 1664091
##  3rd Qu.:   91.0   3rd Qu.: 1305698
##  Max.   :22338.0   Max.   :43805214
##  NA's   :2256      NA's   :5460

Clearly, we have a considerable amount of missing values, with data since 1979 to 2016, which is still quite recent. The sex and country variables must be converted to categorical ones:

who_suicide_statistics$sex <- as.factor(who_suicide_statistics$sex)
who_suicide_statistics$country <- as.factor(who_suicide_statistics$country)

Next, the age variable should be an ordered factor:

who_suicide_statistics$age <- factor(who_suicide_statistics$age, levels = c("5-14 years", "15-24 years", "25-34 years", "55-74 years", "75+ years"))

Let’s take a look at our most important variable –- suicide number:

Histogram of suicide number with highly skewed distribution

Clearly, the distribution is extremely skewed and zero-inflated, ranging from 0 to very high values. Let’s create a proportional suicide number variable (suicide_rate), defined by prop_suicide = suicides_no/population * 1000000 (per million people) and see its distribution:

total_suicide_rate <- who_suicide_statistics %>% group_by(country, year) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit

hist(total_suicide_rate$rate_suicide, xlab = "Suicide rate", main = NA)

Histogram of suicide rate

Much less variance, but still a very broad range. Let’s summarize and plot some graphs to see the relationships between variables.

library(ggplot2, dplyr)

## Warning: package 'ggplot2' was built under R version 4.0.5

total_suicide <- who_suicide_statistics %>% group_by(year, country) %>% summarise(total_suicide = sum(suicides_no, na.rm = T), .groups = "drop_last")

Line plot of suicide numbers per country over time

Line plot of suicide rate per country over time

Violin plot of suicide rate per gender

Men have higher suicide rates overall Let’s see which countries have the most and least suicides:

Top 10 countries and correspondent years with the highest suicide rates

country	year	rate_suicide
Lithuania	1996	510.1976
Lithuania	1995	500.1256
Lithuania	1994	499.8927
Hungary	1983	492.1207
Lithuania	2000	491.9875
Hungary	1981	491.5882
Hungary	1984	490.6624
Hungary	1980	486.1906
Lithuania	1997	485.0974
Hungary	1979	485.0378

Top 10 countries and correspondent years with the lowest positive suicide rates

country	year	rate_suicide
Egypt	1980	0.4035020
Jamaica	2004	0.4057770
Jamaica	1991	0.4640727
Jamaica	1986	0.4872034
Egypt	2007	0.4927728
Egypt	1987	0.4942756
Jamaica	1982	0.5138543
Egypt	2002	0.5620709
Egypt	2015	0.6084794
Egypt	2008	0.6107135

Top 20 countries with the highest suicide rates (2012-2016 average)

Now let’s take an average over the last five years of data and see again the highs and lows.

country	rate_suicide
Lithuania	335.3883
Guyana	305.1528
Republic of Korea	289.2143
Suriname	265.4565
Slovenia	217.4291
Hungary	212.4062
Latvia	209.5409
Kazakhstan	208.0467
Japan	207.0399
Belarus	204.4635
Russian Federation	203.9287
Ukraine	198.5692
Uruguay	186.5003
Belgium	182.3194
Croatia	179.5992
Estonia	178.9021
Serbia	169.9654
Republic of Moldova	168.2837
Mongolia	166.7801
Poland	166.0466

Top 20 countries with the lowest positive suicide rates (2012-2016 average)

country	rate_suicide
Egypt	1.596867
Oman	1.927792
Antigua and Barbuda	2.720674
Grenada	4.191730
Bahrain	9.113524
Mayotte	10.501900
South Africa	11.001666
Bahamas	14.440957
Kuwait	15.263111
Brunei Darussalam	15.960329
Turkey	22.758229
Qatar	23.989111
Armenia	24.627670
Venezuela (Bolivarian Republic of)	24.873804
Turkmenistan	26.545199
Iran (Islamic Rep of)	34.028634
Guatemala	34.051098
Saint Vincent and Grenadines	37.354314
Panama	37.454562
Fiji	40.871639

Democracy Index

Let’s see if there’s any relationship between suicide rates (2012-2016) and Democracy Index (2015) calculated by The Economist group. The democracy index data was manually curated to correspond to country names present in the WHO dataset.

democracy <- read.csv(file = "democracy_index_2015.csv")

democracy_compare_data <- total_suicide_rate %>% filter(year >= 2012) %>% filter(country %in% as.character(unique(democracy$Country))) %>% group_by(country) %>% summarise(rate_suicide = mean(rate_suicide, na.rm = T)) %>% arrange(country)

democracy <- democracy %>% filter(Country %in% as.character(unique(democracy_compare_data$country))) %>% arrange(Country)

democracy_compare_data$overall_score <- democracy$Overall_score

ggplot(data = democracy_compare_data, aes(overall_score, rate_suicide)) + geom_point(size = 2, alpha = 0.75, colour = "dark blue") + theme_bw() + geom_smooth(formula = y ~ x, method = "loess", se = F) + xlab("Democracy score (overall)") + ylab("Suicide rate (per million people)")

Scatter plot between democracy score and suicide rate showing positive correlation

tidy(cor.test(democracy$Overall_score, democracy_compare_data$rate_suicide, method = "pearson")) %>% kable()

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
0.3072375	2.833023	0.0058833	77	0.0924044	0.4947386	Pearson’s product-moment correlation	two.sided

tidy(cor.test(democracy$Overall_score, democracy_compare_data$rate_suicide, method = "spearman")) %>% kable()

estimate	statistic	p.value	method	alternative
0.3547168	53016.47	0.0013388	Spearman’s rank correlation rho	two.sided

There’s a weak (R = 0.307) but significant positive Pearson correlation between the Democracy Index and suicide rates. However, there are many confounding factors here, as more democratic countries are in general richer and may report suicide statistics with better accuracy. Also, there are huge cultural differences between countries. Among highly democratic nations the correlation is near zero:

democracy_compare_data %>% filter(overall_score > 6) %>% ggplot(aes(overall_score, rate_suicide)) + geom_point(size = 2, alpha = 0.75, colour = "dark blue") + theme_bw() + geom_smooth(formula = y ~ x, method = "loess", se = F) + xlab("Democracy score (overall)") + ylab("Suicide rate (per million people)")

Scatter plot between democracy score and suicide rate in highly democratic nations showing no clear correlation

Gross domestic product based on purchasing-power-parity (PPP) per capita GDP values (2015) in international dollars were obtained from the International Monetary Fund (IMF).

gdppc <- read.csv("WEO_Data.xls", sep = "\t")
gdppc$X2015 <- as.numeric(as.character(gdppc$X2015))

gdp_compare_data <- total_suicide_rate %>% filter(year >= 2012) %>% filter(country %in% as.character(unique(gdppc$Country))) %>% group_by(country) %>% summarise(rate_suicide = mean(rate_suicide, na.rm = T)) %>% arrange(country)

gdppc <- gdppc %>% filter(Country %in% as.character(unique(gdp_compare_data$country))) %>% arrange(Country)

Histogram of PPP GDP per capita (2015) Histogram of log of PPP GDP per capita (2015)

As the GDP variable is heavily skewed, it’s better to visualize it using its log transform:

Scatter plot between log of PPP GDP per capita and suicide rate with no clear correlation

tidy(cor.test(gdppc$X2015, gdp_compare_data$rate_suicide, method = "spearman")) %>% kable()

estimate	statistic	p.value	method	alternative
0.1861228	69440	0.0983024	Spearman’s rank correlation rho	two.sided

There does not seem to exist an apparent association between suicide rates and per capita GDP income.

Gender Ratios

female_rates <- who_suicide_statistics %>% filter(year >= 2012) %>% group_by(country, sex) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit %>% arrange(country) %>% filter(sex == "female")

male_rates <- who_suicide_statistics %>% filter(year >= 2012) %>% group_by(country, sex) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit %>% arrange(country) %>% filter(sex == "male")

gender_ratio <- data.frame(country = female_rates$country, ratio = male_rates$rate_suicide / female_rates$rate_suicide) %>% na.omit() %>% filter(is.finite(ratio))

hist(gender_ratio$ratio, main = NA, xlab = "Gender Ratio")

Histogram of suicide gender ratio

gender_ratio_gdp <- gender_ratio %>% filter(country %in% as.character(unique(gdppc$Country)))
gdppc_gender <- gdppc %>% filter(Country %in% as.character(unique(gender_ratio_gdp$country)))

tidy(cor.test(gender_ratio_gdp$ratio, gdppc_gender$X2015)) %>% kable()

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-0.2276648	-1.956149	0.0544387	70	-0.4363206	0.0042267	Pearson’s product-moment correlation	two.sided

gender_ratio_dem <- gender_ratio %>% filter(country %in% as.character(unique(democracy$Country)))
democracy_gender <- democracy %>% filter(Country %in% as.character(unique(gender_ratio_dem$country)))

tidy(cor.test(gender_ratio_dem$ratio, democracy_gender$Overall_score)) %>% kable()

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-0.0795524	-0.6724517	0.5034788	71	-0.3040547	0.153321	Pearson’s product-moment correlation	two.sided

There does not seem to be any association between gender ratios and Democracy Index nor per capita GDP.

Top 10 countries with the highest gender ratios (male-to-female) 2012-2016

head(gender_ratio %>% arrange(desc(ratio)), n = 20) %>% kable()

country	ratio
Bahrain	9.262603
Poland	6.992961
Saint Lucia	6.889143
Seychelles	6.841246
Slovakia	6.684513
Panama	6.439133
Mongolia	6.425235
Puerto Rico	6.290557
Costa Rica	6.000486
Romania	5.962328
Republic of Moldova	5.751732
Belize	5.586587
Latvia	5.480794
Lithuania	5.463001
Russian Federation	5.419557
Cyprus	5.224378
Reunion	5.115328
Ukraine	5.011528
Malta	4.980026
Georgia	4.875833

Top 10 countries with lowest positive gender ratios (male-to-female) 2012-2016

head(gender_ratio %>% filter(ratio > 0) %>% arrange(ratio), n = 20) %>% kable()

country	ratio
Kuwait	1.472849
Aruba	1.903816
Hong Kong SAR	1.915214
Uzbekistan	2.076370
Singapore	2.079089
Iran (Islamic Rep of)	2.116816
Fiji	2.120162
Netherlands	2.211227
Republic of Korea	2.304564
Norway	2.332093
Sweden	2.360817
Japan	2.424445
Virgin Islands (USA)	2.484681
Paraguay	2.492538
Turkmenistan	2.578286
Luxembourg	2.584304
Belgium	2.614295
Guatemala	2.636253
Saint Vincent and Grenadines	2.699769
New Zealand	2.759721

Age

Elderly suicide is an increasingly troublesome concern as the population grows older.

elderly_data <- who_suicide_statistics %>% filter(year >= 2012) %>% filter(age == "55-74 years" | age == "75+ years") %>% group_by(country) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population)) %>% na.omit %>% arrange(desc(rate_suicide))

Top 10 countries with the highest elderly suicide rates (2012-2016)

head(elderly_data, n = 10) %>% kable()

country	rate_suicide
Republic of Korea	494.6145
Lithuania	391.9060
Slovenia	346.1632
Hungary	342.0108
Guyana	333.2012
Suriname	307.5963
Serbia	296.0792
Croatia	289.7782
Cuba	277.7552
Uruguay	261.7103

This, however, can be biased due to a higher overall higher incidence of suicides in some countries. Thus, let’s calculate the percentage of total suicides that are elderly ones (55+ years).

total_elderly <- who_suicide_statistics %>% filter(year >= 2012) %>% filter(age == "55-74 years" | age == "75+ years") %>% group_by(country) %>% summarise(total_suicide = sum(suicides_no)) %>% na.omit

total_2012_16 <- total_suicide %>% filter(year >= 2012) %>% group_by(country) %>% summarise(total_suicide = sum(total_suicide, na.rm = T)) %>% filter(country %in% as.character(unique(total_elderly$country)))

elderly_proportion <- data.frame(country = total_elderly$country, proportion = total_elderly$total_suicide / total_2012_16$total_suicide)

elderly_proportion <- elderly_proportion[is.finite(elderly_proportion$proportion), ]

Top 10 countries with highest elderly suicide proportion (2012-2016)

head(elderly_proportion %>% arrange(desc(proportion)), n = 10) %>% kable()

country	proportion
Antigua and Barbuda	1.0000000
Serbia	0.6119500
Portugal	0.5895522
Bulgaria	0.5839448
Croatia	0.5570321
Hungary	0.5400160
Germany	0.5384444
Austria	0.5329861
Slovenia	0.5318396
Cuba	0.5177912

USA and Brazil: a case-study

I’ve selected two countries for further analysis: Brazil and USA, both very big countries with reliable data.

BR_data <- subset(who_suicide_statistics, country == "Brazil")

US_data <- subset(who_suicide_statistics, country == "United States of America")

Suicide rate over time in Brazil with a positive trend Suicide rate over time in the USA, the number drops until the year 2000 and rises again

Plot of suicide rate per gender showing higher rates in males (Brazil)

Gender differences can be calculated over time:

Suicide rate per gender over time in Brazil

sex_US_data <- US_data %>% group_by(year, sex) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit

sex_BR_data <- BR_data %>% group_by(year, sex) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit

US_data_sexratio <- data.frame(year = subset(sex_US_data, sex == "male")$year, ratio = subset(sex_US_data, sex == "male")$rate_suicide / subset(sex_US_data, sex == "female")$rate_suicide, country = "US")

BR_data_sexratio <- data.frame(year = subset(sex_BR_data, sex == "male")$year, ratio = subset(sex_BR_data, sex == "male")$rate_suicide / subset(sex_BR_data, sex == "female")$rate_suicide, country = "BR")

data_sexratio <- rbind(US_data_sexratio, BR_data_sexratio)

Gender ratio of suicides in Brazil and USA

In Brazil, suicide rates for men have been steadily increasing since the 1980s, while rates for women have stayed roughly the same. In the US, however, suicide rates for men increased during the 80s (not followed by an increase in women’s rates), decline in the 2000s and has been increasing since 2005-6. This increase is now followed by a similar (but smaller) one in women’s rates. Thus, the men-to-women ratio increased with time in Brazil and decreased only after 2000 in the US. In 2015, for each woman, 4-4.5 men have ended their lives in Brazil or in the US.

age_data_usbr <- who_suicide_statistics %>% group_by(year, country, age) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit

Suicide rates per age group in Brazil. Elderly people show higher rates

Both countries present the highest suicide rates for the elderly. However, in both cases, the gap between adults (25–34 years) and elderly (55+ years) is getting narrower since the 2000s, which shows that adult suicide is more likely now than compared to the past (1990s).

age_gender_usbr <- who_suicide_statistics %>% group_by(sex, year, country, age) %>% summarise(rate_suicide = sum(suicides_no) * 1000000 / sum(population), .groups = "drop_last") %>% na.omit

Suicide rate per gender and age group in Brazil

Interestingly, the high elderly suicide rate is apparently accounted for by only male people. There’s practically no age gap among women. This suggests that elderly suicide is almost exclusively a male issue in these countries.

Conclusion

This exploratory analysis is descriptive and serves the purpose to inform about overall characteristics and trends in global suicide reports provided by the WHO. Suicide is a complex social phenomenon and should not be interpreted simplistically. Still, the huge difference between genders in the age gap is of interest.

Exploratory data analysis: the WHO suicide dataset

The Data

Top 10 coun­tries and cor­re­spon­dent years with the high­est sui­cide rates

Top 10 coun­tries and cor­re­spon­dent years with the low­est pos­i­tive sui­cide rates

Top 20 coun­tries with the high­est sui­cide rates (2012-2016 av­er­age)

Top 20 coun­tries with the low­est pos­i­tive sui­cide rates (2012-2016 av­er­age)

Democ­racy Index

Gen­der Ra­tios

Top 10 coun­tries with the high­est gen­der ra­tios (male-​to-​female) 2012-2016

Top 10 coun­tries with low­est pos­i­tive gen­der ra­tios (male-​to-​female) 2012-2016

Age

Top 10 coun­tries with the high­est el­derly sui­cide rates (2012-2016)

Top 10 coun­tries with high­est el­derly sui­cide pro­por­tion (2012-2016)

USA and Brazil: a case-​study

Con­clu­sion