Understanding Employee Attrition: Data-Driven Insights

This analysis focuses on employee attrition using data from the IBM HR Analytics Employee Attrition & Performance. It explores possible reasons why employees leave their jobs and the process used to identify these insights. The examples provided illustrate key factors influencing attrition, but in a real-world setting, a more comprehensive analysis would be necessary.

1 Peronal Factors

1.1 The Role of Age in Employee Attrition: Trends and Insights

What is the distribution of age?

To start my analysis I decided to look if there is a certain age group that seem to be leaving the organization more than others. First I had to look at the age distarbution to determine the size and range of my bins.

The following graph visualize the age distribution to identify natural groupings for analysis.

Then I added a column that divided the ages into the groups I wanted. In addition, the Attrition column given as 0 (retained) and 1 (quit), therefore I re-coded the data to a factor, which will help me in the analysis moving forward.

The following two graphs looks at age group and quit status. The graph to the left point out that more employees who belong to age group 26-35 quit. However, the biggest age group in the data is employees between the age of 26-35 so it is not a surprise that it will be more people quitting there than other age groups. Therefore, it is important to look at the age group percentage. The graph to the right point out another age group that we should pay attention to, employees between 18-25. We can see that with age the quit rate decrease and than increase again when people are passing 56 year of age. Although there are differences in age, this analysis in not enough to determine age is the reason for quitting. Next I am going to test other possibilities that might contribute to employee attrition.

2 Categorical Analysis Using Chi-Square

A Chi-Square analysis was conducted to examine the relationship between Quit Status and several categorical variables, including gender, department, job role, marital status, and overtime. Job role, marital status, and overtime showed significant differences (p < 0.05), while department and education field had p-values less than 0.10, suggesting potential trends. These variables will be further explored to assess their role in predicting employee turnover.

Variable	Chi-Square Test Statistic	P-Value
Chi-Square Test Results for Quit Status vs Other Variables
Gender	0.2	0.66
Department	4.6	0.10
Job Role	66.4	0.00
Education Field	9.8	0.08
Marital Status	37.6	0.00
Overtime	67.8	0.00

2.1 Digging Deeper - Categorical Variables

What Factors Influence Employee Attrition?

--- title: "Understanding Employee Attrition: Data-Driven Insights" author: "" date: "" format: html: html-table-processing: none page-layout: full include-after-body: abbrv_toc.html number-sections: true toc: true toc-depth: 2 toc-title: Contents css: custom.css filters: - lightbox lightbox: auto --- This analysis focuses on employee attrition using data from the [*IBM HR Analytics Employee Attrition & Performance*](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset). It explores possible reasons why employees leave their jobs and the process used to identify these insights. The examples provided illustrate key factors influencing attrition, but in a real-world setting, a more comprehensive analysis would be necessary. # Peronal Factors ## The Role of Age in Employee Attrition: Trends and Insights {toc-text="Age"} ```{r} #| output: false #| include: false #| warning: false library(tidyverse) library(janitor) library(ggpattern) library(here) library(patchwork) library(gt) library(scales) library(GGally) library(stringr) ibm_data <- read_csv(here("ibm_dataset","ibm_dataset.csv")) |> clean_names() ``` > What is the distribution of age? To start my analysis I decided to look if there is a certain age group that seem to be leaving the organization more than others. First I had to look at the age distarbution to determine the size and range of my bins. The following graph visualize the age distribution to identify natural groupings for analysis. ```{r} #| echo: false #| warning: false p1 <- ggplot(ibm_data, aes(x = age)) + geom_histogram(fill = "#0066CC", color = "#e9ecef", alpha = 0.9) + stat_bin(aes(label = after_stat(count)), geom = "text", color = "white", vjust = 1.5, size = 3) + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(title = "Age Distribution of Employees") p1 ``` Then I added a column that divided the ages into the groups I wanted. In addition, the Attrition column given as 0 (retained) and 1 (quit), therefore I re-coded the data to a factor, which will help me in the analysis moving forward. ```{r} #| echo: false ibm_data <- ibm_data |> mutate( quit_status = factor(attrition, levels = c(0,1), labels = c("Retained","Quit")), over_time = factor(over_time, levels = c("Yes","No"), labels = c("Overtime","No Overtime")), age_group = factor(cut(age, breaks = c(18, 25, 35, 45, 55, 65), labels = c("18-25", "26-35", "36-45", "46-55", "56-65"), right = FALSE)) ) ``` The following two graphs looks at age group and quit status. The graph to the left point out that more employees who belong to age group 26-35 quit. However, the biggest age group in the data is employees between the age of 26-35 so it is not a surprise that it will be more people quitting there than other age groups. Therefore, it is important to look at the age group percentage. The graph to the right point out another age group that we should pay attention to, employees between 18-25. We can see that with age the quit rate decrease and than increase again when people are passing 56 year of age. Although there are differences in age, this analysis in not enough to determine age is the reason for quitting. Next I am going to test other possibilities that might contribute to employee attrition. ```{r} #| echo: false #| warning: false # Number of Employees Who Quit vs. Stayed Across Age Groups p2 <- ggplot(ibm_data, aes(x = age_group, fill = quit_status)) + geom_bar(position = "dodge", alpha = 0.9) + geom_text(stat = "count", aes(label = after_stat(count)), position = position_dodge(width = 0.9), vjust = -0.5) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange"), labels = c("Retained", "Quit")) + labs( x = "Age Group", y = "Number of Employees", fill = "Quit Status", ) + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Quit and Retention Percentages by Age Group (Stacked + percent) p3 <- ggplot(ibm_data, aes(x = age_group, fill = quit_status)) + geom_bar(position = "fill", alpha=0.9, show.legend = FALSE) + scale_y_continuous(labels = scales::percent) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange"), labels = c("Retained", "Quit")) + labs(x = "Age Group", y = "Percentage", fill = "Quit Status") + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) (p2 | p3) + plot_annotation( title = "Number of Employees Who Quit vs. Retained", theme = theme(plot.title = element_text(hjust = 0.5)) ) + plot_layout(guides = "collect") & theme(legend.position = "top") ``` # Categorical Analysis Using Chi-Square A Chi-Square analysis was conducted to examine the relationship between Quit Status and several categorical variables, including gender, department, job role, marital status, and overtime. Job role, marital status, and overtime showed significant differences (p < 0.05), while department and education field had p-values less than 0.10, suggesting potential trends. These variables will be further explored to assess their role in predicting employee turnover. ```{r} #| echo: false #| warning: false #| message: false #| include: false # List of variables for which you want to create contingency tables with Quit_Status vars <- c("gender", "department", "job_role", "education_field", "marital_status", "over_time") # Capitalize the variable names for the output vars_capitalized <- c("Gender", "Department", "Job Role", "Education Field", "Marital Status", "Overtime") # Create the list of contingency tables with Quit_Status and each other variable contingency_list <- lapply(vars, function(var) { table(ibm_data[[var]], ibm_data$quit_status) }) ``` ```{r} #| echo: false #| warning: false #| message: false #| include: false # Perform Chi-Square test on each contingency table and store results chi_results <- lapply(contingency_list, function(cont_table) { test_result <- chisq.test(cont_table) c(Statistic = test_result$statistic, P_Value = test_result$p.value, Df = test_result$parameter) }) # Collate the results into a data frame chi_results_df <- do.call(rbind, chi_results) chi_results_df <- as.data.frame(chi_results_df) chi_results_df$Variable <- vars_capitalized ``` ```{r} #| echo: false #| warning: false #| message: false #| include: false # Format the results into a gt table with capitalized column labels and highlighted p-values chi_results_gt <- chi_results_df |> rename(chi_test_score = `Statistic.X-squared`, p_value = P_Value, df = Df.df) |> select(-df) |> relocate(Variable, 1) |> mutate(chi_test_score = round(chi_test_score, 1), p_value = round(p_value, 2)) ``` ```{r} #| echo: false #| warning: false #| message: false #| include: true gt(chi_results_gt) |> tab_header( title = "Chi-Square Test Results for Quit Status vs Other Variables" ) |> cols_label( Variable = "Variable", chi_test_score = "Chi-Square Test Statistic", p_value = "P-Value", ) |> # Highlight rows with p-value ≤ 0.05 tab_style( style = cell_fill(color = "#FFE4E1"), # Light red background locations = cells_body( rows = p_value <= 0.05 ) ) |> # Optionally bold the p-value in significant results tab_style( style = cell_text(weight = "bold"), locations = cells_body( rows = p_value <= 0.05, columns = vars(p_value) ) ) ``` ## Digging Deeper - Categorical Variables {toc-text="Categorical Var"} > What Factors Influence Employee Attrition? ```{r} #| echo: false #| warning: false #| message: false #| include: true #| fig-width: 15 #| fig-height: 17 # Education p4 <- ggplot(ibm_data, aes(x = education_field, fill = quit_status)) + geom_bar(position = "dodge", alpha = 0.9) + geom_text(stat = "count", aes(label = after_stat(count)), position = position_dodge(width = 0.9), vjust = -0.2, size = 3.5) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange")) + labs( x = "Education Field", y = "Number of Employees", fill = "Quit Status", ) + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) p5 <- ggplot(ibm_data, aes(x = education_field, fill = quit_status)) + geom_bar(position = "fill", alpha=0.9, show.legend = FALSE) + scale_y_continuous(labels = scales::percent) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange")) + labs(x = "Job Role", y = "Percentage", fill = "Quit Status") + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Job Role p6 <- ggplot(ibm_data, aes(x = job_role, fill = quit_status)) + geom_bar(position = "dodge", alpha = 0.9) + geom_text(stat = "count", aes(label = after_stat(count)), position = position_dodge(width = 0.9), vjust = -0.2, size = 3.5) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange")) + labs( x = "Job Role", y = "Number of Employees", fill = "Quit Status" ) + theme_minimal() + theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.y = element_text(size = 10) ) + scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) p7 <- ggplot(ibm_data, aes(x = job_role, fill = quit_status)) + geom_bar(position = "fill", alpha = 0.9, show.legend = FALSE) + scale_y_continuous(labels = scales::percent) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange")) + labs(x = "Education Field", y = "Percentage", fill = "Quit Status") + theme_minimal() + theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.y = element_text(size = 8), plot.margin = margin(12, 12, 12, 12) ) + scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) # Marital Status p8 <- ggplot(ibm_data, aes(x = marital_status, fill = quit_status)) + geom_bar(position = "dodge", alpha = 0.9) + geom_text(stat = "count", aes(label = after_stat(count)), position = position_dodge(width = 0.9), vjust = -0.2, size = 3.5) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange")) + labs( x = "Marital Status", y = "Number of Employees", fill = "Quit Status", ) + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) p9 <- ggplot(ibm_data, aes(x = marital_status, fill = quit_status)) + geom_bar(position = "fill", alpha=0.9, show.legend = FALSE) + scale_y_continuous(labels = scales::percent) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange")) + labs(x = "Marital Status", y = "Percentage", fill = "Quit Status") + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Overtime p10 <- ggplot(ibm_data, aes(x = over_time, fill = quit_status)) + geom_bar(position = "dodge", alpha = 0.9) + geom_text(stat = "count", aes(label = after_stat(count)), position = position_dodge(width = 0.9), vjust = -0.2, size = 3.5) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange")) + labs( x = "Overtime", y = "Number of Employees", fill = "Quit Status", ) + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) p11 <- ggplot(ibm_data, aes(x = over_time, fill = quit_status)) + geom_bar(position = "fill", alpha=0.9, show.legend = FALSE) + scale_y_continuous(labels = scales::percent) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange")) + labs(x = "Overtime", y = "Percentage", fill = "Quit Status") + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Department p12 <- ggplot(ibm_data, aes(x = department, fill = quit_status)) + geom_bar(position = "dodge", alpha = 0.9) + geom_text(stat = "count", aes(label = after_stat(count)), position = position_dodge(width = 0.9), vjust = -0.2, size = 3.5) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange")) + labs( x = "Department", y = "Number of Employees", fill = "Quit Status", ) + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) p13 <- ggplot(ibm_data, aes(x = department, fill = quit_status)) + geom_bar(position = "fill", alpha=0.9, show.legend = FALSE) + scale_y_continuous(labels = scales::percent) + scale_fill_manual(values = c("Retained" = "#0066CC", "Quit" = "orange")) + labs(x = "Department", y = "Percentage", fill = "Quit Status") + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) (p6 | p7)/ (p10 | p11)/ (p8 | p9)/ (p4 | p5)/ (p12 | p13) + plot_annotation( title = "Number of Employees Who Quit vs. Retained", theme = theme(plot.title = element_text(hjust = 0.5)) ) + plot_layout(guides = "collect") & theme(legend.position = "top") ``` ```{r} #| echo: false #| warning: false #| message: false #| include: false cat_df <- ibm_data |> select(quit_status, job_role, over_time, department, education_field) correlation_label <- function(data, mapping, ...) { cor_val <- cor(as.numeric(data[[mapping$x]]), as.numeric(data[[mapping$y]]), use = "complete.obs") ggplot2::geom_text(aes(label = sprintf("%.2f", cor_val)), color = "black", size = 3, ...) } # Create the pairwise plot with customized upper triangle for correlation values ggpairs(cat_df, aes(color = quit_status, alpha = 0.5), upper = list(continuous = correlation_label), # Use correlation values in the upper triangle lower = list(continuous = "smooth"), # Use scatter plots in the lower triangle diag = list(continuous = "barDiag"), # Diagonal shows bar plots for categorical data discrete = list(combo = "facetdensity")) ```