Predicting attrition using HR analytics

Sivakumar V
6 min readJun 5, 2021

Sivakumar V

Problem Statement:

Problem Statement

Every year a lot of companies hire a number of employees. The companies invest time and money in training those employees, not just this but there are training programs within the companies for their existing employees as well. The aim of these programs is to increase the effectiveness of their employees. But where HR Analytics fit in this? and is it just about improving the performance of employees?

HR Analytics

Human resource analytics (HR analytics) is an area in the field of analytics that refers to applying analytic processes to the human resource department of an organization in the hope of improving employee performance and therefore getting a better return on investment. HR analytics does not just deal with gathering data on employee efficiency. Instead, it aims to provide insight into each process by gathering data and then using it to make relevant decisions about how to improve these processes.

Attrition in HR

Attrition in human resources refers to the gradual loss of employees overtime. In general, relatively high attrition is problematic for companies. HR professionals often assume a leadership role in designing company compensation programs, work culture, and motivation systems that help the organization retain top employees.

How does Attrition affect companies? and how does HR Analytics help in analyzing attrition? We will discuss the first question here and for the second question, we will write the code and try to understand the process step by step.

Attrition affecting Companies

A major problem in high employee attrition is its cost to an organization. Job postings, hiring processes, paperwork, and new hire training are some of the common expenses of losing employees and replacing them. Additionally, regular employee turnover prohibits your organization from increasing its collective knowledge base and experience over time. This is especially concerning if your business is customer-facing, as customers often prefer to interact with familiar people. Errors and issues are more likely if you constantly have new workers.

Importing Libraries

# Data analyzing and manipulating libraries
import pandas as pd
import numpy as np

#Visulizing library
import matplotlib.pyplot as plt
import seaborn as sns

#Supress the warning
import warnings
warnings.filterwarnings(‘ignore’)

#MachineLearning library
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import power_transform

#model
from sklearn.linear_model import LogisticRegression

#metrics
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

# data cleaning and wrangling
from scipy.stats import boxcox,zscore

Getting the Data

Photo by Markus Spiske on Unsplash

df = pd.read_csv(‘WA_Fn-UseC_-HR-Employee-Attrition.csv’)
pd.set_option(‘display.max_columns’, None)
df.head()

Data Analysis

The dataset of HR analyst

Shape of the Dataset

df.shape
(1470, 35)

Observation of Dataset

  • Target Variable is Attrition.
  • Supervised logistic problem

Checking DataType

df.shape
(1470, 35)
Age int64
Attrition object
BusinessTravel object
DailyRate int64
Department object
DistanceFromHome int64
Education int64
EducationField object
EmployeeCount int64
EmployeeNumber int64
EnvironmentSatisfaction int64
Gender object
HourlyRate int64
JobInvolvement int64
JobLevel int64
JobRole object
JobSatisfaction int64
MaritalStatus object
MonthlyIncome int64
MonthlyRate int64
NumCompaniesWorked int64
Over18 object
OverTime object
PercentSalaryHike int64
PerformanceRating int64
RelationshipSatisfaction int64
StandardHours int64
StockOptionLevel int64
TotalWorkingYears int64
TrainingTimesLastYear int64
WorkLifeBalance int64
YearsAtCompany int64
YearsInCurrentRole int64
YearsSinceLastPromotion int64
YearsWithCurrManager int64
dtype: object

Checking Null values

df.isnull().sum()

Age                         0
Attrition 0
BusinessTravel 0
DailyRate 0
Department 0
DistanceFromHome 0
Education 0
EducationField 0
EmployeeCount 0
EmployeeNumber 0
EnvironmentSatisfaction 0
Gender 0
HourlyRate 0
JobInvolvement 0
JobLevel 0
JobRole 0
JobSatisfaction 0
MaritalStatus 0
MonthlyIncome 0
MonthlyRate 0
NumCompaniesWorked 0
Over18 0
OverTime 0
PercentSalaryHike 0
PerformanceRating 0
RelationshipSatisfaction 0
StandardHours 0
StockOptionLevel 0
TotalWorkingYears 0
TrainingTimesLastYear 0
WorkLifeBalance 0
YearsAtCompany 0
YearsInCurrentRole 0
YearsSinceLastPromotion 0
YearsWithCurrManager 0
dtype: int64

EDA Concluding Remark

df.describe()

Observation :

  • Categorical column outlier we should not consider
  • Distancehome,Monthlyincome,Number of companies workin mean , meadian value showing skewness in the data
  • We need to check outlier some other way

Checking outlier

df.iloc[:,:].plot(kind='box',subplots=True,layout=(4,10))
plt.subplots_adjust(top=2,bottom=1.25,right=1.5)
plt.show()

Pre-Processing Pipeline

Univariant Analysis

sns.countplot(df.Attrition)
sns.countplot(df.Department)

Removing outlier from Data

from scipy.stats import zscore,boxcoxz=np.abs(zscore(df))
z

All rows are deleted so we need to keep outlier dataset.

Checking Skewness

df.skew()
Age 0.413286
Attrition 1.844366
BusinessTravel -1.439006
DailyRate -0.003519
Department 0.172231
DistanceFromHome 0.958118
Education -0.289681
EducationField 0.550371
EmployeeCount 0.000000
EmployeeNumber 0.016574
EnvironmentSatisfaction -0.321654
Gender -0.408665
HourlyRate -0.032311
JobInvolvement -0.498419
JobLevel 1.025401
JobRole -0.357270
JobSatisfaction -0.329672
MaritalStatus -0.152175
MonthlyIncome 1.369817
MonthlyRate 0.018578
NumCompaniesWorked 1.026471
Over18 0.000000
OverTime 0.964489
PercentSalaryHike 0.821128
PerformanceRating 1.921883
RelationshipSatisfaction -0.302828
StandardHours 0.000000
StockOptionLevel 0.968980
TotalWorkingYears 1.117172
TrainingTimesLastYear 0.553124
WorkLifeBalance -0.552480
YearsAtCompany 1.764529
YearsInCurrentRole 0.917363
YearsSinceLastPromotion 1.984290
YearsWithCurrManager 0.833451
dtype: float64

As per assumption skewness greater than 0.5 and less than -0.5 is consider abnormal distribution.

Building Machine Learning Models

Spliting the data

x = df.drop(‘Attrition’,axis=1)

y = df[‘Attrition’]

Removing Skewness

df_x = power_transform(x)df_x = pd.DataFrame(df_x)

Standard_Scalar

from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
x = scalar.fit_transform(x)

Finding best random state to get good accuracy model

maxAccu = 0
maxRS = 0
for i in range(1,200):
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.23,random_state=i)
LR =LogisticRegression()
LR.fit(x_train,y_train)
pred = LR.predict(x_test)
acc = accuracy_score(y_test,pred)
if acc > maxAccu:
maxAccu = acc
maxRS = i
print(f"Best accuracy is {maxAccu} and best random state {maxRS}")Best accuracy is 0.9144542772861357 and best random state 120

# Spliting test and training data
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.23,random_state=120)

# Size of the training and testing dataset
print(“X_train”,x_train.shape)
print(“y_train”,y_train.shape)
print(“x_test”,x_test.shape)
print(“y_test”,y_test.shape)

X_train (1131, 34)
y_train (1131,)
x_test (339, 34)
y_test (339,)

Building Machine Learning Models

Logistic Regression Model

# Checking the accuracy of the value ,confusion matrix,classification reportdef metrics_model(y_test,pred):
print("Accuracy score is",accuracy_score(y_test,pred))
print("."*80)
print("Confusion matrix value is:\n",confusion_matrix(y_test,pred))
print("."*80)
print(classification_report(y_test,pred))
lg = LogisticRegression()
lg.fit(x_train,y_train)
pred =lg.predict(x_test)
metrics_model(y_test,pred)
Accuracy score is 0.9144542772861357
................................................................................
Confusion matrix value is:
[[293 3]
[ 26 17]]
................................................................................
precision recall f1-score support

0 0.92 0.99 0.95 296
1 0.85 0.40 0.54 43

accuracy 0.91 339
macro avg 0.88 0.69 0.75 339
weighted avg 0.91 0.91 0.90 339

Decision Tree regression

from sklearn.tree import DecisionTreeRegressordcr = DecisionTreeRegressor()
dcr.fit(x_train,y_train)
prec = dcr.predict(x_test)
metrics_model(y_test,prec)
Accuracy score is 0.7964601769911505
................................................................................
Confusion matrix value is:
[[253 43]
[ 26 17]]
................................................................................
precision recall f1-score support

0 0.91 0.85 0.88 296
1 0.28 0.40 0.33 43

accuracy 0.80 339
macro avg 0.60 0.63 0.61 339
weighted avg 0.83 0.80 0.81 339
Accuracy score is 0.7964601769911505
................................................................................
Confusion matrix value is:
[[253 43]
[ 26 17]]
................................................................................
precision recall f1-score support

0 0.91 0.85 0.88 296
1 0.28 0.40 0.33 43

accuracy 0.80 339
macro avg 0.60 0.63 0.61 339
weighted avg 0.83 0.80 0.81 339
Accuracy score is 0.7964601769911505
................................................................................
Confusion matrix value is:
[[253 43]
[ 26 17]]
................................................................................
precision recall f1-score support

0 0.91 0.85 0.88 296
1 0.28 0.40 0.33 43

accuracy 0.80 339
macro avg 0.60 0.63 0.61 339
weighted avg 0.83 0.80 0.81 339
Accuracy score is 0.7964601769911505
................................................................................
Confusion matrix value is:
[[253 43]
[ 26 17]]
................................................................................
precision recall f1-score support

0 0.91 0.85 0.88 296
1 0.28 0.40 0.33 43

accuracy 0.80 339
macro avg 0.60 0.63 0.61 339
weighted avg 0.83 0.80 0.81 339

Hyper parametertunning

from sklearn.model_selection import GridSearchCV
params = {'criterion':["mse", "friedman_mse", "mae"],'splitter': ["best", "random"]}
dcrh = GridSearchCV(DecisionTreeRegressor(),params,cv=5)
dcrh.fit(x_train,y_train)
dcrh.best_params_
dcr = DecisionTreeRegressor(criterion='mse',splitter='random')
dcr.fit(x_train,y_train)
prec = dcr.predict(x_test)
metrics_model(y_test,prec)
Accuracy score is 0.7817109144542773
................................................................................
Confusion matrix value is:
[[247 49]
[ 25 18]]
................................................................................
precision recall f1-score support

0 0.91 0.83 0.87 296
1 0.27 0.42 0.33 43

accuracy 0.78 339
macro avg 0.59 0.63 0.60 339
weighted avg 0.83 0.78 0.80 339

Concluding Remarks.

Decision tree classification model is giving good accuracy.

Metrics : AUC ROC Curve

from sklearn.metrics import roc_curve,roc_auc_score,plot_roc_curveY_pred_pb =lg.predict_proba(x_test)[:,1]fpr,tpr,thereshold = roc_curve(y_test,Y_pred_pb)
roc_auc_score(y_test,Y_pred_pb)
plot_roc_curve(dcr,x_test,y_test)

Saving the Model

We found the best machine learning model and now we need to save model.

Joblib library is used save the object of decision tree classification model.

import joblib#First Methodjoblib.dump(lg,"HR_Analytics.obj")#loding model to filejob=joblib.load("HR_Analytics.obj")
job

--

--