Predicting attrition using HR analytics

6 min readJun 5, 2021

Sivakumar V

Problem Statement

Every year a lot of companies hire a number of employees. The companies invest time and money in training those employees, not just this but there are training programs within the companies for their existing employees as well. The aim of these programs is to increase the effectiveness of their employees. But where HR Analytics fit in this? and is it just about improving the performance of employees?

HR Analytics

Human resource analytics (HR analytics) is an area in the field of analytics that refers to applying analytic processes to the human resource department of an organization in the hope of improving employee performance and therefore getting a better return on investment. HR analytics does not just deal with gathering data on employee efficiency. Instead, it aims to provide insight into each process by gathering data and then using it to make relevant decisions about how to improve these processes.

Attrition in HR

Attrition in human resources refers to the gradual loss of employees overtime. In general, relatively high attrition is problematic for companies. HR professionals often assume a leadership role in designing company compensation programs, work culture, and motivation systems that help the organization retain top employees.

How does Attrition affect companies? and how does HR Analytics help in analyzing attrition? We will discuss the first question here and for the second question, we will write the code and try to understand the process step by step.

Attrition affecting Companies

A major problem in high employee attrition is its cost to an organization. Job postings, hiring processes, paperwork, and new hire training are some of the common expenses of losing employees and replacing them. Additionally, regular employee turnover prohibits your organization from increasing its collective knowledge base and experience over time. This is especially concerning if your business is customer-facing, as customers often prefer to interact with familiar people. Errors and issues are more likely if you constantly have new workers.

Importing Libraries

# Data analyzing and manipulating libraries
import pandas as pd
import numpy as np
#Visulizing library
import matplotlib.pyplot as plt
import seaborn as sns
#Supress the warning
import warnings
warnings.filterwarnings(‘ignore’)
#MachineLearning library
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import power_transform
#model
from sklearn.linear_model import LogisticRegression
#metrics
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
# data cleaning and wrangling
from scipy.stats import boxcox,zscore

Getting the Data

df = pd.read_csv(‘WA_Fn-UseC_-HR-Employee-Attrition.csv’)
pd.set_option(‘display.max_columns’, None)
df.head()

Data Analysis

Shape of the Dataset

df.shape
(1470, 35)

Observation of Dataset

Target Variable is Attrition.
Supervised logistic problem

Checking DataType

df.shape
(1470, 35)Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance              int64
YearsAtCompany               int64
YearsInCurrentRole           int64
YearsSinceLastPromotion      int64
YearsWithCurrManager         int64
dtype: object

Checking Null values

df.isnull().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

EDA Concluding Remark

df.describe()

Observation :

Categorical column outlier we should not consider
Distancehome,Monthlyincome,Number of companies workin mean , meadian value showing skewness in the data
We need to check outlier some other way

Checking outlier

df.iloc[:,:].plot(kind='box',subplots=True,layout=(4,10))
plt.subplots_adjust(top=2,bottom=1.25,right=1.5)
plt.show()

Pre-Processing Pipeline

Univariant Analysis

sns.countplot(df.Attrition)
sns.countplot(df.Department)

Removing outlier from Data

from scipy.stats import zscore,boxcoxz=np.abs(zscore(df))
z

All rows are deleted so we need to keep outlier dataset.

Checking Skewness

df.skew()
Age                         0.413286
Attrition                   1.844366
BusinessTravel             -1.439006
DailyRate                  -0.003519
Department                  0.172231
DistanceFromHome            0.958118
Education                  -0.289681
EducationField              0.550371
EmployeeCount               0.000000
EmployeeNumber              0.016574
EnvironmentSatisfaction    -0.321654
Gender                     -0.408665
HourlyRate                 -0.032311
JobInvolvement             -0.498419
JobLevel                    1.025401
JobRole                    -0.357270
JobSatisfaction            -0.329672
MaritalStatus              -0.152175
MonthlyIncome               1.369817
MonthlyRate                 0.018578
NumCompaniesWorked          1.026471
Over18                      0.000000
OverTime                    0.964489
PercentSalaryHike           0.821128
PerformanceRating           1.921883
RelationshipSatisfaction   -0.302828
StandardHours               0.000000
StockOptionLevel            0.968980
TotalWorkingYears           1.117172
TrainingTimesLastYear       0.553124
WorkLifeBalance            -0.552480
YearsAtCompany              1.764529
YearsInCurrentRole          0.917363
YearsSinceLastPromotion     1.984290
YearsWithCurrManager        0.833451
dtype: float64

As per assumption skewness greater than 0.5 and less than -0.5 is consider abnormal distribution.

Building Machine Learning Models

Spliting the data

x = df.drop(‘Attrition’,axis=1)

y = df[‘Attrition’]

Removing Skewness

df_x = power_transform(x)df_x = pd.DataFrame(df_x)

Standard_Scalar

from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
x = scalar.fit_transform(x)

Finding best random state to get good accuracy model

maxAccu = 0
maxRS = 0for i in range(1,200):
    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.23,random_state=i)
    LR =LogisticRegression()
    LR.fit(x_train,y_train)
    pred = LR.predict(x_test)
    acc = accuracy_score(y_test,pred)
    if acc > maxAccu:
        maxAccu = acc
        maxRS = iprint(f"Best accuracy is {maxAccu} and best random state {maxRS}")Best accuracy is 0.9144542772861357 and best random state 120

# Spliting test and training data
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.23,random_state=120)

# Size of the training and testing dataset
print(“X_train”,x_train.shape)
print(“y_train”,y_train.shape)
print(“x_test”,x_test.shape)
print(“y_test”,y_test.shape)

X_train (1131, 34)
y_train (1131,)
x_test (339, 34)
y_test (339,)

Building Machine Learning Models

Logistic Regression Model

# Checking the accuracy of the value ,confusion matrix,classification reportdef metrics_model(y_test,pred):
    print("Accuracy score is",accuracy_score(y_test,pred))
    print("."*80)
    print("Confusion matrix value is:\n",confusion_matrix(y_test,pred))
    print("."*80)
    print(classification_report(y_test,pred))lg = LogisticRegression()
lg.fit(x_train,y_train)
pred =lg.predict(x_test)
metrics_model(y_test,pred)Accuracy score is 0.9144542772861357
................................................................................
Confusion matrix value is:
 [[293   3]
 [ 26  17]]
................................................................................
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       296
           1       0.85      0.40      0.54        43

    accuracy                           0.91       339
   macro avg       0.88      0.69      0.75       339
weighted avg       0.91      0.91      0.90       339

Decision Tree regression

from sklearn.tree import DecisionTreeRegressordcr = DecisionTreeRegressor()
dcr.fit(x_train,y_train)
prec = dcr.predict(x_test)
metrics_model(y_test,prec)Accuracy score is 0.7964601769911505
................................................................................
Confusion matrix value is:
 [[253  43]
 [ 26  17]]
................................................................................
              precision    recall  f1-score   support

           0       0.91      0.85      0.88       296
           1       0.28      0.40      0.33        43

    accuracy                           0.80       339
   macro avg       0.60      0.63      0.61       339
weighted avg       0.83      0.80      0.81       339Accuracy score is 0.7964601769911505
................................................................................
Confusion matrix value is:
 [[253  43]
 [ 26  17]]
................................................................................
              precision    recall  f1-score   support

           0       0.91      0.85      0.88       296
           1       0.28      0.40      0.33        43

    accuracy                           0.80       339
   macro avg       0.60      0.63      0.61       339
weighted avg       0.83      0.80      0.81       339Accuracy score is 0.7964601769911505
................................................................................
Confusion matrix value is:
 [[253  43]
 [ 26  17]]
................................................................................
              precision    recall  f1-score   support

           0       0.91      0.85      0.88       296
           1       0.28      0.40      0.33        43

    accuracy                           0.80       339
   macro avg       0.60      0.63      0.61       339
weighted avg       0.83      0.80      0.81       339Accuracy score is 0.7964601769911505
................................................................................
Confusion matrix value is:
 [[253  43]
 [ 26  17]]
................................................................................
              precision    recall  f1-score   support

           0       0.91      0.85      0.88       296
           1       0.28      0.40      0.33        43

    accuracy                           0.80       339
   macro avg       0.60      0.63      0.61       339
weighted avg       0.83      0.80      0.81       339

Hyper parametertunning

from sklearn.model_selection import GridSearchCV
params = {'criterion':["mse", "friedman_mse", "mae"],'splitter': ["best", "random"]}dcrh = GridSearchCV(DecisionTreeRegressor(),params,cv=5)
dcrh.fit(x_train,y_train)
dcrh.best_params_dcr = DecisionTreeRegressor(criterion='mse',splitter='random')
dcr.fit(x_train,y_train)
prec = dcr.predict(x_test)
metrics_model(y_test,prec)Accuracy score is 0.7817109144542773
................................................................................
Confusion matrix value is:
 [[247  49]
 [ 25  18]]
................................................................................
              precision    recall  f1-score   support

           0       0.91      0.83      0.87       296
           1       0.27      0.42      0.33        43

    accuracy                           0.78       339
   macro avg       0.59      0.63      0.60       339
weighted avg       0.83      0.78      0.80       339

Concluding Remarks.

Decision tree classification model is giving good accuracy.

Metrics : AUC ROC Curve

from sklearn.metrics import roc_curve,roc_auc_score,plot_roc_curveY_pred_pb =lg.predict_proba(x_test)[:,1]fpr,tpr,thereshold = roc_curve(y_test,Y_pred_pb)
roc_auc_score(y_test,Y_pred_pb)
plot_roc_curve(dcr,x_test,y_test)

Saving the Model

We found the best machine learning model and now we need to save model.

Joblib library is used save the object of decision tree classification model.

import joblib#First Methodjoblib.dump(lg,"HR_Analytics.obj")#loding model to filejob=joblib.load("HR_Analytics.obj")
job