Kaggle case: predicting employee turnover (artificial intelligence tells you the answer)
###################============== Loading Packages =============== ==== #################
library(plyr) # Rmisc association package, if you need to load dplyr package at the same time, you must first load the plyr package.
library(dplyr) # filter()
library(ggplot2) # ggplot()
library(DT) # datatable() Create an interactive data table
library(caret) # createDataPartition() stratified sampling function
library(rpart) # rpart()
library(e1071) # naiveBayes()
library(pROC) # roc()
library(Rmisc) # multiplot() Split drawing area
################### ============= Import data ================ == #################
hr <- read.csv("D:/R/天善智能/书豪å大案例/Employee turnover prediction \\HR_comma_sep.csv")
str(hr) # View the basic data structure of the data
Descriptive analysis
################### ============= Descriptive Analysis ================ === ###############
str(hr) # View the basic data structure of the data
summary(hr) # Calculate the main descriptive statistics of the data
# subsequent individual models need the target variable to be factor type, we convert it to factor type
hr$left <- factor(hr$left, levels = c('0', '1'))
## Exploring the relationship between employee satisfaction, performance evaluation, and average monthly working hours and resignation
# Draw a box plot of satisfaction with the company and whether or not to leave
box_sat <- ggplot(hr, aes(x = left, y = satisfaction_level, fill = left)) +
geom_boxplot() +
theme_bw() + # a ggplot theme
labs(x = 'left', y = 'satisfaction_level') # Set the horizontal and vertical coordinates
box_sat
Box line chart of employee satisfaction with the company and whether or not to leave
Retired employees are less satisfied with the company, mostly concentrated around 0.4;
# Draw a performance assessment and a box line diagram of whether to leave
box_eva <- ggplot(hr, aes(x = left, y = last_evaluation, fill = left)) +
geom_boxplot() +
theme_bw() +
labs(x = 'left', y = 'last_evaluation')
box_eva
Performance appraisal and box line diagram of resignation
The performance evaluation of the departing employees is higher, and the concentration is above 0.8;
# Draw a box plot of the average monthly working hours and whether or not to leave
box_mon <- ggplot(hr, aes(x = left, y = average_montly_hours, fill = left)) +
geom_boxplot() +
theme_bw() +
labs(x = 'left', y = 'average_montly_hours')
box_mon
The average monthly working hours of retired employees is higher, more than half of the average (200 hours)
# Draw a box plot of the employee's working years in the company and whether or not to leave
box_time <- ggplot(hr, aes(x = left, y = time_spend_company, fill = left)) +
geom_boxplot() +
theme_bw() +
labs(x = 'left', y = 'time_spend_company')
box_time
The working years of the departing employees are around 4 years.
# Combine these graphics in a drawing area, cols = 2 means that the layout is a row and two columns
multiplot(box_sat, box_eva, box_mon, box_time, cols = 2)
## Explore the number of participating projects, whether there is promotion in five years, and the relationship between salary and turnover
# Need to convert this variable into a factor type when drawing a bar chart of participating items
hr$number_project <- factor(hr$number_project,
levels = c('2', '3', '4', '5', '6', '7'))
# Draw the number of participating projects and whether or not to leave the percentage of the stacked bar chart
bar_pro <- ggplot(hr, aes(x = number_project, fill = left)) +
geom_bar(position = 'fill') + # position = 'fill' is to draw a percentage stacked bar chart
theme_bw() +
labs(x = 'left', y = 'number_project')
bar_pro
Employees participating in the number of projects and the percentage of whether they left the stacked bar chart
The more employees attending the project, the greater the turnover rate of employees (samples with 2 items removed)
# Draw a percentage bar chart of whether to promote and resign within 5 years
bar_5years <- ggplot(hr, aes(x = as.factor(promotion_last_5years), fill = left)) +
geom_bar(position = 'fill') +
theme_bw() +
labs(x = 'left', y = 'promotion_last_5years')
bar_5years
Percentage bar chart of whether to promote and resign within 5 years
The turnover rate of employees who have not been promoted within five years is relatively large.
# Plot the salary and the percentage of the resignation stacked bar chart
bar_salary <- ggplot(hr, aes(x = salary, fill = left)) +
geom_bar(position = 'fill') +
theme_bw() +
labs(x = 'left', y = 'salary')
bar_salary
Payroll and percentage of whether or not to leave a stacked bar chart
The higher the salary, the lower the turnover rate
# Combine these graphics in a drawing area, cols = 3 means that the layout is a row and three columns
multiplot(bar_pro, bar_5years, bar_salary, cols = 3)
Modeling prediction regression tree
############## =============== Extracting Excellent Employees =========== ####### ############
# filter() is used to filter the eligible samples
hr_model <- filter(hr, last_evaluation >= 0.70 | time_spend_company >= 4
| number_project > 5)
############### ============ Custom cross-validation method ========== ######## ##########
# Set 5-fold cross-validation method = 'cv' is to set the cross-validation method, number = 5 means 5-fold cross-validation
train_control <- trainControl(method = 'cv', number = 5)
################ =========== Divided into samples ============== ####### ###################
set.seed(1234) # Set random seeds in order to make the results consistent for each sample
# 7:3 stratified sampling based on the dependent variable of the data, returning the row index vector p = 0.7 means sampling according to 7:3,
# list=FI will not return the list, return vector
index <- createDataPartition(hr_model$left, p = 0.7, list = F)
traindata <- hr_model[index, ] # extracts the data of the index corresponding to the index in the data as a training set
testdata <- hr_model[-index, ] # rest as a test set
#####################================================================================================= ####################
# Using the train function in the caret package to establish a decision tree model using the 5-fold crossover method for the training set
# left ~. Means modeling from dependent variables and all independent variables; trControl is the control used to model
# method is to set which algorithm to use
rpartmodel <- train(left ~ ., data = traindata,
trControl = train_control, method = 'rpart')
# Use the rpartmodel model to predict the test set, ([-7] means to eliminate the dependent variable of the test set)
pred_rpart <- predict(rpartmodel, testdata[-7])
#Create confusion matrix, positive='1' set our positive example to "1"
con_rpart <- table(pred_rpart, testdata$left)
con_rpart
Modeling prediction of naive Bayes
###################============ Naives Bayes =============== ## ###############
nbmodel <- train(left ~ ., data = traindata,
trControl = train_control, method = 'nb')
pred_nb <- predict(nbmodel, testdata[-7])
con_nb <- table(pred_nb, testdata$left)
con_nb
Model evaluation + application
##################====================================================================== ====== #################
# When using the roc function, the predicted value must be numeric
pred_rpart <- as.numeric(as.character(pred_rpart))
pred_nb <- as.numeric(as.character(pred_nb))
roc_rpart <- roc(testdata$left, pred_rpart) # Get the information used in subsequent drawing
# False positive rate: (1-Specificity[)
Specificity <- roc_rpart$specificities # lays the foundation for the subsequent horizontal and vertical axis, true counterexample rate
Sensitivity <- roc_rpart$sensitivities # recall rate: sensitivities, also true case rate
# draw ROC curve
#we only need the horizontal and vertical coordinates NULL is to declare that we are not using any data
p_rpart <- ggplot(data = NULL, aes(x = 1- Specificity, y = Sensitivity)) +
geom_line(colour = 'red') + # Draw ROC curve
geom_abline() + # draw diagonal
annotate('text', x = 0.4, y = 0.5, label = paste('AUC=', #text is a text comment on the declaration layer
# '3' is a parameter inside the round function, retaining three decimal places
round(roc_rpart$auc, 3))) + theme_bw() + # Add AUC value in the figure (0.4, 0.5)
labs(x = '1 - Specificity', y = 'Sensitivities') # Set the horizontal and vertical axis labels
p_rpart
Returning tree ROC curve
roc_nb <- roc(testdata$left, pred_nb)
Specificity <- roc_nb$specificities
Sensitivity <- roc_nb$sensitivities
p_nb <- ggplot(data = NULL, aes(x = 1- Specificity, y = Sensitivity)) +
geom_line(colour = 'red') + geom_abline() +
annotate('text', x = 0.4, y = 0.5, label = paste('AUC=',
round(roc_nb$auc, 3))) + theme_bw() +
labs(x = '1 - Specificity', y = 'Sensitivities')
p_nb
Naive Bayes ROC Curve
AUC value of the regression tree (0.93) > AUC value of naive Bayes (0.839)
Finally, we chose the regression tree model as our actual prediction model.
###############################==================================================================== ==####################
# Use the regression tree model to predict the probability of classification, type='prob' set the prediction result as the probability of leaving the job and the probability of not leaving the job.
pred_end <- predict(rpartmodel, testdata[-7], type = 'prob')
# Combined forecast results and predicted probability results
data_end <- cbind(round(pred_end, 3), pred_rpart)
# Rename the forecast results table
names(data_end) <- c('pred.0', 'pred.1', 'pred')
# Generate an interactive data table
datatable(data_end)
Finally we will generate a forecast result table
Overmolding the Connectors offers significant opportunities for cable improvements with higher pull strength and waterproof issue for those parts, which without these characteristic by conventional types.Such as jst jwpf connector. Just be free to contact us if you need any wire-harness solutions or partner for your products. Our professional and experienced team would support you by satisfied skill and service.
Molded Connectors,Molded Waterproof Connector,Molded Straight Wire Connector,Jst Jwpf Connector
ETOP WIREHARNESS LIMITED , https://www.etopwireharness.com