September 19, 2021
Customer churn is an important metric for service companies like telecommunications. Retaining customers is more cost-effective than acquiring new ones (Gallo, 2014). Therefore, a good predictive model can help organizations anticipate and prevent customer churn. In this project, five performance measures are calculated on the telecommunication company customer churn logistic regression model; two use all variables, and three use the most significant predictors.
File Read and Object Structure Viewed of Data Frame
Preparation and Partitioning of Data in Training and Testing Sets
Logistic Model for Churn with Training Data
Evaluation of Logistic Model with Testing Data
Logistic Regression Accuracy
Confusion Matrix Using the table() Function
Performance Measures of Logistic Models Using the Most Significant Predictors
- Model Using Contract Predictor
- Model Using Paperless Billing Predictor
- Model Using Tenure Group Predictor
(d) Model Using All Three Predictors Collectively
The project reads in the telecommunications CSV file using the getwd() for the current working directory and the read.csv() functions. The file is saved in the telco data frame. The str() command displays the telco object’s structure, as seen in Figure 1. This data frame shows the number of observations and variables and a list of the independent and dependent variables with characteristics in the data set. After installing the caret package and pulling in its library, the script to prepare and partition the data into training and testing sets is completed, as seen in Figure 2. The intrain data frame holds the information for the dependent variable and the cutoff point of .7 or 70%.
Next, the logistic model for the churn variable using the training data is performed and demonstrated in Figure 3 using the glm() function and binomial family. A summary of the data demonstrates the most significant predictors for churn. Figure 4 evaluates the model error rate with the testing data by setting “yes” as the most important variable equal to 1 and utilizing the predict() function. Any values over 0.5 will show “yes” for churn, and those below 0.5 will show “no.” The mean() function provides summary statistics for the testing data. Figure 5 calculates and prints the logistic model accuracy using the paste() function and the misClasificError variable. Figure 6 displays the confusion matrix utilizing the table() command and pulling in the fitted.results data from the prediction model. Finally, Figure 7 has four figures demonstrating performance measures in three separate models using the three most significant predictors independently and collectively in the last image. This is done using the same script above, with modifications for each predictor.
Interpretation of the Results
The logistic model in Figure 3 demonstrates the most significant predictors as Contract, PaperlessBilling, and tenure_group. The evaluation using testing data in Figure 4 demonstrates a value of 0.2011385, suggesting an error rate of about 20%, indicating the model is approximately 80% accurate. The accuracy of the logistic model shown in Figure 5 confirms this information, showing a value of 0.799. The confusion matrix in Figure 6 details the errors and accuracy. Of the 1,704 actual “0” or “no” responses—meaning the customer did not churn—290 were misclassified as “1” or “yes.” This gives a Class 1 error rate of about 17% (290/1704), with sensitivity—the ability to correctly identify positive results—of about 83% (Berrier et al., 2018). Of the 404 “yes” responses, 134 were incorrectly classified as “no,” giving a Class 0 error rate of about 33% (134/404) and specificity—the ability to correctly identify negative results—of about 67% (Berrier et al., 2018). This is a pretty good model to predict churn.
In Figure 7, performance measures for several models using three significant predictors are completed. Figure 7(a-c) are for the contract, paperless billing, and tenure variables independently. All three demonstrate that 560 of the 2108 values in the testing set were classified as “yes” when they should have been “no.” This gives a Class 1 error rate of about 27% and sensitivity of about 73%, lower than the primary model. Collectively, the three significant predictors show a Class 1 error rate of approximately 20%, with a sensitivity of 80%, and a Class 0 error rate of about 41%, with a specificity of 59%. Overall, these are fair results, but the accuracy of the primary model with all variables considered seems to be a better fit.
In closing, the primary model for predicting churn rates using all variables appears to be a better fit model than the independent and collective predictors, as it is 80% accurate. Although, it is close. The individual predictors are 73% accurate, while the collective model with all three predictors is 77% accurate. These are good indicators of what may be causing churn rates, which could help the company determine the best ways to address the issues.
Berrier, J, Nestler, S., Pardoe, I., Sturdivant, R.X., & Watts, K. (2018). Fundamentals of Data Analytics R. Zyante Inc.
Gallo, A. (2014, October 29). The value of keeping the right customers. Harvard Business Review. Retrieved on September 14, 2022, from https://hbr.org/2014/10/the-value-of-keeping-the-right-customers