June 12, 2022
This project will analyze the Telco_Extra.xls dataset to perform descriptive statistics on several variables including age, years at current address, gender, level of education, income, marital status, region, customer category (custcat), and churn. Statistical evaluations will investigate customer churn rates to help telecommunications companies better retain future customers. Churn rate is a considerable problem for many organizations as it has a significant impact on revenue; therefore, companies need to develop a way to predict churn and identify the factors that increase churn rates (Ahmad et al., 2019). No personal direct identifiers are used in the dataset, and the 1,000 observations equal approximately half of the company’s client base. Income information was purchased by an outside vendor, and the telecommunications company would like to better understand this information to help forecast earnings of the main customer base.
Descriptive Statistics and Graphical Displays
Descriptive statistics help analysts visualize the raw data and present the information in a meaningful way. It includes quantitative summaries of features within the dataset, such as the center, spread, and maximum and minimum values (Sturdivant et al., 2016). Descriptive statistics include graphical displays such as histograms, box plots, and dot plots that show the raw data in a visualization. Summary statistics for the numeric variables are demonstrated in Table 1 below. Frequency tables are used to evaluate the character variables for gender, region, and marital status.
Table 1
Descriptive Statistics for Numeric Variables

The descriptive statistics in the above table define important values for the six numerically coded columns in the dataset. There is a right skew in all but the customer category. This indicates there are some customers who are over the maximum values in each variable. To start, age shows a mean of 41.68 years, and the oldest (maximum) age is 77. The mode of 33 indicates that is the most occurring age of the customer.
Figure 1
Distribution of Age Histogram

Years customers have lived at current address average at 11.55 years, with the longest showing 55 years. The minimum and maximum values of 1 and 55 reveal a large range, but the median is about 9 years. The histogram below shows the rightward skew with most residents staying in their homes for shorter periods.
Figure 2
Distribution of Years at Current Address

The mean income per household in thousands of dollars is reported to be 77.54, or $77,540. The maximum income is $1,668,000 per year. There is a significant skew to this data given the outliers, but most residents fall under $180,000 per year. The most common reported income (mode) is $25,000, with the median income landing at $47,000. The median is likely to be more reliable in this scenario given the outliers. It is important to consider if there are any missing values in this dataset as well. Any missing income can be because of nonresponse, dropouts, or information loss (Kwak & Kim, 2017). This should be thoroughly investigated given the telecommunication company’s desire to forecast future customer revenue based on income.
Figure 3
Distribution of Income

The mean education level is listed as 2.671, but this data requires further information. Available numbers are 1 through 5, and they mean the following: 1) Did not complete high school, 2) high school degree, 3) some college, 4) college degree, and 5) post-undergraduate degree. The mean is in between completed high school and some college, but the mode suggests most customers completed high school. With the rightward skewness listed in Table 1, Table 2 shows that only 6.6% of customers have completed a graduate level degree.
Table 2
Frequency Table of Education Level

Figure 4
Distribution of Education Level

Customer category is another statistic that is numbered for values. Available options are 1 through 4 and mean the following: 1) Basic service, 2) E-service, 3) Plus service, and 4) Total service. This is the only variable that presented a left skewness with a value of -0.032. The mean is 2.487, but the mode is clearer, so the Plus service is the most common value. Table 3 frequency table and Figure 5 distribution histogram show that service types are somewhat evenly distributed.
Table 3
Frequency Table for Service Options

Figure 5
Distribution of Customer Category

The mean churn is listed as 0.27, with a skewness of 1.02, indicating a rightward shift. Table 4 frequency table shows the percentage and number of customers that discontinued or did not discontinue services over the previous month. A value of zero means no, the service was not terminated, and one means yes, the service was terminated. This data suggest that the company realized a churn rate with 27.46% discontinuing service in the last month.
Table 4
Frequency Table for Churn Rate

Figure 6
Distribution of Churn Rate

Gender is divided into two parts, male using the value of zero and female using the value of 1. As the frequency table shows below, females make up the majority of the customer base at 51.7%. There is a concern for gender bias with slightly asymmetrical results, however. Research should incorporate multiple measures of gender to reflect a diversified culture with sharper models that shorten the gaps in gender inequality (Magliozzi et al., 2019). There is a risk of missing information from individuals who identify as transgender, non-binary or other gender; however, the frequency table below indicates no missing values in this sample dataset.
Table 5
Frequency Table for Gender

Figure 7
Distribution of Gender

Marital status provided two responses, married or unmarried. Like the gender issue, this presents a possibility for bias, as it does not include common law partners, widowed individuals, or those in a committed domestic partnership. However, the frequency table shows that the sample data is complete without missing values, and the responses were relatively evenly split, with 49.5% reporting themselves as married and 50.5% as unmarried. To note, marital status has a direct impact on income, so limiting this response to only two options significantly limits the ability to clearly understand income in these customers.
Table 6
Frequency Table for Marital Status

Figure 8
Distribution of Marital Status

Customer location is important to consider when looking at churn rates. The frequency table for the dataset indicates there are three zones that are relatively evenly distributed. It should be noted that information on the dataset states there are five zones; however, zones 4 and 5 are not listed here for unknown reasons.
Table 7
Frequency Table for Region

Figure 9
Distribution of Region

Churn as the Dependent Variable
Using churn as the dependent variable and omitting income, a multiple linear regression analysis was performed. The seven variables had to be divided in SAS. Age and address had many values, so they were added as continuous variables, while region, gender, marital status, education level, and customer category were added as classification variables.
Null Hypothesis
B1 = B2 = B3 = B4 = B5 = B6 = B7 = 0 stating there is no direct connection between last month’s churn and customer age, years at address, region, education, marital status, gender, and customer category.
Alternative hypothesis
At least one Bk ≠ 0 stating there is a correlation between at least one of these variables and the churn rate within the last month.
Table 8
Least Squares Summary, ANOVA, and Estimates for Churn as Dependent Variable



Figure 10
Fit Diagnostics for Churn

The data results indicate that there is a large residual as the predicted response is far from the observed response (Wicklin, 2021). The residuals follow a downward curve in the first two charts in Figure 10, indicating this may not be the best fit model. The two graphs in the far-right column indicate there are some outliers in the dataset that may be affecting readings (Wicklin, 2021). The two graphs in the lower left indicate a somewhat normal distribution, however.
Figure 11
Residual by Regressors


Residual by Regressor graphs indicate that as one ages and the longer one stays at their current address, the less likely they will be to discontinue services and become part of a churn statistic. Education appears to have an effect on churn rates as customers with higher education are less likely to discontinue services. Further analysis of the tables indicates the p-values for the entire model is lower than the confidence interval of 0.05; therefore, we can reject the null hypothesis that states there is no connection between the variables and churn rate in favor of the alternative hypothesis that some variables do affect the churn. However, R square of 11.53% would indicate weak correlations.
Income as the Dependent Variable
Using income as the dependent variable and omitting churn, a multiple linear regression analysis was performed. The other variables had to be divided in SAS. Age and address had many values, so they were added as continuous variables, while region, gender, marital status, education level, and customer category were added as classification variables.
Null Hypothesis
B1 = B2 = B3 = B4 = B5 = B6 = B7 = 0 stating there is no direct connection between income and customer age, years at address, region, education, marital status, gender, and customer category.
Alternative hypothesis
At least one Bk ≠ 0 stating there is a correlation between at least one of these variables and income.
Table 9
Least Squares Summary, ANOVA, and Estimates for Income as Dependent Variable



Figure 12
Fit Diagnostics for Income

In the above fit diagnostic, we can see a few patterns emerging. The middle graph in the middle row is the predicted versus observed response. With the majority of points falling relatively close to the line, this would appear to be a good fit. The first two graphs in the top row show a downward pattern, however, which indicates the residuals are following a curve and therefore not capturing everything in the response variable (Wicklin, 2021). There appear to be some outliers, as expected, but the data also appears to be somewhat normally distributed, as can be seen in the bottom two graphs in the first column.
Figure 13
Residual by Regressors for Income



The above graphs and tables for income as a dependent variable demonstrate positive relationships between some of these variables and income. The p-values of age, high school completion, and basic services in customer category all have values below the significance level of 0.05; therefore, we can reject the null hypothesis that no significant relationship exists between these variables and income. The R-square value of 15.11% indicate it is a weak relationship. Graphs indicate increased income as one ages, and lower income for lower education levels and customers who use basic services.
The telecommunications company wants to better understand customer churn and predict income for a customer so they need not spend any money purchasing income data from another outside vendor. This data indicates that higher education levels, increasing age, the longer they stay in their residence, and higher income levels result in lesser likelihood of the customer discontinuing services; however, further assessment is required as these are weak correlations.
References
Ahmad, A.K., Jafar, A. & Aljoumaa, K. (2019). Customer churn prediction in telecom using machine learning in big data platform. Journal of Big Data, 6(28). https://doi.org/10.1186/s40537-019-0191-6
Kwak, S. K., & Kim, J. H. (2017). Statistical data preparation: management of missing values and outliers. Korean journal of anesthesiology, 70(4), 407–411. https://doi.org/10.4097/kjae.2017.70.4.407
Magliozzi, D., Saperstein, A., & Westbrook, L.E. (2016). Scaling Up: Representing Gender Diversity in Survey Research. Socius, 2. https://doi.org/10.1177/2378023116664352
Sturdivant, R., Pardoe, I., Berrier, I., & Watts, K. (2016). Statistics for Data Analytics. zyBook [online].