AML Customer Churn¶

In [1]:
#Tested on Python 3.11.2
%pip install pandas numpy seaborn matplotlib scikit-learn
Welcome, Kaver
Requirement already satisfied: pandas in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (2.3.0)
Requirement already satisfied: numpy in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (2.3.0)
Requirement already satisfied: seaborn in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (0.13.2)
Requirement already satisfied: matplotlib in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (3.10.3)
Requirement already satisfied: scikit-learn in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (1.7.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from pandas) (2025.2)
Requirement already satisfied: contourpy>=1.0.1 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (4.58.4)
Requirement already satisfied: kiwisolver>=1.3.1 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (1.4.8)
Requirement already satisfied: packaging>=20.0 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (25.0)
Requirement already satisfied: pillow>=8 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (11.2.1)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (3.2.3)
Requirement already satisfied: scipy>=1.8.0 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from scikit-learn) (1.15.3)
Requirement already satisfied: joblib>=1.2.0 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from scikit-learn) (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from scikit-learn) (3.6.0)
Requirement already satisfied: six>=1.5 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Note: you may need to restart the kernel to use updated packages.

1. The Data¶

Customer Churn Dataset¶

The Customer Churn dataset is designed to analyze and predict customer behavior related to discontinuing a subscription. Understanding and predicting customer churn helps businesses identify patterns, develop retention strategies, and improve customer satisfaction, which directly impacts revenue and growth.

Dataset Features¶

  • CustomerID: unique identifier for each customer.
  • Age: customer's age, between 18 and 65.
  • Gender: customer's gender.
  • Tenure: number of months the customer has been with the company, between 1 and 60.
  • Usage Frequency: frequency of service usage, between 1 and 30 times.
  • Support Calls: number of support calls made by the customer, between 1 and 10.
  • Payment Delay: number of days payment is delayed, between 0 and 30.
  • Subscription Type: type of subscription plan (Premium, Standard, Basic).
  • Contract Length: duration of the subscription contract (Annual, Quarterly, Monthly).
  • Total Spend: total amount spent by the customer.
  • Last Interaction: days since the last interaction.
  • Churn: Customer has churned (1) or not (0).

Motivation¶

With my experience in creating and optimizing web-apps to reduce customer churn, analyzing this dataset will enhance my understanding of customer behaviors, leading to improved UI/UX strategies. This understanding will help me as a web developer to create better features, ultimately reducing churn rates in my current role.

Source¶

The dataset is sourced from Kaggle and can be found here: https://www.kaggle.com/datasets/muhammadshahidazeem/customer-churn-dataset.

Descriptive statistics¶

Load the data¶

In [ ]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as pyplot
from sklearn.calibration import cross_val_predict
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import KFold, RandomizedSearchCV, cross_val_score, train_test_split, learning_curve
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score
from sklearn.feature_selection import RFE, VarianceThreshold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

filename = './data/customer_churn_dataset-final-master.csv'
data =  pd.read_csv(filename)
print(data.shape)
data.head()
Matplotlib is building the font cache; this may take a moment.
(64374, 12)
Out[ ]:
CustomerID Age Gender Tenure Usage Frequency Support Calls Payment Delay Subscription Type Contract Length Total Spend Last Interaction Churn
0 1 22 Female 25 14 4 27 Basic Monthly 598 9 1
1 2 41 Female 28 28 7 13 Standard Monthly 584 20 0
2 3 47 Male 27 10 2 29 Premium Annual 757 21 0
3 4 35 Male 9 12 5 17 Premium Quarterly 232 18 0
4 5 53 Female 58 24 9 2 Standard Annual 533 18 0
In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64374 entries, 0 to 64373
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   CustomerID         64374 non-null  int64 
 1   Age                64374 non-null  int64 
 2   Gender             64374 non-null  object
 3   Tenure             64374 non-null  int64 
 4   Usage Frequency    64374 non-null  int64 
 5   Support Calls      64374 non-null  int64 
 6   Payment Delay      64374 non-null  int64 
 7   Subscription Type  64374 non-null  object
 8   Contract Length    64374 non-null  object
 9   Total Spend        64374 non-null  int64 
 10  Last Interaction   64374 non-null  int64 
 11  Churn              64374 non-null  int64 
dtypes: int64(9), object(3)
memory usage: 5.9+ MB

The dataset contains a wide range of data points with 64,374 entries and 12 columns (11 features and 1 target). No attribute has missing values.

In the initial evaluation, the CustomerID attribute, which is a unique identifier, doesn't add value to our model for churn prediction and will create unnecessary complexity for our initial data analysis and visualization. Therefore, it can be dropped. Removing unnecessary information is crucial, as it serves as noise for our model and should be eliminated to improve model performance.

Additionally, there are no no missing values, so there will no need to drop or replace null values during the data preparation phase.

In [ ]:
data = data.drop(columns=['CustomerID'])

In order to follow best practices, it is good to create a train/test split and set the test set aside, so the training set is manipulated. For the exploration phase, a copy of the training set will be made to analyze the data without affecting the original training set.

In [5]:
# Split data set and create a copy of the training set for data exploration
train_set, test_set = train_test_split(data, test_size=0.3, random_state=7)

train_set_explore = train_set.copy()

Statistical Summary¶

In [6]:
# Statistical Summary
pd.set_option('display.width', 100)
pd.set_option('display.precision', 3)
train_set_explore.describe()
Out[6]:
Age Tenure Usage Frequency Support Calls Payment Delay Total Spend Last Interaction Churn
count 45061.000 45061.000 45061.000 45061.000 45061.000 45061.000 45061.000 45061.000
mean 41.959 31.989 15.059 5.409 17.119 540.359 15.536 0.474
std 13.924 17.090 8.818 3.112 8.855 261.213 8.639 0.499
min 18.000 1.000 1.000 0.000 0.000 100.000 1.000 0.000
25% 30.000 18.000 7.000 3.000 10.000 312.000 8.000 0.000
50% 42.000 33.000 15.000 6.000 19.000 534.000 16.000 0.000
75% 54.000 47.000 23.000 8.000 25.000 768.000 23.000 1.000
max 65.000 60.000 30.000 10.000 30.000 1000.000 30.000 1.000

Summary of key observations¶

  1. Customer Demographics and Behavior:
  • Support Calls: mean of 5.04 calls, indicating higher level of customer support interaction. It might be beneficial to examine the correlation between support calls and churn rate.
  • The standard deviation values suggest a diverse customer base in terms of age, tenure, and spending habits.
  1. Churn Distribution
  • Churn Rate: approximately 47.4% of customers have churned, indicating that almost half of the customers have discontinued their relationship with the company. Understanding and addressing this churn is critical for developing effective retention strategies.

Class Distribution¶

In [7]:
# Calculate class counts
class_counts = train_set_explore.groupby('Churn').size()

pyplot.figure(figsize=(3, 3))
class_counts.plot(kind='bar')
pyplot.title('Class Counts of Churn')
pyplot.xlabel('Churn')
pyplot.xticks(ticks=[0, 1], labels=['No Churn', 'Churn'], rotation=0)
pyplot.show()
No description has been provided for this image

The class values are slightly imbalanced, with a slightly higher count of "No Churn" compared to "Churn." However, the imbalance is not significant, so special handling, such as resampling techniques, is not required at this stage. However, it's important to note that imbalanced classes might increase the risk of overfitting towards the dominant class in some models such as Decision Tree classifier (CART).

It is necessary to distinguish between categorical and numerical data, as only numerical data can be used for correlation and skewness analysis.

In [8]:
# Extract numerical data
numerical_features = ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Last Interaction', 'Churn']
numerical_data_explore = train_set_explore[numerical_features]

numerical_data_explore.head()
Out[8]:
Age Tenure Usage Frequency Support Calls Payment Delay Total Spend Last Interaction Churn
20878 53 46 20 9 9 700 15 0
55080 18 26 28 4 21 955 28 1
38481 44 49 9 6 0 437 16 0
28513 26 42 28 7 23 249 11 1
38838 29 17 17 3 8 876 12 0

Correlations between Attributes¶

In [9]:
# Correlation between numerical features
correlations = numerical_data_explore.corr(method='pearson')
correlations_rounded = correlations.round(3)
correlations_rounded
Out[9]:
Age Tenure Usage Frequency Support Calls Payment Delay Total Spend Last Interaction Churn
Age 1.000 -0.005 -0.040 0.008 -0.015 0.005 -0.001 0.064
Tenure -0.005 1.000 0.023 0.059 0.055 0.009 0.005 0.194
Usage Frequency -0.040 0.023 1.000 -0.012 0.029 0.007 -0.009 -0.118
Support Calls 0.008 0.059 -0.012 1.000 0.063 0.021 0.001 0.304
Payment Delay -0.015 0.055 0.029 0.063 1.000 -0.030 -0.011 0.557
Total Spend 0.005 0.009 0.007 0.021 -0.030 1.000 -0.003 -0.078
Last Interaction -0.001 0.005 -0.009 0.001 -0.011 -0.003 1.000 -0.006
Churn 0.064 0.194 -0.118 0.304 0.557 -0.078 -0.006 1.000
  • Payment Delay and Churn: stronger positive correlation (0.557) between payment delays and churn. This shows that customers with higher payment delays are significantly more likely to churn, which may indicate potential financial issues or dissatisfaction with the service.
  • Support Calls and Churn: moderate positive correlation (0.304) between support calls and churn, indicating that customers who make more support calls are more likely to churn. This suggests that customers who frequently contact support may be experiencing issues or dissatisfaction, leading them to leave.
  • Usage Frequency and Churn: weak negative correlation (-0.118), indicating that higher usage frequency is slightly associated with lower churn.

Monitoring these correlations carefully is essential to ensure they do not significantly influence the model training process. Careful feature selection and potential transformations will be necessary to manage these correlations effectively.

Skew of Univariate Distributions¶

In [10]:
# Skew of Univariate Distributions
skew = numerical_data_explore.skew()
skew
Out[10]:
Age                -3.763e-02
Tenure             -1.250e-01
Usage Frequency     4.295e-02
Support Calls      -2.001e-01
Payment Delay      -3.499e-01
Total Spend         5.001e-02
Last Interaction   -3.673e-04
Churn               1.050e-01
dtype: float64

The skew values are close to 0, indicating that the distributions of the numerical features are nearly symmetric. Normalizing these distributions will still enhance performance of some models such as K-Nearest Neighbors by ensuring that all features are on a similar scale.

Visualization - numerical features¶

Univariate Histograms¶

In [11]:
# Univariate Histograms
numerical_data_explore.hist(figsize=[10, 10])
pyplot.show()
No description has been provided for this image

Most features do not follow a Gaussian distribution. There is bias in support calls and payment delays towards higher.

Linear regression may not be ideal since it assumes a linear relationship and normally distributed (Gaussian) data, which is not present in our dataset. Non-parametric models that do not assume a specific relationship between variables, such as Decision Trees, Random Forests, or K-Nearest Neighbors (KNN), are preferred. These models are likely to perform better compared to linear regression models, as they can handle non-linear relationships and do not require the data to follow a Gaussian distribution.

No attributes are currently scaled, which will need to be addressed during the data processing phase. Many algorithms perform better after normalization, and it is essential to scale the features appropriately to improve model performance.

Density Plots¶

To observe the distribution of each attribute more clearly, density plot analysis will be used. This method will also confirm whether the features follow a normal Gaussian distribution.

In [12]:
# Univariate Density Plots
numerical_data_explore.plot(kind='density', subplots=True, layout=(3,3), sharex=False, figsize=[15, 15]) 
pyplot.show()
No description has been provided for this image

Box and Whisker Plots¶

In [13]:
# Box and Whisker Plots
numerical_data_explore.plot(kind='box', subplots=True, layout=(5,5), sharex=False, sharey=False,figsize=[15, 15]) 
pyplot.show()
No description has been provided for this image

Box and whisker plots are another useful method to review the distribution of each attribute. The absence of dots outside the whiskers indicates that there are no outliers or anomalies in the features.

Correlation Matrix Plot¶

In [14]:
# Correlation Matrix Plot
fig = pyplot.figure(figsize=[5, 5])
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)

ticks = np.arange(0, correlations.shape[0], 1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(numerical_features, rotation=90)
ax.set_yticklabels(numerical_features)
pyplot.show()
No description has been provided for this image

The correlation matrix plot visually represents the relationship between different features in the dataset:

  • Positive Correlation: variables that increase together, such as payment delay and support calls with churn (indicating higher churn rates with increased payment delay and support calls).
  • Negative Correlation: variables that move in opposite directions, like total spend and churn (indicating higher spenders are less likely to churn).

Implications:

  • Feature Relationships: important for feature selection.
  • Model Selection: prefer non-parametric models like Decision Trees, Random Forests, or KNN.

Visualization - categorical features¶

For categorical variables such as Gender, Subscription Type, and Contract Length, bar plots will be used.

In [15]:
# Extract categorical data
categorical_features = ['Gender', 'Subscription Type', 'Contract Length']
categorical_data_explore = train_set_explore[categorical_features]

categorical_data_explore.head()
Out[15]:
Gender Subscription Type Contract Length
20878 Male Basic Annual
55080 Female Standard Monthly
38481 Male Premium Quarterly
28513 Male Basic Quarterly
38838 Female Standard Monthly
In [16]:
# Visualize the categorical features
fig, axes = pyplot.subplots(nrows=1, ncols=3, figsize=(10, 5))

for ax, feature in zip(axes, categorical_features):
    categorical_data_explore[feature].value_counts().plot(kind='bar', ax=ax, title=feature)
    ax.set_xlabel(feature)

pyplot.tight_layout()
pyplot.show()
No description has been provided for this image

Potential biases include a gender distribution skewed towards more females. Distribution of subscription type and contract length is nearly even, indicating no particular type dominates the dataset.

Summary from numerical and categorical feature analyses¶

The data analysis indicates that numerical features are not normally distributed, with some skewness observed in support calls and payment delays. Applying normalization will be an essential step in preprocessing. Log transformations can also be considered to improve skewness in support calls and payment delays if necessary after initial model evaluation. Given the distribution characteristics and correlations, non-parametric models like Decision Trees, Random Forests, and KNN are likely to perform better. The box and whisker plots show no dots, indicating the absence of outliers. Balanced categorical features and a slight gender imbalance do not require special handling.

Data preparation¶

A new copy of the training data is used again. During the final model evaluation, the pipeline will incorporate these data preparation steps.

In [17]:
train_set_prep = train_set.copy()
  1. Removing Unnecessary Columns, Handling Duplicate Rows, and Addressing Missing Values: this step is typically essential, but in this case, there are no missing values, so it does not need to be incorporated.
  1. Convert categorical features into numerical formats.
In [18]:
# Encode categorical features
encoder = OneHotEncoder(drop='first', sparse_output=False, dtype='int')
encoded_data = encoder.fit_transform(train_set_prep[categorical_features])

# Get feature names and convert encoded data to DataFrame
encoded_feature_names = encoder.get_feature_names_out(categorical_features)
encoded_df = pd.DataFrame(encoded_data, columns=encoded_feature_names)

# Drop original categorical columns and concatenate the encoded columns with numerical and target features
data_encoded = train_set_prep.drop(columns=categorical_features).reset_index(drop=True)
data_encoded = pd.concat([data_encoded, encoded_df], axis=1)

print("Columns after one-hot encoding:\n", data_encoded.columns.tolist())
data_encoded.head()
Columns after one-hot encoding:
 ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Last Interaction', 'Churn', 'Gender_Male', 'Subscription Type_Premium', 'Subscription Type_Standard', 'Contract Length_Monthly', 'Contract Length_Quarterly']
Out[18]:
Age Tenure Usage Frequency Support Calls Payment Delay Total Spend Last Interaction Churn Gender_Male Subscription Type_Premium Subscription Type_Standard Contract Length_Monthly Contract Length_Quarterly
0 53 46 20 9 9 700 15 0 1 0 0 0 0
1 18 26 28 4 21 955 28 1 0 0 1 1 0
2 44 49 9 6 0 437 16 0 1 1 0 0 1
3 26 42 28 7 23 249 11 1 1 0 0 0 1
4 29 17 17 3 8 876 12 0 0 0 1 1 0

One-hot encoding transforms each category into a separate binary column, ensuring that categories are treated equally without implying any order. LabelEncoder was not used, as it assigns integer values to categories, which could mislead algorithms into interpreting them as ordered. One-hot encoding avoids this issue, making it more suitable for this dataset. Additionally, drop='first' was used to avoid multicollinearity by dropping the first category of each feature. This means the first category is inferred when all other categories are 0, preventing redundancy and ensuring the model interprets the data correctly.

  1. Rescaling the data - many machine learning algorithms benefit from having all attributes on the same scale.
In [19]:
# Separate input and output components and rescale the input features
X_prep = data_encoded.drop(columns=['Churn'])
Y_prep = data_encoded['Churn']

# Rescale the input features
scaler = MinMaxScaler(feature_range=(0, 1))
X_prep_rescaled = X_prep.copy()
X_prep_rescaled[X_prep.columns] = scaler.fit_transform(X_prep[X_prep.columns])

X_prep_rescaled.head()
Out[19]:
Age Tenure Usage Frequency Support Calls Payment Delay Total Spend Last Interaction Gender_Male Subscription Type_Premium Subscription Type_Standard Contract Length_Monthly Contract Length_Quarterly
0 0.745 0.763 0.655 0.9 0.300 0.667 0.483 1.0 0.0 0.0 0.0 0.0
1 0.000 0.424 0.931 0.4 0.700 0.950 0.931 0.0 0.0 1.0 1.0 0.0
2 0.553 0.814 0.276 0.6 0.000 0.374 0.517 1.0 1.0 0.0 0.0 1.0
3 0.170 0.695 0.931 0.7 0.767 0.166 0.345 1.0 0.0 0.0 0.0 1.0
4 0.234 0.271 0.552 0.3 0.267 0.862 0.379 0.0 0.0 1.0 1.0 0.0

After rescaling with MinMaxScaler, StandardScaler or any other standardization techniques will not be used. Standardization is typically useful for transforming attributes with Gaussian distributions and varying means and standard deviations into a standard Gaussian distribution. However, this step will be skipped as our data does not follow a Gaussian distribution.

Influence of Data transformation on ML models¶

In [20]:
# Influence of Data transformation on ML models
model = KNeighborsClassifier()

kfold = KFold(n_splits=10, shuffle=True, random_state=7)

# Evaluate K-Nearest Neighbors classification on original data
scores_original = cross_val_score(model, X_prep, Y_prep, cv=kfold)
print("\nKNN Accuracy on original data: ", scores_original.mean(), scores_original.std())

# Evaluate K-Nearest Neighbors classification on normalized data
scores_normalized = cross_val_score(model, X_prep_rescaled, Y_prep, cv=kfold)
print("\nKNN Accuracy on normalized data: ", scores_normalized.mean(), scores_normalized.std())
KNN Accuracy on original data:  0.8092585967028061 0.004959270149748385

KNN Accuracy on normalized data:  0.9036860499389864 0.0051678042031202435

Normalization significantly improves KNN accuracy, increasing it from 0.809 to 0.903. This is because KNN relies on distance metrics, which work better when features are on a similar scale. This highlights the importance of selecting appropriate preprocessing techniques based on the specific needs of each model.

2. Constructing and Selecting Features¶

In the next step, feature selection will be performed to identify the most important features. This process helps reduce overfitting, improve accuracy, and decrease training time. Filter-based methods such as Univariate Selection and Feature Extraction with Univariate Statistical Tests (Chi-squared for classification), along with wrapper methods like Recursive Feature Elimination (RFE), will be used to select the features that contribute most to the target variable.

VarianceThreshold¶

In [21]:
#Feature Selection
# 1. VarianceThreshold

threshold_n = 0.95
sel = VarianceThreshold(threshold=(threshold_n * (1 - threshold_n)))
X_variance_threshold = sel.fit_transform(X_prep)
idx = np.where(sel.variances_ > threshold_n)[0]
selected_features_variance_threshold = X_prep.columns[sel.variances_ > threshold_n]

# Print selected features that have the strongest relationship with the target
print("Selected feature:")
print(idx)
print(selected_features_variance_threshold.tolist())
Selected feature:
[0 1 2 3 4 5 6]
['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Last Interaction']

SelectKBest¶

In [22]:
from sklearn.feature_selection import SelectKBest, chi2

# 2. SelectKBest with chi-squared statistical test for non-negative features
selector_kbest = SelectKBest(score_func=chi2, k=6)

X_kbest = selector_kbest.fit_transform(X_prep_rescaled, Y_prep)
selected_features_kbest = X_prep_rescaled.columns[selector_kbest.get_support(indices=True)]

print("Top 6 features selected by SelectKBest for KNN:")
print(selected_features_kbest.tolist())
Top 6 features selected by SelectKBest for KNN:
['Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male', 'Contract Length_Monthly']

Recursive Feature Elimination - RFE¶

Now, Recursive Feature Elimination (RFE) will be used to evaluate different numbers of features to determine the optimal set.

RFE will be run with a varying number of features, ranging from 4 to 9, and performance will be evaluated using cross-validation (cross_val_score). This approach will help identify the number of features that have the best performance based on the mean cross-validation score.

In [23]:
# 3. Recursive Feature Elimination (RFE)
# Util function to test RFE with different number of features
def evaluate_rfe(X, Y, num_features, model):
    rfe = RFE(model, n_features_to_select=num_features)
    rfe.fit(X, Y)
    selected_X = rfe.transform(X)
    scores = cross_val_score(model, selected_X, Y, cv=5)

    return scores.mean(), rfe

models = {
    'Logistic Regression': LogisticRegression(solver='liblinear'),
    'Decision Tree': DecisionTreeClassifier(criterion="gini", random_state=7),
}

for model_name, model in models.items():
    print(f"Testing {model_name}")
    for k in range(4, 10):
        score, selector = evaluate_rfe(X_prep_rescaled, Y_prep, k, model)
        selected_features = X_prep.columns[selector.get_support(indices=True)]
        print(f"RFE with {k} features: Mean score = {score:.2f}")
        print(f"Selected features: {selected_features.tolist()}")
Testing Logistic Regression
RFE with 4 features: Mean score = 0.81
Selected features: ['Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay']
RFE with 5 features: Mean score = 0.82
Selected features: ['Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male']
RFE with 6 features: Mean score = 0.82
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male']
RFE with 7 features: Mean score = 0.83
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Gender_Male']
RFE with 8 features: Mean score = 0.83
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Gender_Male', 'Contract Length_Monthly']
RFE with 9 features: Mean score = 0.83
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Gender_Male', 'Contract Length_Monthly', 'Contract Length_Quarterly']
Testing Decision Tree
RFE with 4 features: Mean score = 0.85
Selected features: ['Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay']
RFE with 5 features: Mean score = 0.91
Selected features: ['Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male']
RFE with 6 features: Mean score = 0.94
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male']
RFE with 7 features: Mean score = 0.95
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male', 'Contract Length_Monthly']
RFE with 8 features: Mean score = 0.99
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Gender_Male', 'Contract Length_Monthly']
RFE with 9 features: Mean score = 1.00
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Gender_Male', 'Contract Length_Monthly', 'Contract Length_Quarterly']
  • Consistent Feature Importance: both Logistic Regression (LR) and Decision Tree (CART) models, when using RFE, select similar important features such as 'Age', 'Tenure', 'Usage Frequency', 'Support Calls','Payment Delay' and 'Gender_Male'. This indicates that these features are highly informative for predicting the target variable.
  • LR: shows a consistent mean score of 0.83 when using 7 to 9 features.
  • CART: Achieves perfect accuracy with 9 features, but this may indicate over-fitting as the model captures all details in the training data.

Visual representation of feature importance using CART¶

In [24]:
# 4. Feature importance scoring and visual representation using CART
cart_model = DecisionTreeClassifier(criterion="gini", random_state=7)
cart_model.fit(X_prep_rescaled, Y_prep)

feature_importance_df = pd.DataFrame(cart_model.feature_importances_, index=X_prep_rescaled.columns, columns=['Importance'])
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

pyplot.figure(figsize=(5, 5))
pyplot.barh(feature_importance_df.index, feature_importance_df['Importance'])
pyplot.xlabel('Mean Decrease in Impurity')
pyplot.ylabel('Feature')
pyplot.title('Feature Importance using CART')
pyplot.gca().invert_yaxis()
pyplot.show()
No description has been provided for this image

Conclusion¶

All feature selection methods (SelectKBest, Variance Threshold, and RFE) identified similar important features, confirming their significance for our predictive model.

Given the consistency across these methods, it can be concluded that these features are robust indicators for predicting the target variable. For the final model evaluation, RFE with Decision Trees will be used, selecting 6 or 7 features. This approach balances the model's complexity, helping to prevent overfitting while maintaining high accuracy.

Selected features: Payment Delay, Support Calls, Gender_Male, Usage Frequency, Tenure, Age.

3. Building ML algorithms¶

Model selection¶

Three machine learning algorithms were selected for our labeled classification task:

  • Logistic Regression (Supervised)
  • K-Nearest Neighbors (Supervised)
  • Decision Tree Classifier (Supervised)

Selection Strategies¶

  • Logistic Regression: good baseline, easy to implement and interpret. However, it assumes linear relationships and Gaussian data distribution which our dataset doesn't have.
  • K-Nearest Neighbors (KNN): it does not assume specific data distributions (non-parametric model) and benefits from feature normalization. It uses a distance metric to find the k most similar instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction.
  • Decision Tree Classifier: handles non-linear relationships, identifies feature importance, and works well with mixed data types. Decision-tree learners can create over-complex trees that don not generalise the data well so that's why they are prone to overfitting, manageable with techniques like pruning and feature selection. Pruning tree is implemented via updating the model parameters such as max_leaf_nodes, min_smaples and max_depth.

Now the final train_set will be used.

In [25]:
X_train = train_set.drop(columns=['Churn'])
Y_train = train_set['Churn']

X_test = test_set.drop(columns=['Churn'])
Y_test = test_set['Churn']

Create pipelines¶

In [26]:
# Create pipelines for numerical and categorical features
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

num_pipeline = Pipeline([
    ('scaler', MinMaxScaler())
])

cat_pipeline = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine the pipelines into a full preprocessing pipeline
full_pipeline = ColumnTransformer([
    ('num', num_pipeline, numerical_features),
    ('cat', cat_pipeline, categorical_features)
])

Create util functions for model evaluation¶

In [27]:
def plot_confusion_matrix(matrix, title):
    pyplot.figure(figsize=(6, 4))
    sns.heatmap(matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['No Churn', 'Churn'], yticklabels=['No Churn', 'Churn'])
    pyplot.title(f'{title} Confusion Matrix')
    pyplot.ylabel('Actual')
    pyplot.xlabel('Predicted')
    pyplot.show()

def plot_learning_curve(model, title, X, Y, cv):
    train_sizes, train_scores, test_scores = learning_curve(model, X, Y, cv=cv, n_jobs=-1)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    pyplot.figure()
    pyplot.title(title)
    pyplot.xlabel("Training examples")
    pyplot.ylabel("Score")
    pyplot.grid()

    pyplot.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="blue")
    pyplot.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="green")
    pyplot.plot(train_sizes, train_scores_mean, 'o-', color="blue", label="Training score")
    pyplot.plot(train_sizes, test_scores_mean, 'o-', color="green", label="Cross-validation score")

    pyplot.legend(loc="best")
    pyplot.show()

def plot_roc_curve(model, X, Y, cv):
    predictions = cross_val_predict(model, X, Y, cv=cv, method='predict_proba')
    y_scores = predictions[:, 1]
    auc = roc_auc_score(Y, y_scores)
    print(f'ROC AUC: {auc:.3f}')
    fpr, tpr, thresholds = roc_curve(Y, y_scores)
    pyplot.plot([0, 1], [0, 1], linestyle='--')
    pyplot.plot(fpr, tpr, marker='.')
    pyplot.ylabel("Sensitivity")
    pyplot.xlabel("1-Specificity")
    pyplot.show()

Create base models for initial evaluation¶

In [28]:
# Define base models for initial evaluation
base_models = [
    ('LR', LogisticRegression(solver='liblinear')),
    ('KNN', KNeighborsClassifier()),
    ('CART', DecisionTreeClassifier())
]

# Evaluate base models
kfold = KFold(n_splits=5, random_state=7, shuffle=True)

for name, model in base_models:
    pipeline = Pipeline([
        ('preprocessor', full_pipeline),
        ('model', model)
    ])
   
    predictions = cross_val_predict(pipeline, X_train, Y_train, cv=kfold)
    
    accuracy = accuracy_score(Y_train, predictions)
    print(f"{name} Mean Accuracy: {accuracy:.4f}")
    
    report = classification_report(Y_train, predictions)
    print(f"{name} Classification Report:\n", report)

    matrix = confusion_matrix(Y_train, predictions)
    plot_confusion_matrix(matrix, name)
    
    plot_roc_curve(pipeline, X_train, Y_train, cv=kfold)
    plot_learning_curve(pipeline, f'Learning Curve ({name})', X_train, Y_train, cv=kfold)
LR Mean Accuracy: 0.8266
LR Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.83      0.83     23712
           1       0.81      0.83      0.82     21349

    accuracy                           0.83     45061
   macro avg       0.83      0.83      0.83     45061
weighted avg       0.83      0.83      0.83     45061

No description has been provided for this image
ROC AUC: 0.904
No description has been provided for this image
No description has been provided for this image
KNN Mean Accuracy: 0.9013
KNN Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.88      0.90     23712
           1       0.87      0.93      0.90     21349

    accuracy                           0.90     45061
   macro avg       0.90      0.90      0.90     45061
weighted avg       0.90      0.90      0.90     45061

No description has been provided for this image
ROC AUC: 0.963
No description has been provided for this image
No description has been provided for this image
CART Mean Accuracy: 0.9980
CART Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     23712
           1       1.00      1.00      1.00     21349

    accuracy                           1.00     45061
   macro avg       1.00      1.00      1.00     45061
weighted avg       1.00      1.00      1.00     45061

No description has been provided for this image
ROC AUC: 0.998
No description has been provided for this image
No description has been provided for this image

Observations:

  • LR: provides a good baseline with balanced performance across metrics. It shows a decent ROC AUC and stable learning curves.
  • KNN: demonstrates significant improvement in accuracy and ROC AUC after normalization, highlighting the importance of preprocessing.
  • CART: achieves near-perfect scores but shows signs of overfitting. It's crucial to prune or regularize the tree to prevent overfitting.

Model optimization - hyperparameter tuning and feature selection¶

After the initial model evaluation, the next step is to optimize our models through hyperparameter tuning and feature selection. Based on initial evaluation, models like K-Nearest Neighbors (KNN) and Decision Trees (CART) showed promising results but indicated overfitting as CART was showing almost perfect scores. To address this, feature selection and hyperparameter tuning will be incorporated in order to enhance performance and prevent overfitting. This will be part of our pipeline process that was already established in the previous step with normalization and category encoding.

Feature Selection Strategy:¶

Based on the initial analyses regarding feature selection in step 6, six features will be selected for the final models. By using FeatureUnion, SelectKBest and RFE are combined within the pipeline, ensuring relevant features are selected during cross-validation. This integration enhances the accuracy of the final models and prevents data leaks.

Hyperparameter Tuning:¶

RandomizedSearchCV will be used for hyperparameter tuning, suitable for large datasets, and it shows much faster execution times compared to more complex Grid search. This method uses a random search approach to parameter tuning that samples algorithm parameters from a random distribution for a fixed number of iterations and then models are constructed and evaluated for each combination of specified parameters.

Logistic Regression (LR):¶
  • solver: 'liblinear' is good for smaller datasets; 'sag' and 'saga' are faster for large ones.
  • penalty: both 'l1' and 'l2' were tested. C (Regularization Strength): Tested [0.01, 0.1, 1].
K-Nearest Neighbors (KNN):¶
  • n_neighbors: as the model showed sign of overfitting during the initial base evaluation, higher values [50, 75, 100, 125, 200, 250] will be used to find optimal neighbors.
  • weights: 'uniform' and 'distance' to assess performance improvement.
  • algorithm: default 'auto' was used as it will attempt to decide the most appropriate algorithm based on the values passed to the fit method.
Decision Tree Classifier (CART):¶
  • criterion: Both 'gini', 'entropy' for split quality.
  • splitter: default 'best' will be used. This strategy chooses the best split at each node.
  • max_depth: default is 'none', where nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split. An array [1, 2, 3, 4, 5] will be tested to control tree depth. For example, a depth of 1 might underfit the data as it may not capture complex patterns but is less likely to overfit, which is the current problem. Depths of 3 or 4 balance complexity and the ability to capture patterns and are often considered good starting points. Depths of 10 or more create a complex decision boundary and are prone to overfitting, which will be avoided as overfitting is already an issue.
  • min_samples_split: controls the minimum number of samples required to split an internal node. Smaller samples create more complex trees and are more prone to overfitting. An array [80, 100, 120] will be used to ensure adequate samples for splits.
  • min_sample_leaf: the minimum number of samples required to split and internal node.
In [29]:
# Define feature extraction/selection methods
from sklearn.pipeline import FeatureUnion


features = []
features.append(('select_best', SelectKBest(score_func=chi2, k=6)))
features.append(('rfe', RFE(estimator=DecisionTreeClassifier(criterion="gini", random_state=7), n_features_to_select=6)))
feature_union = FeatureUnion(features)


# Model selection and optimization via hyperparameter tuning using randomized search
param_grid_lr = {
    'model__C': [0.01, 0.1, 1],
    'model__penalty': ['l1', 'l2'],
    'model__solver': ['liblinear', 'saga']
}

param_grid_knn = {
    'model__n_neighbors': [50, 75, 100, 125, 150, 200, 250],
    'model__weights': ['uniform', 'distance'],
    'model__algorithm': ['auto'],
}

param_grid_cart = {
    'model__criterion': ['gini', 'entropy'],
    'model__max_depth': [1, 2, 3, 4, 5],
    'model__min_samples_split': [80, 100, 120],
    'model__min_samples_leaf': [10, 20, 30, 40, 50],
}


models = [
    ('LR', LogisticRegression(), param_grid_lr),
    ('KNN', KNeighborsClassifier(), param_grid_knn),
    ('CART', DecisionTreeClassifier(), param_grid_cart)
]

# Perform hyperparameter tuning with RandomizedSearchCV
for name, model, param_dist in models:
    pipeline = Pipeline([
        ('preprocessor', full_pipeline),
        ('feature_union', feature_union),
        ('model', model)
    ])
    random_search = RandomizedSearchCV(estimator=pipeline, param_distributions=param_dist, n_iter=12, cv=kfold, scoring='accuracy', random_state=7)
    random_search.fit(X_train, Y_train)
    print(f"Best parameters for {name}: {random_search.best_params_}")
    print(f"Best cross-validation score for {name}: {random_search.best_score_:.3f}")
Best parameters for LR: {'model__solver': 'saga', 'model__penalty': 'l1', 'model__C': 0.01}
Best cross-validation score for LR: 0.824
Best parameters for KNN: {'model__weights': 'distance', 'model__n_neighbors': 50, 'model__algorithm': 'auto'}
Best cross-validation score for KNN: 0.928
Best parameters for CART: {'model__min_samples_split': 120, 'model__min_samples_leaf': 50, 'model__max_depth': 5, 'model__criterion': 'entropy'}
Best cross-validation score for CART: 0.956
  • The best parameters for LR were selected saga, with l1 penalty and c of 0.01.

  • The best parameters for KNN were selected as distance weights and n_neighbors set to 50. After further evaluation through trial and error, the model still showed overfitting in cross-validation scores, so the default weight of "uniform" was selected and n_neighbors was set to a higher value of 100.

  • The best parameters for CART were selected by RandomizedSearchCV as min_samples_split of 120, min_samples_leaf of 50, max_depth of 5, and criterion entropy. High focus was on addressing model overfitting, as this was the main concern after the initial base model evaluation. The model showed a good match between training scores and cross-validation scores. Max_depth was further decreased to 4, as with a depth of 5, there was a high discrepancy between false positives and false negatives.

4. Evaluating models and analyzing the results¶

Final model evaluation is done on unseen test data that was extracted using the train-test split method in a 70/30 ratio, one of the first steps before the initial analyses. As part of the baseline and final model evaluations, a confusion matrix with predictions on the x-axis and actual outcomes on the y-axis was used. Our binary classification problem involved True Positives, False Negatives, False Positives, and True Negatives. A second report, "Area Under ROC Curve," was generated to evaluate the model's ability to differentiate between positive and negative classes. Accuracy is measured by the area under the ROC curve, with anything above 0.9 representing an excellent score. Since this is a classification problem, a classification report was also generated to gain quick insight into the model's accuracy using precision, recall, f1-score, and support for each class. Finally, a learning curve was generated to show the model's performance on the training data in comparison with the unseen test data, represented by the cross-validation score.

In [30]:
final_models = [
    ('LR', LogisticRegression(solver='saga', penalty='l1', C=0.1)),
    ('KNN', KNeighborsClassifier(n_neighbors=200)),
    ('CART', DecisionTreeClassifier(
        max_depth=4,
        min_samples_split=120,
        min_samples_leaf=50,
        criterion='entropy',
    ))
]

for name, model in final_models:
    pipeline = Pipeline([
        ('preprocessor', full_pipeline),
        ('feature_union', feature_union),
        ('model', model)
    ])

    pipeline.fit(X_train, Y_train)
    Y_pred = pipeline.predict(X_test)
    
    print(f"\n{name} Test Accuracy: {accuracy_score(Y_test, Y_pred):.4f}")
    print(classification_report(Y_test, Y_pred))
    matrix = confusion_matrix(Y_test, Y_pred)

    plot_confusion_matrix(matrix, name)
    plot_roc_curve(pipeline, X_train, Y_train, cv=kfold)
    plot_learning_curve(pipeline, f'Learning Curve ({name})', X_train, Y_train, cv=kfold)
LR Test Accuracy: 0.8249
              precision    recall  f1-score   support

           0       0.84      0.82      0.83     10169
           1       0.81      0.83      0.82      9144

    accuracy                           0.82     19313
   macro avg       0.82      0.82      0.82     19313
weighted avg       0.83      0.82      0.82     19313

No description has been provided for this image
ROC AUC: 0.900
No description has been provided for this image
No description has been provided for this image
KNN Test Accuracy: 0.9198
              precision    recall  f1-score   support

           0       0.97      0.88      0.92     10169
           1       0.88      0.97      0.92      9144

    accuracy                           0.92     19313
   macro avg       0.92      0.92      0.92     19313
weighted avg       0.92      0.92      0.92     19313

No description has been provided for this image
ROC AUC: 0.972
No description has been provided for this image
No description has been provided for this image
CART Test Accuracy: 0.9443
              precision    recall  f1-score   support

           0       0.96      0.93      0.95     10169
           1       0.92      0.96      0.94      9144

    accuracy                           0.94     19313
   macro avg       0.94      0.95      0.94     19313
weighted avg       0.95      0.94      0.94     19313

No description has been provided for this image
ROC AUC: 0.986
No description has been provided for this image
No description has been provided for this image

The best-performing model is CART, which initially showed significant overfitting. However, after hyperparameter tuning and feature selection, a mean accuracy of 94.43% was achieved. Precision, which focuses on the accuracy of positive predictions and minimizes false positives, is higher for "No Churn." Recall, which shows the model's ability to capture all actual positives and minimizes false negatives, is higher for "Churn." The F1 score indicates balanced performance for both classes, demonstrating good overall accuracy. The confusion matrix shows some discrepancies between false positives and false negatives but shows much better results compared to KNN. The ROC AUC of 0.986 demonstrates an excellent ability to distinguish between classes. The learning curve shows that the training and cross-validation scores are very close, indicating minimal overfitting and good generalization. Based on these observations, CART, after hyperparameter tuning and feature selection, is recommended due to its balanced performance.

KNN model shows a good overall performance with a test accuracy of 91.98% and a ROC AUC of 0.972, the confusion matrix reveals an imbalance in false positives (1257) and false negatives (291). This suggests that the model is more prone to misclassifying 'Churn' instances. Despite applying different values of n_neighbors, this imbalance persisted.

LR model, when evaluated on the unseen test data, showed a test accuracy of approximately 82.48%. The cross-validation score is quite consistent with the training score and the confusion matrix reveals balance between false positives and false negatives. However, as expected from the initial analysis, the LR model has the lowest accuracy among the models tested. This aligns with our prediction since the data does not follow a Gaussian distribution, and models like KNN and CART, which do not assume a specific data distribution, were anticipated to perform better.

Final resolution: CART and KNN are recommended models.

Overall, the accuracy of KNN and CART is much better than LR, which is why these models were primarily selected. KNN is a simple and easy-to-implement algorithm that showed good results for our classification problem. It is also easier to explain and understand, as it predicts the label of the test dataset by looking at the labels of its closest neighbors in the feature space of the training dataset. Even non-technical persons can quickly grasp the concept with the important parameter of n_neighbors.

Decision trees, on the other hand, showed even better results than KNN, but they are more demanding in terms of computation power and hyperparameter tuning, with a higher risk of overfitting and bias towards dominant classes. CART is relatively easy to interpret even for non-technical stakeholders as well. Another benefit is that CART does not make any assumptions about the distribution and can handle both categorical and numerical features without requiring feature scaling, which simplifies the data preprocessing phase. In contrast, KNN needs to be normalized. Additionally, CART can be used for feature importance, whereas KNN does not have this option, and other selection techniques, such as SelectKBest, need to be used.