#Tested on Python 3.11.2
%pip install pandas numpy seaborn matplotlib scikit-learn

Welcome, Kaver
Requirement already satisfied: pandas in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (2.3.0)
Requirement already satisfied: numpy in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (2.3.0)
Requirement already satisfied: seaborn in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (0.13.2)
Requirement already satisfied: matplotlib in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (3.10.3)
Requirement already satisfied: scikit-learn in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (1.7.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from pandas) (2025.2)
Requirement already satisfied: contourpy>=1.0.1 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (4.58.4)
Requirement already satisfied: kiwisolver>=1.3.1 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (1.4.8)
Requirement already satisfied: packaging>=20.0 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (25.0)
Requirement already satisfied: pillow>=8 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (11.2.1)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from matplotlib) (3.2.3)
Requirement already satisfied: scipy>=1.8.0 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from scikit-learn) (1.15.3)
Requirement already satisfied: joblib>=1.2.0 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from scikit-learn) (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from scikit-learn) (3.6.0)
Requirement already satisfied: six>=1.5 in /Users/Kaver/Development.nosync/Smihula/.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Note: you may need to restart the kernel to use updated packages.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as pyplot
from sklearn.calibration import cross_val_predict
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import KFold, RandomizedSearchCV, cross_val_score, train_test_split, learning_curve
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score
from sklearn.feature_selection import RFE, VarianceThreshold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

filename = './data/customer_churn_dataset-final-master.csv'
data =  pd.read_csv(filename)
print(data.shape)
data.head()

Matplotlib is building the font cache; this may take a moment.

(64374, 12)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64374 entries, 0 to 64373
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   CustomerID         64374 non-null  int64 
 1   Age                64374 non-null  int64 
 2   Gender             64374 non-null  object
 3   Tenure             64374 non-null  int64 
 4   Usage Frequency    64374 non-null  int64 
 5   Support Calls      64374 non-null  int64 
 6   Payment Delay      64374 non-null  int64 
 7   Subscription Type  64374 non-null  object
 8   Contract Length    64374 non-null  object
 9   Total Spend        64374 non-null  int64 
 10  Last Interaction   64374 non-null  int64 
 11  Churn              64374 non-null  int64 
dtypes: int64(9), object(3)
memory usage: 5.9+ MB

data = data.drop(columns=['CustomerID'])

# Split data set and create a copy of the training set for data exploration
train_set, test_set = train_test_split(data, test_size=0.3, random_state=7)

train_set_explore = train_set.copy()

# Statistical Summary
pd.set_option('display.width', 100)
pd.set_option('display.precision', 3)
train_set_explore.describe()

# Calculate class counts
class_counts = train_set_explore.groupby('Churn').size()

pyplot.figure(figsize=(3, 3))
class_counts.plot(kind='bar')
pyplot.title('Class Counts of Churn')
pyplot.xlabel('Churn')
pyplot.xticks(ticks=[0, 1], labels=['No Churn', 'Churn'], rotation=0)
pyplot.show()

# Extract numerical data
numerical_features = ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Last Interaction', 'Churn']
numerical_data_explore = train_set_explore[numerical_features]

numerical_data_explore.head()

# Correlation between numerical features
correlations = numerical_data_explore.corr(method='pearson')
correlations_rounded = correlations.round(3)
correlations_rounded

# Skew of Univariate Distributions
skew = numerical_data_explore.skew()
skew

Age                -3.763e-02
Tenure             -1.250e-01
Usage Frequency     4.295e-02
Support Calls      -2.001e-01
Payment Delay      -3.499e-01
Total Spend         5.001e-02
Last Interaction   -3.673e-04
Churn               1.050e-01
dtype: float64

# Univariate Histograms
numerical_data_explore.hist(figsize=[10, 10])
pyplot.show()

# Univariate Density Plots
numerical_data_explore.plot(kind='density', subplots=True, layout=(3,3), sharex=False, figsize=[15, 15]) 
pyplot.show()

# Box and Whisker Plots
numerical_data_explore.plot(kind='box', subplots=True, layout=(5,5), sharex=False, sharey=False,figsize=[15, 15]) 
pyplot.show()

# Correlation Matrix Plot
fig = pyplot.figure(figsize=[5, 5])
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)

ticks = np.arange(0, correlations.shape[0], 1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(numerical_features, rotation=90)
ax.set_yticklabels(numerical_features)
pyplot.show()

# Extract categorical data
categorical_features = ['Gender', 'Subscription Type', 'Contract Length']
categorical_data_explore = train_set_explore[categorical_features]

categorical_data_explore.head()

# Visualize the categorical features
fig, axes = pyplot.subplots(nrows=1, ncols=3, figsize=(10, 5))

for ax, feature in zip(axes, categorical_features):
    categorical_data_explore[feature].value_counts().plot(kind='bar', ax=ax, title=feature)
    ax.set_xlabel(feature)

pyplot.tight_layout()
pyplot.show()

train_set_prep = train_set.copy()

# Encode categorical features
encoder = OneHotEncoder(drop='first', sparse_output=False, dtype='int')
encoded_data = encoder.fit_transform(train_set_prep[categorical_features])

# Get feature names and convert encoded data to DataFrame
encoded_feature_names = encoder.get_feature_names_out(categorical_features)
encoded_df = pd.DataFrame(encoded_data, columns=encoded_feature_names)

# Drop original categorical columns and concatenate the encoded columns with numerical and target features
data_encoded = train_set_prep.drop(columns=categorical_features).reset_index(drop=True)
data_encoded = pd.concat([data_encoded, encoded_df], axis=1)

print("Columns after one-hot encoding:\n", data_encoded.columns.tolist())
data_encoded.head()

Columns after one-hot encoding:
 ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Last Interaction', 'Churn', 'Gender_Male', 'Subscription Type_Premium', 'Subscription Type_Standard', 'Contract Length_Monthly', 'Contract Length_Quarterly']

# Separate input and output components and rescale the input features
X_prep = data_encoded.drop(columns=['Churn'])
Y_prep = data_encoded['Churn']

# Rescale the input features
scaler = MinMaxScaler(feature_range=(0, 1))
X_prep_rescaled = X_prep.copy()
X_prep_rescaled[X_prep.columns] = scaler.fit_transform(X_prep[X_prep.columns])

X_prep_rescaled.head()

# Influence of Data transformation on ML models
model = KNeighborsClassifier()

kfold = KFold(n_splits=10, shuffle=True, random_state=7)

# Evaluate K-Nearest Neighbors classification on original data
scores_original = cross_val_score(model, X_prep, Y_prep, cv=kfold)
print("\nKNN Accuracy on original data: ", scores_original.mean(), scores_original.std())

# Evaluate K-Nearest Neighbors classification on normalized data
scores_normalized = cross_val_score(model, X_prep_rescaled, Y_prep, cv=kfold)
print("\nKNN Accuracy on normalized data: ", scores_normalized.mean(), scores_normalized.std())

KNN Accuracy on original data:  0.8092585967028061 0.004959270149748385

KNN Accuracy on normalized data:  0.9036860499389864 0.0051678042031202435

#Feature Selection
# 1. VarianceThreshold

threshold_n = 0.95
sel = VarianceThreshold(threshold=(threshold_n * (1 - threshold_n)))
X_variance_threshold = sel.fit_transform(X_prep)
idx = np.where(sel.variances_ > threshold_n)[0]
selected_features_variance_threshold = X_prep.columns[sel.variances_ > threshold_n]

# Print selected features that have the strongest relationship with the target
print("Selected feature:")
print(idx)
print(selected_features_variance_threshold.tolist())

Selected feature:
[0 1 2 3 4 5 6]
['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Last Interaction']

from sklearn.feature_selection import SelectKBest, chi2

# 2. SelectKBest with chi-squared statistical test for non-negative features
selector_kbest = SelectKBest(score_func=chi2, k=6)

X_kbest = selector_kbest.fit_transform(X_prep_rescaled, Y_prep)
selected_features_kbest = X_prep_rescaled.columns[selector_kbest.get_support(indices=True)]

print("Top 6 features selected by SelectKBest for KNN:")
print(selected_features_kbest.tolist())

Top 6 features selected by SelectKBest for KNN:
['Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male', 'Contract Length_Monthly']

# 3. Recursive Feature Elimination (RFE)
# Util function to test RFE with different number of features
def evaluate_rfe(X, Y, num_features, model):
    rfe = RFE(model, n_features_to_select=num_features)
    rfe.fit(X, Y)
    selected_X = rfe.transform(X)
    scores = cross_val_score(model, selected_X, Y, cv=5)

    return scores.mean(), rfe

models = {
    'Logistic Regression': LogisticRegression(solver='liblinear'),
    'Decision Tree': DecisionTreeClassifier(criterion="gini", random_state=7),
}

for model_name, model in models.items():
    print(f"Testing {model_name}")
    for k in range(4, 10):
        score, selector = evaluate_rfe(X_prep_rescaled, Y_prep, k, model)
        selected_features = X_prep.columns[selector.get_support(indices=True)]
        print(f"RFE with {k} features: Mean score = {score:.2f}")
        print(f"Selected features: {selected_features.tolist()}")

Testing Logistic Regression
RFE with 4 features: Mean score = 0.81
Selected features: ['Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay']
RFE with 5 features: Mean score = 0.82
Selected features: ['Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male']
RFE with 6 features: Mean score = 0.82
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male']
RFE with 7 features: Mean score = 0.83
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Gender_Male']
RFE with 8 features: Mean score = 0.83
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Gender_Male', 'Contract Length_Monthly']
RFE with 9 features: Mean score = 0.83
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Gender_Male', 'Contract Length_Monthly', 'Contract Length_Quarterly']
Testing Decision Tree
RFE with 4 features: Mean score = 0.85
Selected features: ['Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay']
RFE with 5 features: Mean score = 0.91
Selected features: ['Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male']
RFE with 6 features: Mean score = 0.94
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male']
RFE with 7 features: Mean score = 0.95
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Gender_Male', 'Contract Length_Monthly']
RFE with 8 features: Mean score = 0.99
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Gender_Male', 'Contract Length_Monthly']
RFE with 9 features: Mean score = 1.00
Selected features: ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay', 'Total Spend', 'Gender_Male', 'Contract Length_Monthly', 'Contract Length_Quarterly']

# 4. Feature importance scoring and visual representation using CART
cart_model = DecisionTreeClassifier(criterion="gini", random_state=7)
cart_model.fit(X_prep_rescaled, Y_prep)

feature_importance_df = pd.DataFrame(cart_model.feature_importances_, index=X_prep_rescaled.columns, columns=['Importance'])
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

pyplot.figure(figsize=(5, 5))
pyplot.barh(feature_importance_df.index, feature_importance_df['Importance'])
pyplot.xlabel('Mean Decrease in Impurity')
pyplot.ylabel('Feature')
pyplot.title('Feature Importance using CART')
pyplot.gca().invert_yaxis()
pyplot.show()

X_train = train_set.drop(columns=['Churn'])
Y_train = train_set['Churn']

X_test = test_set.drop(columns=['Churn'])
Y_test = test_set['Churn']

# Create pipelines for numerical and categorical features
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

num_pipeline = Pipeline([
    ('scaler', MinMaxScaler())
])

cat_pipeline = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine the pipelines into a full preprocessing pipeline
full_pipeline = ColumnTransformer([
    ('num', num_pipeline, numerical_features),
    ('cat', cat_pipeline, categorical_features)
])

def plot_confusion_matrix(matrix, title):
    pyplot.figure(figsize=(6, 4))
    sns.heatmap(matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['No Churn', 'Churn'], yticklabels=['No Churn', 'Churn'])
    pyplot.title(f'{title} Confusion Matrix')
    pyplot.ylabel('Actual')
    pyplot.xlabel('Predicted')
    pyplot.show()

def plot_learning_curve(model, title, X, Y, cv):
    train_sizes, train_scores, test_scores = learning_curve(model, X, Y, cv=cv, n_jobs=-1)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    pyplot.figure()
    pyplot.title(title)
    pyplot.xlabel("Training examples")
    pyplot.ylabel("Score")
    pyplot.grid()

    pyplot.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="blue")
    pyplot.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="green")
    pyplot.plot(train_sizes, train_scores_mean, 'o-', color="blue", label="Training score")
    pyplot.plot(train_sizes, test_scores_mean, 'o-', color="green", label="Cross-validation score")

    pyplot.legend(loc="best")
    pyplot.show()

def plot_roc_curve(model, X, Y, cv):
    predictions = cross_val_predict(model, X, Y, cv=cv, method='predict_proba')
    y_scores = predictions[:, 1]
    auc = roc_auc_score(Y, y_scores)
    print(f'ROC AUC: {auc:.3f}')
    fpr, tpr, thresholds = roc_curve(Y, y_scores)
    pyplot.plot([0, 1], [0, 1], linestyle='--')
    pyplot.plot(fpr, tpr, marker='.')
    pyplot.ylabel("Sensitivity")
    pyplot.xlabel("1-Specificity")
    pyplot.show()

# Define base models for initial evaluation
base_models = [
    ('LR', LogisticRegression(solver='liblinear')),
    ('KNN', KNeighborsClassifier()),
    ('CART', DecisionTreeClassifier())
]

# Evaluate base models
kfold = KFold(n_splits=5, random_state=7, shuffle=True)

for name, model in base_models:
    pipeline = Pipeline([
        ('preprocessor', full_pipeline),
        ('model', model)
    ])
   
    predictions = cross_val_predict(pipeline, X_train, Y_train, cv=kfold)
    
    accuracy = accuracy_score(Y_train, predictions)
    print(f"{name} Mean Accuracy: {accuracy:.4f}")
    
    report = classification_report(Y_train, predictions)
    print(f"{name} Classification Report:\n", report)

    matrix = confusion_matrix(Y_train, predictions)
    plot_confusion_matrix(matrix, name)
    
    plot_roc_curve(pipeline, X_train, Y_train, cv=kfold)
    plot_learning_curve(pipeline, f'Learning Curve ({name})', X_train, Y_train, cv=kfold)

LR Mean Accuracy: 0.8266
LR Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.83      0.83     23712
           1       0.81      0.83      0.82     21349

    accuracy                           0.83     45061
   macro avg       0.83      0.83      0.83     45061
weighted avg       0.83      0.83      0.83     45061

ROC AUC: 0.904

KNN Mean Accuracy: 0.9013
KNN Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.88      0.90     23712
           1       0.87      0.93      0.90     21349

    accuracy                           0.90     45061
   macro avg       0.90      0.90      0.90     45061
weighted avg       0.90      0.90      0.90     45061

ROC AUC: 0.963

CART Mean Accuracy: 0.9980
CART Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     23712
           1       1.00      1.00      1.00     21349

    accuracy                           1.00     45061
   macro avg       1.00      1.00      1.00     45061
weighted avg       1.00      1.00      1.00     45061

ROC AUC: 0.998

# Define feature extraction/selection methods
from sklearn.pipeline import FeatureUnion


features = []
features.append(('select_best', SelectKBest(score_func=chi2, k=6)))
features.append(('rfe', RFE(estimator=DecisionTreeClassifier(criterion="gini", random_state=7), n_features_to_select=6)))
feature_union = FeatureUnion(features)


# Model selection and optimization via hyperparameter tuning using randomized search
param_grid_lr = {
    'model__C': [0.01, 0.1, 1],
    'model__penalty': ['l1', 'l2'],
    'model__solver': ['liblinear', 'saga']
}

param_grid_knn = {
    'model__n_neighbors': [50, 75, 100, 125, 150, 200, 250],
    'model__weights': ['uniform', 'distance'],
    'model__algorithm': ['auto'],
}

param_grid_cart = {
    'model__criterion': ['gini', 'entropy'],
    'model__max_depth': [1, 2, 3, 4, 5],
    'model__min_samples_split': [80, 100, 120],
    'model__min_samples_leaf': [10, 20, 30, 40, 50],
}


models = [
    ('LR', LogisticRegression(), param_grid_lr),
    ('KNN', KNeighborsClassifier(), param_grid_knn),
    ('CART', DecisionTreeClassifier(), param_grid_cart)
]

# Perform hyperparameter tuning with RandomizedSearchCV
for name, model, param_dist in models:
    pipeline = Pipeline([
        ('preprocessor', full_pipeline),
        ('feature_union', feature_union),
        ('model', model)
    ])
    random_search = RandomizedSearchCV(estimator=pipeline, param_distributions=param_dist, n_iter=12, cv=kfold, scoring='accuracy', random_state=7)
    random_search.fit(X_train, Y_train)
    print(f"Best parameters for {name}: {random_search.best_params_}")
    print(f"Best cross-validation score for {name}: {random_search.best_score_:.3f}")

Best parameters for LR: {'model__solver': 'saga', 'model__penalty': 'l1', 'model__C': 0.01}
Best cross-validation score for LR: 0.824
Best parameters for KNN: {'model__weights': 'distance', 'model__n_neighbors': 50, 'model__algorithm': 'auto'}
Best cross-validation score for KNN: 0.928
Best parameters for CART: {'model__min_samples_split': 120, 'model__min_samples_leaf': 50, 'model__max_depth': 5, 'model__criterion': 'entropy'}
Best cross-validation score for CART: 0.956

final_models = [
    ('LR', LogisticRegression(solver='saga', penalty='l1', C=0.1)),
    ('KNN', KNeighborsClassifier(n_neighbors=200)),
    ('CART', DecisionTreeClassifier(
        max_depth=4,
        min_samples_split=120,
        min_samples_leaf=50,
        criterion='entropy',
    ))
]

for name, model in final_models:
    pipeline = Pipeline([
        ('preprocessor', full_pipeline),
        ('feature_union', feature_union),
        ('model', model)
    ])

    pipeline.fit(X_train, Y_train)
    Y_pred = pipeline.predict(X_test)
    
    print(f"\n{name} Test Accuracy: {accuracy_score(Y_test, Y_pred):.4f}")
    print(classification_report(Y_test, Y_pred))
    matrix = confusion_matrix(Y_test, Y_pred)

    plot_confusion_matrix(matrix, name)
    plot_roc_curve(pipeline, X_train, Y_train, cv=kfold)
    plot_learning_curve(pipeline, f'Learning Curve ({name})', X_train, Y_train, cv=kfold)

LR Test Accuracy: 0.8249
              precision    recall  f1-score   support

           0       0.84      0.82      0.83     10169
           1       0.81      0.83      0.82      9144

    accuracy                           0.82     19313
   macro avg       0.82      0.82      0.82     19313
weighted avg       0.83      0.82      0.82     19313

ROC AUC: 0.900

KNN Test Accuracy: 0.9198
              precision    recall  f1-score   support

           0       0.97      0.88      0.92     10169
           1       0.88      0.97      0.92      9144

    accuracy                           0.92     19313
   macro avg       0.92      0.92      0.92     19313
weighted avg       0.92      0.92      0.92     19313

ROC AUC: 0.972

CART Test Accuracy: 0.9443
              precision    recall  f1-score   support

           0       0.96      0.93      0.95     10169
           1       0.92      0.96      0.94      9144

    accuracy                           0.94     19313
   macro avg       0.94      0.95      0.94     19313
weighted avg       0.95      0.94      0.94     19313

ROC AUC: 0.986

	Age	Tenure	Usage Frequency	Support Calls	Payment Delay	Total Spend	Last Interaction	Churn
count	45061.000	45061.000	45061.000	45061.000	45061.000	45061.000	45061.000	45061.000
mean	41.959	31.989	15.059	5.409	17.119	540.359	15.536	0.474
std	13.924	17.090	8.818	3.112	8.855	261.213	8.639	0.499
min	18.000	1.000	1.000	0.000	0.000	100.000	1.000	0.000
25%	30.000	18.000	7.000	3.000	10.000	312.000	8.000	0.000
50%	42.000	33.000	15.000	6.000	19.000	534.000	16.000	0.000
75%	54.000	47.000	23.000	8.000	25.000	768.000	23.000	1.000
max	65.000	60.000	30.000	10.000	30.000	1000.000	30.000	1.000

	Age	Tenure	Usage Frequency	Support Calls	Payment Delay	Total Spend	Last Interaction	Churn
Age	1.000	-0.005	-0.040	0.008	-0.015	0.005	-0.001	0.064
Tenure	-0.005	1.000	0.023	0.059	0.055	0.009	0.005	0.194
Usage Frequency	-0.040	0.023	1.000	-0.012	0.029	0.007	-0.009	-0.118
Support Calls	0.008	0.059	-0.012	1.000	0.063	0.021	0.001	0.304
Payment Delay	-0.015	0.055	0.029	0.063	1.000	-0.030	-0.011	0.557
Total Spend	0.005	0.009	0.007	0.021	-0.030	1.000	-0.003	-0.078
Last Interaction	-0.001	0.005	-0.009	0.001	-0.011	-0.003	1.000	-0.006
Churn	0.064	0.194	-0.118	0.304	0.557	-0.078	-0.006	1.000

	CustomerID	Age	Gender	Tenure	Usage Frequency	Support Calls	Payment Delay	Subscription Type	Contract Length	Total Spend	Last Interaction	Churn
0	1	22	Female	25	14	4	27	Basic	Monthly	598	9	1
1	2	41	Female	28	28	7	13	Standard	Monthly	584	20	0
2	3	47	Male	27	10	2	29	Premium	Annual	757	21	0
3	4	35	Male	9	12	5	17	Premium	Quarterly	232	18	0
4	5	53	Female	58	24	9	2	Standard	Annual	533	18	0

	Age	Tenure	Usage Frequency	Support Calls	Payment Delay	Total Spend	Last Interaction	Churn
20878	53	46	20	9	9	700	15	0
55080	18	26	28	4	21	955	28	1
38481	44	49	9	6	0	437	16	0
28513	26	42	28	7	23	249	11	1
38838	29	17	17	3	8	876	12	0

	Age	Tenure	Usage Frequency	Support Calls	Payment Delay	Total Spend	Last Interaction	Churn	Gender_Male	Subscription Type_Premium	Subscription Type_Standard	Contract Length_Monthly	Contract Length_Quarterly
0	53	46	20	9	9	700	15	0	1	0	0	0	0
1	18	26	28	4	21	955	28	1	0	0	1	1	0
2	44	49	9	6	0	437	16	0	1	1	0	0	1
3	26	42	28	7	23	249	11	1	1	0	0	0	1
4	29	17	17	3	8	876	12	0	0	0	1	1	0

	Age	Tenure	Usage Frequency	Support Calls	Payment Delay	Total Spend	Last Interaction	Gender_Male	Subscription Type_Premium	Subscription Type_Standard	Contract Length_Monthly	Contract Length_Quarterly
0	0.745	0.763	0.655	0.9	0.300	0.667	0.483	1.0	0.0	0.0	0.0	0.0
1	0.000	0.424	0.931	0.4	0.700	0.950	0.931	0.0	0.0	1.0	1.0	0.0
2	0.553	0.814	0.276	0.6	0.000	0.374	0.517	1.0	1.0	0.0	0.0	1.0
3	0.170	0.695	0.931	0.7	0.767	0.166	0.345	1.0	0.0	0.0	0.0	1.0
4	0.234	0.271	0.552	0.3	0.267	0.862	0.379	0.0	0.0	1.0	1.0	0.0

	Age	Tenure	Usage Frequency	Support Calls	Payment Delay	Total Spend	Last Interaction	Churn	Gender_Male	Subscription Type_Premium	Subscription Type_Standard	Contract Length_Monthly	Contract Length_Quarterly
0	53	46	20	9	9	700	15	0	1	0	0	0	0
1	18	26	28	4	21	955	28	1	0	0	1	1	0
2	44	49	9	6	0	437	16	0	1	1	0	0	1
3	26	42	28	7	23	249	11	1	1	0	0	0	1
4	29	17	17	3	8	876	12	0	0	0	1	1	0

AML Customer Churn¶

1. The Data¶

Customer Churn Dataset¶

Dataset Features¶

Motivation¶

Source¶

Descriptive statistics¶

Load the data¶

Statistical Summary¶

Summary of key observations¶

Class Distribution¶

Correlations between Attributes¶

Skew of Univariate Distributions¶

Visualization - numerical features¶

Univariate Histograms¶

Density Plots¶

Box and Whisker Plots¶

Correlation Matrix Plot¶

Visualization - categorical features¶

Summary from numerical and categorical feature analyses¶

Data preparation¶

Influence of Data transformation on ML models¶

2. Constructing and Selecting Features¶

VarianceThreshold¶

SelectKBest¶

Recursive Feature Elimination - RFE¶

Visual representation of feature importance using CART¶

Conclusion¶

3. Building ML algorithms¶

Model selection¶

Selection Strategies¶

Create pipelines¶

Create util functions for model evaluation¶

Create base models for initial evaluation¶

Model optimization - hyperparameter tuning and feature selection¶

Feature Selection Strategy:¶

Hyperparameter Tuning:¶

Logistic Regression (LR):¶

K-Nearest Neighbors (KNN):¶

Decision Tree Classifier (CART):¶

4. Evaluating models and analyzing the results¶

	Age	Tenure	Usage Frequency	Support Calls	Payment Delay	Total Spend	Last Interaction	Churn	Gender_Male	Subscription Type_Premium	Subscription Type_Standard	Contract Length_Monthly	Contract Length_Quarterly
0	53	46	20	9	9	700	15	0	1	0	0	0	0
1	18	26	28	4	21	955	28	1	0	0	1	1	0
2	44	49	9	6	0	437	16	0	1	1	0	0	1
3	26	42	28	7	23	249	11	1	1	0	0	0	1
4	29	17	17	3	8	876	12	0	0	0	1	1	0