Introduction
Customer churn is a major challenge for businesses that rely on long-term customer engagement. Losing customers not only reduces revenue but also increases the cost of acquiring new ones. To address this issue proactively, I built a churn prediction model that identifies users likely to stop transacting in the next three months.
This project aims to help businesses take preventive measures by targeting at-risk users with retention strategies. In this article, I’ll walk you through my approach — from data preparation to model building — and discuss actionable insights that businesses can use to improve customer retention.
Understanding Churn and Its Business Impact
Customer churn occurs when users stop engaging with a business over a specific period. For companies with subscription-based models, churn means users cancel their subscriptions. In transactional businesses, churn can be identified when users stop making purchases or transactions.
High churn rates can indicate dissatisfaction, poor customer service, or competitive market pressures. Predicting churn helps businesses:
- Retain high-value customers before they leave.
- Improve customer experience by identifying pain points.
- Increase revenue by maintaining a loyal customer base.
Data Preprocessing & Feature Engineering
Before training the model, I prepared the data by:
- Handling missing values appropriately, ensuring no data inconsistencies.
- Creating a churn label based on the ‘last transaction completed’ date.
- Feature selection by identifying the most relevant attributes affecting churn.
- Feature engineering, including transaction frequency, recency, and engagement metrics.
from datetime import timedelta
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from imblearn.over_sampling import SMOTE
import xgboost as xgb
ordinal_encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
smote = SMOTE(random_state=42)
scaler = StandardScaler()
xgb_model = xgb.XGBClassifier(eval_metric='logloss', random_state=42)# Define the females list
females = [
'miriam', 'chisom', 'grace', 'precious', 'chiamaka', 'ifeoma', 'stephanie',
'ugo', 'blessing', 'onyinye', 'sandra', 'ijeoma', 'chioma', 'chidiebere',
'abigail', 'elizabeth', 'naomi', 'nkiru', 'ifeyinwa', 'omowumi', 'glory', 'sarah',
'olajumoke', 'anthonia', 'fatima', 'magdalene', 'ibukun', 'christabel', 'oluchi',
'oluwanifemi', 'nana', 'aminat', 'eki', 'victory', 'favour', 'prisca', 'ayebapriye',
'deborah', 'chidinma', 'maryann', 'yetunde', 'faizah', 'ogochukwu', 'oreoluwa',
'chika', 'adebusola', 'funmilola', 'aishat', 'titilayo', 'emem', 'chinasa', 'melanie',
'joy', 'princess', 'omolara', 'boluwatife', 'oluwatoyin', 'amanda', 'nneoma', 'rebecca',
'cynthia', 'esther', 'ijeoma', 'omowumi', 'chioma', 'grace', 'ibukun', 'precious',
'christabel', 'blessing', 'abigail', 'sarah', 'joy', 'glory', 'omolara', 'oluwanifemi',
'princess', 'titilayo', 'aishat', 'chidiebere', 'onyinye', 'miriam', 'emem', 'funmilola',
'anuoluwa', 'melanie', 'adebusola', 'hafisat', 'genevieve', 'mami-zebah', 'modupeola',
'naomi', 'elohor', 'anita', 'eneyi', 'ogechi', 'doyinsolami', 'oluwasola', 'prudence',
'faridah', 'modestha', 'soseimiebi', 'clarence', 'chika', 'nkiru', 'sikemi', 'josephine',
'tracey', 'chinenye', 'chidinma', 'oluwatomi', 'valeria', 'vanessa', 'jamelia', 'bosola',
'chika', 'omonye', 'oyeronke', 'ifeoluwa', 'nofisat', 'folake', 'martha', 'philomena',
'adebukola', 'abiatha', 'olufunmilayo', 'christine', 'bisola', 'pamela', 'oluwatoyin',
'oluchi', 'margaret', 'folasade', 'ejiro', 'brendan', 'shukurat', 'uchechi',
'oluwatunmike', 'omotola', 'gbemisola', 'bunmi', 'patience', 'yetunde', 'motunrayo',
'yemisi', 'olufunke', 'monsurat', 'stephanie', 'natasa', 'kamilat', 'ugo', 'ogechukwu',
'rachael', 'abisade', 'adaora', 'janet', 'morioluwa', 'chiamaka', 'nkechi', 'kosisochukwuamaka',
'carmen', 'olubusayo', 'rebecca', 'maureen', 'happy', 'nene', 'elizabeth', 'damilola',
'chinasa', 'yemisi', 'adesuwa', 'beatrice', 'jumoke', 'kathryn', 'olajumoke', 'yetunde',
'desiree', 'maryann', 'soso', 'emmanuella', 'abigail', 'chinwendu', 'angela'
]# Lowercase the `females` list for consistency
females = [name.lower() for name in females]class Predict_Churn_Users:
def __init__(self, model, scaler, encoder):
self.model = model
self.scaler = scaler
self.encoder = encoder
def Preprocess(self, training_data, test_data, reference_date):
# Drop unnecessary columns
columns_to_drop = [
'email', 'kaoshi_email', 'username', 'middlename', 'firstname', 'lastname',
'phone', 'avatar', 'mxuid', 'account_officer_id', 'referral_code',
'full_name_shown', 'building_number', 'street_number', 'address', 'address2',
'city', 'state', 'zipcode', 'country_abbreviation', 'timezone', 'hear_from',
'delivery_methods', 'payment_methods', 'bank_options', 'currencies',
'verification_started_at', 'unbanned_at', 'verified_at', 'last_logged_at',
'tfa_required_for_login', 'two_factor_verified', 'email_verified_at',
'name_verified_at', 'dob_verified_at', 'is_active', 'deleted_at',
'transaction_post_completed_amount', 'transaction_match_completed_amount',
'transaction_received_completed_amount'
]
training_data.drop(columns=columns_to_drop, axis=1, inplace=True)
test_data.drop(columns=columns_to_drop, axis=1, inplace=True) # Handle 'last_transaction_completed_at'
three_months_ago = reference_date - timedelta(days=90)
training_data['last_transaction_completed_at'] = pd.to_datetime(
training_data['last_transaction_completed_at'], errors='coerce'
).dt.tz_localize(None)
test_data['last_transaction_completed_at'] = pd.to_datetime(
test_data['last_transaction_completed_at'], errors='coerce'
).dt.tz_localize(None)
training_data['churn'] = training_data['last_transaction_completed_at'].apply(
lambda x: 0 if x >= three_months_ago else 1
)
training_data.drop('last_transaction_completed_at', axis=1, inplace=True)
test_data.drop('last_transaction_completed_at', axis=1, inplace=True) # Gender imputation based on first names
def assign_gender(first_name):
first_name = first_name.lower()
return 'Female' if first_name in females else 'Male' training_data['first_name'] = training_data['full_name'].str.split(' ').str[0]
test_data['first_name'] = test_data['full_name'].str.split(' ').str[0] training_data['gender'] = training_data.apply(
lambda row: assign_gender(row['first_name']) if pd.isnull(row['gender']) else row['gender'], axis=1
)
test_data['gender'] = test_data.apply(
lambda row: assign_gender(row['first_name']) if pd.isnull(row['gender']) else row['gender'], axis=1
) # Fill missing inviter IDs
training_data['inviter_id'].fillna('None', inplace=True)
test_data['inviter_id'].fillna('None', inplace=True) # Drop temporary columns
training_data.drop(columns=['first_name', 'identity_verified_at'], axis=1, inplace=True)
test_data.drop(columns=['first_name', 'identity_verified_at'], axis=1, inplace=True) # Convert datetime columns
for col in ['updated_at', 'first_transaction_completed_at']:
training_data[col] = pd.to_datetime(training_data[col], errors='coerce')
test_data[col] = pd.to_datetime(test_data[col], errors='coerce') return training_data, test_data def Feature_Engineering(self, training_data, test_data):
# Identify datetime columns dynamically
for df in [training_data, test_data]:
df['updated_at'] = pd.to_datetime(df['updated_at'], errors = 'coerce')
df['first_transaction_completed_at'] = pd.to_datetime(df['first_transaction_completed_at'], errors = 'coerce')
datetime_cols = training_data.select_dtypes(include=['datetime64']).columns.tolist()
datetime_cols1 = ['updated_at', 'first_transaction_completed_at']
datetime_cols.extend(datetime_cols1) # Feature extraction for datetime columns
for df in [training_data, test_data]:
for col in datetime_cols:
df[f"{col}_year"] = df[col].dt.year
df[f"{col}_month"] = df[col].dt.month
df[f"{col}_day"] = df[col].dt.day
df[f"{col}_dayofweek"] = df[col].dt.dayofweek
df[f"{col}_month_sin"] = np.sin(2 * np.pi * df[col].dt.month / 12)
df[f"{col}_month_cos"] = np.cos(2 * np.pi * df[col].dt.month / 12) # Drop original datetime columns
df.drop(datetime_cols, axis=1, inplace=True)
# Separate target variable
X = training_data.drop('churn', axis=1)
y = training_data['churn']
# Identify object columns
obj_cols = X.select_dtypes(include=['object']).columns # Encode object columns
X[obj_cols] = self.encoder.fit_transform(X[obj_cols])
test_data[obj_cols] = self.encoder.transform(test_data[obj_cols])
# SMOTE for resampling
X_train, y_train = smote.fit_resample(X, y)
# Scale data
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(test_data)
return X_train_scaled, X_test_scaled, y_train
Building the Churn Prediction Model
The model was trained using the XGBOOST model and evaluated using metrics like f1 score, precision, recall and confusion matrix, we make use of classes and objects to ensure reusability of our code in future deployments.
#XGBoost Modeling
xgb_model = xgb.XGBClassifier(eval_metric='logloss', random_state=42)
xgb_model.fit(X_train_scaled, y_train_resampled)
# Prediction and evaluation for XGBoost
xgb_preds = xgb_model.predict(X_test_scaled)
xgb_accuracy = accuracy_score(y_test, xgb_preds)
xgb_recall = recall_score(y_test, xgb_preds)
xgb_precision = precision_score(y_test, xgb_preds)
xgb_f1 = f1_score(y_test, xgb_preds)
conf_matrix = confusion_matrix(y_test, xgb_preds)print(f"Model Accuracy: {xgb_accuracy:.4f}")
print(f"Model f1: {xgb_f1:.4f}")
print(f"Model Precision: {xgb_precision:.4f}")
print(f"Model recall: {xgb_recall:.4f}")plt.title('Confusion Matrix for Churn Prediction Model', size = 14)
sns.heatmap(conf_matrix, annot = True, cmap ="YlGnBu")
Model Accuracy: 0.9414
Model f1: 0.9683
Model Precision: 0.9683
Model recall: 0.9683
Actionable Recommendations for Customer Retention
Based on the model’s predictions, I recommend that customer service teams:
- Target at-risk users with personalized retention offers.
- Engage customers proactively through emails, promotions, and support.
- Enhance user experience by addressing churn drivers (e.g., transaction friction, poor service).
Challenges & Lessons Learned
Some challenges I encountered:
- Imbalanced data, as fewer users churned compared to retained ones.
- Feature selection trade-offs, balancing model performance and interpretability.
Conclusion
This churn prediction model provides businesses with a proactive approach to customer retention. By leveraging data insights, companies can reduce churn, improve customer satisfaction, and drive long-term growth.