Kaoshi Users Classification Using KMeans Clustering.

Kaoshi
3 min readMar 3, 2025

--

Introduction

Understanding user behavior is essential for businesses looking to improve customer experience, enhance engagement, and maximize revenue. One effective way to achieve this is through user segmentation — grouping users based on shared characteristics. In this project, we use KMeans clustering to classify users on the Kaoshi platform, allowing for personalized marketing and targeted advertising.

This article outlines the full journey of this project, from data preprocessing to clustering implementation and insight generation.

Why User Segmentation?

User segmentation helps businesses:

  • Identify high-value customers for exclusive offers.
  • Target inactive users with re-engagement strategies.
  • Personalize marketing campaigns for different customer groups.
  • Optimize resource allocation by focusing on the right audience.

Since Kaoshi is a peer-to-peer finance platform, understanding user behavior can enhance fraud detection, improve customer retention, and increase transaction efficiency.

Project Overview

This project follows a three-phase approach:

  1. Data Cleaning & Preprocessing
  2. Modeling with KMeans Clustering
  3. Insights and Interpretation

Importing the Necessary Libraries

This project was carried out using Python, to implement this project we need to import some necessary libraries as seen in the code snippet below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
import re

Data Cleaning & Preprocessing

Before we apply machine learning algorithms, we must ensure the dataset is clean and structured. The dataset contains various attributes such as user demographics, transaction history, and account activity.

Key Steps in Data Preprocessing

  • Handling missing values to avoid gaps in analysis.
  • Removing unnecessary columns such as usernames, email addresses, and avatars.
  • Converting date columns into numerical features (e.g., “Days Since Joined”).
  • Standardizing categorical data (e.g., encoding gender and country values).
  • Normalizing numerical features to ensure fair clustering.
  • After preprocessing, we have a structured dataset ready for clustering.
#Import data
data = pd.read_csv('Users.csv')
#Creates copy of our data
data_copy = data.copy()
#Choose columns to clean
cols = ['transaction_completed_count', 'transaction_post_completed_amount', 'transaction_received_completed_amount',
'Post Completed', 'Match Completed', 'Received Completed', 'transaction_match_completed_amount']
#Clean our numeric columns and convert to numeric datatype
for i in cols:
data[i] = data[i].apply(
lambda x: re.sub(r'[a-zA-Z\s]', '', str(x).split(',')[0])
)
data[i] = pd.to_numeric(data[i], errors='coerce')
# Select relevant columns
columns = [
'transaction_completed_count',
'transaction_post_completed_amount',
'transaction_match_completed_amount',
'transaction_received_completed_amount',
'Post Completed', 'Match Completed', 'Received Completed',
'is_identity_verified', 'is_bank_added', 'is_payment_methods_added',
'is_delivery_methods_added', 'gender', 'country', 'joined_at', 'last_logged_at'
]
data = data[columns]
#Preprocessing
#Convert dates to durations (example: days since joining)
data['joined_at'] = pd.to_datetime(data['joined_at']).dt.tz_localize(None)
data['last_logged_at'] = pd.to_datetime(data['last_logged_at']).dt.tz_localize(None)
#Ensure pd.Timestamp.today() is also timezone-naive
data['days_since_joined'] = (pd.Timestamp.today().replace(tzinfo=None) - data['joined_at']).dt.days
data['days_since_last_login'] = (pd.Timestamp.today().replace(tzinfo=None) - pd.to_datetime(data['last_logged_at'])).dt.days
#Drop original date columns
data = data.drop(['joined_at', 'last_logged_at'], axis=1)
#Choose columns with missing values
missing_cols = [x for x in data.columns if pd.isnull(data[x]).any()]
#Fill missing categorical columns
data[['gender', 'days_since_last_login']] = data[['gender', 'days_since_last_login']].fillna(method = 'bfill')
#Fill missing numerical columns with 0
mising_num_cols = [x for x in data.columns if x not in ['gender', 'days_since_last_login']]
data[mising_num_cols] = data[mising_num_cols].fillna(0)
#Encode categorical variables
obj_cols = [x for x in data.columns if data[x].dtype == object]
encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
data[obj_cols] = encoder.fit_transform(data[obj_cols])
#Scale data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

Modeling with KMeans Clustering

KMeans clustering is an unsupervised machine learning algorithm used to group data points into clusters based on similarity.

Steps in Clustering

Determining the optimal number of clusters using Elbow Method.

inertia = []
for k in range(1, 10): # Try different numbers of clusters
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(scaled_data)
inertia.append(kmeans.inertia_)
# Plot inertia to find the elbow point
plt.figure(figsize=(8, 5))
plt.plot(range(1, 10), inertia, marker='o')
plt.title('Elbow Method for Optimal Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')

Training the KMeans algorithm to segment users.

Validating cluster formation to ensure interpretability.

Once clustering is complete, users are assigned to specific groups based on behavior.

# KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=42) # Adjust `n_clusters` as needed
data['cluster'] = kmeans.fit_predict(scaled_data)
# View clustered data
data.groupby('cluster').mean()

Insights and Interpretation

The final phase involves analyzing the clusters to derive actionable business insights.

Understanding the Clusters

Each cluster represents a distinct user persona. For example:

  • Cluster 0: Low-activity users with minimal transactions.
  • Cluster 1: Regular users with verified accounts.
  • Cluster 2: High-value users making large transactions.
  • Cluster 3: Inactive but previously high-value users.
  • Cluster 4: Users who primarily receive transactions but do not initiate.

Conclusion

This project demonstrates how KMeans clustering can be applied to real-world user segmentation. By understanding Kaoshi users, the platform can create smarter marketing campaigns, improve customer engagement, and drive business growth.

--

--

Kaoshi
Kaoshi

Written by Kaoshi

We are a marketplace connecting Africans at home and abroad, to the financial services that enable them to meet their obligations, affordably and conveniently.

No responses yet