← Back to projects

Machine Learning for Fraud Detection in E-commerce

About this project

This project investigates how machine learning techniques can be applied to identify fraudulent behavior in online transactions. Using real-world e-commerce transaction data, multiple algorithms are implemented and evaluated to understand how different modeling strategies detect suspicious activity. The study combines supervised classification and unsupervised anomaly detection approaches, providing both predictive performance and behavioral pattern discovery.

The work is organized into three case studies. The first focuses on the Random Forest algorithm as a primary fraud detection model, including feature engineering, outlier handling, and recursive feature elimination to improve predictive capability. The second study compares classical supervised classifiers — Logistic Regression, Naive Bayes, and Decision Trees — analyzing their decision boundaries and performance differences when applied to imbalanced fraud datasets. The third study explores unsupervised techniques, particularly DBSCAN clustering and Z-score based anomaly detection, to identify potentially fraudulent transactions without relying on labeled outcomes.

Fraud detection can be framed as a binary classification problem where a model estimates the probability that a transaction is fraudulent:

$$ \hat{y} = \mathbb{1}\big(p(y=1 \mid x) > \tau\big) $$

Supervised models learn this probability directly from labeled examples, while clustering methods instead identify abnormal patterns in feature space. For density-based clustering such as DBSCAN, transactions are grouped based on neighborhood density:

$$ N_\varepsilon(x) = {x_j \mid |x_j - x| \le \varepsilon} $$

Points that do not belong to any dense region are treated as anomalies and may correspond to fraud.

The project highlights a practical insight in financial machine learning: fraudulent behavior often appears as rare or irregular patterns rather than typical class separation. As a result, combining classification models with anomaly detection improves robustness. The implementation demonstrates how different algorithms capture different aspects of fraud — statistical irregularity, rule-based separation, and behavioral similarity — making multi-method analysis valuable for real-world fraud prevention systems.