← Back to data analytics
Logistic Regression

Logistic Regression

Detecting Rare Disease Cases in a Highly Imbalanced Dataset

About this analysis

In this project, I worked on a highly imbalanced binary classification problem where the goal was to predict whether a person has a disease or not. Since the positive class was very rare, standard accuracy was not useful as the main evaluation metric. A model could easily achieve high accuracy by predicting most cases as healthy, while still failing to identify actual disease cases. For that reason, the modeling process was designed around a recall-first objective.

The workflow started by loading the dataset, separating the input features from the target variable, and applying a stratified train-validation-test split so that the original class imbalance was preserved across each subset. Rather than changing the class distribution through oversampling or undersampling, the original skewed structure of the data was intentionally maintained. This allowed the imbalance problem to be handled at the modeling stage instead of by altering the dataset itself.

A preprocessing and modeling pipeline was then built using scikit-learn. Numerical variables were scaled, categorical variables were one-hot encoded, and Logistic Regression was used as the baseline model. Because false negatives were especially costly in this disease-prediction setting, the classifier was trained with class_weight='balanced', allowing minority-class errors to receive stronger penalty during learning. Hyperparameter tuning was performed with GridSearchCV, and recall was used as the scoring metric in order to align model selection with the actual goal of the project.

One of the most important parts of the workflow was threshold tuning. Instead of relying on the default probability cutoff of 0.5, predicted probabilities on the validation set were examined across multiple thresholds. For each threshold, recall, precision, and F1-score were calculated. This made it possible to treat classification not as a fixed output, but as a controllable decision process. Lower thresholds increased recall and helped catch more true disease cases, while higher thresholds reduced false positives but risked missing actual positives.

A constrained threshold selection strategy was applied to reflect the recall-first objective. Thresholds that achieved at least 0.90 recall on the validation set were retained, and among them the threshold with the strongest F1-score and better precision was selected. This created a more practical balance between sensitivity and over-prediction, rather than blindly maximizing recall at all costs.

The validation results showed that the model was highly effective at identifying disease cases. Out of 23 actual positive cases in the validation set, 21 were correctly detected, producing a recall of 0.913. This indicates that the classifier was successful in minimizing false negatives, which was the main priority of the project. However, this high recall came with a substantial number of false positives, which lowered precision to 0.037. This tradeoff is expected in highly imbalanced medical-style classification problems, especially when threshold selection is intentionally shifted toward sensitivity.

From a broader performance perspective, the model achieved a ROC-AUC of approximately 0.896 and a PR-AUC of 0.247 on the validation set. In a dataset where the positive class rate is extremely low, PR-AUC provides a more meaningful picture than accuracy alone. The results suggest that the model is able to rank disease risk reasonably well, even though the final decision threshold strongly affects how aggressively positive cases are flagged.

This project shows a practical end-to-end machine learning workflow for rare-event detection. The process included preserving the original imbalance, using stratified train-validation-test splitting, building a preprocessing and modeling pipeline, applying class-weighted learning, tuning hyperparameters with recall as the optimization target, selecting a threshold based on validation performance, and evaluating the model with confusion matrix, recall, precision, F1-score, ROC-AUC, and PR-AUC. Overall, the project demonstrates how machine learning can be adapted to domain-specific priorities, especially in cases where missing a positive case is more costly than producing false alarms.