In this project, an end-to-end machine learning workflow was developed to predict whether an individual earns more than $50K per year using the Adult Census Income dataset. The objective was not limited to training a classifier, but extended to demonstrating a complete tabular machine learning pipeline covering data cleaning, exploratory analysis, preprocessing, model comparison, hyperparameter tuning, and final evaluation on unseen test data.
The dataset was first examined in terms of structure, data types, class distribution, and categorical value counts using pandas and NumPy. During the initial data audit, it was identified that some missing information was not represented as standard NaN values, but instead appeared as question marks in categorical columns such as workclass, occupation, and native_country. These values were handled carefully and were replaced or imputed according to the context of each feature. Low-value noise, such as the extremely rare Holand-Netherlands entry, was also removed, and less useful or redundant fields such as fnlwgt were excluded from the modeling process.
A key preprocessing decision involved feature representation. Since education and education.num described the same underlying concept, the ordinal numeric version was retained while the text-based categorical duplicate was removed to reduce redundancy. Several columns were also renamed into cleaner snake_case format to improve readability and ensure smoother downstream use in Python and scikit-learn.
Following the cleaning stage, the target column, income, was separated from the predictor features, and a structured preprocessing pipeline was constructed. Categorical variables were transformed through one-hot encoding, while numerical variables were scaled to standardize their ranges. This preprocessing logic was integrated into a scikit-learn pipeline to keep the workflow consistent and to reduce the risk of data leakage.
To compare modeling approaches, four classification algorithms were trained and tuned: Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting. A train/validation/test split with stratification was used, and RandomizedSearchCV was applied to explore hyperparameter combinations efficiently. Model performance was compared using validation-set metrics, with F1-score selected as the primary criterion because it provides a more balanced view of performance than accuracy alone.
Among the tested models, Gradient Boosting produced the strongest results. After tuning, the final selected model was evaluated on the held-out test set. The model achieved 87.8% accuracy, 0.792 precision, 0.667 recall, and an F1-score of 0.724, indicating solid generalization performance on unseen data. These results suggest that predictions for the higher-income class were relatively reliable, while still leaving room for improvement in identifying all true positive cases.