This ML Penguin project used the unsimplified penguins_lter.csv dataset to build a structured multiclass classification pipeline for predicting penguin species. The target variable was species, and the model learned to classify penguins into three classes: Adelie, Chinstrap, and Gentoo. Instead of using the simplified version of the dataset, the original research-style dataset was chosen because it contained more realistic column names, extra biological variables, missing values, ID columns, and messy research metadata. This made the project more useful from a machine learning workflow perspective because it required proper inspection, cleaning, preprocessing, model tuning, and final evaluation.
The first stage focused on understanding the dataset before making any changes. The dataset originally contained research-based columns such as Sample Number, Region, Stage, Individual ID, Date Egg, Comments, body measurement variables, isotope variables, sex, island, clutch completion, and species. Before dropping anything, each column was inspected using data types, missing value counts, missing percentages, and unique value counts. This was important because columns should not be removed blindly. For example, Region and Stage were removed only after checking that they contained no useful variation. Sample Number and Individual ID were removed because they were identifier-like columns and would not generalize as meaningful biological predictors. The raw Comments column was also inspected first; although it contained some research notes, it had too many missing values and was too inconsistent as free text, so it was not kept as a direct feature.
The cleaning stage was handled carefully. The Sex column contained normal categories such as MALE and FEMALE, but also had missing values and one invalid "." value. The "." value was treated as missing, and missing sex values were filled with "Unknown" instead of forcing them into male or female. This was a better decision because the dataset was small, and randomly or manually assigning sex labels would have created fake certainty. The species names were also simplified from long scientific names into clean labels such as Adelie, Chinstrap, and Gentoo, which made the classification results easier to interpret.
The missing numeric values were intentionally not filled immediately before the train-validation-test split. This order was important. If missing numeric values had been filled using the whole dataset before splitting, the median values would have been calculated using information from the validation and test sets. That would create data leakage. Instead, missing numeric values in columns such as culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g, delta_15_n, and delta_13_c were handled later inside the preprocessing pipeline using median imputation. This means the imputer learned medians from the training data only during cross-validation and model fitting, which is the correct professional workflow.
Exploratory data analysis was also included before modeling. The project checked missing values, class distribution, numeric summaries, skewness, and outliers. Histograms and boxplots were used to understand the distributions of the numeric variables before scaling. The dataset did not show serious outlier problems overall. Some values looked naturally different across species, especially body mass and flipper length, but these differences are biologically meaningful rather than obvious data errors. Because of that, the project did not delete outliers. This was a good choice because the goal was species classification, and those measurement differences help the model distinguish species.
Scaling was added as part of the preprocessing pipeline using StandardScaler. This was especially important for models such as Logistic Regression, SVM-style models, and distance-sensitive methods because numeric features were measured on very different scales. For example, body_mass_g is measured in thousands, while culmen measurements are around tens of millimeters, and isotope values have much smaller ranges. Standardization made the numeric variables comparable by centering them around zero and scaling by standard deviation. Plots were also created before and after scaling to show what changed. The important point is that scaling changed the numeric scale, but it did not change the shape of the distributions. So the histograms after scaling had similar shapes, but different x-axis values.
After cleaning and EDA, the project defined the feature matrix and target variable. The selected features included island, clutch completion, culmen length, culmen depth, flipper length, body mass, sex, delta 15 N, and delta 13 C. The target was species. The data was split into training, validation, and test sets before encoding, imputing, or scaling. This was one of the strongest parts of the pipeline because it respected the correct order of machine learning workflow. The validation set was used to compare tuned models, while the test set was kept untouched until the very end.
The preprocessing step used two separate pipelines: one for numeric columns and one for categorical columns. Numeric columns were handled with median imputation followed by StandardScaler. Categorical columns were handled with imputation and one-hot encoding. This separation was important because numeric and categorical variables need different preprocessing methods. The full preprocessing was wrapped inside a ColumnTransformer, and then connected to each model using a full scikit-learn pipeline. This made the workflow cleaner, safer, and reproducible.
Several classification models were trained and tuned: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, XGBoost, CatBoost, and LightGBM. Each model was tuned using RandomizedSearchCV with 5-fold cross-validation and 20 iterations. The scoring metric used for tuning was macro F1-score, which was a good choice because this was a multiclass classification problem and the classes were not perfectly balanced. Macro F1 treats each class more equally, instead of letting the largest class dominate the evaluation.
The validation results were very strong. Logistic Regression achieved a perfect validation score with 1.000 accuracy, 1.000 macro F1, 1.000 macro precision, and 1.000 macro recall. XGBoost, CatBoost, and LightGBM also achieved perfect validation metrics, while Random Forest and Gradient Boosting were only slightly lower. The Decision Tree also performed very well, but was slightly weaker than the best models. Based on validation performance, Logistic Regression was selected as the best model. This is actually a nice result because it shows that a simpler model can perform as well as or better than more complex models when the class separation is strong and the features are informative.
The final best model was then refitted using the combined training and validation data, and only after that it was evaluated on the test set. This final test evaluation is the most important result because the test set was not used during model selection. On the test set, the final model achieved an accuracy of 0.9808, macro F1-score of 0.9834, macro precision of 0.9861, and macro recall of 0.9815. These results show that the model generalized very well beyond the training and validation data.
The classification report gives more detail. For Adelie penguins, the model achieved 0.96 precision, 1.00 recall, and 0.98 F1-score. This means all Adelie penguins in the test set were correctly found, with only a small amount of confusion from predictions assigned as Adelie. For Chinstrap penguins, the model achieved perfect precision, recall, and F1-score, meaning it classified all Chinstrap examples correctly in the test set. For Gentoo penguins, the model achieved 1.00 precision, 0.94 recall, and 0.97 F1-score. This means every penguin predicted as Gentoo was correct, but one or a small number of actual Gentoo cases were likely classified as another species.
Overall, the model performed extremely well. The high scores are not surprising because penguin species are strongly separated by biological measurements such as culmen length, culmen depth, flipper length, body mass, and also by location-related variables such as island. However, the real value of this project is not only the high accuracy. The stronger achievement is the structure of the machine learning pipeline: inspecting before dropping columns, handling missing values correctly inside the pipeline, splitting data before preprocessing, scaling numeric variables properly, encoding categorical variables safely, tuning multiple models with cross-validation, comparing models on a validation set, and testing the final selected model only once at the end.