← Back to projects

Forest Cover Type Classification with XGBoost

End-to-end multiclass (seven) classification pipeline using the UCI Forest CoverType dataset with 580K+ instances and 54 features to classify forest cover classes from cartographic and environmental features.

About this project

Dataset

The Forest CoverType classification project used the UCI Forest CoverType dataset to predict the dominant forest cover class from cartographic and environmental features. The dataset describes forest areas using variables such as elevation, aspect, slope, distances to hydrology, roadways and fire points, hillshade measurements at different times of the day, wilderness area indicators, and soil type indicators. The target variable, Cover_Type, represents one of seven forest cover classes. Because the goal was to predict a categorical class with seven possible outcomes, this was treated as a multiclass classification problem.

Before model training, the dataset was first loaded and inspected using pandas. The shape of the data, column names, data types, first rows, and descriptive statistics were reviewed to understand the structure of the dataset. Column names were cleaned and converted into a consistent lowercase format with underscores, making them easier to use in Python code. The target column was identified as cover_type, while all remaining columns were treated as model features. Since the original target classes were labeled from 1 to 7, they were shifted to 0 to 6 because XGBoost expects multiclass labels to start from 0.

The exploratory data analysis focused on understanding both the predictors and the target distribution. Missing values were checked across all columns, and no major missing-value issue was found. Duplicate rows were also checked to ensure that repeated records did not distort model training. The target distribution was examined to understand class imbalance. This step was important because some cover types had far more observations than others, meaning accuracy alone would not be enough to judge model quality. For this reason, macro-averaged metrics such as macro F1, macro precision, and macro recall were also used.

The feature columns were separated into two main groups. Continuous numeric features included variables such as elevation, slope, aspect, hillshade values, and horizontal or vertical distances to environmental landmarks. The wilderness area and soil type variables were already represented as binary one-hot encoded columns, so no additional categorical encoding was necessary. A ColumnTransformer was used to scale the continuous variables with StandardScaler, while passing the already encoded binary variables through unchanged. Although XGBoost does not strictly require feature scaling because it is tree-based, including this preprocessing step made the pipeline more structured and reusable.

The data was then split into training, validation, and test sets using stratified sampling. Stratification was important because the target classes were imbalanced, and each split needed to preserve approximately the same class proportions. The final split was 60% training, 20% validation, and 20% testing. The training set was used to fit the model, the validation set was used to compare baseline and tuned performance, and the test set was kept aside for the final unbiased evaluation.

A baseline XGBoost classifier was first trained using a standard multiclass configuration. The objective was set to multi:softprob, the number of classes was set to 7, and mlogloss was used as the evaluation metric. The baseline model achieved a validation accuracy of 0.8688 and a macro F1-score of 0.8520. This already showed that XGBoost was able to learn meaningful patterns from the dataset. However, the class-level results showed that performance was not equal across all classes. For example, class 4 had a much lower recall of 0.55, meaning many true examples of that class were missed by the baseline model. This is why macro metrics were useful: they made weaker class performance more visible.

After the baseline model, hyperparameter tuning was applied using RandomizedSearchCV. The search tested 20 different parameter combinations with 3-fold cross-validation, resulting in 60 total fits. The tuning process searched over important XGBoost parameters such as n_estimators, max_depth, learning_rate, subsample, colsample_bytree, min_child_weight, gamma, reg_alpha, and reg_lambda. The best cross-validation macro F1-score was 0.9208, showing that tuning improved the model substantially compared with the baseline.

The best parameter combination used 700 estimators, a maximum depth of 10, a learning rate of 0.05, subsample=0.9, colsample_bytree=0.8, min_child_weight=3, gamma=0, reg_alpha=0.01, and reg_lambda=1.5. These values suggest that the tuned model benefited from a deeper and more expressive tree structure while still using regularization and sampling controls to reduce overfitting. The lower learning rate combined with a larger number of estimators allowed the model to learn more gradually and improve performance.

On the validation set, the tuned XGBoost model achieved an accuracy of 0.9509, a macro F1-score of 0.9324, macro precision of 0.9400, and macro recall of 0.9254. Compared with the baseline model, this was a strong improvement. Accuracy increased from 0.8688 to 0.9509, and macro F1 increased from 0.8520 to 0.9324. This means the tuning process did not only improve the overall number of correct predictions, but also improved performance across the less frequent classes.

The validation classification report confirmed this improvement clearly. The tuned model performed very well on the two largest classes, class 0 and class 1, with F1-scores of 0.95 and 0.96. It also improved performance for smaller classes. Class 4, which was weak in the baseline model with an F1-score of 0.67, improved to 0.87 after tuning. This is one of the most important results because it shows that hyperparameter tuning helped the model become more balanced across classes, not just better on the majority classes.

After choosing the tuned model, the final model was trained on the combined training and validation data, then evaluated once on the held-out test set. This final evaluation produced an accuracy of 0.9550, macro F1-score of 0.9370, macro precision of 0.9474, and macro recall of 0.9274. The test performance was very close to the validation performance, which is a good sign. It suggests that the tuned model generalized well and did not simply overfit the validation set.

The final test classification report showed consistently strong performance across the seven cover types. Class 0 achieved an F1-score of 0.95, class 1 achieved 0.96, class 2 achieved 0.96, class 3 achieved 0.89, class 4 achieved 0.90, class 5 achieved 0.93, and class 6 achieved 0.97. The model performed especially well on classes 1, 2, and 6. The more difficult classes were class 3 and class 4, but even there the F1-scores remained strong at 0.89 and 0.90.

The difference between weighted and macro averages is also important. The weighted F1-score was 0.95, while the macro F1-score was 0.94. Since the macro average treats all classes equally, this means the model was not only performing well because of the majority classes. It also handled smaller classes reasonably well. This makes the final result more reliable for a multiclass imbalanced dataset.

Feature importance analysis showed that wilderness_area4 was the most influential feature, with an importance score of 0.1502. Several soil type variables were also highly important, including soil_type4, soil_type39, soil_type22, and soil_type2. Elevation was also among the top features, which makes sense because forest cover types are strongly related to environmental and geographic conditions. The importance of wilderness area and soil type features suggests that ecological zone and soil characteristics played a major role in distinguishing forest cover classes.

Overall, this project built a complete machine learning pipeline for multiclass forest cover prediction. The workflow included data loading, column cleaning, EDA, missing-value and duplicate checks, target preparation, feature grouping, preprocessing with ColumnTransformer, stratified train-validation-test splitting, baseline XGBoost modeling, hyperparameter tuning with cross-validation, final test evaluation, confusion matrix analysis, and feature importance interpretation. The tuned XGBoost model achieved strong final test performance with 95.5% accuracy and 0.937 macro F1-score, showing that it was effective at predicting forest cover type across all seven classes.

The main success of the project was not only the high final accuracy, but the improvement from baseline to tuned model. The baseline model was already useful, but it struggled more with minority classes. After tuning, macro F1 and macro recall improved substantially, meaning the final model became more balanced and reliable across the full set of cover types. This makes the project a strong example of how structured preprocessing, careful validation, and hyperparameter tuning can improve a multiclass XGBoost classification pipeline.

Reference: Blackard, J. (1998). Covertype [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C50K5N.

Also see simplified xgboost implementation