← Back to data analytics
Decision Tree Regression

Decision Tree Regression

About this analysis

Predicting house prices is a common regression problem, and Decision Trees are a natural choice when you want to capture nonlinear relationships and interactions between features. Here’s how the workflow is structured in this project.


1. Loading and Splitting Data

The first step is loading the dataset and splitting it into training, validation, and test sets:

X_train, X_val, X_test

This ensures that the model learns from the training data, is tuned on the validation data, and is finally evaluated on the test data to measure generalization performance.

  • Training set: 7,000 samples
  • Validation set: 1,500 samples
  • Test set: 1,500 samples

Splitting data like this is essential to prevent data leakage and to get reliable performance metrics.


2. Identifying Feature Types

Next, numeric and categorical features are separated:

numeric_cols = ['square_feet', 'num_bedrooms', 'num_bathrooms', 'lot_size', 'year_built']
categorical_cols = ['has_garage', 'neighborhood']
  • Numeric features are continuous variables used directly in splits.
  • Categorical features require encoding so the tree can handle them as numeric inputs.

3. Imputation and Encoding

Missing values are handled separately for numeric and categorical data:

num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

Then, categorical variables are one-hot encoded:

encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X_train_cat_enc = encoder.fit_transform(X_train_cat)
  • Imputers make the dataset complete so the tree never sees missing values.
  • Encoding converts text categories into numeric columns, allowing the tree to split on them.
  • The encoder is fitted only on the training set to avoid leaking information from validation or test data.

After encoding, numeric and categorical data are combined into a final feature matrix:

X_train_final = pd.concat([X_train_num, X_train_cat_enc], axis=1)

4. Model Definition and Hyperparameter Tuning

The DecisionTreeRegressor is instantiated:

dt = DecisionTreeRegressor(random_state=42)

A RandomizedSearchCV explores different hyperparameters efficiently:

param_dist = {
    "max_depth": [5, 10, 15, 20, None],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["auto", "sqrt", "log2", None]
}
  • The search uses 10 candidates with 10-fold cross-validation.
  • It automatically selects the combination that gives the best on training folds.

Best hyperparameters from your results:

{'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': None, 'max_depth': 5}

These parameters control the tree’s depth and split rules to avoid overfitting.


5. Training and Evaluation

The model is trained on the training set and then evaluated on validation and test sets:

y_train_pred = best_dt.predict(X_train_final)
evaluate(y_train, y_train_pred, "TRAIN")

Metrics used:

  • MAE: average absolute error
  • RMSE: penalizes larger errors
  • R²: proportion of variance explained

Results indicate:

  • Training R² = 0.819 → good fit
  • Validation R² = 0.795, Test R² = 0.810 → stable, minimal overfitting
  • MAE/RMSE (~16–20k) → reasonable for house price scale

6. Feature Importance

Finally, feature importances are computed:

feature_importance = pd.DataFrame({
    "feature": X_train_final.columns,
    "importance": best_dt.feature_importances_
}).sort_values(by="importance", ascending=False)

Top contributors:

  • square_feet dominates (~93%)
  • num_bedrooms and some neighborhood indicators provide minor contributions
  • Remaining features have negligible influence in this tree configuration

This shows which features the Decision Tree relies on for its predictions.


7. Key Takeaways

  • Workflow: Split → Impute → Encode → Train → Tune → Evaluate → Interpret
  • Decision Tree: captures nonlinear patterns, works with mixed numeric/categorical features
  • Model behavior: generalizes well, stable validation/test metrics
  • Interpretation: feature importance clearly highlights house size and bedrooms as primary price drivers