Ziraddin Gulumjanli

Predicting house prices is a common regression problem, and Decision Trees are a natural choice when you want to capture nonlinear relationships and interactions between features. Here’s how the workflow is structured in this project.

1. Loading and Splitting Data

The first step is loading the dataset and splitting it into training, validation, and test sets:

X_train, X_val, X_test

This ensures that the model learns from the training data, is tuned on the validation data, and is finally evaluated on the test data to measure generalization performance.

Training set: 7,000 samples
Validation set: 1,500 samples
Test set: 1,500 samples

Splitting data like this is essential to prevent data leakage and to get reliable performance metrics.

2. Identifying Feature Types

Next, numeric and categorical features are separated:

numeric_cols = ['square_feet', 'num_bedrooms', 'num_bathrooms', 'lot_size', 'year_built']
categorical_cols = ['has_garage', 'neighborhood']

Numeric features are continuous variables used directly in splits.
Categorical features require encoding so the tree can handle them as numeric inputs.

3. Imputation and Encoding

Missing values are handled separately for numeric and categorical data:

num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

Then, categorical variables are one-hot encoded:

encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X_train_cat_enc = encoder.fit_transform(X_train_cat)

Imputers make the dataset complete so the tree never sees missing values.
Encoding converts text categories into numeric columns, allowing the tree to split on them.
The encoder is fitted only on the training set to avoid leaking information from validation or test data.

After encoding, numeric and categorical data are combined into a final feature matrix:

X_train_final = pd.concat([X_train_num, X_train_cat_enc], axis=1)

4. Model Definition and Hyperparameter Tuning

The DecisionTreeRegressor is instantiated:

dt = DecisionTreeRegressor(random_state=42)

A RandomizedSearchCV explores different hyperparameters efficiently:

param_dist = {
    "max_depth": [5, 10, 15, 20, None],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["auto", "sqrt", "log2", None]
}

The search uses 10 candidates with 10-fold cross-validation.
It automatically selects the combination that gives the best R² on training folds.

Best hyperparameters from your results:

{'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': None, 'max_depth': 5}

These parameters control the tree’s depth and split rules to avoid overfitting.

5. Training and Evaluation

The model is trained on the training set and then evaluated on validation and test sets:

y_train_pred = best_dt.predict(X_train_final)
evaluate(y_train, y_train_pred, "TRAIN")

Metrics used:

MAE: average absolute error
RMSE: penalizes larger errors
R²: proportion of variance explained

Results indicate:

Training R² = 0.819 → good fit
Validation R² = 0.795, Test R² = 0.810 → stable, minimal overfitting
MAE/RMSE (~16–20k) → reasonable for house price scale

6. Feature Importance

Finally, feature importances are computed:

feature_importance = pd.DataFrame({
    "feature": X_train_final.columns,
    "importance": best_dt.feature_importances_
}).sort_values(by="importance", ascending=False)

Top contributors:

square_feet dominates (~93%)
num_bedrooms and some neighborhood indicators provide minor contributions
Remaining features have negligible influence in this tree configuration

This shows which features the Decision Tree relies on for its predictions.

7. Key Takeaways

Workflow: Split → Impute → Encode → Train → Tune → Evaluate → Interpret
Decision Tree: captures nonlinear patterns, works with mixed numeric/categorical features
Model behavior: generalizes well, stable validation/test metrics
Interpretation: feature importance clearly highlights house size and bedrooms as primary price drivers

Decision Tree Regression