For this project, I built a Linear Regression model on a realistic housing dataset with both numeric and categorical features.
The dataset includes features like square_feet, num_bedrooms, num_bathrooms, lot_size, year_built, along with categorical variables has_garage and neighborhood. The target is price_k, representing house price in thousands. The workflow follows a clear step-by-step process: splitting the data, preprocessing, training, evaluation, and interpretation.
Data Preparation
I split the data into 70% training, 15% validation, and 15% test sets:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
Numeric features were imputed with the median, while categorical features were imputed with the most frequent category. Then, categorical features were one-hot encoded fitted only on the training data:
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X_train_cat_enc = encoder.fit_transform(X_train_cat)
X_val_cat_enc = encoder.transform(X_val_cat)
X_test_cat_enc = encoder.transform(X_test_cat)
Finally, numeric and encoded categorical features were combined to form the final matrices for training and evaluation.
Model Training
A LinearRegression model was trained with minor hyperparameter tuning (fit_intercept and positive) using RandomizedSearchCV with 8-fold cross-validation:
param_dist = {"fit_intercept": [True, False], "positive": [True, False]}
random_search = RandomizedSearchCV(
estimator=lr,
param_distributions=param_dist,
n_iter=4,
cv=8,
scoring='r2',
n_jobs=1,
random_state=42
)
random_search.fit(X_train_final, y_train)
Evaluation
The model performed strongly across all sets:
- Training set: MAE ≈ 7.9k, RMSE ≈ 9.9k, R² ≈ 0.953
- Validation set: MAE ≈ 7.8k, RMSE ≈ 9.9k, R² ≈ 0.952
- Test set: MAE ≈ 8.1k, RMSE ≈ 10.1k, R² ≈ 0.952
These results indicate the model generalizes well, with minimal overfitting. The errors are reasonable relative to typical house prices in this dataset.
Feature Interpretation
The coefficients highlight the contribution of each feature:
Coefficients:
neighborhood_A 30.46
neighborhood_B 20.24
has_garage_Yes 14.84
num_bedrooms 9.99
num_bathrooms 7.79
square_feet 0.08
lot_size 0.005
year_built 0.30
- Houses in neighborhood A are predicted to be about 30k higher than baseline.
- Having a garage adds roughly 15k.
- Each bedroom contributes ~10k, and each bathroom ~8k.
square_feetandlot_sizecontribute linearly per unit, while newer houses have a modest positive effect.
Soooo, the Linear Regression model is interpretable and stable. It successfully handles numeric and categorical data, imputes missing values, and produces strong predictive performance with an R² around 0.95. Feature coefficients provide actionable insights into what drives house prices, making the model valuable for analysis and decision-making.