← Back to data analytics
Linear Regression

Linear Regression

About this analysis

For this project, I built a Linear Regression model on a realistic housing dataset with both numeric and categorical features.

The dataset includes features like square_feet, num_bedrooms, num_bathrooms, lot_size, year_built, along with categorical variables has_garage and neighborhood. The target is price_k, representing house price in thousands. The workflow follows a clear step-by-step process: splitting the data, preprocessing, training, evaluation, and interpretation.

Data Preparation

I split the data into 70% training, 15% validation, and 15% test sets:

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Numeric features were imputed with the median, while categorical features were imputed with the most frequent category. Then, categorical features were one-hot encoded fitted only on the training data:

encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X_train_cat_enc = encoder.fit_transform(X_train_cat)
X_val_cat_enc = encoder.transform(X_val_cat)
X_test_cat_enc = encoder.transform(X_test_cat)

Finally, numeric and encoded categorical features were combined to form the final matrices for training and evaluation.

Model Training

A LinearRegression model was trained with minor hyperparameter tuning (fit_intercept and positive) using RandomizedSearchCV with 8-fold cross-validation:

param_dist = {"fit_intercept": [True, False], "positive": [True, False]}
random_search = RandomizedSearchCV(
    estimator=lr,
    param_distributions=param_dist,
    n_iter=4,
    cv=8,
    scoring='r2',
    n_jobs=1,
    random_state=42
)
random_search.fit(X_train_final, y_train)

Evaluation

The model performed strongly across all sets:

  • Training set: MAE ≈ 7.9k, RMSE ≈ 9.9k, R² ≈ 0.953
  • Validation set: MAE ≈ 7.8k, RMSE ≈ 9.9k, R² ≈ 0.952
  • Test set: MAE ≈ 8.1k, RMSE ≈ 10.1k, R² ≈ 0.952

These results indicate the model generalizes well, with minimal overfitting. The errors are reasonable relative to typical house prices in this dataset.

Feature Interpretation

The coefficients highlight the contribution of each feature:

Coefficients:
neighborhood_A    30.46
neighborhood_B    20.24
has_garage_Yes    14.84
num_bedrooms       9.99
num_bathrooms      7.79
square_feet        0.08
lot_size           0.005
year_built         0.30
  • Houses in neighborhood A are predicted to be about 30k higher than baseline.
  • Having a garage adds roughly 15k.
  • Each bedroom contributes ~10k, and each bathroom ~8k.
  • square_feet and lot_size contribute linearly per unit, while newer houses have a modest positive effect.

Soooo, the Linear Regression model is interpretable and stable. It successfully handles numeric and categorical data, imputes missing values, and produces strong predictive performance with an R² around 0.95. Feature coefficients provide actionable insights into what drives house prices, making the model valuable for analysis and decision-making.