Hyperparameter tuning is a crucial step in machine learning that involves selecting the optimal configuration of parameters that control the behavior of a model. Unlike model parameters, which are learned from data during training, hyperparameters are set before the learning process and can significantly impact performance. Common examples include the depth of a decision tree, the number of estimators in a random forest, the regularization strength in logistic regression, or the kernel type in an SVM. The goal of hyperparameter tuning is to identify the combination of values that maximizes model performance, typically evaluated using a chosen metric such as accuracy, F1-score, or AUC. Techniques like grid search exhaustively explore all possible combinations of specified hyperparameters, performing cross-validation to ensure that the selected values generalize well to unseen data. Randomized search offers a more computationally efficient alternative by sampling a subset of parameter combinations, which can be particularly useful for large search spaces.
1. Why Grid Search?
When training machine learning models, hyperparameters control model behavior (e.g., how deep a tree grows, or how strong regularization is). Choosing them arbitrarily can lead to poor performance. GridSearchCV in scikit-learn automates this by:
- Trying all combinations of specified hyperparameters.
- Performing cross-validation to estimate performance on unseen data.
- Returning the best parameter set according to a scoring metric.
This ensures your model is tuned efficiently without guesswork.
2. Basic Grid Search Structure
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# 1. Define the model
rf = RandomForestClassifier(random_state=42)
# 2. Define hyperparameter grid
param_grid = {
'n_estimators': [100, 200, 300], # number of trees
'max_depth': [None, 5, 10], # maximum depth of tree
'min_samples_split': [2, 5, 10], # minimum samples to split a node
'min_samples_leaf': [1, 2, 4], # minimum samples per leaf
'max_features': ['sqrt', 'log2', None] # features considered per split
}
# 3. Initialize GridSearchCV
grid_search = GridSearchCV(
estimator=rf,
param_grid=param_grid,
scoring='accuracy', # metric to optimize
cv=5, # 5-fold cross-validation
n_jobs=-1, # use all CPU cores
verbose=2, # print progress messages
return_train_score=True
)
# 4. Fit to training data
grid_search.fit(X_train, y_train)
# 5. Best hyperparameters
print(grid_search.best_params_)
print(grid_search.best_score_)
3. Explanation of Each Part
estimator=rf
- The ML model we want to train (here,
RandomForestClassifier).
param_grid=param_grid
- A dictionary specifying all hyperparameters to explore. Each key is a parameter, and the value is a list of options.
scoring='accuracy'
- The metric used to evaluate performance. Other options:
'precision','recall','f1','roc_auc', or a custom scoring function.
cv=5
- Number of folds in cross-validation. Each fold trains on 80% and validates on 20% (5 times).
n_jobs=-1
-
Parallelization option.
-
-1→ use all available CPU cores 1→ single core2, 3, ...→ use that many cores
verbose=2
-
Controls logging details during search.
-
0 = silent, 1 = minimal, 2 = detailed.
return_train_score=True
- Keeps training set scores in the results, useful to check overfitting.
4. Algorithm-Specific Hyperparameters
Different algorithms use different sets of hyperparameters:
| Algorithm | Common Hyperparameters |
|---|---|
| Decision Tree | max_depth, min_samples_split, min_samples_leaf, max_features |
| Random Forest | All DT params + n_estimators, bootstrap |
| XGBoost / Gradient Boosting | learning_rate, n_estimators, max_depth, subsample, colsample_bytree |
| Logistic Regression | C (inverse regularization), penalty (l1, l2, elasticnet), solver |
| SVM / SVR | C (regularization), kernel (linear, rbf, poly), gamma, degree (for poly), coef0 |
| KNN | n_neighbors, weights (uniform, distance), metric (euclidean, manhattan) |
| Neural Networks (MLP) | hidden_layer_sizes, activation, solver, alpha (L2), learning_rate |
5. Tips & Common Mistakes
- Large grids → long runtime: Use
RandomizedSearchCVfor sampling fewer combinations. - Scaling features: SVM, Logistic Regression, KNN require feature scaling; trees do not.
- Class imbalance: Use
class_weight='balanced'for models that support it. - Use proper scoring: In imbalanced datasets, optimizing
'accuracy'might be misleading; prefer'recall'or'f1'. - Seed for reproducibility: Always set
random_statefor consistent results.
6. Randomized Search (Optional Shortcut)
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
estimator=rf,
param_distributions=param_grid,
n_iter=20, # only 20 random combinations
scoring='accuracy',
cv=5,
n_jobs=-1,
verbose=2,
random_state=42
)
- Faster than grid search for very large grids.
n_itercontrols how many random parameter sets are tested.