← Back to projects

The Ocean’s Memory: Regression prediction on Ocean Temperature Using over 60 Years of CalCOFI Environmental Data

Built a leakage-safe regression pipeline on nearly one million CalCOFI oceanographic records, using selected environmental and spatial features to predict Pacific Ocean temperature across over 60 years of observations.

About this project

Scikit-learn API Design Paper

Dataset on Kaggle

We use the CalCOFI oceanographic dataset, one of the longest-running marine observation datasets in the world. The dataset contains oceanographic measurements collected from the California Current system, including water temperature, salinity, oxygen, nutrients, depth, location, and cruise/station information. For this project, the main goal was to build a regression pipeline that predicts ocean water temperature in Celsius using environmental and spatial features.

The project used two main files: bottle.csv and cast.csv. The bottle.csv file contained depth-level ocean measurements, while cast.csv contained station-level metadata such as year, month, latitude, longitude, bottom depth, distance from coast, and cruise-related information. These two datasets were merged using the shared key Cst_Cnt, allowing each bottle-level observation to be connected with its corresponding time and location metadata.

The target variable was:

T_degC

which was renamed as:

target_temperature_celsius

Several feature names were also renamed to make the notebook easier to understand. For example, Depthm became depth_meters, Salnty became salinity, O2ml_L became oxygen_ml_per_liter, PO4uM became phosphate_umol, NO3uM became nitrate_umol, and Lat_Dec / Lon_Dec became latitude and longitude. This made the project more readable while still preserving the scientific meaning of the variables.

A major focus of this project was avoiding data leakage. Direct or risky temperature-related variables such as R_TEMP, R_POTEMP, STheta, O2Sat, and most R_ reported/processed columns were excluded from the modeling features. This was important because some of these variables are direct versions of temperature or calculated using temperature-related information. Including them would make the model appear stronger than it really is.

The final feature set included raw environmental, chemical, spatial, and time-based predictors such as:

depth_meters
salinity
oxygen_ml_per_liter
phosphate_umol
silicate_umol
nitrite_umol
nitrate_umol
year
month
latitude
longitude
bottom_depth
distance_from_coast

Missing values were handled inside a scikit-learn pipeline using:

SimpleImputer(strategy="median", add_indicator=True)

This means numerical missing values were replaced with the median value learned from the training fold only, and missingness indicators were added where useful. This was done inside the pipeline so that imputation was fitted only on training data during cross-validation, preventing leakage from validation or test data.

Because the CalCOFI dataset is time-based, a normal random train-test split was not used as the final evaluation strategy. Instead, the data was split chronologically. Observations from 1949–2014 were used for training and validation, while observations from 2015–2016 were kept as the final untouched test set. This made the evaluation more realistic because the model was trained on older oceanographic observations and tested on newer unseen years.

For validation, the project used:

TimeSeriesSplit(n_splits=5)

This allowed the model to be evaluated across multiple chronological folds. In each fold, earlier years were used for training and later years were used for validation. This was much safer than random K-Fold cross-validation because nearby time periods and similar cruise conditions were not randomly mixed across train and validation sets.

The first model was a Ridge Regression baseline. Ridge was chosen because it is a regularized linear model and gives a clean, interpretable starting point. The baseline Ridge model achieved improving validation performance as more historical years were added. Across the Ridge tuning stage, different alpha values were tested, including:

0.01, 0.1, 1, 10, 50, 100, 500, 1000

However, Ridge performance barely changed across alpha values. The best Ridge models achieved around:

Mean validation MAE ≈ 1.566 °C
Mean validation RMSE ≈ 2.030 °C
Mean validation R² ≈ 0.724

This showed that the linear model was useful as a baseline, but the relationship between ocean temperature and environmental variables was clearly more nonlinear.

After the Ridge baseline, several tree-based and ensemble models were trained and compared using the same time-series validation strategy. The models included:

DecisionTreeRegressor
RandomForestRegressor
HistGradientBoostingRegressor

For each model, both training and validation metrics were checked. This was important because training scores alone can be misleading. A model with very low training error but much worse validation error is likely overfitting. The comparison focused on MAE, RMSE, and R².

The Decision Tree models performed much better than Ridge, but deeper trees showed signs of overfitting. For example, DecisionTree_depth_15 achieved strong validation performance, but single decision trees are less stable than ensemble models. Therefore, the Decision Tree was useful for comparison but was not selected as the final model.

The HistGradientBoostingRegressor models also performed strongly. The best practical HGB model was:

HistGradientBoostingRegressor(
    learning_rate=0.1,
    max_iter=200,
    max_depth=6,
    max_leaf_nodes=31,
    l2_regularization=0.0,
    random_state=42
)

It achieved approximately:

Mean train MAE ≈ 0.441 °C
Mean validation MAE ≈ 0.607 °C
Mean validation RMSE ≈ 0.819 °C
Mean validation R² ≈ 0.952

This was a major improvement over Ridge and showed that nonlinear ensemble methods captured the temperature patterns much better.

The Random Forest model performed the best overall. The first strong Random Forest model was:

RandomForestRegressor(
    n_estimators=100,
    max_depth=15,
    min_samples_leaf=5,
    n_jobs=-1,
    random_state=42
)

It achieved:

Mean train MAE ≈ 0.335 °C
Mean validation MAE ≈ 0.515 °C
Mean validation RMSE ≈ 0.799 °C
Mean validation R² ≈ 0.955

After that, the Random Forest was tuned further. The final tuned Random Forest used:

RandomForestRegressor(
    n_estimators=150,
    max_depth=18,
    min_samples_leaf=5,
    max_features="sqrt",
    n_jobs=-1,
    random_state=42
)

This model achieved the best validation performance:

Mean train MAE ≈ 0.366 °C
Mean validation MAE ≈ 0.479 °C
Mean validation RMSE ≈ 0.708 °C
Mean validation R² ≈ 0.966

The train-validation gap was acceptable. The model performed better on training data than validation data, as expected, but the difference was not extreme. This suggested that the model generalized well and was not dangerously overfitting.

After all model selection and hyperparameter tuning were completed using only the 1949–2014 training/validation period, the final tuned Random Forest model was trained on the full training-validation dataset and evaluated once on the untouched 2015–2016 test set.

The final test results were:

MAE  = 0.433 °C
RMSE = 0.699 °C
R²   = 0.971

These results mean that the final model predicted ocean water temperature with an average error of about 0.43°C on unseen future data. The R² score of 0.971 shows that the model explained around 97.1% of the variation in the held-out test period.