the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
An interpretable machine learning for marine heatwave prediction for the south China sea
Abstract. A primary challenge of machine learning to predict marine heatwave (MHW) for the south China sea (SCS) is the limited availability of observational data for model training. To address this issue, this study explores the viability of leveraging multi-member ensemble simulations from the Coupled Model Intercomparison Project Phase 6 (CMIP6), to construct an extensive, physically consistent training dataset for various machine learning models. After training on multiple CMIP6 ensemble members, the constructed models are evaluated for their predictive capacity regarding MHW in the SCS. The results also show that these machine learning-based methods can perform comparably to the existing dynamic models in terms of prediction performance, and in some cases even outperform the latter. Furthermore, by incorporating machine learning interpretability techniques, the key physical processes can also be elucidated from these predictions. That is to say, the new method is not a traditional "black box", but rather an effective tool that can possess certain physical transparency and scientific interpretability.
- Preprint
(4378 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 23 Apr 2026)
-
RC1: 'Comment on egusphere-2026-262', Anonymous Referee #1, 27 Feb 2026
reply
-
AC1: 'Reply on RC1', Peihao Yang, 14 Mar 2026
reply
Thank the editor and reviewers for their handling and review. Regarding the questions raised by the reviewers, our responses are as follows:
1, Firstly, short-term forecasting and seasonal/multi-year are fundamentally different. Generally, we cannot and do not need to predict the exact state of the climate 730 days in advance. I would suggest the authors study carefully the conventions of forecasting at different timescales, and to consider the target timescale most suited to their method.
Reply: Thank you for your comments. We agree that predictive capability and physical predictability differ substantially across time scales. Experiments with longer lead times (e.g., 90–730 days) are not intended to provide practical predictions of the exact future state. Instead, they are designed to illustrate how the predictive signal learned from CMIP6 gradually decays as the forecast horizon increases, thereby indicating the upper limit of the model’s usable predictability range.
2, Do the predictions presented represent an area-averaged time series? The SCS is a large area with many studies showing distinct MHW dynamics in the area. I am concerned that an area-averaged time series is not truly representative of individual events, nor relevant to impacts. Such an approach should at least cover sub-domains, if not individual grid cells on a common grid.
Reply: Thank you for your comments. In our study, the predictions are not based on an area-averaged time series. Instead, each grid cell within the South China Sea is treated as an independent prediction target, and the model uses the corresponding local predictor series for point-wise prediction. To avoid this confusion, we have revised the Section 3.2 Method and clarified this point in current version.
3, The forecast validation (Figure 2 and the corresponding analysis) is very unclear. This is unfortunate because it is the foundation of the paper. What does it mean to have a prediction accuracy of CMIP6 models or of ML models alone? Also, ERA5 is not used as a prediction.
Reply: Thank you for your comments. It should be clarified that ERA5 was not used as a prediction model. In the revised manuscript it has been corrected to “ECMWF forecast.” In addition, the definition of the accuracy has been clarified, indicating that the results from CMIP6 and related datasets represent event-agreement under a unified labeling framework rather than prediction accuracy in the machine-learning sense.
4, Accuracy is not defined.
Reply: Thank you for your comments. The accuracy used in this study represents the consistency of daily event identification. Specifically, at each time step, the MHW state predicted by the model is compared with the observed record to determine whether they are consistent, and the results are then averaged over the entire time series. This metric reflects the degree to which different methods consistently identify the occurrence or absence of events on a daily basis, allowing for a fair comparison under the same label construction scheme and the same class imbalance conditions. It should be noted that this metric is only used to evaluate pointwise classification consistency and does not replace the probabilistic or intensity-based error measures commonly used in extreme event studies.
5, A 7-day moving average is applied to target data. This removes variability that would be crucial for short-term forecasts, which are within the scope of this study. Please check the sensitivity of this approach.
Reply: Thank you for your comments. A typical MHW event is defined by a persistence of at least five consecutive days, meaning that its identification primarily relies on sustained warm anomalies rather than day-to-day fluctuations. The moving average suppresses high-frequency variability, making the labels more consistent with the persistence characteristic of MHW events and reducing spurious detections caused by transient noise. Because the definition of MHW itself is based on persistence, this processing step only stabilizes the labels and does not alter the event identification logic or affect the underlying climatic and physical interpretation.
6, It is not clear why Random Forest is considered the best performing model and subsequently used for the explainability analysis.
Reply: Thank you for your comments. Machine learning method shows predictive performance, where RF demonstrates stable performance, with relatively small error fluctuations and higher prediction consistency. In addition, tree-based models possess inherent advantages in structural transparency, variable interpretability, and quantitative evaluation of feature contributions. RF relies on explicit tree-splitting structures and node impurity reduction metrics, which enable it to provide robust and consistent rankings of feature importance and make it highly compatible with tree-based SHAP approaches. Therefore, this study selects RF to interpret the influence mechanisms of key meteorological and oceanic factors, and selects three representative forecast lead time—7-day, 30-day, and 90-day—for in-depth interpretation.
7, Line 202-204: Provide References.
Reply: Thank you for your comments. We have added References to this Section.
1)Wang, Z., Han, L., Ding, R., and Li, J.: Evaluation of the performance of CMIP5 and CMIP6 models in simulating the South Pacific Quadrupole–ENSO relationship, Atmospheric and Oceanic Science Letters, 14, 100057, 2021.
2)Yang, X. and Huang, P.: Improvements in the relationship between tropical precipitation and sea surface temperature from CMIP5 to CMIP6, Climate Dynamics, 60, 3319–3337, 2023.
Citation: https://doi.org/10.5194/egusphere-2026-262-AC1
-
AC1: 'Reply on RC1', Peihao Yang, 14 Mar 2026
reply
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 165 | 78 | 23 | 266 | 14 | 16 |
- HTML: 165
- PDF: 78
- XML: 23
- Total: 266
- BibTeX: 14
- EndNote: 16
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This study applies Machine and Deep Learning techniques to forecast marine heatwaves (MHWs) in the South China Sea. Training is performed by linking CMIP6 model MHWs to EOFs of atmospheric and oceanic predictors. Forecasts of a recent period are then made using ERA5 atmospheric reanalysis predictors and validated against satellite sea surface temperature.
The study suffers from fundamental gaps in the description and justification of the methods, and flaws in the forecast analysis. There is potential for an interesting work on using climate simulations to predict ocean extremes but major revisions and additions are first required.
Firstly, short-term forecasting and seasonal/multi-year are fundamentally different. Generally, we cannot and do not need to predict the exact state of the climate 730 days in advance. I would suggest the authors study carefully the conventions of forecasting at different timescales, and to consider the target timescale most suited to their method.
Do the predictions presented represent an area-averaged time series? The SCS is a large area with many studies showing distinct MHW dynamics in the area. I am concerned that an area-averaged time series is not truly representative of individual events, nor relevant to impacts. Such an approach should at least cover sub-domains, if not individual grid cells on a common grid.
The forecast validation (Figure 2 and the corresponding analysis) is very unclear. This is unfortunate because it is the foundation of the paper. What does it mean to have a prediction accuracy of CMIP6 models or of ML models alone? Also, ERA5 is not used as a prediction.
Accuracy is not defined.
A 7-day moving average is applied to target data. This removes variability that would be crucial for short-term forecasts, which are within the scope of this study. Please check the sensitivity of this approach.
It is not clear why Random Forest is considered the best performing model and subsequently used for the explainability analysis.
Line 202-204: Provide References.