the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Storm surge dynamics in the northern Adriatic Sea: comparing AI emulators with high-resolution numerical simulations
Abstract. Accurate storm surge forecasting is vital for protecting coastal regions, particularly in the northern Adriatic Sea where sea-level rise and increasingly severe storm events pose growing risks. Machine Learning (ML) approaches offer compelling speed and flexibility, yet their ability to emulate high-resolution dynamic models, especially for extreme surge events, has not been sufficiently assessed across methods and loss functions. In this study, a range of ML emulators, from Multivariate Linear Regression (MLR) to Long Short-Term Memory (LSTM) networks, is benchmarked against a high-resolution hydrodynamic model optimized for extreme surge representation. We also evaluate the impact of training loss functions, comparing the conventional Mean Squared Error (MSE) with the corrected Mean Absolute Deviation squared (MADc²), designed to better capture surge peaks. Results show that even simple models like MLR, when trained with MADc², achieve performance comparable to advanced neural networks while remaining orders of magnitude faster. These findings demonstrate that with appropriate training strategies, data-driven emulators can rival physics-based models in reproducing extremes. The MLR-MADc² configuration emerges as a practical balance between computational efficiency and accuracy, underscoring the potential of ML emulators for coastal forecasting and risk assessment.
Competing interests: Co-author Massimo Tondello is employed by the company HS Marine SrL. Co-author Michalis Vousdoukas is employed by the company MV Coastal and Climate Research Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationship that could be construed as a potential conflict of interest.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(4270 KB) - Metadata XML
-
Supplement
(2117 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-5313', Anonymous Referee #1, 08 Dec 2025
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-5313/egusphere-2025-5313-RC1-supplement.pdfCitation: https://doi.org/
10.5194/egusphere-2025-5313-RC1 -
AC1: 'Reply on RC1', Rodrigo Campos Caba, 09 Dec 2025
Dear reviewer,
We would like to sincerely thank you for the time and care dedicated to evaluating our manuscript. Your constructive comments highlight several important aspects that will help us improve the clarity, transparency, and contextualization of our work.
As this is an initial response within the open discussion, ahead of the full revised manuscript and detailed rebuttal, we take the opportunity to address the major scientific points raised, clarify aspects of the methodology that may have been misunderstood, and acknowledge the many legitimate suggestions that will strengthen the revised version.
Before addressing the major comments, we note that two of the critiques concern developments that occurred after our original submission date, while another relates to the interpretation of our methodological framework. We respectfully clarify these points below while also emphasizing that we greatly appreciate the reviewer’s careful assessment and the improvements their feedback enables.
POINT 1: NOVELTY AND TIMELINE OF EXTREME-FOCUSED LOSS FUNCTIONS
The reviewer notes that recent studies (Hermans et al., 2025; Longo et al., 2025) have also explored loss functions tailored to extremes, suggesting that our contribution may be less novel than stated. While we acknowledge these contributions, and thank you for mentioning them, we respectfully note that Hermans et al. (2025) was published on 21 November 2025, after our submission (18 November 2025), and therefore was not available during manuscript preparation. Longo et al. is still a preprint, as is our own manuscript. Although preprints may be accessible before formal publication, they have not yet undergone peer review and their content may still evolve. For this reason, preprints are typically not treated as part of the established scientific record at the time of submission and are therefore not used to retrospectively reassess the novelty of a contribution.Our work was developed independently, and the fact that Hermans et al. (2025) later introduced a dense-loss strategy for extremes, while Longo et al. (preprint) explored quantile-based alternatives, reinforces rather than diminishes the novelty of our contribution. Instead, this convergence of ideas across multiple groups highlights a timely and emerging research direction in which the community is increasingly recognizing the need for loss functions tailored to the tails of the distribution, precisely the motivation that led us to propose MADc². The independent appearance of related approaches underscores the relevance of our methodology and the importance of explicitly addressing extremes in data-driven storm surge modeling.
A central contribution of our work is the introduction of the MADc² loss function, which builds upon the MADc metric that we first introduced in Campos-Caba et al. (2024a) for the evaluation of extreme storm surge simulations. That earlier study demonstrated that MADc uniquely identifies the configuration of our high-resolution dynamic model, which was specifically designed with a coastal resolution of 50 m, forced by a dedicated atmospheric downscaling, and calibrated to maximize performance on extreme events. This provided us with a physically credible and exceptionally strong benchmark for evaluating ML emulators, which we consider a further novel contribution of the study.
Following the publication of that work, we extended MADc into a differentiable loss function (MADc²) and formally presented it as a learning objective for emulators in several workshops (Campos-Caba et al., 2024b; Mentaschi et al., 2025; Campos-Caba et al., 2025).
In the revision, we will:
• Clarify the developmental timeline of MADc and MADc².
• Contextualize our contribution within this emerging research direction.POINT 3: USE OF MED-MFC SEA SURFACE HEIGHT AS A PREDICTOR
We thank the reviewer for raising this important point.You express concern that using high-resolution Med-MFC sea surface height (SSH) as a predictor may be “circular” or may “defeat the purpose” of data-driven emulators. We respectfully clarify that this interpretation does not apply to our methodological framework, and underscore another contribution of our work. Our objective is statistical downscaling, not full model replacement.
Using coarse-resolution model output to produce refined local predictions is a long-established and foundational practice in coastal oceanography, forming the backbone of operational storm surge forecasting for decades (e.g., Flather, 2000; von Storch & Woth, 2008). Dynamical downscaling systems routinely use coarse ocean model SSH to force high-resolution coastal models (Trotta et al., 2016; Federico et al., 2017). Our ML emulator performs the same function statistically: refining operationally available basin-scale fields to resolve coastal processes that coarse models cannot capture.
Therefore, a further novel aspect of our approach is integrating ML emulation directly into operational forecasting workflows. Most ML storm surge studies predict from atmospheric variables alone. In contrast, we mirror operational downscaling chains: coarse SSH from Copernicus Med-MFC (which is freely and consistently available in near-real time) is statistically refined to tide-gauge scale. This positions our emulator as a computationally efficient component within existing systems, not a standalone replacement.
Far from circular, this approach is operationally advantageous: it leverages high-quality basin-scale output (already assimilating observations) and focuses ML on coastal refinement, where dynamical modeling is most expensive.
That said, your suggestion is valuable. In the revised manuscript, we will comment on this alternative approach and clarify the distinction between our statistical downscaling framework and time-series forecasting methods. We will also:
• More explicitly describe the downscaling framework and distinguish it from full surrogate modeling.
• Comment on the complementary experiment you propose (evaluating models using atmospheric predictors only).
• Clarify the strengths and limitations of using SSH-driven statistical downscaling for extreme-value prediction.We also acknowledge that your proposed improvements in this section are legitimate and will meaningfully enhance the manuscript.
OTHER COMMENTS
We acknowledge the remaining points and thank the reviewer for highlighting them. In the revised manuscript, we will:
• Expand the discussion of PCA limitations and clarify that several encoding approaches exist but were not explored here.
• Strengthen Section 4 by situating our findings more clearly within the broader literature on data-driven storm surge modeling (including Hermans et al., Longo et al., Tiggeloven et al., Tadesse et al., and others).
• Provide full hyperparameter details for all neural network architectures (depth, units, learning rate, dropout rates, and optimization settings).
• Add the predictor spatial domains directly to Figure 1.
• Clarify the rationale for using the 99th percentile threshold and include a brief discussion of behavior above the 99.9th percentile.
• Acknowledge the limitation that only two locations are analyzed.
• Replace rainbow color maps with perceptually uniform alternatives.
• Make all code and data available in a public repository to guarantee full reproducibility.
• Add clarifications regarding the temporal split and the representativeness of extremes across the three-year testing period.These are constructive suggestions, and we will incorporate them systematically into the revised manuscript.
Finally, we thank the reviewer again for their thoughtful and constructive feedback. We believe that addressing these points will substantially strengthen the manuscript, particularly in clarifying the novelty of our contribution, aligning the methodology with established downscaling practices, and enhancing transparency in our ML implementation. We are preparing a full revised version of the manuscript and a detailed rebuttal accordingly.
REFERENCES
Federico, I., Pinardi, N., Coppini, G., Oddo, P., Lecci, R., & Mossa, M. (2017). Coastal ocean forecasting with an unstructured grid model in the southern Adriatic and northern Ionian seas. Natural Hazards and Earth System Sciences, 17(1), 45–59. https://doi.org/10.5194/nhess-17-45-2017.Flather, R. (2000). Existing operational oceanography. Coastal Engineering 41, 13-40.
Hermans, T., Hammouda, C., Treu, S., Tiggeloven, T., Couasnon, A., Busecke, J., and van de Wal., R. (2025). Computing extreme storm surges in Europe using neural networks. Nat. Hazards Earth Syst. Sci., 25, 4593-4612. https://doi.org/10.5194/nhess-25-4593-2025.
Campos-Caba, R., Alessandri, J., Camus, P., Mazzino, A., Ferrari, F., Federico, I., Vousdoukas, M., Tondello, M., and Mentaschi, L. (2024a). Assessing storm surge model performance: what error indicators can measure the model’s skill? Ocean Sci., 20, 1513-1526. https://doi.org/10.5194/os-20-1513-2024.
Campos-Caba, R., Mentaschi, L., Pinardi, N., Alessandri, J., Camus, P., Tondello, M., Mazzino, A., and Ferrari, F. (2024b). Developments on a machine learning downscaling system for storm surge in the Northern Adriatic Sea. Fourth ESA-ECMWF workshop: Machine Learning for Earth system observation and prediction. Frascati, Italy. Available at: [https://www.ml4esop.esa.int/posters].
Campos-Caba, R., Alessandri, J., Camus, P., Mazzino, A., Ferrari, F., Federico, I., Vousdoukas, M., Tondello, M., Coppini, G., and Mentaschi, L. (2025). Enhancing storm surge downscaling: A comparative study of machine learning and dynamical modeling in the northern Adriatic Sea. 4th International Workshop on Waves, Storm Surges, and Coastal Hazards. Santander, Spain.
Longo, E., Ficchi, A., Verlaan, M., Muis, S., Castelletti, A. (preprint). A deep learning framework for extreme storm surge modelling under future climate scenarios. Manuscript submitted to Earth’s Future.
Mentaschi, L, Campos-Caba, R., Alessandri, J., Camus, P., Mazzino, A., Ferrari, F., Federico, I., Vousdoukas, M., Tondello, M., and Coppini, G. (2025). Storm surge prediction in the Northern Adriatic Sea: a comparison between Machine Learning and numerical modelling. European Geoscience Union, General Assembly 2025. Vienna, Austria. Abstract available at: [https://meetingorganizer.copernicus.org/EGU25/EGU25-17094.html].
Trotta, F., Fenu, E., Pinardi, N., Bruciaferri, D., Giacomelli, L., Federico, I., & Coppini, G. (2016). A structured and unstructured grid relocatable ocean platform for forecasting (SURF). Deep-Sea Research II, 133, 54–75. https://doi.org/10.1016/j.dsr2.2016.05.004.
Von Storch, H., and Woth, K. (2008). Storm surges: perspectives and options. Sustain Sci. 3:33-43. https://doi.org/10.1007/s11645-008-0044-2.
Citation: https://doi.org/10.5194/egusphere-2025-5313-AC1 -
RC2: 'Reply on AC1', Anonymous Referee #1, 10 Dec 2025
The initial response of the authors to my review is sensible, but I disagree with the authors’ stance on using preprints. Dismissing preprints because they ‘are not part of the established scientific record’ goes against the purpose of preprints, which is to accelerate science by making research available earlier. Even though they have not been peer-reviewed yet, preprints have their own DOI and are therefore traceable and citeable. The specific studies I referred to have been available as preprints for multiple months, so could have been included. Furthermore, in the past years, different ways to address data imbalance in regression problems in general have been investigated as well. With the comment in my review I did not mean to discredit the present study, but rather to point out how, in my view, the authors could make a bigger contribution to advancing the science in this regard with relatively little additional effort. I will leave it up to the editor to decide whether they find that additional effort necessary or not, but it is good to read that the authors at least plan to include the latest research in their introduction and discussion.
Citation: https://doi.org/10.5194/egusphere-2025-5313-RC2
-
RC2: 'Reply on AC1', Anonymous Referee #1, 10 Dec 2025
-
AC1: 'Reply on RC1', Rodrigo Campos Caba, 09 Dec 2025
-
RC3: 'Comment on egusphere-2025-5313', Anonymous Referee #2, 17 Feb 2026
The manuscript examines the use of the MADc² metric as an improved loss function in machine learning emulators, by demonstrating its performance against MSE-based loss functions in a few different mutlivariate regression and ML models. The study is timely and of interest in the respective field. The overall manuscript is of high quality, well-prepared and thoroughly organized. A few issues I would like to raise are the following points:
1. The temporal prediction strategy is not mentioned in the manuscript. It is not stated whether the rolling window approach or a one-shot strategy was applied, along with the related input window and prediction window sizes. This is important for understanding the predictive capability of the model in practice.
2. The hyperparameter tuning procedure is not presented. Ranges of the varied parameters along with the respective range evaluation metrics should be provided. Using MSE loss during hyperparameter tuning further enhances the arguments in favor of the MADc² - based models, however this choice and the respective limitations (as for instance, using MADc² loss for tuning might probably further improve the models) should be mentioned.
3. There is in inconsistency regarding the choice of the "best model". In L185, it is stated that "This slope was therefore used to identify the best-performing model run", with the slope also being mentioned as the criterion for chosing a "best model" in the violin plot captions. However, in L240, it is stated that: "each emulator architecture was trained and evaluated on the testing set 40 times. For each run and location, statistical metrics were computed separately and then averaged across Punta della Salute and Trieste, yielding one set of values per run". So In the evaluations in sections 3.3 and 3.4, what model was used? The one determined by best slope during validation, or the average per location?
4: In L107: "The first seven principal components (PCs) of each predictor were used independently, without merging into a multivariate series". What is the difference? Isn't the input a 2D array, with one dimension being the time and the other the different features? Some schematic on the feature dimensionality might improve the presentation. Also in L109, more information should be provided for the MLR performance tests (e.g., on the whole test set? Best model?)
5. In L145, the synthetic case study is unclear, and the caption of the respective Figure (S7) is probably wrong.
6. Around L260, it is mentioned that: "the MADc²-trained emulators exhibit performance comparable to SHYFEM-MPI, with a significant portion of their distributions outperforming the benchmark (Figure 2b)". This is quite an overstatement, as only a portion of the low tails are below the benchmark lines. In Figure 2, indices a-c are also missing in the graphic.
7. In Figure 2c, it is shown that all MADc² models appear to have greater variability than their MSE counterparts when evaluated with the MADc metric. An explanation to that would be interesting for interpreting the behavior of MADc² as a loss metric.
8: In L116, is this a temporal permutation (i.e., are the values in each timeseries suffled in order)?
9: Around L285 and in Figure 4: Is this the improvement compared to the MSE models or the numerical model baseline? Probably the first, but it could be better to clarify. In the text, the discussion reads quite cumbersome with all the percentages included. It would be better to keep only the take-home message of these statistics. In the figure caption, the term "variability" is a bit misleading. Probably the term "deviation" would be more appropriate (?). The authors could also state the formula to clarify.
10. In L415-L420: The PIT histogram could explained a bit further, particularly on what is the meaning of the x-axis, for less informed readers.
11. A few figure captions include typos: in Figure 3, "bias" should be used instead of "RMSE". In Figure 11, "LSTMh" should be used instead of "MADc²". In Figure 12, (a), (b) is for Trieste and (c), (d) is for Punta della Salute.
12. The limitations of this study should be more thoroughly discussed. While this is a very interesting case study to benchmark the MADc² loss metric that the authors previously assessed (and its advantages over MSE) in a few ML architectures, it should be clarified that factors such as the limited number of stations, the use of only a few time segments for testing (and potentially the prediction horizon - not sure if applicable) that were used in this assessment limit the scope of this study from extracting further conclusions on the overall performance of ML emulators in real time scenarios.
Addressing the above, to my opinion, would improve the scientifc quality of the manuscript and render it suitable to be published.
Citation: https://doi.org/10.5194/egusphere-2025-5313-RC3 -
RC4: 'Comment on egusphere-2025-5313', Anonymous Referee #3, 23 Feb 2026
General Comments:
Significant efforts are made to use best practices for machine learning and to evaluate the statistical significance of the results. This is the primary strength of the paper. Additionally, the application of ML for storm surge to the Adriatic Sea is relatively novel. The use of multiple training metrics and separate consideration of extreme surges is also a plus. The paper does not clearly explain the ML problem formulation. Additionally, there is little explanation of why storm surge prediction is important, and no clear indication of how the developed models would be used.
Specific Comments:
- The introduction talks about storm surge, which is caused by extreme weather events, but then the paper focuses on predicting water levels at tidal gauges. Storm surge specifically refers to the excess water height above the regular tide caused by an extreme weather event. This distinction needs to be made clear in the paper.
- Traditionally, storm surge forecasting focuses on regions subject to tropical cyclones, such as the North Atlantic, the Indian Ocean, or the Pacific Ocean. They are responsible for significant loss of life and property damage, which justifies significant forecasting efforts. What is the impact of storm surge in the Adriatic Sea specifically?
- To improve reproducibility, the full list of features should be enumerated in a table.
- It appears that separate models are trained for each prediction location, but that the predictor inputs to these models are identical. If so, have you considered including location specific predictors? Also, this should be clearly explained.
- The most important predictor is sea surface height, which is an output from a physics-based ocean circulation model. This means that in any application context, the physics-based model would need to be run first. Consequently, the models in this work function as correctors to an existing physics-driven model of water height. This can result in greater accuracy than the original model but not increased computational efficiency. The cost of running Med-MFC needs to be added to the ML model training/inference cost for purposes of computational performance comparison.
- When discussing the potential applications of these models, it needs to be made clear that they rely on the sea surface output of Med-MFC, which is the same physical quantity as what the models intend to predict (sea surface height). Using them in an ensemble forecast scenario will require multiple evaluations of Med-MFC. It may be worthwhile to train a model that relies only on wind and pressure fields, which is more suitable for integration into forecasting applications and has the expected performance advantages for a machine learning model (compared to physics-based ocean circulation models).
Technical Corrections:
No comments.
Citation: https://doi.org/10.5194/egusphere-2025-5313-RC4
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 494 | 187 | 39 | 720 | 70 | 34 | 32 |
- HTML: 494
- PDF: 187
- XML: 39
- Total: 720
- Supplement: 70
- BibTeX: 34
- EndNote: 32
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1