Computing Extreme Storm Surges in Europe Using Neural Networks

Hermans, Tim H. J.; Ben Hammouda, Chiheb; Treu, Simon; Tiggeloven, Timothy; Couasnon, Anaïs; Busecke, Julius J. M.; van de Wal, Roderik S. W.

doi:10.5194/egusphere-2025-196

Preprints

https://doi.org/10.5194/egusphere-2025-196

Preprints

03 Feb 2025

| 03 Feb 2025

Computing Extreme Storm Surges in Europe Using Neural Networks

Tim H. J. Hermans, Chiheb Ben Hammouda, Simon Treu, Timothy Tiggeloven, Anaïs Couasnon, Julius J. M. Busecke, and Roderik S. W. van de Wal

Abstract. Because of the computational costs of computing storm surges with hydrodynamic models, projections of changes in extreme storm surges are often based on small ensembles of climate model simulations. This may be resolved by using data-driven storm-surge models instead, which are computationally much cheaper to apply than hydrodynamic models. However, the potential performance of data-driven models at predicting extreme storm surges is unclear because previous studies did not train their models to specifically predict the extremes, which are underrepresented in observations. Here, we investigate the performance of neural networks at predicting extreme storm surges at 9 tide-gauge stations in Europe when trained with a cost-sensitive learning approach based on the density of the observed storm surges. We find that density-based weighting improves both the error and timing of predictions of exceedances of the 99th percentile made with Long-Short-Term-Memory (LSTM) models, with the optimal degree of weighting depending on the location. At most locations, the performance of the neural networks also improves by exploiting spatiotemporal patterns in the input data with a convolutional LSTM (ConvLSTM) layer. The neural networks generally outperform an existing multi-linear regression model, and at the majority of locations, the performance of especially the ConvLSTM models approximates that of the hydrodynamic Global Tide and Surge Model. While the neural networks still predominantly underestimate the highest extreme storm surges, we conclude that addressing the imbalance in the training data through density-based weighting helps to improve the performance of neural networks at predicting the extremes and forms a step forward towards their use for climate projections.

Received: 16 Jan 2025 – Discussion started: 03 Feb 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Tim H. J. Hermans, Chiheb Ben Hammouda, Simon Treu, Timothy Tiggeloven, Anaïs Couasnon, Julius J. M. Busecke, and Roderik S. W. van de Wal

Status: closed

RC1:
'Comment on egusphere-2025-196', Anonymous Referee #1, 10 Feb 2025
Review of the Manuscript “Computing Extreme Storm Surges in Europe Using Neural Networks”
The manuscript presents an approach to storm surge prediction using deep learning. Several methodological issues must be addressed to improve the manuscript’s clarity and strength. Specifically, clarifying dataset construction, justifying hyperparameter choices, and improving performance evaluation will significantly enhance the manuscript. In its current form, the manuscript is not appropriate for publication.
Introduction:
L41-48: The last phrase “Furthermore, because several … hydrodynamics models” does not sound logical to me. It sounds like the authors reached a "general" conclusion from the previous “several” studies.

Methodology – Data Preparation:
Dataset Size and Class Distribution: The paper mentions using data from 1979 to 2017 at a three-hour resolution. However, it is unclear how many training samples remain after filtering or how the extreme events (99% and 99.9%) are distributed.

Explanatory Variables: While the paper includes zonal, meridional, and absolute wind speed as predictors, absolute wind speed is directly derivable from the other two. The authors need to justify this inclusion. Otherwise, removing absolute wind speed could prevent redundancy and improve efficiency.

Construction of Data Points: Storm surges can persist for several days. It is essential to clarify whether data points overlap, whether each event is treated as an independent sample, or if multi-day storm surges are captured uniquely.

Atmospheric Variables: The authors predict sea-level height based on ERA5 atmospheric data but do not specify whether land-based data is included. If land data is incorporated, the authors need to justify and discuss how it was handled.

Model Training & Hyperparameter Tuning:
Training Epochs: The authors use a maximum of 100 training epochs with early stopping. Given the complexity of the models, 100 epochs may not be sufficient. The authors need to justify the convergence of the model with 100 epochs.

Dropout Rate: The dropout rate of 0.1–0.2 may be too low. LSTM and ConvLSTM models often use dropout rates of 0.3–0.5 to prevent overfitting. If different dropout rates have been tested, discussing their impact would improve transparency.

Performance Evaluation:
Evaluation Metrics: The authors use the F1-score as a primary evaluation metric. In extreme event prediction, recall is often more important than precision, as missing a storm surge event is more consequential than a false positive. A high F1-score does not necessarily indicate strong model performance if recall is low. Reporting recall and precision alongside the F1-score would provide a more comprehensive assessment. A confusion matrix could be also beneficial.

Discussion & Conclusion:
The discussion and conclusion are well-written based on the current results of the study, but they will need to be updated after the revision of the manuscript.

Minor Edits:
"v.s." to "vs."

L41: “more moderate” to “moderate.”

L318: Remove “at least.”
Citation: https://doi.org/10.5194/egusphere-2025-196-RC1
- AC1: 'Reply on RC1', Tim Hermans, 10 Sep 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-196/egusphere-2025-196-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-196-AC1
RC2:
'Comment on egusphere-2025-196', Anonymous Referee #2, 06 Aug 2025
This paper presents an investigation into the use of neural networks (NNs) for predicting extreme storm surges. The core contribution is the application and evaluation of a cost-sensitive learning approach (DenseLoss) to specifically improve the prediction of the rare, high-impact events. Two NN architectures (LSTM and ConvLSTM) were compared against both a simpler statistical model (MLR) and a hydrodynamic model (GTSM) across nine European tide-gauge locations. This is a well-written paper on an interesting topic. I believe the manuscript could be strengthened by considering the following points.
The paper identifies data imbalance as a major issue, but the choice of DenseLoss needs stronger justification. It’s basically a simple re-weighting technique—and more advanced options (like SMOGN) can be used? Does just up-weighting rare events actually help the model learn their complex, nonlinear physics, or does it merely force better scores on a few outliers while impacting overall physical consistency?

The conclusions are based on an experimental setup with several fixed, important parameters—like using just nine tide gauges. Why these, and do they really capture Europe’s varied coastal dynamics? And how did you choose a 5×5° domain and a 24-hour lookback? A sensitivity analysis would show whether your results hold up when these parameters change.

Based on your results, the NNs still tend to underestimate the very highest extremes (99.9 percentile). Since accurate tail behavior is key for hazard assessment, it might help to investigate why this happens. Is it a smoothing effect in the ERA5 reanalysis, or a limit in the network’s ability to extrapolate even with DenseLoss? A brief investigation could really strengthen your conclusions.

Noting that the best α parameter changes from site to site raises practical challenges: is the model capturing general patterns or simply fitting each location’s unique data distribution? If you must tune α for every gauge, rolling this out to hundreds of sites becomes both computationally heavy and methodologically challenging.

Perhaps the comparison would be more balanced if both models used the same input cadence (ConvLSTM and GTSM). The ConvLSTM is driven by 3-hourly data, whereas GTSM benefits from hourly forcing, which may contribute to its sharper extreme peaks. Ideally, the authors could run GTSM on the same 3-hourly inputs; if that isn’t feasible, a clearer justification for the differing cadences would be helpful.

The paper would be improved by a brief discussion of its findings in the context of other advanced architectures. The authors should consider contextualizing their work with respect to models like Graph Neural Networks (GNNs), hierarchical deep neural networks, and Gaussian Process models, which have been successfully applied to similar spatiotemporal problems. This would provide valuable perspective on why LSTM/ConvLSTM were chosen and how they fit within the rapidly evolving field.

Minor Comment:
Line 487 (Appendix A): "Regularization and normalization help to avoid overfitting..." It should be "Regularization and dropout help to avoid overfitting...". Batch normalization serves a different primary purpose (stabilizing and accelerating training).
Citation: https://doi.org/10.5194/egusphere-2025-196-RC2
- AC2: 'Reply on RC2', Tim Hermans, 10 Sep 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-196/egusphere-2025-196-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-196-AC2

Status: closed

RC1:
'Comment on egusphere-2025-196', Anonymous Referee #1, 10 Feb 2025
Review of the Manuscript “Computing Extreme Storm Surges in Europe Using Neural Networks”
The manuscript presents an approach to storm surge prediction using deep learning. Several methodological issues must be addressed to improve the manuscript’s clarity and strength. Specifically, clarifying dataset construction, justifying hyperparameter choices, and improving performance evaluation will significantly enhance the manuscript. In its current form, the manuscript is not appropriate for publication.
Introduction:
L41-48: The last phrase “Furthermore, because several … hydrodynamics models” does not sound logical to me. It sounds like the authors reached a "general" conclusion from the previous “several” studies.

Methodology – Data Preparation:
Dataset Size and Class Distribution: The paper mentions using data from 1979 to 2017 at a three-hour resolution. However, it is unclear how many training samples remain after filtering or how the extreme events (99% and 99.9%) are distributed.

Explanatory Variables: While the paper includes zonal, meridional, and absolute wind speed as predictors, absolute wind speed is directly derivable from the other two. The authors need to justify this inclusion. Otherwise, removing absolute wind speed could prevent redundancy and improve efficiency.

Construction of Data Points: Storm surges can persist for several days. It is essential to clarify whether data points overlap, whether each event is treated as an independent sample, or if multi-day storm surges are captured uniquely.

Atmospheric Variables: The authors predict sea-level height based on ERA5 atmospheric data but do not specify whether land-based data is included. If land data is incorporated, the authors need to justify and discuss how it was handled.

Model Training & Hyperparameter Tuning:
Training Epochs: The authors use a maximum of 100 training epochs with early stopping. Given the complexity of the models, 100 epochs may not be sufficient. The authors need to justify the convergence of the model with 100 epochs.

Dropout Rate: The dropout rate of 0.1–0.2 may be too low. LSTM and ConvLSTM models often use dropout rates of 0.3–0.5 to prevent overfitting. If different dropout rates have been tested, discussing their impact would improve transparency.

Performance Evaluation:
Evaluation Metrics: The authors use the F1-score as a primary evaluation metric. In extreme event prediction, recall is often more important than precision, as missing a storm surge event is more consequential than a false positive. A high F1-score does not necessarily indicate strong model performance if recall is low. Reporting recall and precision alongside the F1-score would provide a more comprehensive assessment. A confusion matrix could be also beneficial.

Discussion & Conclusion:
The discussion and conclusion are well-written based on the current results of the study, but they will need to be updated after the revision of the manuscript.

Minor Edits:
"v.s." to "vs."

L41: “more moderate” to “moderate.”

L318: Remove “at least.”
Citation: https://doi.org/10.5194/egusphere-2025-196-RC1
- AC1: 'Reply on RC1', Tim Hermans, 10 Sep 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-196/egusphere-2025-196-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-196-AC1
RC2:
'Comment on egusphere-2025-196', Anonymous Referee #2, 06 Aug 2025
This paper presents an investigation into the use of neural networks (NNs) for predicting extreme storm surges. The core contribution is the application and evaluation of a cost-sensitive learning approach (DenseLoss) to specifically improve the prediction of the rare, high-impact events. Two NN architectures (LSTM and ConvLSTM) were compared against both a simpler statistical model (MLR) and a hydrodynamic model (GTSM) across nine European tide-gauge locations. This is a well-written paper on an interesting topic. I believe the manuscript could be strengthened by considering the following points.
The paper identifies data imbalance as a major issue, but the choice of DenseLoss needs stronger justification. It’s basically a simple re-weighting technique—and more advanced options (like SMOGN) can be used? Does just up-weighting rare events actually help the model learn their complex, nonlinear physics, or does it merely force better scores on a few outliers while impacting overall physical consistency?

The conclusions are based on an experimental setup with several fixed, important parameters—like using just nine tide gauges. Why these, and do they really capture Europe’s varied coastal dynamics? And how did you choose a 5×5° domain and a 24-hour lookback? A sensitivity analysis would show whether your results hold up when these parameters change.

Based on your results, the NNs still tend to underestimate the very highest extremes (99.9 percentile). Since accurate tail behavior is key for hazard assessment, it might help to investigate why this happens. Is it a smoothing effect in the ERA5 reanalysis, or a limit in the network’s ability to extrapolate even with DenseLoss? A brief investigation could really strengthen your conclusions.

Noting that the best α parameter changes from site to site raises practical challenges: is the model capturing general patterns or simply fitting each location’s unique data distribution? If you must tune α for every gauge, rolling this out to hundreds of sites becomes both computationally heavy and methodologically challenging.

Perhaps the comparison would be more balanced if both models used the same input cadence (ConvLSTM and GTSM). The ConvLSTM is driven by 3-hourly data, whereas GTSM benefits from hourly forcing, which may contribute to its sharper extreme peaks. Ideally, the authors could run GTSM on the same 3-hourly inputs; if that isn’t feasible, a clearer justification for the differing cadences would be helpful.

The paper would be improved by a brief discussion of its findings in the context of other advanced architectures. The authors should consider contextualizing their work with respect to models like Graph Neural Networks (GNNs), hierarchical deep neural networks, and Gaussian Process models, which have been successfully applied to similar spatiotemporal problems. This would provide valuable perspective on why LSTM/ConvLSTM were chosen and how they fit within the rapidly evolving field.

Minor Comment:
Line 487 (Appendix A): "Regularization and normalization help to avoid overfitting..." It should be "Regularization and dropout help to avoid overfitting...". Batch normalization serves a different primary purpose (stabilizing and accelerating training).
Citation: https://doi.org/10.5194/egusphere-2025-196-RC2
- AC2: 'Reply on RC2', Tim Hermans, 10 Sep 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-196/egusphere-2025-196-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-196-AC2

Tim H. J. Hermans, Chiheb Ben Hammouda, Simon Treu, Timothy Tiggeloven, Anaïs Couasnon, Julius J. M. Busecke, and Roderik S. W. van de Wal

Viewed

Total article views: 1,182 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
990	156	36	1,182	46	57

HTML: 990
PDF: 156
XML: 36
Total: 1,182
BibTeX: 46
EndNote: 57

Views and downloads (calculated since 03 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	157	42	6	205
Mar 2025	39	11	0	50
Apr 2025	24	12	3	39
May 2025	33	9	2	44
Jun 2025	43	20	3	66
Jul 2025	41	19	2	62
Aug 2025	118	22	4	144
Sep 2025	488	8	14	510
Oct 2025	32	11	2	45
Nov 2025	15	2	0	17

Cumulative views and downloads (calculated since 03 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	157	42	6	205
Mar 2025	39	11	0	50
Apr 2025	24	12	3	39
May 2025	33	9	2	44
Jun 2025	43	20	3	66
Jul 2025	41	19	2	62
Aug 2025	118	22	4	144
Sep 2025	488	8	14	510
Oct 2025	32	11	2	45
Nov 2025	15	2	0	17

Viewed (geographical distribution)

Total article views: 1,205 (including HTML, PDF, and XML) Thereof 1,205 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 08 Nov 2025

Short summary

We studied the performance of different types of neural networks at predicting extreme storm surges. We found that that performance improves when during model training, events with a lower density are given a higher weight. Additionally, we found that the performance of especially convolutional neural networks approaches that of a state-of-the-art hydrodynamic model. This is promising for the application of neural networks to climate model simulations.


Total:	0
HTML:	0
PDF:	0
XML:	0