the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Assessing the storm surge model performance: What error indicators can measure the skill?
Abstract. A well-validated storm surge numerical model is crucial, offering precise coastal hazard information and serving as a basis for extensive databases and advanced data-driven algorithms. However, selecting the best model setup based solely on common error indicators like RMSE or Pearson correlation doesn't always yield optimal results. To illustrate this, we conducted 34-year high-resolution simulations for storm surge under barotropic (BT) and baroclinic (BC) configurations, using atmospheric data from ERA5 and a high-resolution downscaling of the Climate Forecast System Reanalysis (CFSR) developed by the University of Genoa (UniGe). We combined forcing and configurations to produce three datasets: 1) BT-ERA5, 2) BC-ERA5, and 3) BC-UniGe. The model performance was assessed against nearshore station data using various statistical metrics. While RMSE and Pearson correlation suggest BT-ERA5, i.e. the coarsest and simplest setup, as the best model, followed by BC-ERA5, we demonstrate that these indicators aren't always reliable for performance assessment. The most sophisticated model BC-UniGe shows worse values of RMSE or Pearson correlation due to the so-called “double penalty” effect. Here we propose new skill indicators that assess the ability of the model to reproduce the distribution of the observations. This, combined with an analysis of values above the 99th percentile, identifies BC-UniGe as the best model, while ERA5 simulations tend to underestimate the extremes. Although the study focuses on the accurate representation of storm surge by the numerical model, the analysis and proposed metrics can be applied to any problem involving the comparison between time series of simulation and observation.
- Preprint
(5136 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 17 Jul 2024)
-
RC1: 'Comment on egusphere-2024-1415', Anonymous Referee #1, 25 Jun 2024
reply
General comments
In the presented study, the authors applied a storm surge model of the Northern Adriatic Sea, running it in both barotropic and baroclinic modes and utilizing different atmospheric forcings. Afterwards, they introduced new metrics (MADp, MADc) to evaluate the skill of these different model configurations in reproducing regional storm surges. Traditional metrics, such as the Pearson correlation coefficient and RMSE, indicated that the model performed best when run in barotropic mode and with a coarser global atmospheric forcing. In contrast, the new metrics suggested that the model yielded best results when run in baroclinic mode and with downscaled high-resolution atmospheric forcing. The findings underscore the importance of using multiple metrics, understanding their limitations, and applying expert judgement to properly evaluate model performance.
Overall, the study presents valuable new ideas for evaluating the skill of storm surge models. The manuscript is written in a comprehensible manner and the results are well-presented and thoroughly discussed. However, there are some issues that should be addressed by the authors. A list with specific comments can be found below:
Specific comments
- LL59-66: “Over recent years, unstructured grid models have increasingly emerged as alternatives to regular grids for large-scale simulations (e.g. Mentaschi et al., 2020; Muis et al., 2016; Vousdoukas et al., 2018; Fernández-Montblanc et al., 2020; Saillour et al., 2021; Wang et al., 2022; Zhang et al., 2023; Mentaschi et al., 2023), with established circulation unstructured models like […] Delft3D (Deltares: Delft, 2024), among others.”
The standard Delft3D model is actually based on structured grids. In this context, you should specifically refer to Delft3D-FM, which uses unstructured grids. - LL142-144: “Tides with hourly resolution from the Finite Element Solution (FES) 2014 (Lyard et al., 2021) were also included to account for the total sea level in the simulations.”
Were all tidal constituents from FES2014 included in the boundary forcing, or was just a selection of constituents used? - LL206-207: “Additionally, with the aim of considering the representation of extremes by the simulations, we introduce two new metrics based on customized versions of the Mean Absolute Deviation (MAD):”
a) Why don’t you use the traditional MAD as a metric to assess your models’ qualities, given that your new metrics are based on it? Does the traditional MAD already exhibit similar behaviour to MADp and MADc in identifying BC-UniGe as the best model? If it does, this would raise the question of why the new metrics are necessary.
b) Have you also considered introducing RMSEp and RMSEc, either instead of or in addition to MADp and MADc? If so, why have you decided against using them? - Figures 3-7
a) When looking at Figure 7, it appears that BC-UniGe overestimates the extremes, while the other configurations underestimate them. A closer look at the models’ abilities to simulate these extremes (99th percentile in Figure 6) actually shows that most metrics indicate BC-UniGe as the best model setup. When assessing the overall quality of the models, as highlighted in the other figures, it is actually challenging to identify a clear best model. It’s true that the Pearson correlation coefficient and the RMSE indicate better performance for the ERA5 configurations, while MADp and MADc mostly identify BC-UniGe as superior. However, the differences in the metrics across the various configurations are often only of the order of millimetres. This should be highlighted and discussed in further detail. Maybe models of other regions (with higher variations in water levels) might even be better suited to demonstrate the benefits (and limitations) of your newly introduced metrics.
b) Do you also have any insights into why model performance for the locations of Monfalcone and Caorle differs from that of the other locations?
c) Furthermore, is it necessary to show 6 panels in all these figures (one for each tide gauge)? If most of the gauges show similar behaviour, it might be more effective to display only selected representative gauges and provide the rest in the supplementary material. - LL343-345: “In RMSE, “double penalty” is further amplified compared to MAD, as the penalizations due to the peak mismatch are squared. This means that phase errors have a disproportionately large impact on RMSE.”
To provide a clearer picture, it would be beneficial to include the traditional MAD as an additional metric. Without this, it is challenging to determine whether the “double penalty” is the primary issue or if this amplification leads to the RMSE favouring the ERA5 configurations. - LL371-372: “This is because BC-UniGe is more prone to phase error than BT-ERA5, and is thus doubly penalized in indicators such as RMSE, MAD and Pearson correlation.”
Is this speculation or have you quantified it? It would be beneficial to understand the magnitude of phase differences between simulated and observed peaks across the different configurations of your model.
Technical corrections
- LL15-18: “To illustrate this, we conducted 34-year high-resolution simulations for storm surge under barotropic (BT) and baroclinic (BC) configurations, using atmospheric data from ERA5 and a high-resolution downscaling of the Climate Forecast System Reanalysis (CFSR) developed by the University of Genoa (UniGe).”
I have noticed this throughout your paper. Since your simulations deal with multiple extreme events, it would often be more fitting to use “storm surges” instead of “storm surge”. - LL88-90: “The model has been already implemented in operational (Federico et al., 2017) and relocatable (Trotta et al., 2016) forecasting framework, and for storm surge events (Park et al., 2022; Alessandri et al., 2023).
The term “frameworks” should be used here instead of “framework”. - LL104-105: “Finally, the conclusion show on Section 5 summarizes the key points of the study.”
The word “show” should be omitted here. - LL112-114: “ERA5 is relatively high resolution and accurate for a global reanalysis, although it is known to be affected by negative biases at high percentiles, particularly when is compared with measured wind speed (Pineau-Guillou et al., 2018; Vannucchi et al., 2021; Benetazzo et al., 2022; Gumuscu et al., 2023).”
“[…], particularly when compared with measured wind speeds […].” - Figure 1
The red dashed box in panel (a) is barely visible; please use a thicker line. Additionally, the font sizes in panels (a) and (b) differ, and some fonts in panel (a) are extremely small. Have you also verified, whether the chosen colours in all your figures are readable for people with colour vision deficiencies? Please refer to the journal guidelines for further details. - LL158-160: “The observational data were acquired from Italian National Institute for Environmental Protection and Research (ISPRA), the Civil Protection of the Friuli-Venezia Giulia Region, and Raicich (2023).”
“[…] the Italian National Institute […]” - L169: “Both the model output and the observations were processed as follow to enable their intercomparability.”
“[…] as follows […]”
Citation: https://doi.org/10.5194/egusphere-2024-1415-RC1 - LL59-66: “Over recent years, unstructured grid models have increasingly emerged as alternatives to regular grids for large-scale simulations (e.g. Mentaschi et al., 2020; Muis et al., 2016; Vousdoukas et al., 2018; Fernández-Montblanc et al., 2020; Saillour et al., 2021; Wang et al., 2022; Zhang et al., 2023; Mentaschi et al., 2023), with established circulation unstructured models like […] Delft3D (Deltares: Delft, 2024), among others.”
-
RC2: 'Comment on egusphere-2024-1415', Anonymous Referee #2, 28 Jun 2024
reply
Dear authors,
Congratulations on the manuscript. The manuscript compares different model assessment metrics (including two new ones proposed by the authors) using storm surge simulations forced using different boundary conditions. The authors find that, depending on the metric used, different models have the best performance, highlighting the sensibility of such metrics and the need for using multiple ones when analyzing model performance. The manuscript is well-written and of high societal interest. The conclusions are supported by the results. I have some minor comments and suggestions for the authors.
Introduction:
It is important to explicitly add the objective of the paper in the introduction. We infer it by the title and the discussion, but it is important to explicitly write it down. Also, it is important to introduce the problem before the objective (e.g. why there is a need to assess the error indicators for numerical models, even though they are established?).
Line 88: Fix citation
Line 124: Fix citation
Line 131: Add the definition of nearshore used throughout the paper (it may differ across disciplines)
Line 150-152: Consider entering the formula here instead.
Line 205: Could you also add the standard MAD for comparison, so to understand the effects of your changes on the direct metric it is based on? Also, since it is new metrics, it would benefit of some sort of “validation”, or at least a sort of sensitivity analysis on the metrics based on well-defined synthetic time-series. Adding such comparison between your new metrics and the standard ones based on a synthetic (thus with specific known errors for testing) would enhance significantly this paper, especially considering the variability across the different metrics are small.
Results: Consider adding a small section where you discuss the storm patterns based on the dataset (e.g. histograms, frequency, so forth). It does not need to be long since the core of the paper is about testing the skill assessment methods. Nonetheless, it is important for the reader to conceptualize with the sort of storm patterns seen here.
Radar charts: Add on the caption of the applicable figures throughout the manuscript that values on the fringe refer to better performance.
Citation: https://doi.org/10.5194/egusphere-2024-1415-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
182 | 71 | 12 | 265 | 6 | 7 |
- HTML: 182
- PDF: 71
- XML: 12
- Total: 265
- BibTeX: 6
- EndNote: 7
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1