Assessing the storm surge model performance: What error indicators can measure the skill?

Campos-Caba, Rodrigo; Mentaschi, Lorenzo; Alessandri, Jacopo; Camus, Paula; Mazzino, Andrea; Ferrari, Franceso; Federico, Ivan; Vousdoukas, Michalis; Tondello, Massimo

doi:https://doi.org/10.5194/egusphere-2024-1415

Preprints

https://doi.org/10.5194/egusphere-2024-1415

Preprints

22 May 2024

| 22 May 2024

Assessing the storm surge model performance: What error indicators can measure the skill?

Rodrigo Campos-Caba, Lorenzo Mentaschi, Jacopo Alessandri, Paula Camus, Andrea Mazzino, Franceso Ferrari, Ivan Federico, Michalis Vousdoukas, and Massimo Tondello

Abstract. A well-validated storm surge numerical model is crucial, offering precise coastal hazard information and serving as a basis for extensive databases and advanced data-driven algorithms. However, selecting the best model setup based solely on common error indicators like RMSE or Pearson correlation doesn't always yield optimal results. To illustrate this, we conducted 34-year high-resolution simulations for storm surge under barotropic (BT) and baroclinic (BC) configurations, using atmospheric data from ERA5 and a high-resolution downscaling of the Climate Forecast System Reanalysis (CFSR) developed by the University of Genoa (UniGe). We combined forcing and configurations to produce three datasets: 1) BT-ERA5, 2) BC-ERA5, and 3) BC-UniGe. The model performance was assessed against nearshore station data using various statistical metrics. While RMSE and Pearson correlation suggest BT-ERA5, i.e. the coarsest and simplest setup, as the best model, followed by BC-ERA5, we demonstrate that these indicators aren't always reliable for performance assessment. The most sophisticated model BC-UniGe shows worse values of RMSE or Pearson correlation due to the so-called “double penalty” effect. Here we propose new skill indicators that assess the ability of the model to reproduce the distribution of the observations. This, combined with an analysis of values above the 99th percentile, identifies BC-UniGe as the best model, while ERA5 simulations tend to underestimate the extremes. Although the study focuses on the accurate representation of storm surge by the numerical model, the analysis and proposed metrics can be applied to any problem involving the comparison between time series of simulation and observation.

Received: 13 May 2024 – Discussion started: 22 May 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Rodrigo Campos-Caba, Lorenzo Mentaschi, Jacopo Alessandri, Paula Camus, Andrea Mazzino, Franceso Ferrari, Ivan Federico, Michalis Vousdoukas, and Massimo Tondello

Status: final response (author comments only)

RC1:
'Comment on egusphere-2024-1415', Anonymous Referee #1, 25 Jun 2024
General comments
In the presented study, the authors applied a storm surge model of the Northern Adriatic Sea, running it in both barotropic and baroclinic modes and utilizing different atmospheric forcings. Afterwards, they introduced new metrics (MADp, MADc) to evaluate the skill of these different model configurations in reproducing regional storm surges. Traditional metrics, such as the Pearson correlation coefficient and RMSE, indicated that the model performed best when run in barotropic mode and with a coarser global atmospheric forcing. In contrast, the new metrics suggested that the model yielded best results when run in baroclinic mode and with downscaled high-resolution atmospheric forcing. The findings underscore the importance of using multiple metrics, understanding their limitations, and applying expert judgement to properly evaluate model performance.
Overall, the study presents valuable new ideas for evaluating the skill of storm surge models. The manuscript is written in a comprehensible manner and the results are well-presented and thoroughly discussed. However, there are some issues that should be addressed by the authors. A list with specific comments can be found below:
Specific comments
LL59-66: “Over recent years, unstructured grid models have increasingly emerged as alternatives to regular grids for large-scale simulations (e.g. Mentaschi et al., 2020; Muis et al., 2016; Vousdoukas et al., 2018; Fernández-Montblanc et al., 2020; Saillour et al., 2021; Wang et al., 2022; Zhang et al., 2023; Mentaschi et al., 2023), with established circulation unstructured models like […] Delft3D (Deltares: Delft, 2024), among others.”

The standard Delft3D model is actually based on structured grids. In this context, you should specifically refer to Delft3D-FM, which uses unstructured grids.

LL142-144: “Tides with hourly resolution from the Finite Element Solution (FES) 2014 (Lyard et al., 2021) were also included to account for the total sea level in the simulations.”

Were all tidal constituents from FES2014 included in the boundary forcing, or was just a selection of constituents used?

LL206-207: “Additionally, with the aim of considering the representation of extremes by the simulations, we introduce two new metrics based on customized versions of the Mean Absolute Deviation (MAD):”

a) Why don’t you use the traditional MAD as a metric to assess your models’ qualities, given that your new metrics are based on it? Does the traditional MAD already exhibit similar behaviour to MADp and MADc in identifying BC-UniGe as the best model? If it does, this would raise the question of why the new metrics are necessary.

b) Have you also considered introducing RMSEp and RMSEc, either instead of or in addition to MADp and MADc? If so, why have you decided against using them?

Figures 3-7

a) When looking at Figure 7, it appears that BC-UniGe overestimates the extremes, while the other configurations underestimate them. A closer look at the models’ abilities to simulate these extremes (99th percentile in Figure 6) actually shows that most metrics indicate BC-UniGe as the best model setup. When assessing the overall quality of the models, as highlighted in the other figures, it is actually challenging to identify a clear best model. It’s true that the Pearson correlation coefficient and the RMSE indicate better performance for the ERA5 configurations, while MADp and MADc mostly identify BC-UniGe as superior. However, the differences in the metrics across the various configurations are often only of the order of millimetres. This should be highlighted and discussed in further detail. Maybe models of other regions (with higher variations in water levels) might even be better suited to demonstrate the benefits (and limitations) of your newly introduced metrics.

b) Do you also have any insights into why model performance for the locations of Monfalcone and Caorle differs from that of the other locations?

c) Furthermore, is it necessary to show 6 panels in all these figures (one for each tide gauge)? If most of the gauges show similar behaviour, it might be more effective to display only selected representative gauges and provide the rest in the supplementary material.

LL343-345: “In RMSE, “double penalty” is further amplified compared to MAD, as the penalizations due to the peak mismatch are squared. This means that phase errors have a disproportionately large impact on RMSE.”

To provide a clearer picture, it would be beneficial to include the traditional MAD as an additional metric. Without this, it is challenging to determine whether the “double penalty” is the primary issue or if this amplification leads to the RMSE favouring the ERA5 configurations.

LL371-372: “This is because BC-UniGe is more prone to phase error than BT-ERA5, and is thus doubly penalized in indicators such as RMSE, MAD and Pearson correlation.”

Is this speculation or have you quantified it? It would be beneficial to understand the magnitude of phase differences between simulated and observed peaks across the different configurations of your model.

Technical corrections
LL15-18: “To illustrate this, we conducted 34-year high-resolution simulations for storm surge under barotropic (BT) and baroclinic (BC) configurations, using atmospheric data from ERA5 and a high-resolution downscaling of the Climate Forecast System Reanalysis (CFSR) developed by the University of Genoa (UniGe).”

I have noticed this throughout your paper. Since your simulations deal with multiple extreme events, it would often be more fitting to use “storm surges” instead of “storm surge”.

LL88-90: “The model has been already implemented in operational (Federico et al., 2017) and relocatable (Trotta et al., 2016) forecasting framework, and for storm surge events (Park et al., 2022; Alessandri et al., 2023).

The term “frameworks” should be used here instead of “framework”.

LL104-105: “Finally, the conclusion show on Section 5 summarizes the key points of the study.”

The word “show” should be omitted here.

LL112-114: “ERA5 is relatively high resolution and accurate for a global reanalysis, although it is known to be affected by negative biases at high percentiles, particularly when is compared with measured wind speed (Pineau-Guillou et al., 2018; Vannucchi et al., 2021; Benetazzo et al., 2022; Gumuscu et al., 2023).”

“[…], particularly when compared with measured wind speeds […].”

Figure 1

The red dashed box in panel (a) is barely visible; please use a thicker line. Additionally, the font sizes in panels (a) and (b) differ, and some fonts in panel (a) are extremely small. Have you also verified, whether the chosen colours in all your figures are readable for people with colour vision deficiencies? Please refer to the journal guidelines for further details.

LL158-160: “The observational data were acquired from Italian National Institute for Environmental Protection and Research (ISPRA), the Civil Protection of the Friuli-Venezia Giulia Region, and Raicich (2023).”

“[…] the Italian National Institute […]”

L169: “Both the model output and the observations were processed as follow to enable their intercomparability.”

“[…] as follows […]”
Citation: https://doi.org/10.5194/egusphere-2024-1415-RC1
RC2: 'Comment on egusphere-2024-1415', Anonymous Referee #2, 28 Jun 2024

Dear authors,
Congratulations on the manuscript. The manuscript compares different model assessment metrics (including two new ones proposed by the authors) using storm surge simulations forced using different boundary conditions. The authors find that, depending on the metric used, different models have the best performance, highlighting the sensibility of such metrics and the need for using multiple ones when analyzing model performance. The manuscript is well-written and of high societal interest. The conclusions are supported by the results. I have some minor comments and suggestions for the authors.

Introduction:
It is important to explicitly add the objective of the paper in the introduction. We infer it by the title and the discussion, but it is important to explicitly write it down. Also, it is important to introduce the problem before the objective (e.g. why there is a need to assess the error indicators for numerical models, even though they are established?).
Line 88: Fix citation
Line 124: Fix citation
Line 131: Add the definition of nearshore used throughout the paper (it may differ across disciplines)
Line 150-152: Consider entering the formula here instead.
Line 205: Could you also add the standard MAD for comparison, so to understand the effects of your changes on the direct metric it is based on? Also, since it is new metrics, it would benefit of some sort of “validation”, or at least a sort of sensitivity analysis on the metrics based on well-defined synthetic time-series. Adding such comparison between your new metrics and the standard ones based on a synthetic (thus with specific known errors for testing) would enhance significantly this paper, especially considering the variability across the different metrics are small.
Results: Consider adding a small section where you discuss the storm patterns based on the dataset (e.g. histograms, frequency, so forth). It does not need to be long since the core of the paper is about testing the skill assessment methods. Nonetheless, it is important for the reader to conceptualize with the sort of storm patterns seen here.
Radar charts: Add on the caption of the applicable figures throughout the manuscript that values on the fringe refer to better performance.

Citation: https://doi.org/10.5194/egusphere-2024-1415-RC2
RC3: 'Comment on egusphere-2024-1415', Anonymous Referee #3, 30 Jun 2024

The present paper compares the performance of various storm surge models in a domain within the North Adriatic Sea. The authors aim to illustrate the differences between three distinct approaches and examine the impact of the chosen statistical metrics on evaluating the model results. The article ultimately suggests extending sea level storm surge analysis metrics to include a corrected Mean Absolute Difference term, which avoids the potential double penalty effect caused by phase shifts in the peaks of model simulations compared to tide gauge observations.
I find this article to be well-written and insightful. Therefore, I recommend it for publication, subject to addressing the following concerns:
1. In the article, the authors discuss several modelling choices that could potentially impact the results. However, there is no explanation for the selection of SHYFEM for this modelling effort. I suggest that the authors elaborate on their choice, discussing the advantages and disadvantages of using SHYFEM.
2. The comparison between BT-ERA5, BC-ERA5, and BC-UniGe is influenced by the use of different atmospheric simulations. For the first two, the base reanalysis model is ERA5, whereas the comparison is later shifted to a downscaled CFSR product. Given the sensitivity of extreme results for these two models, this comparison seems somewhat unfair. While I understand the role of metrics and agree with the overall message of the paper, you are comparing different products. Although ERA5 and CFSR might be similar for average statistics, it is unclear how they compare for extremes, particularly in an enclosed basin like the Adriatic Sea. The paper should clearly state this, or if possible, show the performance of the native ERA5 and CFSR storm surge products in the region. Additionally, consider discussing what a WRF ERA5 downscaling would contribute to the analysis.
3. In comparing the tide gauge time series with the different simulations, there are noticeable differences between the modelled and observed time series. I find puzzling that the impact of waves is not mentioned in the paper. What effect would waves have on these comparisons? I suggest the authors include comments on whether waves play a crucial role in the storm surge levels in the Adriatic Sea.

Citation: https://doi.org/10.5194/egusphere-2024-1415-RC3

Rodrigo Campos-Caba, Lorenzo Mentaschi, Jacopo Alessandri, Paula Camus, Andrea Mazzino, Franceso Ferrari, Ivan Federico, Michalis Vousdoukas, and Massimo Tondello

Viewed

Total article views: 388 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
265	103	20	388	14	14

HTML: 265
PDF: 103
XML: 20
Total: 388
BibTeX: 14
EndNote: 14

Views and downloads (calculated since 22 May 2024)

Month	HTML	PDF	XML	Total
May 2024	87	42	6	135
Jun 2024	120	33	9	162
Jul 2024	58	28	5	91

Cumulative views and downloads (calculated since 22 May 2024)

Month	HTML	PDF	XML	Total
May 2024	87	42	6	135
Jun 2024	120	33	9	162
Jul 2024	58	28	5	91

Viewed (geographical distribution)

Total article views: 373 (including HTML, PDF, and XML) Thereof 373 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 26 Jul 2024

Short summary

Development of high-resolution simulations of storm surge in the Northern Adriatic Sea, employing different atmospheric forcing data and physical configurations. Traditional metrics like Pearson correlation and RMSE favor a simulation forced by a coarser database and employing a less sophisticated setup (2D, barotropic). Closer examination allows to identify a baroclinic (3D) model forced by a high-resolution dataset as better able to capture the variability and peak values of the storm surge.


Total:	0
HTML:	0
PDF:	0
XML:	0