Using Random Forests to Predict Extreme Sea-Levels at the Baltic Coast at Weekly Timescales

Bellinghausen, Kai; Hünicke, Birgit; Zorita, Eduardo

doi:10.5194/egusphere-2024-2222

Preprints

https://doi.org/10.5194/egusphere-2024-2222

Preprints

13 Aug 2024

| 13 Aug 2024

Using Random Forests to Predict Extreme Sea-Levels at the Baltic Coast at Weekly Timescales

Kai Bellinghausen, Birgit Hünicke, and Eduardo Zorita

Abstract. We have designed a machine-learning method to predict the occurrence of daily extreme sea-level at the Baltic Sea coast with lead times of a few days. The method is based on a Random Forest Classifier and uses spatially resolved fields of daily sea level pressure, surface wind, precipitation, and the prefilling state of the Baltic Sea as predictors for daily sea level above the 95 % quantile at each of seven tide-gauge stations representative of the Baltic coast.

The method is purely data-driven and is trained with sea-level data from the Global Extreme Sea Level Analysis (GESLA) data set and from the meteorological reanalysis ERA5 of the European Centre for Mid-range Weather Forecasting. Sea-level extremes at lead times of up to 3 days are statisfactorily predicted by the method and the relevant predictor regions are identified. The sensitivity, measured as the proportion of correctly predicted extremes is, depending on the stations, of the order of 70 %.

The proportion of false warnings, related to the specificity of the predictions, is typically as low as 10 to 20 %. For lead times longer than 3 days, the predictive skill degrades; for 7 days, it is comparable to a random skill. These values are generally higher than those derived from storm-surge reanalysis of dynamical models.

The importance of each predictor depends on the location of the tide gauge. Usually, the most relevant predictors are sea level pressure, surface wind and prefilling. Extreme sea levels in the Northern Baltic are better predicted by surface pressure and the meridional surface wind component. By contrast, for stations located in the south, the most relevant predictors are surface pressure and the zonal wind component. Precipitation was not a relevant predictor for any of the stations analysed.

The Random Forest classifier is not required to have considerable complexity and the computing time to issue predictions is typically a few minutes on a personal laptop. The method can, therefore, be used as a pre-warning system triggering the application of more sophisticated algorithms to estimate the height of the ensuing extreme sea level or as a warning to run larger ensembles with physically based numerical models.

Received: 16 Jul 2024 – Discussion started: 13 Aug 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3946 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (3946 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

17 Mar 2025

Using random forests to forecast daily extreme sea level occurrences at the Baltic Coast

Kai Bellinghausen, Birgit Hünicke, and Eduardo Zorita

Nat. Hazards Earth Syst. Sci., 25, 1139–1162, https://doi.org/10.5194/nhess-25-1139-2025,https://doi.org/10.5194/nhess-25-1139-2025, 2025

Short summary

Kai Bellinghausen, Birgit Hünicke, and Eduardo Zorita

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-2222', Anonymous Referee #1, 17 Sep 2024

I would like to extend my congratulations to Bellinghausen et al. for their meticulously researched, comprehensive, and well-crafted manuscript. Their work is both concise and impactful. By leveraging machine learning to improve the early prediction of storm surges, their research holds significant potential for benefiting society and the risk community.
If the model's performance indeed surpasses that of currently feasible hydrodynamic and empirical models, as suggested in the manuscript, it could serve as an excellent and computationally cheap tool for early prediction of storm surges in the Baltic Sea after which more expensive models can be implemented. Additionally, this manuscript serves as a motivation for further exploring machine learning models in the field of disaster risk management.
I only have few suggestions for further improvements:
1. The manuscript can be bolstered by including more comparisons with other hydrodynamic and empirical models, especially if any of them are currently used for early prediction in the Baltic Sea.

2. Fig 1 - As a minor suggestion, changing the white color of continental mass to gray shade would help with readability. Since white is often used to represent water on similar maps, it took me some time to understand it.

3. Line 186 - The acronym PF (prefilling, I assume) is used for the first time without defining it.

4. Line 192 - Could you please clarify whether the linear de-trending and 95th%ile selection as positive labels were done prior to or after splitting the data into M_T and M_V? If these were done prior to the split, data leakage would have occurred in training by having included information from the validation set. This could raise questions about the validity of the model performance. This would be less of a concern if it can be demonstrated that the linear trend of the mean is parallel to the x-axis (i.e., not varying by time), and the frequency of occurrence of 95th%ile events remains consistent over the two datasets.

5. Line 194 - DJF is defined on page 510 while first used here. Suggest moving the definition to first usage.

6. Line 232 - u10 -> U10

7. Line 239 - interpretable *models*

8. Line 253 - It would be helpful here to include whether the 75-25 split was done temporally similar to M_T and M_V, and the reasoning behind that decision.

9. Line 327- The statement is unclear about why TTPR is more sensible than VTPR.

10. Section 5.2 - Suggest adding the time lag range used in this experiment, in addition to in Table A5.

11. Figure 13, 15, 16 - Similar as comment 9. Using VTPR and TTPR selectively for different stations gives the impression that only the best TPR is shown. To remain consistent with the text, I would suggest only providing VTPR.

12. Table A2 - The predictors are represented in uppercase in text, but lowercase in the table. Suggest changing all to same case.

Citation: https://doi.org/10.5194/egusphere-2024-2222-RC1
- AC1:
  'Reply on RC1', Kai Bellinghausen, 24 Oct 2024
  We appreciate and thank the reviewer for the useful comments!
  
  Below reviewer comments are bold, our reply is in normal text format.
  
  1. The manuscript can be bolstered by including more comparisons with other hydrodynamic and empirical models, especially if any of them are currently used for early prediction in the Baltic Sea.
  We agree here and reached out to the BSH for more data to compare our results with. Unfortunately it is not guaranteed that we can get access to this data though.
  
  2. Fig 1 - As a minor suggestion, changing the white color of continental mass to gray shade would help with readability. Since white is often used to represent water on similar maps, it took me some time to understand it.
  Point partially taken. We took the image from the reference indicated, hence changing the color is not possible. We added a note to the description to clarify the point of view.
  
  3. Line 186 - The acronym PF (prefilling, I assume) is used for the first time without defining it.
  Point taken.
  
  4. Line 192 - Could you please clarify whether the linear de-trending and 95th%ile selection as positive labels were done prior to or after splitting the data into M_T and M_V? If these were done prior to the split, data leakage would have occurred in training by having included information from the validation set. This could raise questions about the validity of the model performance. This would be less of a concern if it can be demonstrated that the linear trend of the mean is parallel to the x-axis (i.e., not varying by time), and the frequency of occurrence of 95th%ile events remains consistent over the two datasets.
  Indeed the linear detrending was done before splitting the dataset into calibration (M_T) and validation (M_V) sets. In this case, it is not an issue and intentionally done for two reasons:
  
  1. Mainly we subtract the longterm trend off the whole dataset beforehand in order to cancel out the anthropogenic effect and trends of sea level rise as the model is used to predict short term variability. We are currently preparing a revised version that has a more clear split of the data which follows the following steps:
  Linear Detrending of the whole timeperiod (2005-2018) for each station
  
  Selecting months (9, 10, 11, 12, 1, 2)
  
  Split the dataset into training (2005 to 2013), testing (2014 to 2016) and validation (2017 to 2018).
  
  Compute the 95th-percentile of the training set based on hourly data and classify all datasets based on that percentile-threshold to build a hourly storm-surge-index.
  
  Reduce the hourly storm surge index into a daily index by counting one day as a storm surge day if only one hour is indicated as a storm surge.
  
  2. We tested the sea-level trends of all stations using an augmented dickey fuller test which showed stationarity of the timeseries for all stations investigated.
  
  5. Line 194 - DJF is defined on page 510 while first used here. Suggest moving the definition to first usage.
  
  point taken.
  
  6. Line 232 - u10 -> U10
  
  point taken
  
  7. Line 239 - interpretable *models*
  
  point taken
  
  8. Line 253 - It would be helpful here to include whether the 75-25 split was done temporally similar to M_T and M_V, and the reasoning behind that decision.
  
  The reasoning follows in the next sentence. Only the M_T set is split into two subsets (M_T1, M_T2) which are used for the optimization of the RFs hyperparameters using a gridsearch. The gridsearch algorithm optimizes the parameters by training on one set (M_T1) and evaluating on a test-set (M_T2). Hence, in the end, we have three sets in total; one for training (M_T1), one for testing (M_T2) and one for validation (M_V). The validation set M_V is not involved in the model-fitting process, it is only used for the evaluation at the end to check if the generalization did work. We clarified this in a few more sentences.
  
  9. Line 327- The statement is unclear about why TTPR is more sensible than VTPR.
  
  The reasoning is that for ERA5-Predictor data we used a daily timestep, whereas for PF we used hourly (if it was used as a sole predictor). Hence, when splitting the data into train-test subsets the PF dataset that is based on hourly data has a sufficient test-set size, making it more reliable to look at the TTPR. We agree though that looking at the TTPR can not be sufficient and we will adjust this by looking also at the VTPR.
  
  10. Section 5.2 - Suggest adding the time lag range used in this experiment, in addition to in Table A5.
  
  point taken.
  
  11. Figure 13, 15, 16 - Similar as comment 14. Using VTPR and TTPR selectively for different stations gives the impression that only the best TPR is shown. To remain consistent with the text, I would suggest only providing VTPR.
  
  We will only show VTPRs in the revised version.
  
  12. Table A2 - The predictors are represented in uppercase in text, but lowercase in the table. Suggest changing all to same case.
  point taken
  
  Citation: https://doi.org/10.5194/egusphere-2024-2222-AC1
RC2:
'Comment on egusphere-2024-2222', Anonymous Referee #2, 18 Sep 2024

This paper presents a machine-learning approach to predict the occurrence of extreme sea levels (storm surges) at the Baltic Sea coast using a Random Forest Classifier. The approach uses atmospheric variables (sea level pressure, wind speed components and participation rates) and the pre-filling of the Baltic Sea as predictors. The predictand is the occurrence of storm surges, defined as water levels exceeding the 95th percentile, with observations taken from the GESLA database. Results show that the model can predict the occurrence of storm surges with some degree of accuracy, achieving true positive rates in the validation dataset of approximately 70-75% at most stations for lead times of up to 3 days. The authors thoroughly analyze the results, exploring feature importance and generating Predictor Maps to connect the physical processes driving storm surges with the performance of the model.
The paper is well-written, and the authors provide a comprehensive explanation of their model development, which is commendable for reproducibility, and they provide a thorough analysis of the results.
My main concerns with this study are:
- The authors chose to implement a Random Forest Classifier to predict whether or not a storm surge occurs (i.e., binary classification). While this is useful, it is not as practical as predicting the actual storm surge levels, which would provide much more actionable information for decision-makers. While I am not suggesting a rework of the analysis, I wonder if the authors considered using a Random Forest Regressor to predict the actual storm surge values instead of just the occurrence. This could offer a more valuable forecast for real-world applications.
- My second concern is related to the accuracy of their model. The authors compare their model’s performance to the global ocean reanalysis by Muis et al. (2016) and show that their machine learning model outperforms it. While this comparison is very promising, it is important to note that the Muis reanalysis was carried out more than 8 years ago and was not validated specifically for the Baltic Sea. More critically, the VTPR reported by the authors are around 75% at best. Does this mean that 1 in 4 storm surges could go unpredicted? And the VTPR is even lower for other stations. This raises the question of whether this level of accuracy is sufficient for practical purposes. Is there any comparable work in storm surge prediction that achieves a similar or better performance? In other words, is a VTPR of 75% a good benchmark?
Other minor considerations:
- The authors may want to consider incorporating sea-level pressure gradients as another predictor for storm surges (see for example, https://doi.org/10.1016/j.apor.2023.103496).
- The authors use VTPR and TTPR depending on which one is better for a given station. I suggest using only VTPR for consistency.
- The title, specifically "at weekly timescales" can be misleading. The model is shown to be accurate only for lead times of up to 3 days, with accuracy dropping significantly for longer lead times.

Citation: https://doi.org/10.5194/egusphere-2024-2222-RC2
- AC2: 'Reply on RC2', Kai Bellinghausen, 24 Oct 2024
  
  We appreciate and thank the reviewer for the useful comments!
  
  Below reviewer comments are bold, our reply is in normal text format.
  
  The authors chose to implement a Random Forest Classifier to predict whether or not a storm surge occurs (i.e., binary classification). While this is useful, it is not as practical as predicting the actual storm surge levels, which would provide much more actionable information for decision-makers. While I am not suggesting a rework of the analysis, I wonder if the authors considered using a Random Forest Regressor to predict the actual storm surge values instead of just the occurrence. This could offer a more valuable forecast for real-world applications.
  We share this concern and indeed pointed to the possibility of an extension via the use of Random Forest Regressor in the discussion.
  
  My second concern is related to the accuracy of their model. The authors compare their model’s performance to the global ocean reanalysis by Muis et al. (2016) and show that their machine learning model outperforms it. While this comparison is very promising, it is important to note that the Muis reanalysis was carried out more than 8 years ago and was not validated specifically for the Baltic Sea. More critically, the VTPR reported by the authors are around 75% at best. Does this mean that 1 in 4 storm surges could go unpredicted? And the VTPR is even lower for other stations. This raises the question of whether this level of accuracy is sufficient for practical purposes. Is there any comparable work in storm surge prediction that achieves a similar or better performance? In other words, is a VTPR of 75% a good benchmark?
  A VTPR (sometimes termed probability of detection, hit rate or recall) of 75% indicates that the model is able to correctly predict 75% of the storm surge incidences. In other words, the probability that the model issues a storm surge warning and there actually is a storm surge is 75%. This indeed means, that there is a 25% chance that the model indicates no storm surge warning but there is actually a storm surge (False Negative Rate). It holds that TPR + FNR = 1. We compared the VTPRs for each station to the reanalysis product represented in Muis et. al 2016 which are indeed below 75%. We also reached out to the BSH for another dataset to compare our results to but can not guarantee the access to the data yet.
  
  The authors may want to consider incorporating sea-level pressure gradients as another predictor for storm surges (see for example, https://doi.org/10.1016/j.apor.2023.103496).
  
  This is indeed an interesting approach and a valid predictor. We already suggested that in the discussion / outlook and also want to point out here that the ML-Model should - do to its non-linear nature - be able to learn those pressure gradients indirectly when using the surface pressure as a predictor. Nevertheless it could be interesting to test the usability of pressure gradients separately.
  
  The authors use VTPR and TTPR depending on which one is better for a given station. I suggest using only VTPR for consistency.
  
  We will adjust that and currently rerun the model outputs to only state VTPRs.
  
  The title, specifically "at weekly timescales" can be misleading. The model is shown to be accurate only for lead times of up to 3 days, with accuracy dropping significantly for longer lead times.
  
  We adjusted the title accordingly
  
  Citation: https://doi.org/10.5194/egusphere-2024-2222-AC2
RC3:
'Comment on egusphere-2024-2222', Anonymous Referee #3, 18 Sep 2024

Review comment on “Using Random Forests to Predict Extreme Sea-Levels at the Baltic Coast at Weekly Timescales” by Bellinghausen et al.

Summary
The manuscript explores a machine learning method to predict daily extreme sea levels (>95^th percentile) along Baltic coast using a Random Forest Classifier and seven tide gauge stations. Predictor variables, chosen for their physical relevance, are sea level pressure, surface wind, precipitation and prefilling. The manuscript has the advantage of simplicity, practicality and low computational requirements, which are attractive features for implementation in early warning systems. Interestingly, the proposed method captures the regional variability in the predictors of extreme sea level but requires further discussion and comparison with similar predictive methods in the same region or under similar conditions. More clarity on the methodology is needed for the reproducibility of the model.
Major comments
I would suggest revising the title as it lacks accuracy in two ways. First, the method proposed can predict extreme sea-level occurrences but not the magnitude. Second, as stated by the authors, “for lead times longer than 3 days, the predictive skill degrades” so I wouldn’t go as far as stating that the prediction is at weekly timescales.
The sentence in line 12 states that the predictive skill of the proposed method is higher than that derived from storm-surge reanalysis of dynamical models. This statement needs to be proved in the discussion providing comparisons with previous studies and direct references.
How would the presented method complement or improve the regional models in operation in the Baltic Sea (e.g. the BSHcmod mentioned in the Introduction)? Could the authors compare the percentage of true or false warnings or the computational time required?
I understand the complexity of comparing the results with operational forecast systems in the Baltic sea due to the lack of available forecasts at lead time and therefore appreciate the comparison with extreme storm surge data from GTSR. However, I do not understand the choice of comparing in Table 1 the GTSR results with the best RF TPR as well as taking the testset or validation set depending on the station. A fair comparison between performances should show the same predictors and dataset for the RF.
As stated in line 2022 the authors decided to only test combinations of predictors that could theoretically explain the storm surge. However, given the moderate sensitivity obtained (70%), could the authors justify the missing processes/variables that would have been relevant to include or make any assumptions on how would the result change by letting the model derive the most important features?
A major concern is the sometimes arbitrary use of the results based on the test (Mt) or validation (Mv) dataset (see Line 324-325 “We selected promising results based on a combination of the TPR of the test dataset from Mt and validation datasets Mv”). This practice can inflate the perceived success rather than providing a fair assessment of its performance. Can the authors specify the criteria used to select the results or choose a single dataset for all the results to allow better comparison and have the other as a supplementary information?

Minor points
Line 2: Use “machine-learning” or “machine learning” consistently along the mauscript
Line 7: European Centre for Medium-Range Weather Forecasts
Line 13: Use “tide-gauge” or “tide gauge” consistently along the manuscript
Line 56: Remove “Also” for better English phrasing
Line 58 to 72: This is a descriptive synthesis of previous studies (Tadesse et al., 2020; Bruneau et al., 2020; Tiggleloven et al., 2021) that breaks the flow of the introduction for the reader switching from past and present verbal tenses. I suggest rewriting for better clarity and coherence with the scope of the manuscript.
Line 65: I suggest adding “e.g.,” before “Bruneau et al., (2020)” given the wider literature available (see “A Review of Application of Machine Learning in Storm Surge Problems” by Qin et al., 2023).
Line 61: Change across manuscript “sea level” with “sea-level” for coherence
Line 80: The acronym GESLA has already been expressed in the abstract and used in line 61
Line 83 and Line 94 unnecessarily repeat the formula “for the reader unfamiliar with…”
Line 165: the phrase "should be representative" implies an assumption rather than a statement backed by evidence or verification. What was the rationale behind the selection of the stations?
Figure 2: Along the text the stations are referred with their name (Line 344), with their number (Line 23) or with number and name in brackets (Line 426). I suggest a simpler notation, e.g. always referring to them with their name given the fact that they’re all different. Otherwise, I suggest keeping the numbering next to the crosses in the following figures (e.g., Figure 18) so that the reader knows which station is which without having to scroll back up to check Figure 2.
Line 182: This might be a personal preference, but I would remove Table A2 given the open-accessibility of ERA5 documentation with a full description of the parameters available to download.
Line 186: Specify that PF stands for prefilling
Line 193: correct time-series to timeseries for coherence with the rest of the manuscript
Line 214: Why if the predictand timeseries starts in 2005 does the Mt period start in 2009?
Line 228 and Figure 5: DT acronym already specified above
Line 232: It would be relevant and interesting for reproducibility to specify what questions and thresholds were applied for the predictors and how were they decided
Line 253: Were the authors using a random split or a chronological split that accounts for the fact that the sea-level predictions are time dependent?
Line 257: Remove brackets for Breiman 2001
Line 298: (Müller 2017)
Figure 7. The grey squares pointed out by the text box “Top 1% feature importance (grey squares)” could be referred to in the figure caption for better visibility of the plots. There is also a “b)” above the text box that I believe doesn’t belong to the figure. Also, replace “difference” with “different” in the caption.
Line 347: Not specified that AoI stands for area of interest
Line 370: The Baltic Sea is sometimes referred to as BS, sometimes as Baltic Sea. Choose one and maintain along the manuscript.
Figure 11: “The percentage indicates the corresponding VTPR or TTPR” and Figure 13: “Depending on the station VTPR or TTPR is shown”. Without a justification selecting the best case scenario between validation or test true positive rates potentially obscures the model’s performance across different datatests giving the impression that the model performs better than it actually does. I suggest showing the results from both periods also for better evaluating the model’s sensitivity to different time periods.
Fig. 12 and Fig. 14 are not mentioned in the text.
Line 505: Replace “We will” with a present or past tense.
Line 511: There’s already many acronyms and this is only used once, I suggest specifying September, October, November.
Line 514-517: If the authors intend to validate their machine learning model’s performance why are they not taking the GTSR as the baseline?
Table 1: Why is the testset used instead of the validation set for stations 3 and 4?
Line 560: timeseries
Line 569: remove brackets in “by Tyralis et al., 2019”.
Line 616: The reference paper of ERA5 by Hersbach et al., 2018 should be fully cited in the references.
Tables A4 to A9 can be reduced into a single Table as the legend to interpret them would be the same and there’s no need to have the experiments separated.

Citation: https://doi.org/10.5194/egusphere-2024-2222-RC3
- AC3: 'Reply on RC3', Kai Bellinghausen, 24 Oct 2024
  
  We appreciate and thank the reviewer for the useful comments and thoroughly proof-reading!
  
  Below reviewer comments are bold, our reply is in normal text format.
  I would suggest revising the title as it lacks accuracy in two ways. First, the method proposed can predict extreme sea-level occurrences but not the magnitude. Second, as stated by the authors, “for lead times longer than 3 days, the predictive skill degrades” so I wouldn’t go as far as stating that the prediction is at weekly timescales.
  
  Point taken. We adjusted the title.
  
  The sentence in line 12 states that the predictive skill of the proposed method is higher than that derived from storm-surge reanalysis of dynamical models. This statement needs to be proved in the discussion providing comparisons with previous studies and direct references.
  
  Point taken. Currently we only compared it to the reanalysis product of Muis et al. and adjusted the sentence accordingly.
  
  How would the presented method complement or improve the regional models in operation in the Baltic Sea (e.g. the BSHcmod mentioned in the Introduction)? Could the authors compare the percentage of true or false warnings or the computational time required?
  The ML-Model can be used as a surrogate model that triggers more comprehensive and detailed computations based on regional models. Currently we do not have a detailed information on the computing time of the BSHcmod but we know it is based on solving numerical equations on a grid, which is usually more computational expensive than data-driven models such as ML. We will reach out the the BSH for more detailled information and try to compute their TPR and FNR for further comparisons if we get access to their data.
  I understand the complexity of comparing the results with operational forecast systems in the Baltic sea due to the lack of available forecasts at lead time and therefore appreciate the comparison with extreme storm surge data from GTSR. However, I do not understand the choice of comparing in Table 1 the GTSR results with the best RF TPR as well as taking the testset or validation set depending on the station. A fair comparison between performances should show the same predictors and dataset for the RF.
  
  We agree with the fact that we should only use the VTPR or TTPR in terms of consistency for a fair comparison and will adjust this. If we use the ML-model in operation, we can feed multiple combinations of predictors due to the quick model-run time. Hence we picked the best TPR independent of the predictor combination.
  
  As stated in line 2022 the authors decided to only test combinations of predictors that could theoretically explain the storm surge. However, given the moderate sensitivity obtained (70%), could the authors justify the missing processes/variables that would have been relevant to include or make any assumptions on how would the result change by letting the model derive the most important features?
  We only considered local predictors / processes in this study. Indeed one could extend the predictor spatial map to also resolve global teleconnections based on climate modes such as ENSO or NAO. This would increase the complexity of the model but our aim was to develop a simple approach that is understandable and on which one can built more complex models.
  We agree that one could also pass an extensive set of predictors to the model and let the model derive the most important combination on its own. This would indeed be the initial approach if there is no prior theoretical understanding of the local processes of storm surges in the Baltic Sea. As there is already a sound understanding of those processes, we decided to focus on those particular predictors.
  
  A major concern is the sometimes arbitrary use of the results based on the test (Mt) or validation (Mv) dataset (see Line 324-325 “We selected promising results based on a combination of the TPR of the test dataset from Mt and validation datasets Mv”). This practice can inflate the perceived success rather than providing a fair assessment of its performance. Can the authors specify the criteria used to select the results or choose a single dataset for all the results to allow better comparison and have the other as a supplementary information?
  
  Point taken. We will adjust this comparison and only provide VTPRs.
  
  ---
  
  Minor points
  
  ---
  
  Line 2: Use “machine-learning” or “machine learning” consistently along the mauscript
  
  point taken
  
  Line 7: European Centre for Medium-Range Weather Forecasts
  
  point taken
  
  Line 13: Use “tide-gauge” or “tide gauge” consistently along the manuscript
  
  point taken
  
  Line 56: Remove “Also” for better English phrasing
  
  This is subjective style. We will leave it because it sounds more verbal and not too literary.
  
  Line 58 to 72: This is a descriptive synthesis of previous studies (Tadesse et al., 2020; Bruneau et al., 2020; Tiggleloven et al., 2021) that breaks the flow of the introduction for the reader switching from past and present verbal tenses. I suggest rewriting for better clarity and coherence with the scope of the manuscript.
  
  We diasgree with that. The introduction introduces ML as a more complex statistical realm of datadriven models and the upcoming sections synthesis work related to ML that has already been done. From our point of view this is a coherent storyline.
  
  Line 65: I suggest adding “e.g.,” before “Bruneau et al., (2020)” given the wider literature available (see “A Review of Application of Machine Learning in Storm Surge Problems” by Qin et al., 2023).
  
  point taken
  
  Line 61: Change across manuscript “sea level” with “sea-level” for coherence
  
  point taken
  
  Line 80: The acronym GESLA has already been expressed in the abstract and used in line 61
  
  point taken
  
  Line 83 and Line 94 unnecessarily repeat the formula “for the reader unfamiliar with…”
  
  point taken. Removed it in line 94.
  
  Line 165: the phrase "should be representative" implies an assumption rather than a statement backed by evidence or verification. What was the rationale behind the selection of the stations?
  
  point taken. Changes it to "This set of stations was chosen to represent all of the coastal orientations and bays of the Baltic Sea."
  
  Figure 2: Along the text the stations are referred with their name (Line 344), with their number (Line 23) or with number and name in brackets (Line 426). I suggest a simpler notation, e.g. always referring to them with their name given the fact that they’re all different. Otherwise, I suggest keeping the numbering next to the crosses in the following figures (e.g., Figure 18) so that the reader knows which station is which without having to scroll back up to check Figure 2.
  
  point taken.
  
  Line 182: This might be a personal preference, but I would remove Table A2 given the open-accessibility of ERA5 documentation with a full description of the parameters available to download.
  
  We will keep it for people that print and read.
  
  Line 186: Specify that PF stands for prefilling
  
  point taken.
  
  Line 193: correct time-series to timeseries for coherence with the rest of the manuscript
  
  point taken
  
  Line 214: Why if the predictand timeseries starts in 2005 does the Mt period start in 2009?
  
  We choose this period to have 10y of training data. We could have taken even more but as we discussed earlier, Bruneau et al showed that already 6 years are sufficient.
  
  Line 228 and Figure 5: DT acronym already specified above
  
  point taken
  
  Line 232: It would be relevant and interesting for reproducibility to specify what questions and thresholds were applied for the predictors and how were they decided.
  
  Those "questions" or "rules" by which the DT splits the data are not previously set, they are learned and optimized by the algorithm. Hence, they do not need to be known for reproducibility. If the software is loaded from Github and the same random_seed is used, the same DTs will be generated by the end user.
  
  We did not investigate how the data was split as we were only interested in the end result and the feature importance.
  
  Line 253: Were the authors using a random split or a chronological split that accounts for the fact that the sea-level predictions are time dependent?
  
  We used a random split in this version but will introduce a chronological split of the dataset, where the training-set ranges from 2005-2013, the test-set from 2014-2016 and the validation-set from 2017-2018.
  
  Line 257: Remove brackets for Breiman 2001
  
  point taken
  
  Line 298: (Müller 2017)
  
  point taken
  
  Figure 7. The grey squares pointed out by the text box “Top 1% feature importance (grey squares)” could be referred to in the figure caption for better visibility of the plots. There is also a “b)” above the text box that I believe doesn’t belong to the figure. Also, replace “difference” with “different” in the caption.
  
  We adjusted the figure caption but left the box because the squares are a bit transparent. We deleted the (b). We actually mean the difference of the mean maps of TPPs and FNPs here so c = b - a
  
  Line 347: Not specified that AoI stands for area of interest
  
  point taken
  
  Line 370: The Baltic Sea is sometimes referred to as BS, sometimes as Baltic Sea. Choose one and maintain along the manuscript.
  
  point taken
  
  Figure 11: “The percentage indicates the corresponding VTPR or TTPR” and Figure 13: “Depending on the station VTPR or TTPR is shown”. Without a justification selecting the best case scenario between validation or test true positive rates potentially obscures the model’s performance across different datatests giving the impression that the model performs better than it actually does. I suggest showing the results from both periods also for better evaluating the model’s sensitivity to different time periods.
  
  We will only show VTPRs in the new version
  
  Fig. 12 and Fig. 14 are not mentioned in the text.
  
  We now mention Fig 12. Fig14 was already mentioned in 449.
  
  Line 505: Replace “We will” with a present or past tense.
  
  point taken
  
  Line 511: There’s already many acronyms and this is only used once, I suggest specifying September, October, November.
  
  point taken
  
  Line 514-517: If the authors intend to validate their machine learning model’s performance why are they not taking the GTSR as the baseline?
  
  We used the GESLA dataset as a ground truth because it is a more comprehensive dataset than the GTSR as it contains the information of local gauging stations.
  
  Table 1: Why is the testset used instead of the validation set for stations 3 and 4?
  
  Because the validation set only has only 1 year of data for those stations. Hence, only a small change in the number of correctly predicted storm surges drastically increases the corresponding rates. In the new version we adjusted the split of the dataset as mentioned above such that we will show VTPRs for all stations.
  
  Line 560: timeseries
  
  point taken
  
  Line 569: remove brackets in “by Tyralis et al., 2019”.
  
  point taken
  
  Line 616: The reference paper of ERA5 by Hersbach et al., 2018 should be fully cited in the references.
  
  point taken
  
  Tables A4 to A9 can be reduced into a single Table as the legend to interpret them would be the same and there’s no need to have the experiments separated.
  
  This is a matter of taste. We kept it like this because it is easier to maintain and more clearly separated. Additionally in the webversion the links to the tables can be used which directly brings the reader to the table necessary for understanding.
  
  Citation: https://doi.org/10.5194/egusphere-2024-2222-AC3
AC4: 'Final author comment', Kai Bellinghausen, 29 Oct 2024

Final Response (AC)
First of all we thank the reviewers for their detailed and important feedback on the manuscrapt, especially regarding the methodology.
The main concern of all reviewers was the intertwined usage of a TPR based on test data and validation data. In the new manuscript we will revise this by analyzing the TPR of the validation only. Additionally we will look into the precision and F1-Scores of the model.

A second major concern was the methodological approach when preprocessing the data and applying a random train test split. We agree with the reviewers on this point and implement a split of the dataset continuous in time instead of a random split. We will revise the methodology into the following main steps for the predictand data:
- Linear Detrending of the whole timeperiod (2005-2018) for each station

- Selecting months (9, 10, 11, 12, 1, 2)

- Split the dataset into training (2005 to 2013), testing (2014 to 2016) and validation (2017 to 2018).

- Compute the 95th-percentile of the training set based on hourly data and classify all datasets based on that percentile-threshold to build a hourly storm-surge-index.

- Reduce the hourly storm surge index into a daily index by counting one day as a storm surge day if only one hour is indicated as a storm surge.
And we also conduct statistical tests on stationarity of the predictand timeseries before detrending it.
A third concern was the benchmark and questioned availability of operational storm surge forecast data. We reached out to the BSH regarding this matter and will add a new benchmark to the revised manuscript if we get access to the data.
We also already take into account most of the minor comments, especially regarding the readability of the Figures and adjust it accordingly.
There were suggestions of testing different model structures and predictors. We will not adress those points and leave it for additional research as this studies purpose was to test a simple version of a ML-model for storm surge predictions in the Baltic Sea. Further research can use the results represented here and refine the model architecture to improve accuracy and predictability.
We again thank all reviewers for the thoughtful contribution, which will effectively lead to version that is methodological more sound than before.

Citation: https://doi.org/10.5194/egusphere-2024-2222-AC4

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-2222', Anonymous Referee #1, 17 Sep 2024

I would like to extend my congratulations to Bellinghausen et al. for their meticulously researched, comprehensive, and well-crafted manuscript. Their work is both concise and impactful. By leveraging machine learning to improve the early prediction of storm surges, their research holds significant potential for benefiting society and the risk community.
If the model's performance indeed surpasses that of currently feasible hydrodynamic and empirical models, as suggested in the manuscript, it could serve as an excellent and computationally cheap tool for early prediction of storm surges in the Baltic Sea after which more expensive models can be implemented. Additionally, this manuscript serves as a motivation for further exploring machine learning models in the field of disaster risk management.
I only have few suggestions for further improvements:
1. The manuscript can be bolstered by including more comparisons with other hydrodynamic and empirical models, especially if any of them are currently used for early prediction in the Baltic Sea.

2. Fig 1 - As a minor suggestion, changing the white color of continental mass to gray shade would help with readability. Since white is often used to represent water on similar maps, it took me some time to understand it.

3. Line 186 - The acronym PF (prefilling, I assume) is used for the first time without defining it.

4. Line 192 - Could you please clarify whether the linear de-trending and 95th%ile selection as positive labels were done prior to or after splitting the data into M_T and M_V? If these were done prior to the split, data leakage would have occurred in training by having included information from the validation set. This could raise questions about the validity of the model performance. This would be less of a concern if it can be demonstrated that the linear trend of the mean is parallel to the x-axis (i.e., not varying by time), and the frequency of occurrence of 95th%ile events remains consistent over the two datasets.

5. Line 194 - DJF is defined on page 510 while first used here. Suggest moving the definition to first usage.

6. Line 232 - u10 -> U10

7. Line 239 - interpretable *models*

8. Line 253 - It would be helpful here to include whether the 75-25 split was done temporally similar to M_T and M_V, and the reasoning behind that decision.

9. Line 327- The statement is unclear about why TTPR is more sensible than VTPR.

10. Section 5.2 - Suggest adding the time lag range used in this experiment, in addition to in Table A5.

11. Figure 13, 15, 16 - Similar as comment 9. Using VTPR and TTPR selectively for different stations gives the impression that only the best TPR is shown. To remain consistent with the text, I would suggest only providing VTPR.

12. Table A2 - The predictors are represented in uppercase in text, but lowercase in the table. Suggest changing all to same case.

Citation: https://doi.org/10.5194/egusphere-2024-2222-RC1
- AC1:
  'Reply on RC1', Kai Bellinghausen, 24 Oct 2024
  We appreciate and thank the reviewer for the useful comments!
  
  Below reviewer comments are bold, our reply is in normal text format.
  
  1. The manuscript can be bolstered by including more comparisons with other hydrodynamic and empirical models, especially if any of them are currently used for early prediction in the Baltic Sea.
  We agree here and reached out to the BSH for more data to compare our results with. Unfortunately it is not guaranteed that we can get access to this data though.
  
  2. Fig 1 - As a minor suggestion, changing the white color of continental mass to gray shade would help with readability. Since white is often used to represent water on similar maps, it took me some time to understand it.
  Point partially taken. We took the image from the reference indicated, hence changing the color is not possible. We added a note to the description to clarify the point of view.
  
  3. Line 186 - The acronym PF (prefilling, I assume) is used for the first time without defining it.
  Point taken.
  
  4. Line 192 - Could you please clarify whether the linear de-trending and 95th%ile selection as positive labels were done prior to or after splitting the data into M_T and M_V? If these were done prior to the split, data leakage would have occurred in training by having included information from the validation set. This could raise questions about the validity of the model performance. This would be less of a concern if it can be demonstrated that the linear trend of the mean is parallel to the x-axis (i.e., not varying by time), and the frequency of occurrence of 95th%ile events remains consistent over the two datasets.
  Indeed the linear detrending was done before splitting the dataset into calibration (M_T) and validation (M_V) sets. In this case, it is not an issue and intentionally done for two reasons:
  
  1. Mainly we subtract the longterm trend off the whole dataset beforehand in order to cancel out the anthropogenic effect and trends of sea level rise as the model is used to predict short term variability. We are currently preparing a revised version that has a more clear split of the data which follows the following steps:
  Linear Detrending of the whole timeperiod (2005-2018) for each station
  
  Selecting months (9, 10, 11, 12, 1, 2)
  
  Split the dataset into training (2005 to 2013), testing (2014 to 2016) and validation (2017 to 2018).
  
  Compute the 95th-percentile of the training set based on hourly data and classify all datasets based on that percentile-threshold to build a hourly storm-surge-index.
  
  Reduce the hourly storm surge index into a daily index by counting one day as a storm surge day if only one hour is indicated as a storm surge.
  
  2. We tested the sea-level trends of all stations using an augmented dickey fuller test which showed stationarity of the timeseries for all stations investigated.
  
  5. Line 194 - DJF is defined on page 510 while first used here. Suggest moving the definition to first usage.
  
  point taken.
  
  6. Line 232 - u10 -> U10
  
  point taken
  
  7. Line 239 - interpretable *models*
  
  point taken
  
  8. Line 253 - It would be helpful here to include whether the 75-25 split was done temporally similar to M_T and M_V, and the reasoning behind that decision.
  
  The reasoning follows in the next sentence. Only the M_T set is split into two subsets (M_T1, M_T2) which are used for the optimization of the RFs hyperparameters using a gridsearch. The gridsearch algorithm optimizes the parameters by training on one set (M_T1) and evaluating on a test-set (M_T2). Hence, in the end, we have three sets in total; one for training (M_T1), one for testing (M_T2) and one for validation (M_V). The validation set M_V is not involved in the model-fitting process, it is only used for the evaluation at the end to check if the generalization did work. We clarified this in a few more sentences.
  
  9. Line 327- The statement is unclear about why TTPR is more sensible than VTPR.
  
  The reasoning is that for ERA5-Predictor data we used a daily timestep, whereas for PF we used hourly (if it was used as a sole predictor). Hence, when splitting the data into train-test subsets the PF dataset that is based on hourly data has a sufficient test-set size, making it more reliable to look at the TTPR. We agree though that looking at the TTPR can not be sufficient and we will adjust this by looking also at the VTPR.
  
  10. Section 5.2 - Suggest adding the time lag range used in this experiment, in addition to in Table A5.
  
  point taken.
  
  11. Figure 13, 15, 16 - Similar as comment 14. Using VTPR and TTPR selectively for different stations gives the impression that only the best TPR is shown. To remain consistent with the text, I would suggest only providing VTPR.
  
  We will only show VTPRs in the revised version.
  
  12. Table A2 - The predictors are represented in uppercase in text, but lowercase in the table. Suggest changing all to same case.
  point taken
  
  Citation: https://doi.org/10.5194/egusphere-2024-2222-AC1
RC2:
'Comment on egusphere-2024-2222', Anonymous Referee #2, 18 Sep 2024

This paper presents a machine-learning approach to predict the occurrence of extreme sea levels (storm surges) at the Baltic Sea coast using a Random Forest Classifier. The approach uses atmospheric variables (sea level pressure, wind speed components and participation rates) and the pre-filling of the Baltic Sea as predictors. The predictand is the occurrence of storm surges, defined as water levels exceeding the 95th percentile, with observations taken from the GESLA database. Results show that the model can predict the occurrence of storm surges with some degree of accuracy, achieving true positive rates in the validation dataset of approximately 70-75% at most stations for lead times of up to 3 days. The authors thoroughly analyze the results, exploring feature importance and generating Predictor Maps to connect the physical processes driving storm surges with the performance of the model.
The paper is well-written, and the authors provide a comprehensive explanation of their model development, which is commendable for reproducibility, and they provide a thorough analysis of the results.
My main concerns with this study are:
- The authors chose to implement a Random Forest Classifier to predict whether or not a storm surge occurs (i.e., binary classification). While this is useful, it is not as practical as predicting the actual storm surge levels, which would provide much more actionable information for decision-makers. While I am not suggesting a rework of the analysis, I wonder if the authors considered using a Random Forest Regressor to predict the actual storm surge values instead of just the occurrence. This could offer a more valuable forecast for real-world applications.
- My second concern is related to the accuracy of their model. The authors compare their model’s performance to the global ocean reanalysis by Muis et al. (2016) and show that their machine learning model outperforms it. While this comparison is very promising, it is important to note that the Muis reanalysis was carried out more than 8 years ago and was not validated specifically for the Baltic Sea. More critically, the VTPR reported by the authors are around 75% at best. Does this mean that 1 in 4 storm surges could go unpredicted? And the VTPR is even lower for other stations. This raises the question of whether this level of accuracy is sufficient for practical purposes. Is there any comparable work in storm surge prediction that achieves a similar or better performance? In other words, is a VTPR of 75% a good benchmark?
Other minor considerations:
- The authors may want to consider incorporating sea-level pressure gradients as another predictor for storm surges (see for example, https://doi.org/10.1016/j.apor.2023.103496).
- The authors use VTPR and TTPR depending on which one is better for a given station. I suggest using only VTPR for consistency.
- The title, specifically "at weekly timescales" can be misleading. The model is shown to be accurate only for lead times of up to 3 days, with accuracy dropping significantly for longer lead times.

Citation: https://doi.org/10.5194/egusphere-2024-2222-RC2
- AC2: 'Reply on RC2', Kai Bellinghausen, 24 Oct 2024
  
  We appreciate and thank the reviewer for the useful comments!
  
  Below reviewer comments are bold, our reply is in normal text format.
  
  The authors chose to implement a Random Forest Classifier to predict whether or not a storm surge occurs (i.e., binary classification). While this is useful, it is not as practical as predicting the actual storm surge levels, which would provide much more actionable information for decision-makers. While I am not suggesting a rework of the analysis, I wonder if the authors considered using a Random Forest Regressor to predict the actual storm surge values instead of just the occurrence. This could offer a more valuable forecast for real-world applications.
  We share this concern and indeed pointed to the possibility of an extension via the use of Random Forest Regressor in the discussion.
  
  My second concern is related to the accuracy of their model. The authors compare their model’s performance to the global ocean reanalysis by Muis et al. (2016) and show that their machine learning model outperforms it. While this comparison is very promising, it is important to note that the Muis reanalysis was carried out more than 8 years ago and was not validated specifically for the Baltic Sea. More critically, the VTPR reported by the authors are around 75% at best. Does this mean that 1 in 4 storm surges could go unpredicted? And the VTPR is even lower for other stations. This raises the question of whether this level of accuracy is sufficient for practical purposes. Is there any comparable work in storm surge prediction that achieves a similar or better performance? In other words, is a VTPR of 75% a good benchmark?
  A VTPR (sometimes termed probability of detection, hit rate or recall) of 75% indicates that the model is able to correctly predict 75% of the storm surge incidences. In other words, the probability that the model issues a storm surge warning and there actually is a storm surge is 75%. This indeed means, that there is a 25% chance that the model indicates no storm surge warning but there is actually a storm surge (False Negative Rate). It holds that TPR + FNR = 1. We compared the VTPRs for each station to the reanalysis product represented in Muis et. al 2016 which are indeed below 75%. We also reached out to the BSH for another dataset to compare our results to but can not guarantee the access to the data yet.
  
  The authors may want to consider incorporating sea-level pressure gradients as another predictor for storm surges (see for example, https://doi.org/10.1016/j.apor.2023.103496).
  
  This is indeed an interesting approach and a valid predictor. We already suggested that in the discussion / outlook and also want to point out here that the ML-Model should - do to its non-linear nature - be able to learn those pressure gradients indirectly when using the surface pressure as a predictor. Nevertheless it could be interesting to test the usability of pressure gradients separately.
  
  The authors use VTPR and TTPR depending on which one is better for a given station. I suggest using only VTPR for consistency.
  
  We will adjust that and currently rerun the model outputs to only state VTPRs.
  
  The title, specifically "at weekly timescales" can be misleading. The model is shown to be accurate only for lead times of up to 3 days, with accuracy dropping significantly for longer lead times.
  
  We adjusted the title accordingly
  
  Citation: https://doi.org/10.5194/egusphere-2024-2222-AC2
RC3:
'Comment on egusphere-2024-2222', Anonymous Referee #3, 18 Sep 2024

Review comment on “Using Random Forests to Predict Extreme Sea-Levels at the Baltic Coast at Weekly Timescales” by Bellinghausen et al.

Summary
The manuscript explores a machine learning method to predict daily extreme sea levels (>95^th percentile) along Baltic coast using a Random Forest Classifier and seven tide gauge stations. Predictor variables, chosen for their physical relevance, are sea level pressure, surface wind, precipitation and prefilling. The manuscript has the advantage of simplicity, practicality and low computational requirements, which are attractive features for implementation in early warning systems. Interestingly, the proposed method captures the regional variability in the predictors of extreme sea level but requires further discussion and comparison with similar predictive methods in the same region or under similar conditions. More clarity on the methodology is needed for the reproducibility of the model.
Major comments
I would suggest revising the title as it lacks accuracy in two ways. First, the method proposed can predict extreme sea-level occurrences but not the magnitude. Second, as stated by the authors, “for lead times longer than 3 days, the predictive skill degrades” so I wouldn’t go as far as stating that the prediction is at weekly timescales.
The sentence in line 12 states that the predictive skill of the proposed method is higher than that derived from storm-surge reanalysis of dynamical models. This statement needs to be proved in the discussion providing comparisons with previous studies and direct references.
How would the presented method complement or improve the regional models in operation in the Baltic Sea (e.g. the BSHcmod mentioned in the Introduction)? Could the authors compare the percentage of true or false warnings or the computational time required?
I understand the complexity of comparing the results with operational forecast systems in the Baltic sea due to the lack of available forecasts at lead time and therefore appreciate the comparison with extreme storm surge data from GTSR. However, I do not understand the choice of comparing in Table 1 the GTSR results with the best RF TPR as well as taking the testset or validation set depending on the station. A fair comparison between performances should show the same predictors and dataset for the RF.
As stated in line 2022 the authors decided to only test combinations of predictors that could theoretically explain the storm surge. However, given the moderate sensitivity obtained (70%), could the authors justify the missing processes/variables that would have been relevant to include or make any assumptions on how would the result change by letting the model derive the most important features?
A major concern is the sometimes arbitrary use of the results based on the test (Mt) or validation (Mv) dataset (see Line 324-325 “We selected promising results based on a combination of the TPR of the test dataset from Mt and validation datasets Mv”). This practice can inflate the perceived success rather than providing a fair assessment of its performance. Can the authors specify the criteria used to select the results or choose a single dataset for all the results to allow better comparison and have the other as a supplementary information?

Minor points
Line 2: Use “machine-learning” or “machine learning” consistently along the mauscript
Line 7: European Centre for Medium-Range Weather Forecasts
Line 13: Use “tide-gauge” or “tide gauge” consistently along the manuscript
Line 56: Remove “Also” for better English phrasing
Line 58 to 72: This is a descriptive synthesis of previous studies (Tadesse et al., 2020; Bruneau et al., 2020; Tiggleloven et al., 2021) that breaks the flow of the introduction for the reader switching from past and present verbal tenses. I suggest rewriting for better clarity and coherence with the scope of the manuscript.
Line 65: I suggest adding “e.g.,” before “Bruneau et al., (2020)” given the wider literature available (see “A Review of Application of Machine Learning in Storm Surge Problems” by Qin et al., 2023).
Line 61: Change across manuscript “sea level” with “sea-level” for coherence
Line 80: The acronym GESLA has already been expressed in the abstract and used in line 61
Line 83 and Line 94 unnecessarily repeat the formula “for the reader unfamiliar with…”
Line 165: the phrase "should be representative" implies an assumption rather than a statement backed by evidence or verification. What was the rationale behind the selection of the stations?
Figure 2: Along the text the stations are referred with their name (Line 344), with their number (Line 23) or with number and name in brackets (Line 426). I suggest a simpler notation, e.g. always referring to them with their name given the fact that they’re all different. Otherwise, I suggest keeping the numbering next to the crosses in the following figures (e.g., Figure 18) so that the reader knows which station is which without having to scroll back up to check Figure 2.
Line 182: This might be a personal preference, but I would remove Table A2 given the open-accessibility of ERA5 documentation with a full description of the parameters available to download.
Line 186: Specify that PF stands for prefilling
Line 193: correct time-series to timeseries for coherence with the rest of the manuscript
Line 214: Why if the predictand timeseries starts in 2005 does the Mt period start in 2009?
Line 228 and Figure 5: DT acronym already specified above
Line 232: It would be relevant and interesting for reproducibility to specify what questions and thresholds were applied for the predictors and how were they decided
Line 253: Were the authors using a random split or a chronological split that accounts for the fact that the sea-level predictions are time dependent?
Line 257: Remove brackets for Breiman 2001
Line 298: (Müller 2017)
Figure 7. The grey squares pointed out by the text box “Top 1% feature importance (grey squares)” could be referred to in the figure caption for better visibility of the plots. There is also a “b)” above the text box that I believe doesn’t belong to the figure. Also, replace “difference” with “different” in the caption.
Line 347: Not specified that AoI stands for area of interest
Line 370: The Baltic Sea is sometimes referred to as BS, sometimes as Baltic Sea. Choose one and maintain along the manuscript.
Figure 11: “The percentage indicates the corresponding VTPR or TTPR” and Figure 13: “Depending on the station VTPR or TTPR is shown”. Without a justification selecting the best case scenario between validation or test true positive rates potentially obscures the model’s performance across different datatests giving the impression that the model performs better than it actually does. I suggest showing the results from both periods also for better evaluating the model’s sensitivity to different time periods.
Fig. 12 and Fig. 14 are not mentioned in the text.
Line 505: Replace “We will” with a present or past tense.
Line 511: There’s already many acronyms and this is only used once, I suggest specifying September, October, November.
Line 514-517: If the authors intend to validate their machine learning model’s performance why are they not taking the GTSR as the baseline?
Table 1: Why is the testset used instead of the validation set for stations 3 and 4?
Line 560: timeseries
Line 569: remove brackets in “by Tyralis et al., 2019”.
Line 616: The reference paper of ERA5 by Hersbach et al., 2018 should be fully cited in the references.
Tables A4 to A9 can be reduced into a single Table as the legend to interpret them would be the same and there’s no need to have the experiments separated.

Citation: https://doi.org/10.5194/egusphere-2024-2222-RC3
- AC3: 'Reply on RC3', Kai Bellinghausen, 24 Oct 2024
  
  We appreciate and thank the reviewer for the useful comments and thoroughly proof-reading!
  
  Below reviewer comments are bold, our reply is in normal text format.
  I would suggest revising the title as it lacks accuracy in two ways. First, the method proposed can predict extreme sea-level occurrences but not the magnitude. Second, as stated by the authors, “for lead times longer than 3 days, the predictive skill degrades” so I wouldn’t go as far as stating that the prediction is at weekly timescales.
  
  Point taken. We adjusted the title.
  
  The sentence in line 12 states that the predictive skill of the proposed method is higher than that derived from storm-surge reanalysis of dynamical models. This statement needs to be proved in the discussion providing comparisons with previous studies and direct references.
  
  Point taken. Currently we only compared it to the reanalysis product of Muis et al. and adjusted the sentence accordingly.
  
  How would the presented method complement or improve the regional models in operation in the Baltic Sea (e.g. the BSHcmod mentioned in the Introduction)? Could the authors compare the percentage of true or false warnings or the computational time required?
  The ML-Model can be used as a surrogate model that triggers more comprehensive and detailed computations based on regional models. Currently we do not have a detailed information on the computing time of the BSHcmod but we know it is based on solving numerical equations on a grid, which is usually more computational expensive than data-driven models such as ML. We will reach out the the BSH for more detailled information and try to compute their TPR and FNR for further comparisons if we get access to their data.
  I understand the complexity of comparing the results with operational forecast systems in the Baltic sea due to the lack of available forecasts at lead time and therefore appreciate the comparison with extreme storm surge data from GTSR. However, I do not understand the choice of comparing in Table 1 the GTSR results with the best RF TPR as well as taking the testset or validation set depending on the station. A fair comparison between performances should show the same predictors and dataset for the RF.
  
  We agree with the fact that we should only use the VTPR or TTPR in terms of consistency for a fair comparison and will adjust this. If we use the ML-model in operation, we can feed multiple combinations of predictors due to the quick model-run time. Hence we picked the best TPR independent of the predictor combination.
  
  As stated in line 2022 the authors decided to only test combinations of predictors that could theoretically explain the storm surge. However, given the moderate sensitivity obtained (70%), could the authors justify the missing processes/variables that would have been relevant to include or make any assumptions on how would the result change by letting the model derive the most important features?
  We only considered local predictors / processes in this study. Indeed one could extend the predictor spatial map to also resolve global teleconnections based on climate modes such as ENSO or NAO. This would increase the complexity of the model but our aim was to develop a simple approach that is understandable and on which one can built more complex models.
  We agree that one could also pass an extensive set of predictors to the model and let the model derive the most important combination on its own. This would indeed be the initial approach if there is no prior theoretical understanding of the local processes of storm surges in the Baltic Sea. As there is already a sound understanding of those processes, we decided to focus on those particular predictors.
  
  A major concern is the sometimes arbitrary use of the results based on the test (Mt) or validation (Mv) dataset (see Line 324-325 “We selected promising results based on a combination of the TPR of the test dataset from Mt and validation datasets Mv”). This practice can inflate the perceived success rather than providing a fair assessment of its performance. Can the authors specify the criteria used to select the results or choose a single dataset for all the results to allow better comparison and have the other as a supplementary information?
  
  Point taken. We will adjust this comparison and only provide VTPRs.
  
  ---
  
  Minor points
  
  ---
  
  Line 2: Use “machine-learning” or “machine learning” consistently along the mauscript
  
  point taken
  
  Line 7: European Centre for Medium-Range Weather Forecasts
  
  point taken
  
  Line 13: Use “tide-gauge” or “tide gauge” consistently along the manuscript
  
  point taken
  
  Line 56: Remove “Also” for better English phrasing
  
  This is subjective style. We will leave it because it sounds more verbal and not too literary.
  
  Line 58 to 72: This is a descriptive synthesis of previous studies (Tadesse et al., 2020; Bruneau et al., 2020; Tiggleloven et al., 2021) that breaks the flow of the introduction for the reader switching from past and present verbal tenses. I suggest rewriting for better clarity and coherence with the scope of the manuscript.
  
  We diasgree with that. The introduction introduces ML as a more complex statistical realm of datadriven models and the upcoming sections synthesis work related to ML that has already been done. From our point of view this is a coherent storyline.
  
  Line 65: I suggest adding “e.g.,” before “Bruneau et al., (2020)” given the wider literature available (see “A Review of Application of Machine Learning in Storm Surge Problems” by Qin et al., 2023).
  
  point taken
  
  Line 61: Change across manuscript “sea level” with “sea-level” for coherence
  
  point taken
  
  Line 80: The acronym GESLA has already been expressed in the abstract and used in line 61
  
  point taken
  
  Line 83 and Line 94 unnecessarily repeat the formula “for the reader unfamiliar with…”
  
  point taken. Removed it in line 94.
  
  Line 165: the phrase "should be representative" implies an assumption rather than a statement backed by evidence or verification. What was the rationale behind the selection of the stations?
  
  point taken. Changes it to "This set of stations was chosen to represent all of the coastal orientations and bays of the Baltic Sea."
  
  Figure 2: Along the text the stations are referred with their name (Line 344), with their number (Line 23) or with number and name in brackets (Line 426). I suggest a simpler notation, e.g. always referring to them with their name given the fact that they’re all different. Otherwise, I suggest keeping the numbering next to the crosses in the following figures (e.g., Figure 18) so that the reader knows which station is which without having to scroll back up to check Figure 2.
  
  point taken.
  
  Line 182: This might be a personal preference, but I would remove Table A2 given the open-accessibility of ERA5 documentation with a full description of the parameters available to download.
  
  We will keep it for people that print and read.
  
  Line 186: Specify that PF stands for prefilling
  
  point taken.
  
  Line 193: correct time-series to timeseries for coherence with the rest of the manuscript
  
  point taken
  
  Line 214: Why if the predictand timeseries starts in 2005 does the Mt period start in 2009?
  
  We choose this period to have 10y of training data. We could have taken even more but as we discussed earlier, Bruneau et al showed that already 6 years are sufficient.
  
  Line 228 and Figure 5: DT acronym already specified above
  
  point taken
  
  Line 232: It would be relevant and interesting for reproducibility to specify what questions and thresholds were applied for the predictors and how were they decided.
  
  Those "questions" or "rules" by which the DT splits the data are not previously set, they are learned and optimized by the algorithm. Hence, they do not need to be known for reproducibility. If the software is loaded from Github and the same random_seed is used, the same DTs will be generated by the end user.
  
  We did not investigate how the data was split as we were only interested in the end result and the feature importance.
  
  Line 253: Were the authors using a random split or a chronological split that accounts for the fact that the sea-level predictions are time dependent?
  
  We used a random split in this version but will introduce a chronological split of the dataset, where the training-set ranges from 2005-2013, the test-set from 2014-2016 and the validation-set from 2017-2018.
  
  Line 257: Remove brackets for Breiman 2001
  
  point taken
  
  Line 298: (Müller 2017)
  
  point taken
  
  Figure 7. The grey squares pointed out by the text box “Top 1% feature importance (grey squares)” could be referred to in the figure caption for better visibility of the plots. There is also a “b)” above the text box that I believe doesn’t belong to the figure. Also, replace “difference” with “different” in the caption.
  
  We adjusted the figure caption but left the box because the squares are a bit transparent. We deleted the (b). We actually mean the difference of the mean maps of TPPs and FNPs here so c = b - a
  
  Line 347: Not specified that AoI stands for area of interest
  
  point taken
  
  Line 370: The Baltic Sea is sometimes referred to as BS, sometimes as Baltic Sea. Choose one and maintain along the manuscript.
  
  point taken
  
  Figure 11: “The percentage indicates the corresponding VTPR or TTPR” and Figure 13: “Depending on the station VTPR or TTPR is shown”. Without a justification selecting the best case scenario between validation or test true positive rates potentially obscures the model’s performance across different datatests giving the impression that the model performs better than it actually does. I suggest showing the results from both periods also for better evaluating the model’s sensitivity to different time periods.
  
  We will only show VTPRs in the new version
  
  Fig. 12 and Fig. 14 are not mentioned in the text.
  
  We now mention Fig 12. Fig14 was already mentioned in 449.
  
  Line 505: Replace “We will” with a present or past tense.
  
  point taken
  
  Line 511: There’s already many acronyms and this is only used once, I suggest specifying September, October, November.
  
  point taken
  
  Line 514-517: If the authors intend to validate their machine learning model’s performance why are they not taking the GTSR as the baseline?
  
  We used the GESLA dataset as a ground truth because it is a more comprehensive dataset than the GTSR as it contains the information of local gauging stations.
  
  Table 1: Why is the testset used instead of the validation set for stations 3 and 4?
  
  Because the validation set only has only 1 year of data for those stations. Hence, only a small change in the number of correctly predicted storm surges drastically increases the corresponding rates. In the new version we adjusted the split of the dataset as mentioned above such that we will show VTPRs for all stations.
  
  Line 560: timeseries
  
  point taken
  
  Line 569: remove brackets in “by Tyralis et al., 2019”.
  
  point taken
  
  Line 616: The reference paper of ERA5 by Hersbach et al., 2018 should be fully cited in the references.
  
  point taken
  
  Tables A4 to A9 can be reduced into a single Table as the legend to interpret them would be the same and there’s no need to have the experiments separated.
  
  This is a matter of taste. We kept it like this because it is easier to maintain and more clearly separated. Additionally in the webversion the links to the tables can be used which directly brings the reader to the table necessary for understanding.
  
  Citation: https://doi.org/10.5194/egusphere-2024-2222-AC3
AC4: 'Final author comment', Kai Bellinghausen, 29 Oct 2024

Final Response (AC)
First of all we thank the reviewers for their detailed and important feedback on the manuscrapt, especially regarding the methodology.
The main concern of all reviewers was the intertwined usage of a TPR based on test data and validation data. In the new manuscript we will revise this by analyzing the TPR of the validation only. Additionally we will look into the precision and F1-Scores of the model.

A second major concern was the methodological approach when preprocessing the data and applying a random train test split. We agree with the reviewers on this point and implement a split of the dataset continuous in time instead of a random split. We will revise the methodology into the following main steps for the predictand data:
- Linear Detrending of the whole timeperiod (2005-2018) for each station

- Selecting months (9, 10, 11, 12, 1, 2)

- Split the dataset into training (2005 to 2013), testing (2014 to 2016) and validation (2017 to 2018).

- Compute the 95th-percentile of the training set based on hourly data and classify all datasets based on that percentile-threshold to build a hourly storm-surge-index.

- Reduce the hourly storm surge index into a daily index by counting one day as a storm surge day if only one hour is indicated as a storm surge.
And we also conduct statistical tests on stationarity of the predictand timeseries before detrending it.
A third concern was the benchmark and questioned availability of operational storm surge forecast data. We reached out to the BSH regarding this matter and will add a new benchmark to the revised manuscript if we get access to the data.
We also already take into account most of the minor comments, especially regarding the readability of the Figures and adjust it accordingly.
There were suggestions of testing different model structures and predictors. We will not adress those points and leave it for additional research as this studies purpose was to test a simple version of a ML-model for storm surge predictions in the Baltic Sea. Further research can use the results represented here and refine the model architecture to improve accuracy and predictability.
We again thank all reviewers for the thoughtful contribution, which will effectively lead to version that is methodological more sound than before.

Citation: https://doi.org/10.5194/egusphere-2024-2222-AC4

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (20 Nov 2024) by Maria Ana Baptista

AR by Kai Bellinghausen on behalf of the Authors (15 Dec 2024) Author's response Author's tracked changes Manuscript

ED: Publish as is (20 Jan 2025) by Maria Ana Baptista

AR by Kai Bellinghausen on behalf of the Authors (21 Jan 2025) Author's response Manuscript

Journal article(s) based on this preprint

17 Mar 2025

Using random forests to forecast daily extreme sea level occurrences at the Baltic Coast

Kai Bellinghausen, Birgit Hünicke, and Eduardo Zorita

Nat. Hazards Earth Syst. Sci., 25, 1139–1162, https://doi.org/10.5194/nhess-25-1139-2025,https://doi.org/10.5194/nhess-25-1139-2025, 2025

Short summary

Kai Bellinghausen, Birgit Hünicke, and Eduardo Zorita

Viewed

Total article views: 4,086 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,557	420	2,109	4,086	117	146

HTML: 1,557
PDF: 420
XML: 2,109
Total: 4,086
BibTeX: 117
EndNote: 146

Views and downloads (calculated since 13 Aug 2024)

Month	HTML	PDF	XML	Total
Aug 2024	204	52	14	270
Sep 2024	186	26	56	268
Oct 2024	108	26	138	272
Nov 2024	50	8	98	156
Dec 2024	36	10	100	146
Jan 2025	38	6	112	156
Feb 2025	28	10	360	398
Mar 2025	14	28	448	490
Apr 2025	40	16	336	392
May 2025	18	16	280	314
Jun 2025	32	14	116	162
Jul 2025	40	12	2	54
Aug 2025	122	20	0	142
Sep 2025	314	18	2	334
Oct 2025	20	10	0	30
Nov 2025	22	38	4	64
Dec 2025	58	30	12	100
Jan 2026	50	12	14	76
Feb 2026	62	26	10	98
Mar 2026	62	26	4	92
Apr 2026	24	9	2	35
May 2026	17	5	1	23
Jun 2026	12	2	0	14
Jul 2026	0

Cumulative views and downloads (calculated since 13 Aug 2024)

Month	HTML	PDF	XML	Total
Aug 2024	204	52	14	270
Sep 2024	186	26	56	268
Oct 2024	108	26	138	272
Nov 2024	50	8	98	156
Dec 2024	36	10	100	146
Jan 2025	38	6	112	156
Feb 2025	28	10	360	398
Mar 2025	14	28	448	490
Apr 2025	40	16	336	392
May 2025	18	16	280	314
Jun 2025	32	14	116	162
Jul 2025	40	12	2	54
Aug 2025	122	20	0	142
Sep 2025	314	18	2	334
Oct 2025	20	10	0	30
Nov 2025	22	38	4	64
Dec 2025	58	30	12	100
Jan 2026	50	12	14	76
Feb 2026	62	26	10	98
Mar 2026	62	26	4	92
Apr 2026	24	9	2	35
May 2026	17	5	1	23
Jun 2026	12	2	0	14
Jul 2026	0

Viewed (geographical distribution)

Total article views: 4,085 (including HTML, PDF, and XML) Thereof 4,085 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 14 Jul 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (3946 KB)
Metadata XML

Short summary

We designed a tool to predict the storm surges at the Baltic Sea coast with a satisfactorily predictability (70 % correct predictions) using lead times of a few days. The proportion of false warnings is typically as low as 10 to 20 %. We could identify the relevant predictor regions and their patterns – such as low pressure systems and strong winds. Due to its short computing time the method can be used as a pre-warning system triggering the application of more sophisticated algorithms.


Total:	0
HTML:	0
PDF:	0
XML:	0