the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Application of HIDRA2 Deep Learning Model for Sea Level Forecasting Along the Estonian Coast of the Baltic Sea
Abstract. Sea level predictions, typically derived from 3D hydrodynamic models, are computationally intensive and subject to uncertainties stemming from physical representation and inaccuracies in initial or boundary conditions. As a complementary alternative, data-driven machine learning models provide a computationally efficient solution with comparable accuracy. This study employs the deep learning model HIDRA2 to forecast hourly sea levels at five coastal stations along the Estonian coastline of the Baltic Sea, evaluating its performance across various forecast lead times. Compared to the regional NEMOBAL and subregional NEMOEST hydrodynamic models, HIDRA2 consistently delivers superior results, particularly across all sea level ranges and stations. While HIDRA2 struggles to capture high-frequency variability above (6 h)-1, it effectively reproduces energy in lower-frequency bands below (18 h)-1. Errors tend to average out over longer time windows encompassing multiple seiche periods, enabling HIDRA2 to surpass the overall performance of the NEMO models. These findings underscore HIDRA2’s potential as a robust, efficient, and reliable tool for operational sea level forecasting and coastal management in the Eastern Baltic Sea region.
- Preprint
(12240 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-3691', Anonymous Referee #1, 20 Jan 2025
GENERAL COMMENTS
This study presents the main results obtained from the application of the HIDRA2 deep learning model along the Estonian coast for sea level prediction. It compares the model's performance with hydrodynamic models and provides an analysis of the performance of these models based on different components of sea level.Overall, the document is well-structured, including a clear description of the data and methods used, a detailed presentation of the main results, a discussion of significant limitations of the applied model, and a conclusion summarizing the key findings.
However, there are several important points that the authors should address, as outlined below:
SPECIFIC COMMENTS
Section 2.1: Could you please provide the time coverage of the data at each location?
Lines 97–98: “covering a longitudinal range from 16.25°E to 28.5°E and a latitudinal range from 54.25°N to 64°N”. Is there a specific reason for the chosen spatial extent of the fields? Was it determined through trial and error during the study or based on a reference?
Lines 99–100: “The training data for SSH at the coastal stations were obtained from the Estonian Environmental Agency”. Is this the same dataset mentioned in Section 2.1?
Line 101: “…the period 2010 to 2019”. Does this timeframe apply to all five locations? How much of the data was used for training versus testing the model? Did you analyze the data for trends? If so, were they removed, or is the model capable of accounting for them?
Line 107: What do you mean by “adapted”? Did the BALMFC apply modifications to the model? Or do you mean “implemented”?
Lines 155–162: The comments regarding 24-hour and 72-hour lead times should focus on the comparison between HIDRA2 and NEMOEST, as Table 1 provides results for only the 12-hour lead time for NEMOBAL. Am I interpreting this correctly? I suggest reorganizing this paragraph slightly to make the description of results more fluid for the reader.
Additionally, while reading the document, I expected a comparison between HIDRA2 and NEMOEST, as the latter is expected to have better performance compared to NEMOBAL. This assumption arises because NEMOEST has a higher resolution and uses NEMOBAL outputs as boundary conditions, which could lead to improved performance. Did you anticipate this performance hierarchy for NEMOEST? It may be beneficial to provide a more explicit description of the relative performance of all three models in this section.Lines 155–156: “The RMSDs for each station are calculated using sea level data from April 2023 to April 2024”. I suggest moving this statement to the Methods section or the section where validation metrics are introduced.
Lines 164–167: “The performances are also separately assessed during extreme negative and positive observed SSH events (Figure 4b & c). Extreme negative SSH values are defined as those falling below the 5th percentile of observed SSH during the study period at each station, while extreme positive values exceed the 95th percentile (Cannaby et al., 2016; Mentaschi et al., 2023)”. I recommend moving this explanation to the Methods section.
Lines 177–178: “For negative extreme SSH events, correlation is lower with a lower RMSD, while for positive extreme SSH events, correlation is higher but with a higher RMSD”. These comments appear to refer to Figure 5. If this is correct, please explicitly indicate this in the text. Additionally, I suggest including this description before presenting the figure.
Lines 181–182: “For clarity, the performance of NEMOEST, which is the least accurate (as detailed in Section 3.1), is not included in the figure”. The only explicit statement supporting this claim is found in Lines 171–172: “while the subregional model NEMOEST performs better under non-extreme high SSH conditions but struggles to accurately predict extreme SSH values”. This could be clarified further.
Line 182: What do you mean by “stable behavior”? Are you referring to the absence of gaps or offsets in the models, or to the models' ability to consistently replicate observed time series?
Line 182: I have some doubts regarding the use of the term “timescales”. Are you referring to the temporal range used for this comparison?
Line 187: “A visual comparison of model predictions on Figure 7 between 5 and 9 October 2023”. This part is a bit confusing. The seiche representation is not clearly visible in Figure 7, so I initially assumed the reference was to Figure 8. However, based on subsequent context, it seems these comments about the seiche do refer to Figure 7. Could you please confirm this? If the seiche is depicted in Figure 7, I suggest clearly highlighting it in the figure to avoid confusion.
Lines 198–204: “In contrast, the HIDRA2 ensemble mean is overly smooth, rendering it less capable of reproducing sea-level variability within the seiche frequency band. However, it adheres more closely to the overall observations than NEMOBAL. HIDRA2’s limited accuracy during seiche excitations leads to deviations from instantaneous sea-level values, increasing the short-term errors. However, over periods of several days, HIDRA2 exhibits minimal bias. This indicates that while individual hourly predictions may show higher error, HIDRA2’s forecasts do not consistently over- or underestimate SSH over longer time spans, resulting in minimal systematic bias”.
From Table 1, it is evident that HIDRA2 achieves a better RMSD compared to NEMOBAL at Pärnu. Given that NEMOBAL appears to attempt capturing the seiche and the highest sea levels, did you consider whether the RMSD “double penalty” might have affected NEMOBAL's evaluation, leading to HIDRA2 achieving a better score?
I suggest incorporating additional performance statistics beyond RMSD to provide a more robust evaluation of the models. While RMSD offers a general sense of performance, it may not fully account for specific limitations, such as the “double penalty” effect. The following references may prove useful for clarifying the results obtained in your study:
• https://doi.org/10.1016/j.ocemod.2013.08.003
• https://doi.org/10.5194/os-20-1513-2024TECHNICAL CORRECTIONS
Figure 1: There is an error in the y-axis label of Figure 1. It should read "Latitude" instead of "Longitude."
Line 175: Replace "predeicted" with "predicted."
Figure 7: I suggest including a proper title for Figure 7, even if most of the title overlaps with Figure 6. This ensures the reader does not need to refer back to the previous figure to understand the context of Figure 7.
Line 187: The phrase "quite telling" appears somewhat informal for a scientific article. A more suitable synonym could be "highly indicative" or "clearly demonstrates."
Lines 189 to 192: The sentence, “We have high-frequency variability in NEMOBAL and a completely smooth HIDRA2 ensemble mean. One might argue that this smoothing stems from the fact that we are working with the ensemble mean in which otherwise present oscillations in different ensemble members cancel out,” could be improved by rephrasing to avoid informal language. Suggested revision:
"NEMOBAL exhibits high-frequency variability, whereas the HIDRA2 ensemble mean is completely smooth. This smoothing effect may result from averaging the ensemble members, where oscillations present in individual members cancel each other out."Lines 196 to 197: The sentence, "A clear excitation of the seiche is visible in filtered observations in Figure 8 from 3 October on NEMOBAL captures this excitation," could be rephrased for clarity as: "Filtered observations in Figure 8 clearly show the excitation of the seiche starting on 3 October, which is also captured by NEMOBAL."
Figure 9: I recommend placing Figure 9 immediately after the text where it is mentioned for the first time. This improves readability and ensures the reader can quickly reference the figure without searching for it later in the document.
Citation: https://doi.org/10.5194/egusphere-2024-3691-RC1 -
RC2: 'Comment on egusphere-2024-3691', Anonymous Referee #2, 12 Mar 2025
This study presents the results of applying the HIDRA2 deep learning model to forecast sea level along the Estonian coast of the Baltic Sea. Similar to its application in the Adriatic Sea, for which HIDRA2 was originally developed, the data-driven model generally outperforms dynamical models (3D ocean models), except for extreme events. This is expected, as extreme events are inherently rare and, therefore, challenging to accurately represent using a data-driven approach.
While the manuscript is clear and well-written, it does not significantly advance the understanding of sea surface height (SSH) forecasting or machine learning (ML)-based methods for this purpose. However, it provides a valuable report on a state-of-the-art ML-based system capable of producing fast and computationally “cheap” SSH forecasts for selected locations within the Baltic Sea.
I strongly encourage the authors to reduce the length of certain sections, particularly in the discussion and conclusion (which often reads like a summary), where some paragraphs are repetitive or restate well-known concepts—such as the efficiency and computational cost-effectiveness of ML methods compared to 3D ocean models based on primitive equations. Instead, I suggest expanding on the ensemble approach, assessing its limitations, and exploring potential strategies to improve the representativenes of the ensemble spread.
Other comments:
Line 89: Brackets should be only around the year: “Details of the encoding architecture are presented in (Rus et al. 2023).”
Lines 98-99: “The original meteorological data, with a domain size of 40 × 50, were subsampled to a 9 × 12 grid.”
Subsampling appears to discard valuable information. Have the authors attempted to use the full resolution? Do you anticipate any improvements by retaining the original grid size?Figure 2: It is misleading to depict the model architecture using the Adriatic Sea instead of the Baltic Sea. If this figure is sourced from another article, please provide the appropriate citation. If it was created specifically for this study, consider replacing the Adriatic Sea with the Baltic Sea to avoid confusion.
Figure 6: This figure is not particularly informative, as most curves overlap, except for Feb-Mar 2024 in Haapsalu. Consider moving it to the supplementary material, as the key point is already well illustrated in Figure 7.
Line 190: “One might argue that this smoothing stems from the fact that we are working with the ensemble mean”. Please include a figure like Figure 7, but for both the best and worst ensemble members (perhaps based on RMSE). This will help illustrate the smoothing effect of the ensemble mean more effectively.
Figure 9. indicate the meaning of grey lines in the caption
Lines 230-250: The discussion in this section largely reiterates well-established points about HYDRA2 without adding new insights. Writing a full paragraph to restate that ML-based methods are significantly more computationally efficient than traditional dynamical models seem unnecessary, as this fact is widely recognized, even by non-specialists.
An additional advantage of deep learning models, such as HIDRA2, lies in their computational efficiency. In the present study, HIDRA2 demonstrates the ability to generate 50 ensemble predictions for each 72-hour forecast at each station in approximately 30 seconds using a current typical personal computer or laptop (with performance rates typically ranging within the teraflops scale), using only atmospheric input data (wind and mean sea level pressure), background sea level, and the trained network file. In contrast, running a high-resolution hydrodynamic model for even a single ensemble simulation requires significant computational resources, a large amount of input data, and considerable processing time, often necessitating dedicated high-performance computing facilities. Consequently, HIDRA2’s efficiency not only enhances forecast accessibility but also enables a more adaptable and resource-effective approach to sea level prediction in operational settings, providing valuable support for optimizing processes in coastal management.
Line 283: This observation demonstrates the model’s enhanced capability to capture more common extreme event types. This idea should be discussed also for prediction of normal vs extremes (rare) SSH conditions, where HIDRA2 generally outperforms dynamical models.
Line 283: This observation demonstrates the model’s enhanced capability to capture more common extreme event types. A similar discussion should be included for predicting normal vs. extreme (less frequent) SSH conditions, where HIDRA2 generally outperforms dynamical models.
Line 286-292: This section reads more like a summary rather than a conclusion and should be revised accordingly.
Line 305: I do not find this section informative. Deep-learning ensemble models, such as HIDRA2, are pertinent for advancing the development of Digital Twins and associated impact models (Li et al., 2023). These models, utilizing ensemble-based techniques, are particularly effective in capturing the complex, non-linear relationships in SSH data across diverse scales. By integrating multiple predictive models, ensemble approaches enhance the accuracy and robustness of forecasts, making them valuable for the creation of Digital Twins of the Earth systems. This, in turn, supports more precise impact assessments and decision-making processes in coastal management and risk mitigation.
Citation: https://doi.org/10.5194/egusphere-2024-3691-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
201 | 62 | 16 | 279 | 11 | 15 |
- HTML: 201
- PDF: 62
- XML: 16
- Total: 279
- BibTeX: 11
- EndNote: 15
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
United States of America | 1 | 81 | 28 |
Estonia | 2 | 50 | 17 |
Germany | 3 | 28 | 9 |
France | 4 | 15 | 5 |
China | 5 | 13 | 4 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 81