the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Improving Seasonal Arctic Sea Ice Predictions with the Combination of Machine Learning and Earth System Model
Abstract. While dynamical models are essential for seasonal Arctic sea ice prediction, they often exhibit significant errors that are challenging to correct. In this study, we integrate a multilayer perceptron (MLP) machine learning (ML) model into the Norwegian Climate Prediction Model (NorCPM) to improve seasonal sea ice predictions. We compare the online and offline error correction approaches. In the online approach, ML corrects errors in the model’s instantaneous state during the model simulation, while in the offline approach, ML post-processes and calibrates predictions after the model simulation. Our results show that the ML models effectively learn and correct model errors in both methods, leading to improved predictions of Arctic sea ice during test periods (i.e., 2003–2021). Both methods yield the most significant improvements in the marginal ice zone, where error reductions in sea ice concentration exceed 20 %. These improvements vary seasonally, with the most substantial enhancements occurring in the Atlantic, Siberian, and Pacific regions from September to January. The offline error correction approach consistently outperforms the online error correction approach. Notably, in September, the online approach reduces the error of the pan-Arctic sea ice extent by 50 %, while the offline approach achieves a 75 % error reduction.
- Preprint
(40293 KB) - Metadata XML
-
Supplement
(2618 KB) - BibTeX
- EndNote
Status: open (until 09 Apr 2025)
-
RC1: 'Comment on egusphere-2024-4092', Anonymous Referee #1, 16 Mar 2025
reply
The authors have improved the seasonal prediction skill of Arctic sea ice in the NorCPM model using a machine learning method. Specifically, they experimented with two approaches: (a) an online method, which involves correcting the model's initial fields after each time step, and (b) an offline method, where the model runs freely, and the results are uniformly corrected afterward. Both methods are based on simple concepts but have been proven to be effective.
I acknowledge that the approach of integrating machine learning with dynamical models is quite advanced and interesting; however, I have some concerns regarding the broader scientific implications of this study. Please see the following general comments.
General Comments
The discussion section is currently somewhat underdeveloped. I would suggest that the authors discuss their results from the following perspectives:
a) How do the findings of this study inform physical enhancements to predictive systems? For instance, which sea ice processes benefit from online correction of sea ice concentration? And thus improved prediction skills?
b) As a hybrid approach integrating dynamical models with machine learning, what advantages does this study offer compared to purely data-driven machine learning methods (e.g., Andersson et al., 2021; Ren et al., 2024; Kim et al., 2025)? For instance, does it demonstrate enhanced scalability, such as in applications to ice thickness prediction corrections?
c) The current training framework employs a relatively short rolling forecast window (10-month input → 1-month output). However, would a simple MLP's nonlinear approximation capacity remain effective when applied to longer windows required for daily sub-seasonal timescale predictions (e.g., 90-day input → 1-day output)? Alternatively, might this would require more comprehensive deep learning methods? I recommend expanding the discussion to explicitly address the method's generalizability across varying temporal scales.
d) Comparing online and offline methods based on predictive skill metrics is necessary but may be insufficient. Could the analysis be extended to incorporate additional dimensions to better delineate their applicable scenarios? This expanded discussion would provide more actionable guidance for researchers applying machine learning to calibrate model forecasts in operational settings.
Specific Comments
The language needs to be polished, and the logical flow between sentences should be strengthened. The paper's subtitles do not clearly convey the intended meaning and would benefit from revision.
Lines 9-10: If possible, please use one or two sentences to further elaborate on why the offline error correction approach performs better than the online error correction approach.
Line 14: Is there a newer paper to cite? This one is not "recent" enough.
Line 19: I suppose the word "compared" could be removed.
Line 58: The abbreviation "SIC" appears for the first time without a definition.
Line 69: The title should be changed to "Data and Methods" as this section introduces the model and data parts first.
Lines 94: "the summer sea ice extent" should be revised to "the summer SIE".
Lines 102-103: Maybe I missed something, but why not directly use observations as the "truth"? How much discrepancy is there between the reanalysis of NorCPM and observations?
Line 130 & Table 1: Why consider latitude only but not longitude? Can any explanation be provided? And how about the relative importance of these input features?
Lines 142-143: I am curious about the exact post-processing of physically inconsistent fields. Could you give a concise and clear description rather than simply citing a paper?
Line 205: "Pan-arctic". Please make the capitalization of this term consistent throughout the paper. "Fig. 2" or "Figure 2"? Please also make this consistent throughout the paper.
Lines 217-218: "We define the IIEE as the area where the prediction and the truth disagree on the ice concentration being above or below 15%:" It would be better to rephrase as the IIEE metric has been defined by Goessling et al. (2016). Are the authors themselves defining a new metric called IIEE?
Lines 211 & 220-221: Why are consistent subscripts not used in these two equations?
Line 223: What is the meaning of "squared errors"? Do you mean "RMSE"?
Line 253: "NorCPM overestimates the Arctic cloudiness, and its summer-season snowmelt is too slow." I am not clear which figure I can draw such a conclusion from.
Line 256: "Both the OnlineML and OfflineML hindcasts exhibit similar behaviors regardless of the seasonality." This sentence is somewhat confusing, please rephrase it.
Lines 276-277: Why choose to analyze/present the reanalysis initialized in July? Could you provide some explanation? The later analysis should also clarify whether the results shown in Figure 6 depend on the initialization month.
Lines 289-293 (Figure 6d & 6e): Why is the result of the Online ML hindcast in August worse than the Reference hindcast in the Alaskan and Canadian regions?
Lines 297-299: As the author mentioned, the different performance of these two approaches (OnlineML and OfflineML) comes from the way they are constructed. Therefore, these two methods should be intended for different purposes. Is this comparison appropriate? Maybe rephrasing it would be better.
Line 302: Does the error correction performance vary with the initialization month (as mentioned above)?
Figures
Figure 1: The colors in Figure 1 are somewhat confusing (especially the purple and pink, which may cause difficulty for readers in distinguishing them). I recommend to use more distinguishable colors.
Figure 2: "Regional domain definitions for Central Arctic, Atlantic, Siberian, Alaskan, Canadian, and Regions based on sea area definitions in Kimmritz et al. (2019)." The "Regions" should be corrected to "regions". I think it seems a bit crude to combine the Bering and the Sea of Okhotsk into the "Pacific Region", is there any literature to support this approach?
Figure 3: Please indicate in the figure caption that this is the result of reanalysis minus the model (as in Figure 4's caption).
References:
Andersson, T. R., Hosking, J. S., Pérez-Ortiz, M., Paige, B., Elliott, A., Russell, C., et al. (2021). Seasonal Arctic sea ice forecasting with probabilistic deep learning. Nature Communications, 12(1), 5124. https://doi.org/10.1038/s41467-021-25257-4
Goessling, H. F., Tietsche, S., Day, J. J., Hawkins, E., & Jung, T. (2016). Predictability of the Arctic sea ice edge. Geophysical Research Letters, 43(4), 1642–1650. https://doi.org/10.1002/2015GL067232
Kim, Y. J., Kim, H., Han, D., Stroeve, J., & Im, J. (2025). Long-term prediction of Arctic sea ice concentrations using deep learning: Effects of surface temperature, radiation, and wind conditions. Remote Sensing of Environment, 318, 114568. https://doi.org/10.1016/j.rse.2024.114568
Ren, Y., Li, X., & Wang, Y. (2024, December 4). SICNetseason V1.0: a transformer-based deep learning model for seasonal Arctic sea ice prediction by integrating sea ice thickness data. Cryosphere. https://doi.org/10.5194/gmd-2024-200
Citation: https://doi.org/10.5194/egusphere-2024-4092-RC1 -
RC2: 'Comment on egusphere-2024-4092', Anonymous Referee #2, 19 Mar 2025
reply
Review on “Improving Seasonal Arctic Sea Ice Predictions with the Combination of Machine Learning and Earth System Model” by He et al.
The Arctic Ocean is warming at a faster rate than the rest of the planet. Consequently, both the extent and thickness of sea ice have significantly decreased over the past few decades. These changes pose significant challenges to the reliability of seasonal Arctic sea ice predictions. This manuscript aims to enhance the Norwegian Climate Prediction Model (NorCPM) for seasonal Arctic sea ice prediction by integrating machine learning techniques with the Earth system model. Online and offline modules were selected to perform error corrections. The ultimate goal is to improve NorCPM's forecast performance for the marginal ice zone.
I find this study to be both timely and relevant. It aligns well with the scope of the TC Journal. I am inclined to give a positive recommendation. However, I believe there are several issues that need to be addressed before the manuscript can possibly be considered for publication.Major comments:
(1) L158: “MLP excels in function approximation, making it particularly,,,” Please explain why the MLP (Multilayer Perceptron) model was chosen over the convolutional neural network (CNN). In my opinion, a CNN is better suited for learning spatial neighboring relationships compared to an MLP model. However, using an MLP model to operate on each grid point could lead to abrupt spatial changes in the predicted values.
(2) L171: The attention mechanism is a widely used deep learning technique. I suggest that the authors provide a detailed explanation of the attention mechanism employed in this study. Was a temporal attention mechanism or a self-attention mechanism used in this context? Generally, the number of parameters in an attention module is significantly larger than that in an MLP model. Given the volume of data in this study, there is a risk of overfitting. I would like to see the parameter counts for both the MLP model and the attention mechanism presented separately.
(3) L184: Training a separate model for each month of the test period (2003 to 2022) is not a convincing design. A model trained for each month during the training period should be generalizable to the test period. In my opinion, there is no need to train a distinct model for each month of each year during the test period.Other comments:
The manuscript suffers some unclarities concerning the structure
(4) L140:
This section introduces the limitations of online error correction. However, its placement in the methods description section feels abrupt and lacks strong contextual relevance. I suggest moving this part to the discussion section, where it can be framed as a prospect for future work.(5) L165: “The MLP architecture consists of five layers” Please consider presenting this with a diagram for better clarity and algorithm flow.
(6) L172: The specific use of the "linear activation function" should be clarified. Did the authors apply an activation mechanism, or was it unnecessary? These details are critical for understanding the implementation.(7) L175: Please provide a detailed split of the datasets. It is essential to ensure that there is no overlap in time or data between the training set, validation set, and test set to prevent data leakage. Such overlap could lead to an unreliable evaluation of the model's performance on the test set.
(8) L230: It is necessary to evaluate the performance of the MLP model, such as giving specific training set accuracy, validation set accuracy, and test set accuracy, so as to demonstrate the generalization ability of the model and make the subsequent evaluation of specific correction effects more credible.
(9) F3, 4, 7: Please add specific accuracy or error in each subgraph. Present true errors with comma “true errors”
(10) L261: “Compared with the OnlineML hindcasts, the OfflineML hindcasts have a larger error reduction, particularly in September”; L269: “demonstrates larger error reductions in IIEE than the online approach,,,”; L273: “The offline approach outperforms the online approach in reducing both RMSE for SIE and IIEE for ice edge, especially in months with higher prediction errors”
The manuscript contains numerous vague expressions. Please provide more specific and concrete details. For example, instead of stating that "the error has been reduced," specify by how much (e.g., "the error has been reduced by xx%"). Including precise values of accuracy is essential for a comprehensive evaluation.Citation: https://doi.org/10.5194/egusphere-2024-4092-RC2 -
RC3: 'Comment on egusphere-2024-4092', Anonymous Referee #3, 21 Mar 2025
reply
The manuscript evaluates the performance of machine-learning based error correction models in a coupled earth system model, NorCPM. The evaluation compares both online and offline correction schemes. In this study, the machine learning model is trained and validated against the reanalysis generated from the same coupled earth system model, which is used as the truth. The online scheme is trained to correct instantaneous state wheares the offline scheme is used to correct monthly biases. The manuscript shows improvements using both correction schemes but the offline scheme as a post-processing method beats the online scheme. This work is interesting. However, I recommend that the manuscript should be reconsidered for publication after revision.
Major comments:
1. In the comparison between the online and offline schemes, the online scheme is applied at 15-th of each month similar to the reanalysis system. However, the manuscript lacks the discussion on the application of the DA increment in the reanalysis system. For example, does the reanalysis system use any incremental analysis update or nudging to provide a continuous correction? The manuscript also lacks the information on the updated ocean and ice state of the reanalysis system. Are they the same as the online correction scheme? For example, there is no mention of SSS in Sect.2.2 but is corrected in the online scheme. These information can be useful because, in a perfect scenario, if the online correction scheme can give the same increment as the analysis increment, the online correction scheme should be able to recover the reanalysis deemed as truth here. This, of course, cannot be the case in reality, but can be useful for discussing the sources of delta RMSE. For example, lack of spatial correlation due to training on individual grid points, different strategies for correction/increment applications, or the possibility that instantaneous random errors can be averaged out so that the ML only learns systematic biases during the training processes due to the long-term data being used.2. The definition of error needs to be reformulated. The analysis increment is the differences between the analysis and forecast, which is equivalent to the differences between the analysis and forecast error, x^a - x^f = e^a - e^f. Even if we take e^a = 0, the increment is the negative error of the x^f. Therefore, if Eq.(2) is the estimated model error, Eq.(3) means that an error is added to the model forecast. In fact, the error should be removed. The authors may want to add a negative sign to Eq.(2). The same logic can be applied to Sect.3.1 where I believe the negative error, instead of the actual error is presented.
3. As one of the selling point of this manuscript is the use of fully coupled ESM, can the authors provide some analysis and discussions on the ocean state as well such that one can get a better physical intuition of the results?
Minor comments:
1. L28: perhaps reads better with "transitioning to DA methods to...."2. L73: Does NorCPm use DEnKF instead of stochastic EnKF? Would it be more informative to cite SAKOV, P. and OKE, P.R. (2008), A deterministic formulation of the ensemble Kalman filter: an alternative to ensemble square root filters. Tellus A, 60: 361-371. https://doi.org/10.1111/j.1600-0870.2007.00299.x?
3. L100: I'm not sure EnKF used here actually provide a spatiotemporal estimate as normally filtering only provide spatial correlation in their error. Perhaps it is better to say "time-dependent spatial error estimate"?
4. L140-143: It should be explicitly said that the same post-processing of NorCPM is used in the online correction scheme.
5. Sect.2.4: Is post-processing applied to output from offline schemes for physical consistency when comparing online and offline schemes?
6. Sect. 2.5: what is the objective function being used here? Is it RMSE?
7. L193: Why is the reference configuration performed from 1991 - 2002 which is not performed for the online experiment?
8. L207: What is an areal sum of grid points? Is it an area-weighted sum of the SIC over all grid points with SIC >= 15%?
9. L226-229: What does "10 data points from the 10 ensemble members" mean? Do you mean selecting one data point from each of the 10 ensemble members leading to 10 data points in total, or do you mean selecting 10 data points from each of the 10 ensemble members leading to 100 data points in total? Based on the RMSE in Eq.(4), the RMSE is calculated over time, this means that one RMSE will be obtained for each grid point. How can a single RMSE over both time and space be obtained? Are there any results for the uncertainty of the RMSE in this manuscript?
10. L232: "...in predicting..."
Citation: https://doi.org/10.5194/egusphere-2024-4092-RC3
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
86 | 33 | 5 | 124 | 12 | 4 | 4 |
- HTML: 86
- PDF: 33
- XML: 5
- Total: 124
- Supplement: 12
- BibTeX: 4
- EndNote: 4
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
United States of America | 1 | 53 | 41 |
France | 2 | 15 | 11 |
China | 3 | 12 | 9 |
Norway | 4 | 8 | 6 |
Germany | 5 | 7 | 5 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 53