the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
The Modèle Atmosphérique Régional – Intelligence Artificielle (MAR-IA): surface meltwater over Greenland
Abstract. Surface melting over the Greenland Ice Sheet has become one of the dominant sources of contemporary and projected global sea-level rise, with melt rates accelerating over recent decades. Understanding those processes and feedbacks that control Greenland's surface melt is central to improving projections of future mass loss and to clarifying how changes in surface energy balance components shape ice-sheet stability.
To this aim, we developed MAR-IA – a machine-learning emulator of the MAR regional climate model – designed to emulate daily surface meltwater production over Greenland and to enable attribution of melt drivers. We implement two complementary emulators: a high-fidelity MAR-IA trained on full MAR surface energy balance fields and a reanalysis-compatible MAR-IA-ERA trained on variables available from products such as ERA5, thereby extending applicability beyond MAR-specific outputs. Both emulators employ gradient-boosted trees optimized via Bayesian hyperparameter search, achieving test-set performance up to R2 = 0.99 with low mean squared error and negligible bias relative to MAR meltwater outputs. We apply a SHAP-based explainable AI analysis to quantify how the importance of surface energy balance components – e.g., albedo, shortwave and longwave radiation, etc. – evolves across space and time over Greenland. Our results reveal robust spatial and temporal patterns in the dominance of radiative versus non-radiative drivers and demonstrate long-term trends in the relative contribution of temperature, shortwave radiation, and albedo to melt variability. These findings show that emulators can be used as powerful tools to complement regional climate models by enabling computationally efficient ensemble simulations and physically interpretable attribution of past and future Greenland surface melt. Development of regional climate models should go hand in hand with ML-based tools.
Competing interests: At least one of the (co-)authors is a member of the editorial board of The Cryosphere.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(7115 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 25 Mar 2026)
- RC1: 'Comment on egusphere-2026-490', Anonymous Referee #1, 20 Feb 2026 reply
-
CC1: 'Comment on egusphere-2026-490', Elke Schlager, 25 Feb 2026
reply
This study develops an emulator using XGBoost to estimate surface melt from climatic variables and topographic variables from MAR and from ERA5 outputs. They then use the SHAP values of the trained emulator to analyse the melt drivers.
Working on emulating surface melt based on RCM climate data myself, I read the manuscript with great interest. Despite the importance of the topic and the promising approach of using XGBoost, I have several major concerns with the methodology and the conclusions drawn in this paper.
Main Concern #1: The train/validation/test split
Lines 189-190 describe that the data was split into 80% training, 20% validation, and 10% test set. Besides the effect that this sums up to 110%, and that the use of the 3 subsets should be explained more closely, my major concern is regarding the splitting strategy, which is not explained. The data splitting strategy for timeseries data needs to account for the high temporal correlation of samples, as a random splitting strategy for timeseries data yields a test set which is not independent of the training and validation set, and can thus result in overoptimistic performance score on the test set. The authors thus need to clearly describe their train/validation/test splitting strategy, and how it sufficiently takes into account the temporal correlation of the data.
Since the cross-validation uses simple k-fold cross validation, which is not appropriate for timeseries data, the associated cross-validation scores are not trustworthy. Moreover, if the authors did use a split strategy that does not take into account the temporal correlation for the following model trainings, all the scores cannot be trusted either, and the experiments would need to be repeated with a more appropriate split. Literature for timeseries forecasting discusses splitting methods and pitfalls when working with timeseries data, which need to be considered also when regressing on timeseries data (see e.g. Hyndman et al., 2018).
Main Concern #2: Attribution analysis and SHAP values: The authors do not sufficiently separate the use of SHAP values to explain the model predictions on the one hand and explain the physics on the other hand. While the SHAP values can serve as a “sanity check” for the model and to explain the reason for the model’s predictions, care needs to be taken to translate that into interpretations of underlying physics.
- The explanation of the SHAP implementation is not clear to me. Some more clarification on the implementation and mentioning the used libraries are needed. I wanted to find out more via the code as there is some Code/Data Availability given. However, the given Zenodo records include only data, not the code.
- Lines 109-112, 219, 326-327 explain the use of SHAP values for model interpretation, by quantifying the contribution of each input feature to the model’s prediction. However, lines 57 and 124 then indicate that these SHAP values are used to draw physical conclusions and find causal relations; and in line 334 it is claimed that the physical interpretability is now confirmed. However, just because the SHAP values do seem (mostly) reasonable, it does not mean that they explain causal relationships. Based on the current explanation, I am not convinced that the SHAP analysis supports conclusions exceeding the explainability of the ML model itself. For example, in line 145 the issue of using correlated inputs for attribution analysis is mentioned, and in line 296 the fluctuation of the SHAP values is explained by the remaining correlation of the features. However, the discussion beforehand used the SHAP values to draw physical conclusions, not discussing how these correlations may influence the SHAP values, and thus the trends observed. Also, direct melt drivers are identified using SHAP values, although input variables were used, which are not direct melt drivers. Furthermore, the fluctuation of the SHAP values for longitude in Figure 8 is not discussed at all.
- Besides my doubts on the validity of the attribution study using SHAP values, the use of an ML emulator for such an attribution study is not properly motivated. The MAR model delivers all the radiation and turbulent heat flux values to calculate the surface energy balance and explain the melt drivers, see also Wang et al. (2021), Zhang et al. (2023), and Hofer et al. (2017).
- The terminology related to SHAP is currently inconsistent (e.g., “Shapley values”, “SHAP values”, and “Shapley coefficients”), which is rather confusing. In the ML literature, “SHAP values” is typically used for the specific method, while “Shapley values” refers to the underlying game-theoretic concept.
Main Concern#3: The selection of input variables
While the conclusion highlights that predictor selection was guided by physical relevance and statistical analysis, I don’t see that this was done in a sufficient manner.
- Line 80 mentions 2-meter air temperature to be a driver of melt energy. However, while 2-m air temp. is closely related to heat fluxes, considering the surface energy balance, it is not a driver of melt itself, as the surface energy balance is directly driven by shortwave and longwave net radiation, sensible heat flux, latent heat flux, and ground heat flux (Lenaerts et al., 2019). The use of additional inputs, especially when trying to interpret the drivers physically, needs some more explanation.
- Specifically, the motivation for using topographic variables needs more explanations. Line 83 claims that topographic variables modulate the local energy balance and atmospheric conditions. However, when using the atmospheric conditions themselves as model inputs, why use the topographic variables that contribute to those atmospheric conditions?
Main Concern #4: Conflicting model scores
- The abstract claims an R2 of 0.99, but none of the resulting models reach such a high score according to Table 2. Also, the formulation implies that this high score is reached for the model using ERA5 inputs, which seems misleading.
- The discussion, conclusion, and Table 2 do not mention if these are the scores on the test set. Also, the scores in the conclusion are different from those in the discussion and in Table 2.
- Inconsistent use of metrics: in 2.5 MSE is reported, in 3.1 RMSE, and in 3.3 mean absolute difference and mean difference are used.
Main Concern #5: Related literature
- lines 41-43: The authors only mention one paper (Doury et al., 2023) for a specific ML application, which refers to a downscaling method and thus seems somewhat loosely connected to the melt emulation task, as there exists more closely related literature. Furthermore, the authors mention attribution analyses based on ML-emulation, which are not backed up by the two given references.
- On the other hand, the ML approaches for downscaling climate fields (like Doury et al., 2023) should be discussed in the context of the MAR-IA2ERA models. Since these emulators rely on ERA5 data reprojected onto the MAR grid, it would be helpful to clarify that first downscaling ERA5 to the MAR configuration would ensure spatial consistency between the emulator inputs and the resulting melt predictions, and that a lot of research is being done in developing such downscaling via ML.
- Lines 89-90: The given literature for the applications of XGBoost seems quite unrelated to the task in the paper too. Some additional references to work in climate/cryosphere research would be interesting, e.g. Veldhuijsen et al. (2025).
- References should be double checked: Nghiem et al. (2012) was cited four times in the paper, but does not appear in the reference list. In line 291 it is used as only reference for a claim that refers to multiple studies. In contrast, Tedesco et al. 2016a and 2016b seem to be the same reference.
Further comments:
- Inconsistent terminology, e.g. surface temperature and skin temperature are used interchangeably; same with inputs, variables, features, and predictands
- While the formulation “approximately normal distribution” in line 142 is odd in general, especially albedo and upward longwave radiation show distributions that are not comparable with a normal distribution at all. It would therefore be more accurate to state that the variables are not strongly concentrated within specific value ranges, rather than characterizing them as approximately normal. Furthermore, the x axis (or the data itself) seems cropped for most data, and the existence of long tails and extreme values cannot be judged.
- The hyperparameter optimization does not state the ranges that were chosen for the different parameters to be optimized, and how many different combinations were tried in total.
- I do not understand the claim in 365-367: “the MAR-IA emulator allows running past and future scenarios without forcing the atmospheric model in MAR with fields that are not always available for past periods and future simulations.” - Which fields are not available, which emulator product to you mean, and how do you justify the validity of extrapolating past and future data?
- Table 2 includes RMSE and bias for the 95% CI which was not explained nor mentioned or interpreted in the main text.
- Figure 4 includes so many datapoints, that the distribution of errors is not visible; a density plot would be a better fit here. Furthermore, plotting test, validation and train data on top of each other is not of any use here, as the data points from train and validation set are mostly not visible anyway; and the error distribution of the subset was not discussed anyway. And again, it’s unclear on which subset (train, validation, or test) the scores in the table were calculated.
- Figure 4 also shows the presence of negative melt predictions. In further applications, such values would likely be set to zero. It is therefore reasonable to truncate the predictions at zero and report performance metrics based on the truncated results.
References:
Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts.
Lazzeri, F. (2020). Machine learning for time series forecasting with Python. John Wiley & Sons.
Lenaerts, J. T., Medley, B., van den Broeke, M. R., & Wouters, B. (2019). Observing and modeling ice sheet surface mass balance. Reviews of Geophysics, 57(2), 376-420.
Veldhuijsen, S. B. M., van de Berg, W. J., Kuipers Munneke, P., Hansen, N., Boberg, F., Kittel, C., Amory, C., and van den Broeke, M. R.: Emulating the expansion of Antarctic perennial firn aquifers in the 21st century, The Cryosphere, 19, 5157–5173, https://doi.org/10.5194/tc-19-5157-2025, 2025.
Wang, W., Zender, C. S., van As, D., Fausto, R. S., & Laffin, M. K. (2021). Greenland surface melt dominated by solar and sensible heating. Geophysical Research Letters, 48(7), e2020GL090653.
Zhang, Q. L., Ding, M. H., Van Den Broeke, M. R., Noël, B., Fettweis, X., Wang, S., ... & Huai, B. J. (2025). Variations in Greenland surface melt and extreme events from 1958 to 2023. Advances in Climate Change Research, 16(5), 910-921.
Hofer, S., Tedstone, A. J., Fettweis, X., & Bamber, J. L. (2017). Decreasing cloud cover drives the recent mass loss on the Greenland Ice Sheet. Science Advances, 3(6), e1700584.
Citation: https://doi.org/10.5194/egusphere-2026-490-CC1
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 141 | 61 | 17 | 219 | 18 | 30 |
- HTML: 141
- PDF: 61
- XML: 17
- Total: 219
- BibTeX: 18
- EndNote: 30
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Summary
The study introduces a machine-learning based (XGBoost) emulator for the meltwater production over the Greenland ice sheet. SHAP values are used to understand the importance of the different variables on the meltwater production over time. The authors provide different versions of the emulator, trained on different data sets/variable sets. The 'full' version is trained on the regional climate model MAR, while another version is trained on ERA5 and a limited subset of the MAR variables. They find that the emulator generally shows good agreement with the MAR model on the test set.
Generally, I think the study is interesting and publishable in TC. However, I have some problems with the study in the current state, mostly regarding the language (see below).
Major comments:
1. Unfortunately, it's quite obvious that large parts of the manuscript are either written by AI or at least strongly formulated by AI. There seems to be extensive use of "excess vocabulary". If you use AI for formulating or fixing your grammar (as acknowledged in the acknowledgments), it shouldn't be that obvious at least. In my opinion, the credibility of the results suffer, even if all results are valid and correct. For the revision, I advise the authors to rely less on AI or actually only use it to fix grammatical issues without letting ChatGPT write whole sentences.
2. While it's true that ML approaches in ice sheet context are still quite sparse, I think this study has to make more clear why it's novel. At least, it is not clear to me how the ML-based approach is better than for example a simple linear regression. I am missing a comparison with some baseline model. How does it compare to simply doing linear regressions for example? Is there even a clear gain of information? For the revision, I think it's necessary to include such a comparison.
3. I think the introduction needs quite a lot of work. I am missing a discussion and context/comparison to recent advances of ML approaches in ice-sheet modelling/observations. Just some examples that should/could be discussed in the introduction at least: Lütjens et al. 2025 (https://arxiv.org/abs/2512.12142), Bochow et al. 2025 (https://egusphere.copernicus.org/preprints/2025/egusphere-2025-3927/). Especially a comparison with Schlager et al. 2026 (https://egusphere.copernicus.org/preprints/2026/egusphere-2026-7/) is necessary. To be fair, that preprint was posted after this paper but it seems like there is a very similar approach/idea described but instead of XGBoost a neural net is used. How does your model compare with theirs? What are the differences/similarities?
4. In general try to make some sentences short and more on the point. There are quite a lot of nested sentences that make it hard to follow thoughts.
Specific comments:
L.19 What is a low MSE? I think a relative error would be better here.
L.20 SHAP is not clear to everyone (acronym).
L.21 The long dash is a dead giveaway for AI use.
L.26 The last sentence seems out of place and reads more like an opinion.
L.31 I would say increased melt instead of enhanced.
L.33 The MAR acronym is written out, RACMO and HIRHAM are not.
L.35 Driving processes?
L.37 “Meltwater production” sounds odd.
L.37 In Pirk et al., neither Greenland nor the word “melt” is mentioned even once. Are you sure it's the right reference here?
L.44-45 Where are the references for that claim?
L.45-46 Was there a previous study emulating melt dynamics, or is this something you do for the first time? If the latter, I do not understand the sentence.
L.51 What does it mean that predictands are “drawn” from SEB components?
L.54 Do you use only ERA5, or some other dataset as well? If only ERA5, I would remove “such as.”
L.55 What is “evolving importance”?
L.57 In the ML community, “benchmark” is used in a different context, it could be confusing here.
L.86 Maybe add example tasks, especially in the context of ice sheets/climate modelling.
L.92 I don’t think it’s necessary to list boosting and decision trees as (a) and (b) here. I would suggest rephrasing the sentence.
L.92f I also think the description of boosting, and especially trees, is not very clear. For example, in line 96, what kind of thresholds? For the audience of TC, where I assume there are not many ML experts, I think this should be described a bit more extensively. For example: what kind of regularisation term? What does “efficiently” mean, compared to what? Neural networks?
L.100 The term “features” was not introduced.
L.105 I’m not sure why the metaphor of players is introduced. I don’t think it’s necessary. I think it’s easier to understand if you phrase it in terms of variables/predictors rather than “players.”
L.109/110 Rephrase the sentence, or split it into two sentences.
L.122f Split the sentence.
L.123 What does “assign” mean? Is this also computed?
L.140 It is not necessarily clear what you mean by “distribution of predictors.”
L.144f Split the sentence.
L.147 Missing )
L.148 A word is missing.
L.149 Why were shortwave and longwave radiation removed?
L.164 Explain five-fold cross-validation.
L.189 80% + 20% + 10% = 110%
L.197 Here you introduce the term “bias”, however, you used it earlier. Maybe move the definition of bias, MSE, etc. to when you first mention these terms.
L.200 I’m not a fan of using superlatives like “extremely”
L.217 This sounds odd: “explanatory nature”
L.283 Why is there a “["?
L.340 Split the sentence; it’s also not clear what you want to say.
L.353 Again, split the sentence. Please avoid sentences that run over four lines or more.
L.373f What does “such as” refer to?
Figures and Tables
The caption are generally not descriptive enough. Please extend them. The figures and tables should be self-explanatory without looking into the text.
Tab. 2 Metrics on what? The test set?
Tab. 3 Avoid refering to other figures/tables in the caption.
Fig. 2 Did you check if the correlation of the variables is approximately the same in the training/test/validation data set? Colorbar has no label
Fig. 3 last colorbar is missing the label
Fig. 4 labels are missing (a,b,c ...). Maybe make the data points transparent. Currently the blue points overlay everything. The quality is not the best, maybe use vector-based figure.