the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Outlet Glacier Seasonal Terminus Prediction Using Interpretable Machine Learning
Abstract. Glacier terminus retreat involves complex processes superimposed at the interface between the ice sheet, the ocean, and the subglacial substrate, posing challenges for accurate physical modeling of terminus change. To enhance our understanding of outlet glacier ablation, numerous studies have focused on investigating terminus position changes on a seasonal scale with no clear control on seasonal terminus change that has been identified across all glaciers. Here, we explore the potential of machine learning to analyze glaciological time series data to gain insight into the seasonal changes of outlet glacier termini. Using machine learning models, we forecast seasonal changes in terminus positions for 46 outlet glaciers in Greenland. Through the SHapley Additive exPlanations (SHAP) feature importance analysis, we identify the dominant predictors of seasonal terminus position change for each. We find that glacier geometry is important for accurate predictions of the magnitude of terminus seasonality and that environmental variables (mélange, ocean thermal forcing, runoff, and air temperature) are important for determining the onset of seasonal terminus change. Our work highlights the utility of machine learning in understanding and forecasting glacier behavior.
- Preprint
(7635 KB) - Metadata XML
-
Supplement
(8521 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-3483', Anonymous Referee #1, 29 Aug 2025
-
AC1: 'Reply on RC1', Kevin Shionalyn, 17 Sep 2025
Thank you for your review, discussion, and suggestions. We address them below, with portions of your original text in 'normal font' and our responses provided in 'bold'.
Data preparation:
Before your timeseries of data enter the machine learning setup, you remove the multi-annual component from the data leaving the seasonal component as input. You do this for every input variable (L153 - 154). Does this detrending apply to geometric variables such as bed slope and thickness, and to dynamic variables such as strain rate? If yes, what is the physical interpretation of removing the long-term component of bed slope or strain rate near the calving front (which changes position)? For instance, the velocity pattern and thus the strain rates at a given location are highly dependent on the geometry of the glacier (e.g. is it in a narrow or wider spot). Is it actually necessary to detrend the geometric data? Perhaps you can show the reader examples of the full and detrended time series of input variables like you have in Figure 2 for terminus advance. Also adding some more details of the method and on the resulting seasonal component would be useful to the reader, like does the timeseries have a zero mean?
Yes, from the singular spectrum analysis (SSA) we keep only the seasonal components for all data used in our models, including geometric and dynamic variables. We will add samples to our appendix, similar to Figure 2, as examples of seasonal and long-term components for each input variable. Performing the SSA on the geometric variables removes any multi-annual trends that may exist due to, for example, multi-year retreat. In other words, if a glacier is retreating through an overdeepening, the bed slope may increase for a number of years and then decrease for a number of years. Similarly, as the reviewer points out, the strain rates may decrease for a number of years as the glacier retreats through a wider spot and then increase for a number of years as it starts to stabilize in a narrower spot. We wish to remove these multi-annual trends to leave just the seasonal component of the variables for our analysis in this manuscript.
We add more details on the result of the singular spectrum analysis on line 155 to provide clarity for the reader. The resulting time series have a zero mean over the entire time period but not annually.
Because glaciers are not in steady state, more clarification is also needed on how the “magnitude of seasonal terminus change” is defined. Since termini do not necessarily return to the same position at the end of summer, how is this accounted for? In other words, do the long term changes still enter your prepared data through this? Are these situations where the algorithm struggles to make accurate predictions (L394)? Similarly, in Figure 2b some years show double peaks (e.g., 2013) —how are such cases treated?
This relates back to the points above on providing additional details about the Singular Spectrum Analysis (SSA) and the resulting time series. As we can see in Figure 4b, the SSA time series does not return to the same position each summer. As is true of all statistical tools, SSA is not a perfect tool and the resulting time series should be considered as a representation of seasonality. While we use this approach to remove the long-term components with the SSA, some long-term signal can “leak” into the seasonality information, as we do see larger seasonal swings in the data, for instance, during periods of long-term retreat. Thus, when we refer to the ‘magnitude of seasonal terminus change’ we are indicating the magnitude or amplitude of the seasonal change, on average over the entire record. We have updated the text to provide this additional clarity.
Mélange proxy:
I am not convinced that SST is a reliable proxy for mélange conditions. SST does not necessarily capture mélange strength or thickness, as you also note (L406). I understand the motivation, given the lack of a Greenland-wide mélange product, but perhaps it would be more accurate to label this variable “fjord surface temperature,” acknowledging that it reflects mixed influences from open water, fjord ice, and mélange. Alternatively, additional examples establishing its validity as a mélange proxy would strengthen the argument.
We are also not satisfied using sea surface temperature (SST) as our closest proxy for mélange presence but we do so, recognizing that it serves as a temporary solution until better data products are made available on a Greenland-wide scale. To address this comment, we have added additional caveats to our use of SST as a proxy to include the other possible influences you mention, such as fjord ice. We have also added additional reminders to our figures that while we state “mélange”, we are referring to a proxy for mélange.
One advantage we have by using SST is that it creates a continuous data set that we can evaluate as a time series. This is an advantage for us because the machine learning model we employ struggles when mixing continuous and Boolean data. In the future, if a Greenland-wide data set of mélange rigidity values is created, it would be preferred to our use of SST.
Key findings:
In the abstract you write ‘We find that glacier geometry is important for accurate predictions of the magnitude of terminus seasonality and that environmental variables (mélange, ocean thermal forcing, runoff, and air temperature) are important for determining the onset of seasonal terminus change.’ I think these are the most important findings in your study, although perhaps not surprising? However, this is not really emphasized in your conclusions and outlook, where the focus is mostly on the potential of machine learning methods to improve our understanding e.g. seasonal terminus changes from a data driven perspective and how they can be implemented with numerical ice flow models. While these are important points, they are more general/outlook, more weight should be given to the glaciological insights gained—specifically, what your study reveals about the processes governing seasonal terminus changes.
Thank you for this insight---we agree. We have added an additional paragraph in our Outlooks section along with more details in the Conclusion. In these added sections, we discuss more of how our findings add to ongoing discussions on the Tidewater Glacier Cycle, long-term glacier retreat for Greenland’s outlet glaciers, and how our seasonal findings correspond to long-term trends.
Specific comments/suggestions
L125: Should be Danish Meteorological Institute (DMI) not Danish Meteorological Society
Changed as suggested.
L125: Figures 1 and 2 in Jensen et al. (2023) show that the station network is densest in South and Southwest Greenland. It is often very far from the glaciers in your study to the nearest weather station and the observations might not be representative of a particular glacier even when choosing the nearest one. For instance, the weather station is located much closer to the open ocean than the glacier terminus in question. While direct observations are great, would it not be better to apply re-analysis data from e.g. CARRA or similar?
This is a discussion we also had and debated. We chose to include re-analysis products where direct observations are not available, such as for our ocean thermal forcing data product. Where both direct observations and re-analysis product are available, we chose to go with direct observations to give our models a chance to learn from patterns in the direct data. We have provided this reasoning in the text.
L131: Just to clarify, is the timeseries of ocean thermal forcing near the glacier fronts for the entire column of water or only at a specific depth or interval?
Ocean Thermal Forcing is the difference between the in-situ temperature and the salt- and pressure-dependent freezing point of seawater, depth-averaged in the lowest 60% of the water column at each glacier front. We have updated the text to include this detail.
L162: ‘…heterogeneity in this population (Table 4).’ Should be Table 3, I think.
Changed as suggested.
L200: When evaluating the NRMSE, is it done once pr year or at every minimum that you evaluate the amplitude of the seasonal cycle or is it evaluated at every timestep?
The mean seasonal range is done annually, not at every minimum.
L328 Perhaps change the subsection from ‘Feature Importance for Seasonal Terminus Prediction’ to ‘Feature Importance for Magnitude of Seasonal Terminus Prediction’?
Changed as suggested.
L319: You have a mean temporal offset of 6.7 weeks between the predicted and observed peak. This is ~1.5 months so quite a large difference in timing. Is it a systematic bias or is it random?
This is a systematic bias. We have added additional details to our description of mean offset to clarify that this is an absolute offset value.
L357 forward: The studies in L357-359 you cite for findings on mélange impact on timing of retreat of the while you’re the focus of your study is the magnitude and not the timing (L338-340) of the seasonality. I am just wondering if you are comparing the right things.
Good point---we discuss starting on line 338 how our models work to reduce the Root Mean Squared Error (RMSE), or magnitude of error, instead of directly modeling to reduce offset (timing of seasonal change). The comparison is not 1:1, but we would expect a model that predicts the magnitude of retreat to infer the timing of retreat as well. We have added a clarification on line 360 that our model is trained to reduce RMSE.
L394: You write that the best performing models are for glaciers that have experienced little overall retreat during the period. Does this mean, that geometric parameters are the main controls for outlet glaciers close to a steady state?
We add an extra layer of nuance to your statement. Our findings show that geometric parameters are the main control on predictions for outlet glacier terminus seasonality, and that these predictions are more accurate for glaciers close to a steady state. We have added this change to the text on lines 394-395.
L455: suggests -> shows
Changed as suggested.
Figure 2: A Figure 2c is mentioned in the caption but is not present. Are both trends now in Figure 2b?
Changed as suggested.
Citation: https://doi.org/10.5194/egusphere-2025-3483-AC1
-
AC1: 'Reply on RC1', Kevin Shionalyn, 17 Sep 2025
-
RC2: 'Comment on egusphere-2025-3483', Erik Loebel, 09 Sep 2025
This paper presents a novel approach to forecasting and understanding seasonal changes to the calving front of Greenland outlet glaciers. The authors use supervised machine learning, specifically gradient boosting, to analyse a large ensemble of glacier monitoring data and predict calving front seasonality. The benefit of the various input variables are made explainable using SHAP values. This assessment of the input feature not only helps us to better understand the predictions of this particular machine learning algorithm, but will also benefit outlet glacier modelling efforts in general. I find this concept offers a highly promising approach to transforming our growing volume of glacier monitoring data into valuable glaciological insights. For these reasons, I believe that this paper will make a substantial contribution to research on outlet glaciers and is therefore within the scope of TC. That said, there are three points that I would like to see addressed before publication.
- I very much appreciate that this article focuses on the results and the glaciological discussion. However, I think it would be useful for the reader to at least understand the basics how this method works. Section 2.1.1 need more information and clarifications. For me, it currently raises a lot of questions. What are the 7 hyperparameters mentioned in L183? What learning rate was used? What about initialization? What's the stopping criteria used for the training? Also, it would be nice if it were briefly explained how the temporal aspects of the input time series are captured (e.g. were lag features or rolling statistics used?).
- SHAP values assume independence of input features and can be misleading when they are correlated. I don't see any problem with experiments where the inputs are grouped into geometric, dynamic, and climatic categories. However, it could be problematic when the inputs are assessed independently (e.g. for Fig 4 and 5b). For example, if there is a significant correlation between strain rate and velocity, or OTF and ice mélange, the corresponding SHAP values may become unstable, with small changes in the data or model leading to significant fluctuations in the SHAP value. Perhaps XGBoost is more robust in this respect than other methods, but I would like to see this mentioned or addressed in some way in the paper. A table showing the mean correlation between the inputs would already help.
- The model is evaluated using three accuracy metrics, which I think provide a great foundation for the subsequent experiments. That said, I don't think they currently do a particularly good job of giving one-time readers an idea of how well this model actually predicts. Although the metrics are briefly described in Section 2.2, it may still be difficult to understand what the sentence in L241 actually means in terms of the quality of the predictions. Would it make sense to specify an average distance between the actual and the model prediction, measured in metres? Perhaps even an absolute and a relative (to the seasonal terminus variation) average? Furthermore, I would like to see a handful of time series plots from the supplement (the '(b)' plots) included in the main paper. These would be very useful for understanding the text in section 3.2. So that there is at least one example for each of the different cases explained in the text.
Specific Comments
- L6: Consider naming the method here.
- L14: Does this 34/66 split refer to the 46-year period analysed in Mouginot et al. 2019? If yes, it should be clear. Other studies examining different periods yield different results (e.g. IMBIE 2018 got about 50/50 between 1992 and 2018).
- L170: Are the section numbers correct here?
- L188: To avoid confusion with the image processing technique, it would be better to use the terms 'input feature tracking' or 'feature importance tracking'.
- L225: ... in Zhang et al. 2023.
- L227: This makes it sound like these glaciers are not so important. However, when it comes to discharge, the story is different. The NEGIS ice stream alone drains around 12% of the GIS. Also south and southwest glaciers are missing.
- L673: This citation refers to the preprint.
- References: There seem to be many entries without a DOI. Consider adding them.
- Table 1: According to the paper (Zhang, E. 2023), the uncertainty of AutoTerm is 79 m. It's also not directly comparable to the 109 m of TermPicks. As far as I know AutoTerm users an average minimal distance, whereas TermPicks uses a median Hausdorff distance. Perhaps it would be better to make these uncertainty statements more general (e.g. ~100 m).
- Table 2: Why don't the 'Best NRMSE' models have the best NRMSE?
Citation: https://doi.org/10.5194/egusphere-2025-3483-RC2 -
AC2: 'Reply on RC2', Kevin Shionalyn, 27 Sep 2025
Thank you for your review, discussion, and suggestions. We address them below, with portions of your original text in normal font and our responses provided in bold.
I very much appreciate that this article focuses on the results and the glaciological discussion. However, I think it would be useful for the reader to at least understand the basics how this method works. Section 2.1.1 need more information and clarifications. For me, it currently raises a lot of questions. What are the 7 hyperparameters mentioned in L183? What learning rate was used? What about initialization? What's the stopping criteria used for the training? Also, it would be nice if it were briefly explained how the temporal aspects of the input time series are captured (e.g. were lag features or rolling statistics used?).
To address this comment, we add additional details about our model setup. This includes: hyperparameters used (including learning rate and a stopping criteria) and how we tune them, and our use of standard XGBoost initialization with a random state. We did not use lag features or similar temporal controls other than assigning our training data to be the first 80% of each time series, so we add this specification to the text. We will also point readers towards the code and data availability section, where the model code can be viewed for further details.SHAP values assume independence of input features and can be misleading when they are correlated. I don't see any problem with experiments where the inputs are grouped into geometric, dynamic, and climatic categories. However, it could be problematic when the inputs are assessed independently (e.g. for Fig 4 and 5b). For example, if there is a significant correlation between strain rate and velocity, or OTF and ice mélange, the corresponding SHAP values may become unstable, with small changes in the data or model leading to significant fluctuations in the SHAP value. Perhaps XGBoost is more robust in this respect than other methods, but I would like to see this mentioned or addressed in some way in the paper. A table showing the mean correlation between the inputs would already help.
While SHAP can often outperform other feature importance methods in handling correlated input variables, we also recognize that interpretation of individual input importances can be unstable, particularly where multiple inputs are correlated (e.g., Salih et al., 2025; https://doi.org/10.1002/aisy.202400304). To address this comment, we add a sub-figure to Figure 5 that shows the correlation between input variables. We will add a reference to this correlation in our “Limitations of this Study” section, lines 431-434, where we discuss the connection between velocity and strain rate and how this may lead to misleading rankings of individual feature importance.The model is evaluated using three accuracy metrics, which I think provide a great foundation for the subsequent experiments. That said, I don't think they currently do a particularly good job of giving one-time readers an idea of how well this model actually predicts. Although the metrics are briefly described in Section 2.2, it may still be difficult to understand what the sentence in L241 actually means in terms of the quality of the predictions. Would it make sense to specify an average distance between the actual and the model prediction, measured in metres? Perhaps even an absolute and a relative (to the seasonal terminus variation) average? Furthermore, I would like to see a handful of time series plots from the supplement (the '(b)' plots) included in the main paper. These would be very useful for understanding the text in section 3.2. So that there is at least one example for each of the different cases explained in the text.
To address this comment, we add the suggested mean difference between the actual and model predictions, in meters, to Table 4. We also add a figure of additional prediction plots, providing a sample of the categories we refer to in Section 3.2.Specific Comments
- L6: Consider naming the method here.
Changed as suggested. - L14: Does this 34/66 split refer to the 46-year period analysed in Mouginot et al. 2019? If yes, it should be clear. Other studies examining different periods yield different results (e.g. IMBIE 2018 got about 50/50 between 1992 and 2018).
Correct. We cite Mouginot et al. 2019 as the source of the 34/66 split of mass lost to melt versus dynamic responses. - L170: Are the section numbers correct here?
Thank you, this is a formatting error. We fix the section numbering, changing “Machine-learning Analysis” from 2.1.1 to 2.2 and “Model Accuracy and Experimental Design” from 2.2 to 2.3. - L188: To avoid confusion with the image processing technique, it would be better to use the terms 'input feature tracking' or 'feature importance tracking'.
Changed ‘feature tracking to ‘'feature importance tracking' as suggested. - L225: ... in Zhang et al. 2023.
Changed as suggested. - L227: This makes it sound like these glaciers are not so important. However, when it comes to discharge, the story is different. The NEGIS ice stream alone drains around 12% of the GIS. Also south and southwest glaciers are missing.
To address this, we clarified the wording to emphasize that we are referring to the exclusion of a quantity of glaciers with terminus position data. We also added references to the missing south and southwest glaciers. - L673: This citation refers to the preprint.
Updated as suggested to cite the published paper instead of the preprint. - References: There seem to be many entries without a DOI. Consider adding them.
Thank you- these seem to have been lost in our formatting. We add additional DOIs where they are available. - Table 1: According to the paper (Zhang, E. 2023), the uncertainty of AutoTerm is 79 m. It's also not directly comparable to the 109 m of TermPicks. As far as I know AutoTerm users an average minimal distance, whereas TermPicks uses a median Hausdorff distance. Perhaps it would be better to make these uncertainty statements more general (e.g. ~100 m).
Uncertainty statements generalized as suggested. - Table 2: Why don't the 'Best NRMSE' models have the best NRMSE?
In the “Comparative Experiments” section, we explain how the use of RACMO (Experiment 2a) and MAR (Experiment 2b) data products create similar results, and how we ultimately chose to use the MAR data product for a reduction in offset.
Citation: https://doi.org/10.5194/egusphere-2025-3483-AC2 - L6: Consider naming the method here.
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
2,056 | 70 | 19 | 2,145 | 26 | 63 | 53 |
- HTML: 2,056
- PDF: 70
- XML: 19
- Total: 2,145
- Supplement: 26
- BibTeX: 63
- EndNote: 53
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Summary
This study applies machine learning to investigate the drivers of seasonal terminus changes in Greenland outlet glaciers. The analysis uses freely available datasets on glacier geometry, climate, and glacier dynamics for the period 2000–2020. The authors find that geometric variables are the strongest controls on the magnitude of seasonal change, and that their approach performs best for glaciers with limited long-term retreat.
The study addresses a highly relevant topic for the glaciological community and represents a valuable step toward understanding and predicting glacier calving front changes. The manuscript is generally well written and engaging. However, I believe several issues regarding data and data preparation must be addressed to strengthen the paper. Below I provide general comments, followed by specific points.
General points
Data preparation:
Before your timeseries of data enter the machine learning setup, you remove the multi-annual component from the data leaving the seasonal component as input. You do this for every input variable (L153 - 154). Does this detrending apply to geometric variables such as bed slope and thickness, and to dynamic variables such as strain rate? If yes, what is the physical interpretation of removing the long-term component of bed slope or strain rate near the calving front (which changes position)? For instance, the velocity pattern and thus the strain rates at a given location are highly dependent on the geometry of the glacier (e.g. is it in a narrow or wider spot). Is it actually necessary to detrend the geometric data? Perhaps you can show the reader examples of the full and detrended time series of input variables like you have in Figure 2 for terminus advance. Also adding some more details of the method and on the resulting seasonal component would be useful to the reader, like does the timeseries have a zero mean?
Because glaciers are not in steady state, more clarification is also needed on how the “magnitude of seasonal terminus change” is defined. Since termini do not necessarily return to the same position at the end of summer, how is this accounted for? In other words, do the long term changes still enter your prepared data through this? Are these situations where the algorithm struggles to make accurate predictions (L394)? Similarly, in Figure 2b some years show double peaks (e.g., 2013) —how are such cases treated?
Mélange proxy:
I am not convinced that SST is a reliable proxy for mélange conditions. SST does not necessarily capture mélange strength or thickness, as you also note (L406). I understand the motivation, given the lack of a Greenland-wide mélange product, but perhaps it would be more accurate to label this variable “fjord surface temperature,” acknowledging that it reflects mixed influences from open water, fjord ice, and mélange. Alternatively, additional examples establishing its validity as a mélange proxy would strengthen the argument.
Key findings:
In the abstract you write ‘We find that glacier geometry is important for accurate predictions of the magnitude of terminus seasonality and that environmental variables (mélange, ocean thermal forcing, runoff, and air temperature) are important for determining the onset of seasonal terminus change.’ I think these are the most important findings in your study, although perhaps not surprising? However, this is not really emphasized in your conclusions and outlook, where the focus is mostly on the potential of machine learning methods to improve our understanding e.g. seasonal terminus changes from a data driven perspective and how they can be implemented with numerical ice flow models. While these are important points, they are more general/outlook, more weight should be given to the glaciological insights gained—specifically, what your study reveals about the processes governing seasonal terminus changes.
Specific comments/suggestions
L125: Should be Danish Meteorological Institute (DMI) not Danish Meteorological Society
L125: Figures 1 and 2 in Jensen et al. (2023) show that the station network is densest in South and Southwest Greenland. It is often very far from the glaciers in your study to the nearest weather station and the observations might not be representative of a particular glacier even when choosing the nearest one. For instance, the weather station is located much closer to the open ocean than the glacier terminus in question. While direct observations are great, would it not be better to apply re-analysis data from e.g. CARRA or similar?
L131: Just to clarify, is the timeseries of ocean thermal forcing near the glacier fronts for the entire column of water or only at a specific depth or interval?
L162: ‘…heterogeneity in this population (Table 4).’ Should be Table 3, I think.
L200: When evaluating the NRMSE, is it done once pr year or at every minimum that you evaluate the amplitude of the seasonal cycle or is it evaluated at every timestep?
L328 Perhaps change the subsection from ‘Feature Importance for Seasonal Terminus Prediction’ to ‘Feature Importance for Magnitude of Seasonal Terminus Prediction’?
L319: You have a mean temporal offset of 6.7 weeks between the predicted and observed peak. This is ~1.5 months so quite a large difference in timing. Is it a systematic bias or is it random?
L357 forward: The studies in L357-359 you cite for findings on mélange impact on timing of retreat of the while you’re the focus of your study is the magnitude and not the timing (L338-340) of the seasonality. I am just wondering if you are comparing the right things.
L394: You write that the best performing models are for glaciers that have experienced little overall retreat during the period. Does this mean, that geometric parameters are the main controls for outlet glaciers close to a steady state?
L455: suggests -> shows
Figure 2: A Figure 2c is mentioned in the caption but is not present. Are both trends now in Figure 2b?