the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
An ensemble machine learning approach for filling voids in surface elevation change maps over glacier surfaces
Abstract. Glacier mass balance assessments in mountainous regions often rely on digital elevation models (DEMs) to estimate surface elevation change. However, these DEMs are prone to spatial data voids, particularly during historical reconstructions using older imagery. These voids, which are most common in glacier accumulation zones, introduce uncertainty into estimates of glacier mass balance and surface elevation change. Traditional void-filling methods, such as constant and hypsometric interpolation, have limitations in capturing spatial variability in elevation change. This study introduces a machine-learning- based approach using gradient-boosted tree regression (XGBoost) to estimate glacier surface-elevation change across voids. High Mountain Asia (HMA) is an ideal study area for assessing the accuracy of different void-filling approaches across glaciers with varying morphology and climatic settings. We compare XGBoost predictions to traditional void-filling methods across the Western and Eastern Himalayas using a dataset of DEM-derived elevation changes. Results indicate that XGBoost consistently outperforms simpler methods, reducing root mean square error (RMSE) and mean absolute error (MAE) while improving alignment with observed elevation changes. The study highlights the advantages of integrating multiple glaciological and topographic predictors, demonstrating the potential of machine learning to improve assessments of glacier mass balance and elevation change. Future research should explore additional predictors, such as climate data, to further enhance predictive accuracy.
- Preprint
(20374 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-4404', Romain Hugonnet, 18 Dec 2025
-
CC1: 'Comment on egusphere-2025-4404', Codrut-Andrei Diaconu, 18 Dec 2025
Dear authors,
Thank you for sharing this very interesting preprint and for making your code and data processing pipeline available. I have a question about the hyperparameter optimization (HPO) and validation strategy in Section 3.4.2, and how it relates to the strongly spatially correlated nature of the data.
As I understand it, your scheme is:
- training data: all non-void pixels;
- validation + test data: all artificially voided pixels, randomly split 50/50 into a validation set (used for Optuna and early stopping) and a test set (used for the final skill assessment in the voided areas).
Given that pixels are very strongly spatially correlated, with a random split at pixel level between validation and test, both the HPO and the final evaluation are sampling from essentially the same spatial structures. So even if the model is successful, this mainly answers “How well can I interpolate at ~pixel-scale inside the accumulation area?” (where validation and test pixels are close neighbours inside the same voids and glaciers), versus “How well can I reconstruct an entire large contiguous gap?” (e.g. a big accumulation-area void on a glacier that was never used in HPO).
If the method is addressing the second question, then I believe the valid/test split should be different and I was wondering whether you have considered (or could comment on) alternative validation schemes. For instance:
- Using a subset of the covered (non-void) pixels as the validation set for HPO (basically a subset of the current training data). In this setup, Optuna would see only non-void pixels, and the entire void mask would be reserved for testing.
- Splitting at the glacier level rather than at the pixel level. For example, randomly splitting glaciers 50/50 into two folds, generating voids on all glaciers, and then using all void pixels from fold A as the validation set for HPO, and all void pixels from fold B as the test set. This might be easier than the previous setup (a pure extrapolation exercise) and also closer to the situation where only a subset of glaciers in a region have large gaps in their accumulation areas (which might be more realistic as I imagine in practice not all the glaciers have gaps in the accumulation areas).
Finally, would it be possible to report also the R^2 for the predictions on top of RMSE/MAE? This would make it easier to compare with results from other regions, which might have different dh/dt ranges.
Again, many thanks again for the very clear paper and open workflow, and I’d be very interested in your thoughts on this point.
Best regards,
Codrut-Andrei DiaconuCitation: https://doi.org/10.5194/egusphere-2025-4404-CC1 -
RC2: 'Comment on egusphere-2025-4404', Robert McNabb, 06 Jan 2026
I think this is a very interesting study that builds on and extends previous work on void-filling methods for estimating elevation change and geodetic mass balance by applying a machine learning approach and comparing the results to other, simpler methods. The topic is of great value for the community, and I agree with the authors that improving these approaches is important, especially when attempting to use historical imagery to reconstruct past glacier changes. While the study is well-written and interesting and the results are encouraging, I think there are some potential issues with the reporting and the datasets used that need to be considered or addressed.
General comments
----------------1. While the results are promising, I think that the tone of the findings needs to be tempered somewhat. While there is a difference between the RMSE values for hypsometric approach and the XGBoost approach as reported in Table 3, they are fairly small. On a glacier-wide scale (Table 4, Figure 4), the differences are barely perceptible, implying that the hypsometric method is able to capture the majority of the variability in elevation change, and the improvement provided by the XGBoost method is minor in comparison to the other uncertainties in the dataset. This is itself an interesting result, but I think that the findings in the abstract and conclusions need to be softened.
2. I don't know if using the ASTER-derived dataset from Shean et al. (2020) is necessarily the best choice to use here. As much as I love the ASTER-derived results, they can be very noisy. They are also already interpolated (linearly in time rather than spatially), which could pose additional problems when trying to train a model for predicting elevation change. I think it might be beneficial to use a smaller but more detailed dataset such as the HMA DEMs (Shean, 2017) or Pléaides Glacier Observatory DEMs (Berthier et al. 2024). A combination of these high-resolution DEMs, differenced to the SRTM, might provide a cleaner, if smaller, input dataset to train and evaluate the model from.
3. I think that the section on model interpretability is very interesting and useful, but I don't understand why this is only done for two glaciers, rather than a much larger sample or even the entire dataset. If it's not possible to apply this to a larger sample, then this at least needs to be stated/explained. Assuming that it is possible, though, I think it would be highly beneficial to look deeper into the variability of feature importance over larger areas.
4. It appears that the method for generating artificial voids covers entire sections of the accumulation area of each glacier, which might also be much larger/more continuous than is needed. They are also only in the accumulation area, which means that the training data is more biased toward ablation areas. I think it would be helpful to compare examples of the voids generated by your approach with the voids included in historical datasets such as Maurer et al. 2019 - how well do your artificial voids capture the variability shown in that dataset? How often do you see voids at lower elevations? I think that a mix of void sizes and extents might be worth including.
Specific comments
-----------------l. 11-12: I agree that integrating multiple glaciological and topographic predictors is important and advantageous, but the results shown in section 4.3 would suggest that at least in those two detailed cases, x,y,z are the most important features in predicting elevation change, and the other glaciological/topographic predictors are less important. As with the main finding, I think this needs to be softened somewhat.
l. 23: A better citation here would be Berthier et al. 2023, I think - it's both more recent and more tailored to the topic of DEM differencing and geodetic mass balance
l. 40-50: McNabb et al. 2019 here is over-cited. There are many more appropriate references to discuss/cite individual methods, such as the original studies cited by McNabb et al., or a large number of studies published since 2019.
l. 83: why not use spatially-distributed maps of debris cover to assign "debris cover" as a feature for each pixel/point, rather than aggregating to the entire glacier? I think this might have more predictive power for individual pixels, and it should be easy enough to include given that the dataset cited here includes the spatial coverage.
l. 86: how did you determine this? RGI v6.0 does not include any information about "evidence of calving" for glaciers in HMA, as lake-terminating status is only included in a few regions for v6.0.
l. 97-102: This would seem to introduce additional voids into your dataset - how is this handled? Do you just remove those pixels/points from consideration entirely?
l. 108-111: why were these datasets removed? Also, why are you removing glaciers for being heavily debris-covered? This wasn't mentioned in the previous section.
l. 133: why 37%? Is this a constant value for all glaciers, meaning that your "testing" dataset is ~37% of all on-glacier pixels? Additionally, given that this appears to cover entire sections of the upper glacier, I wonder if some of the differences seen between the XGBoost and hypsometric methods aren't due in part to the linear interpolation required to fill the missing elevation bands - have you looked at this as well?
l. 144, elsewhere: define "outperformed" - do you mean that the RMSE and MAE were lower for the mean vs. the median?
Fig. 3: There's linear feature in panel f) at about +0.3 on the vertical axis and running from -1 to 0 on the horizontal axis. Do you have any ideas what this artifact might be?
Table 3: in the previous section (lines 200-201), you mention that you used the non-void pixels for training (~67%), then split the void pixels roughly 50-50 into testing and validation datasets. Here in Table 3, you've reported a single void metric rather than maintaining this split. Is the "void" value reported here only for the testing set (i.e., the void pixels not used to "tune" the model)?
Table 4: Is the standard deviation value for the hypsometric method in the W. Himalaya a typo? Looking at the corresponding violin plot, I can't see how the standard deviation of the pink plots would be an order of magnitude larger than the other two methods, especially when the other statistics are so similar.
Additional references
---------------------Berthier E and others (2023) Measuring glacier mass changes from space—a review. Reports on Progress in Physics 86(3), 036801. doi:10.1088/1361-6633/acaf8e.
Berthier E and others (2024) The Pléiades Glacier Observatory: high-resolution digital elevation models and ortho-imagery to monitor glacier change. The Cryosphere 18(12), 5551–5571. doi:10.5194/tc-18-5551-2024.
Shean D (2017) High Mountain Asia 8-meter DEMs Derived from Along-track Optical Imagery, Version 1. doi:10.5067/GSACB044M4PK.
Citation: https://doi.org/10.5194/egusphere-2025-4404-RC2
Model code and software
Ensemble Machine Learning for Void Filling Cameron Markovsky https://github.com/cmarkovsky/ensemble_void_fill
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 212 | 86 | 28 | 326 | 18 | 19 |
- HTML: 212
- PDF: 86
- XML: 28
- Total: 326
- BibTeX: 18
- EndNote: 19
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of “An ensemble machine learning approach for filling voids in surface elevation change maps over glacier surfaces”, Markovsky et al., submitted to “The Cryosphere”
by Romain Hugonnet, University of Alaska Fairbanks
General comment:
This study by Markovsky and co-authors explores a valuable topic for the glaciological community. In order to provide constrained estimates of past glacier mass/volume change, we need benchmarked void-filling approaches for glacier elevation changes. This is especially important as voids are increasingly common when using historical archives, which hold the potential to unlock decades of past glacier change. Thus, the work explores the performance of an existing machine-learning approach to improve void-filling using exclusively topographical variables, in the aim to improve simple binning approaches currently used.
The study is clear, well-structured and reads well. However, I think important issues regarding its presentation and the reliability of its statistical analysis need to be addressed.
Firstly, the core finding (which is that there is only a marginal improvement in prediction using the machine-learning approach) is somewhat misrepresented, especially in the abstract (but less in the discussion/conclusion). If it is the authors’ intention to publish negative/neutral results, I think it is OK (and an important part of the scientific process), but that would need to be conveyed directly.
Secondly, I believe that the current statistical analysis has major limitations that have not been fully identified or discussed by the authors, namely: 1/ The test data is itself very noisy, so measurement errors are mixed with prediction errors, preventing a reliable statistical validation, 2/ This type of machine-learning approach is known to suffer from error autocorrelation and training regionalization, and can thus be poorly fit to provide uncertainty estimates, which is not discussed.
Finally, while the text reads well, I found that it greatly lacked diversity in the scope of its discussion and was somewhat biased in its scientific references (keeping to only 20 citations and omitting highly relevant work).
Major comments:
1/ Describing the prediction improvement (or lack thereof) accurately
From a statistical viewpoint, the improvement in prediction compared to the widely-used hypsometric method is clearly marginal:
- Per-pixel: RMSE 0.379 vs 0.328, i.e. barely 15%,
- Glacier-wide: Basically no difference (one region slightly worse, one region slightly better).
This “neutral finding” is not conveyed accurately by the authors, especially in their abstract. Compounded with the fact that the per-pixel data is noisy and thus does not necessarily represent true elevation changes (see next comment for details), the statistical significance of these results is quite limited.
While I suspect this could be partly due to the noisy test data, it might be that the prediction performance is also incompressible when using (almost) only topographical variables. If so, that would be an interesting finding in itself: The hypsometric method is largely sufficient when using only topographical characteristics, especially for glacier-wide estimates (what is currently used for model calibration). I know the authors put effort into developing a new prediction approach for this study, and thus a conclusion conveying only a narrow improvement is difficult to put forward, but negative/neutral conclusions are not a bad thing in research and should be clearly reported.
2/ Poor statistical validation due to noisy test data
As the core data for their entire analysis, the authors use elevation changes estimates from Shean et al. (2020), which are derived mostly from ASTER DEMs known to be very noisy (Girod et al., 2017). In High Mountain Asia in particular, where many accumulation areas are extremely bright, ASTER cannot resolve high elevations reliably. This means that the input data of the authors (used for training/validation) is itself affected by measurement errors often higher than the elevation change signal itself, especially at the pixel-scale. Therefore, this data is poorly adapted to study potential improvements in per-pixel prediction in hypsometric gap-filling. In Shean et al. (2020) (or Hugonnet et al. (2021), that performed a similar analysis with more validation and interpretation regarding errors), it is only by spatially aggregating many pixels that random errors cancel out (depending on their spatial autocorrelation) and that reliable glacier-wide estimates can eventually be derived with ASTER. (this is also why it was less of an issue for McNabb et al. 2019, mostly concerned with glacier-wide estimates).
To address this issue of test data, I think the authors have several options:
- Ideally, I'd recommend to use only high-resolution elevation changes, either from local surveys (lidar, aerial) or from high-resolution DEMs such as those of the Pléiades Glacier Observatory that are distributed at various sites globally (Berthier et al., 2024).
- Otherwise, as a “drop-in replacement” for the same region, the authors could potentially still use ASTER elevation change products, but would need to filter pixels with very high uncertainty relative to the signal. For this, the authors need a predicted uncertainty at the pixel level, which are not available from Shean et al. (2020). Hugonnet et al. (2021) provides uncertainty products based on a validated empirical framework, where the per-pixel variability in uncertainty varies with slope and quality of stereo-correlation (Hugonnet et al., 2022), and is propagated during the temporal fit, with validation against high-precision measurements. However, using this data, the authors should expect to have to remove a large part of the dataset including many of the accumulation areas they focused on, or to partition the relative per-pixel errors due to input error and prediction error (more difficult)…
3/ Relevance of the machine-learning approach and its validation
While the authors mostly praise the (potential) advantage of their approach, they fail to discuss known limitations. Many machine-learning methods have been shown to underperform in specific applications in geoscience, which include in particular variables prone to error autocorrelation, or subject to difficult regionalization during training (e.g. review by Hoffimann et al., 2021). Glacier elevation changes have errors that are highly autocorrelated, whether from noise in the DEMs (e.g., Rolstad et al., 2009; Hugonnet et al., 2022), or simply by adding error during temporal prediction, so the first limitation is highly relevant and potentially quite limiting here. Regionalization is also an issue here, given that elevation changes vary significantly from region to region (polar ice caps, alpine glaciers, tidewater glaciers), but also because the authors chose to only focus on upper-area voids (while voids can exist everywhere due to acquisition swath, see my line-to-line comment later) and chose a fixed relative size (37% of the accumulation area, defined as upper 50%).
In particular, providing reliable uncertainty estimates is something that this type of machine-learning approach can struggle with (by overfitting significantly the autocorrelated data), contrary to other machine-learning approaches (such as Gaussian Processes). As the reported improvement in prediction is marginal compared to hypsometric methods, I would argue that improving our estimate of the uncertainty in the prediction is currently as important (if not more) as further improving the prediction itself, which is a topic that was covered slightly in Seehaus et al. (2020). However this topic is omitted in the present manuscript.
All of these limitations should be thoroughly discussed, and the analysis expanded accordingly (e.g., using a varying size of void and not a fixed 37%).
Additionally, concerning the validation:
- Per-pixel accuracy analysis: RMSE and MAE are both pretty bad metrics as they mix random and systematic errors, consider reporting primarily the mean (or median) and standard deviation (or NMAD) of residuals, which capture both independently, as well as the metric used to optimize/learn.
- Glacier-wide analysis: Good inclusion by the authors, because glacier-wide accuracy is the most important output for total mass change. However, this analysis is very size-dependent (as mentioned above, errors cancel out over the glacier based on area), so the authors should not group glaciers of all sizes together, and rather study the performance depending on glacier size. Currently, the errors are probably entirely driven by those of tiny glaciers.
4/ Biased references
The authors repeatedly cite a few references for very different aspects of their study, omitting other relevant studies in the literature, sometimes even those at the origin of a given method. A couple of examples:
- McNabb et al. (2019) is used for the gap-filling methods, without citing original references,
- Shean et al. (2020) is used for most of the world of HMA/remote sensing, even when not especially relevant,
- Maurer et al. (2019) for everything historical and DEM-processing, even when widely used in much earlier and generic studies.
The authors should diversify their citations, and find the original references for a given method or processing step (sometimes cited within the study they cite). I have included some of these references below in line-by-line comments, but I didn’t elaborate on all, and there are many more to address across the manuscript.
Line-by-line comments:
23: The reference to the old Bamber & Rivera review feels a bit specific, given that the end statement is about density. For density, cite for example Huss (2013) that is the most widely used. To add a more recent review including DEM differencing, cite for example Berthier et al. (2023).
26-31: In the whole section, it is not explained that the voids “predominant in accumulation areas” and later described as “common in historical images” are directly due to limits during stereophotogrammetry (this key term almost never appears in the manuscript) performed on optical imagery (thus including historical archives) to generate DEMs. This needs to be clarified. But beyond this, voids also exist in every large-scale (= satellite) DEMs simply because of fixed-width satellite swaths during acquisition, no matter the instrument (optical, radar).
60: Shean et al. (2020) is clearly not the right citation for this statement… There are extensive review, inventories or other studies more adapted to describing HMA glaciers as a whole.
132: How is the artificial void grown from the seed? I assume you use a flood-filling (or seed-filling) algorithm with 4/8-pixel direction? If yes, describe which and include the appropriate reference, such as Newman et al. (1979).
145-154: The first occurrence of hypsometric void filling in glaciology is, to my knowledge, Arendt et al. (2002), and the elevation dependency has been greatly described long before the citations mentioned (Jakob et al., 2021, or McGrath et al, 2017; which can be removed), especially for spatial extrapolation. See for instance Huss (2012).
179-185: Those components are also called “Northness” and “Eastness”.
Fig. 3: Add a colormap for the density, even if it is a linear scale?
Table 4: There’s probably an error in the reported value of the STD of Western Himalaya/Hypsometric (it is an order of magnitude above all other STDs).
New references from this review
Arendt, A. et al. (2002), Rapid Wastage of Alaska Glaciers and Their Contribution to Rising Sea Level.Science297,382-386.DOI:10.1126/science.1072497
Berthier, E., Floriciou, D., Gardner, A. S., Gourmelen, N., Jakob, L., Paul, F., Treichler, D., Wouters, B., Belart, J. M. C., Dehecq, A., Dussaillant, I., Hugonnet, R., Kääb, A., Krieger, L., Pálsson, F., & Zemp, M. (2023). Measuring glacier mass changes from space-a review. Reports on Progress in Physics, 86(3). https://doi.org/10.1088/1361-6633/acaf8e
Girod, L., Nuth, C., Kääb, A., McNabb, R., & Galland, O. (2017). MMASTER: Improved ASTER DEMs for Elevation Change Monitoring. Remote Sensing, 9(7), 704. https://doi.org/10.3390/rs9070704
Hoffimann, J., Zortea, M., de Carvalho, B., & Zadrozny, B. (2021). Geostatistical Learning: Challenges and Opportunities. Frontiers in Applied Mathematics and Statistics, 7. https://doi.org/10.3389/fams.2021.689393
Huss, M. (2012): Extrapolating glacier mass balance to the mountain-range scale: the European Alps 1900–2100, The Cryosphere, 6, 713–727, https://doi.org/10.5194/tc-6-713-2012.
Huss, M. (2013). Density assumptions for converting geodetic glacier volume change to mass change. The Cryosphere, 7(3), 877–887. https://doi.org/10.5194/tc-7-877-2013
Hugonnet, R., Brun, F., Berthier, E., Dehecq, A., Mannerfelt, E. S., Eckert, N., & Farinotti, D. (2022). Uncertainty Analysis of Digital Elevation Models by Spatial Inference From Stable Terrain.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15, 6456–6472. https://doi.org/10.1109/JSTARS.2022.3188922
Hugonnet, R., McNabb, R., Berthier, E., Menounos, B., Nuth, C., Girod, L., Farinotti, D., Huss, M., Dussaillant, I., Brun, F., & Kääb, A. (2021). Accelerated global glacier mass loss in the early twenty-first century. Nature, 592(7856), 726–731. https://doi.org/10.1038/s41586-021-03436-z
Newman, William M; Sproull, Robert Fletcher (1979). Principles of Interactive Computer Graphics (2nd ed.). McGraw-Hill. p. 253. ISBN 978-0-07-046338-7.
Rolstad, C., Haug, T., & Denby, B. (2009). Spatially integrated geodetic glacier mass balance and its uncertainty based on geostatistical analysis: Application to the western Svartisen ice cap, Norway. Journal of Glaciology, 55(192), 666–680. https://doi.org/10.3189/002214309789470950