the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
An ensemble machine learning approach for filling voids in surface elevation change maps over glacier surfaces
Abstract. Glacier mass balance assessments in mountainous regions often rely on digital elevation models (DEMs) to estimate surface elevation change. However, these DEMs are prone to spatial data voids, particularly during historical reconstructions using older imagery. These voids, which are most common in glacier accumulation zones, introduce uncertainty into estimates of glacier mass balance and surface elevation change. Traditional void-filling methods, such as constant and hypsometric interpolation, have limitations in capturing spatial variability in elevation change. This study introduces a machine-learning- based approach using gradient-boosted tree regression (XGBoost) to estimate glacier surface-elevation change across voids. High Mountain Asia (HMA) is an ideal study area for assessing the accuracy of different void-filling approaches across glaciers with varying morphology and climatic settings. We compare XGBoost predictions to traditional void-filling methods across the Western and Eastern Himalayas using a dataset of DEM-derived elevation changes. Results indicate that XGBoost consistently outperforms simpler methods, reducing root mean square error (RMSE) and mean absolute error (MAE) while improving alignment with observed elevation changes. The study highlights the advantages of integrating multiple glaciological and topographic predictors, demonstrating the potential of machine learning to improve assessments of glacier mass balance and elevation change. Future research should explore additional predictors, such as climate data, to further enhance predictive accuracy.
- Preprint
(20374 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-4404', Romain Hugonnet, 18 Dec 2025
-
AC1: 'Reply on RC1', Cameron Markovsky, 03 Feb 2026
General Response
We thank the reviewer for their thoughtful and constructive comments. In response, we have substantially revised both the artificial void generation strategy and the model validation/training framework to address concerns about spatial autocorrelation, the representativeness of voids, and the analysis of model results. Moreover, pixel value filtering has been removed because some erratic values appeared to be caused by a projection misalignment, an issue that has been remedied and will be reflected in the revised manuscript.
One of the most important changes in our revised workflow is the shift from a pixel-random validation paradigm to a glacier-aware experimental design. Artificial voids are now generated using a stochastic, elevation-weighted, morphology-constrained approach that produces a distribution of void sizes, shapes, and elevation ranges consistent with historical DEM voids, rather than a fixed fraction of contiguous void masking. This change aims to still generate most of the voids in the glacier's accumulation zone while imposing less stringent constraints on void location and geometry. In addition, we have updated the workflow to generate multiple void realizations. By testing each method across different stochastic void realizations, we hope to better quantify the uncertainty associated with each method.
Correspondingly, the model training and validation strategy has been redesigned to reduce spatial leakage and overly optimistic skill estimates. Training, validation, and testing are now separated using glacier-level partitioning, rather than random pixel-level splits. Validation is performed on the entire unseen voids on entirely unseen glaciers. This directly addresses the reviewer’s concern that the previous validation primarily measured short-range interpolation skill rather than reconstruction of large contiguous gaps.
Finally, we have revised the framing of the results and conclusions to emphasize that the machine-learning approach provides incremental but robust improvements, particularly in spatial realism and pixel-scale structure, rather than dramatic glacier-wide mass-balance differences relative to hypsometric methods. The results will be updated to reflect the new workflow, and the language used to report them will better reflect the improvement (if any) of the machine-learning method over traditional methods.
Major Comment Responses
- Misrepresentation of marginal improvement
Reviewer comment:
The improvement relative to hypsometric methods is marginal and not accurately conveyed, especially in the abstract.Response:
We agree that the language used in the abstract and conclusions does not accurately portray the results. Once the results have been updated to reflect the new artificial void generation and model-validation workflows, the manuscript will be edited to better reflect the incremental (if any) improvement provided by the machine learning method. Specifically, major sections of the abstract and conclusions will be revised to more clearly characterize the findings without overstating the machine-learning method's performance. Moreover, the discussion of when the hypsometric method may be suitable compared to the ML-based method will be expanded.2: Poor statistical validation due to noisy ASTER test data
Reviewer comment:
ASTER-derived dh/dt is too noisy for pixel-scale validation; measurement error overwhelms signal.Response:
We agree and have clarified this limitation throughout the manuscript. While higher-resolution DEMs (e.g., from the Pléiades Glacier Observatory) would provide a superior benchmark, our study focuses on how these methods perform on noisy datasets. Although the spatial resolution of some historical satellite products (e.g., HEXAGON imagery) is high (~8m), these products often contain artifacts from processing or digitization. We will revise the text now to explicitly state that ASTER-derived elevation change products are not treated as error-free truth, but rather as representative of the data quality commonly available for large-scale historical reconstructions. While beyond the scope of this paper, we appreciate the recommendation and are currently exploring the PGO dataset for future work.We appreciate the interesting points and directions you raise when discussing using a different dataset or including the uncertainty estimates from your previous paper. To address these concerns without removing large parts of the dataset, we plan to add an analysis that examines each pixel's error relative to its associated uncertainty. This addition aims to better contextualize the noise level in the dataset and provide insight into how measurement uncertainties propagate through each void-filling method.
- ML limitations, autocorrelation, and uncertainty
Reviewer comment:
Machine learning methods struggle with autocorrelated errors, regionalization, and uncertainty estimation, which are not discussed.Response:
We agree and will substantially expand on these limitations when introducing the machine learning method. Specifically, we now explicitly discuss the presence of spatial autocorrelation in DEM errors, its implications for ML validation, the challenges of training a single model across heterogeneous glacier regimes, and the difficulty of deriving reliable uncertainty estimates from tree-based ML methods. Moreover, the revised workflow addresses some of these issues by introducing voids of multiple sizes and not limiting them to the accumulation zone. In addition, our new workflow provides uncertainty quantifications by introducing multiple void realizations for each glacier. By comparing multiple void realizations, we aim to avoid a “lucky seed” and instead assess how each method performs on the same glacier with artificial voids introduced at different sizes and locations. In addition, we will revise the manuscript to explicitly state that uncertainty estimation is as important as improving prediction accuracy and that other methods, such as a Gaussian Process, are better suited to this task.- Biased and insufficient references
Reviewer comment:
Over-reliance on a small number of studies and omission of foundational work.Response:
The revised manuscript will diversify its citations. Specifically, we’ve included original references for hypsometric methods, DEM differencing, and historical reconstructions. In addition, we’ve added new references on HMA/remote sensing, incorporated many of the reviewer’s suggested references, and reduced the number of redundant citations.Line-by-Line Responses
- 23: Density references
The manuscript has been revised to include Huss (2013) in the background discussion of glacier mass balance assessments and Berthier et al. (2023) in the discussion of DEM differencing.
- 26–31: Origin of voids in spatial data
We have clarified that voids arise primarily from limitations of stereophotogrammetry (and added this term throughout the background discussion). Specifically, we’ve revised the text to frame voids as an inherent aspect of satellite products resulting from fixed-width swaths, and we've included references for this discussion.
- 60: HMA citations
We have expanded and diversified the citations to include other studies, such as Hugonnet et al. (2021).
- 132: Void growth algorithm
We have expanded this explanation to clarify that we use an 8-connectivity seed-fill algorithm for artificial void generation and have included a reference to Newman et al. (1979).
- 145–154: Hypsometric void filling foundational references
We’ve added references to Arendt et al. (2002) and Huss (2012) as foundational references and reduced later citations to more recent studies.
- 179–185: Aspect components
We’ve clarified the terminology of aspect components as “northness” and “eastness.”
Figure 3
We’ve updated this figure to include a density colormap.
Table 4
This is a typo; the standard deviation of the original results should be 0.086, rather than 0.86. We will update this table with the new results and double-check/explain any odd values.
Citation: https://doi.org/10.5194/egusphere-2025-4404-AC1
-
AC1: 'Reply on RC1', Cameron Markovsky, 03 Feb 2026
-
CC1: 'Comment on egusphere-2025-4404', Codrut-Andrei Diaconu, 18 Dec 2025
Dear authors,
Thank you for sharing this very interesting preprint and for making your code and data processing pipeline available. I have a question about the hyperparameter optimization (HPO) and validation strategy in Section 3.4.2, and how it relates to the strongly spatially correlated nature of the data.
As I understand it, your scheme is:
- training data: all non-void pixels;
- validation + test data: all artificially voided pixels, randomly split 50/50 into a validation set (used for Optuna and early stopping) and a test set (used for the final skill assessment in the voided areas).
Given that pixels are very strongly spatially correlated, with a random split at pixel level between validation and test, both the HPO and the final evaluation are sampling from essentially the same spatial structures. So even if the model is successful, this mainly answers “How well can I interpolate at ~pixel-scale inside the accumulation area?” (where validation and test pixels are close neighbours inside the same voids and glaciers), versus “How well can I reconstruct an entire large contiguous gap?” (e.g. a big accumulation-area void on a glacier that was never used in HPO).
If the method is addressing the second question, then I believe the valid/test split should be different and I was wondering whether you have considered (or could comment on) alternative validation schemes. For instance:
- Using a subset of the covered (non-void) pixels as the validation set for HPO (basically a subset of the current training data). In this setup, Optuna would see only non-void pixels, and the entire void mask would be reserved for testing.
- Splitting at the glacier level rather than at the pixel level. For example, randomly splitting glaciers 50/50 into two folds, generating voids on all glaciers, and then using all void pixels from fold A as the validation set for HPO, and all void pixels from fold B as the test set. This might be easier than the previous setup (a pure extrapolation exercise) and also closer to the situation where only a subset of glaciers in a region have large gaps in their accumulation areas (which might be more realistic as I imagine in practice not all the glaciers have gaps in the accumulation areas).
Finally, would it be possible to report also the R^2 for the predictions on top of RMSE/MAE? This would make it easier to compare with results from other regions, which might have different dh/dt ranges.
Again, many thanks again for the very clear paper and open workflow, and I’d be very interested in your thoughts on this point.
Best regards,
Codrut-Andrei DiaconuCitation: https://doi.org/10.5194/egusphere-2025-4404-CC1 -
AC3: 'Reply on CC1', Cameron Markovsky, 03 Feb 2026
Hello Codrut-Andrei,
Thank you very much for your comment. We agree with the issues you point out regarding the current model validation scheme. Splitting the validation and test pixels was not the correct setup for testing the effectiveness of these different methods on contiguous voids across various glaciers. After this review, we have completely reworked both the artificial void generation process and model validation scheme to address the limitations of the current methodology. Specifically, the new workflow uses a glacier-level training/validation/test split as you suggested. A more detailed write-up of these changes is included in our response to the reviewer comments. Moreover, we will update the results with additional error metrics, such as R^2 and NMAD, which are more appropriate for these data and transfer better across regions.
Thank you again for your thoughtful comment and suggestions. Please feel free to reach out if you have other questions/suggestions.
Best,
Cameron Markovsky
Citation: https://doi.org/10.5194/egusphere-2025-4404-AC3
-
RC2: 'Comment on egusphere-2025-4404', Robert McNabb, 06 Jan 2026
I think this is a very interesting study that builds on and extends previous work on void-filling methods for estimating elevation change and geodetic mass balance by applying a machine learning approach and comparing the results to other, simpler methods. The topic is of great value for the community, and I agree with the authors that improving these approaches is important, especially when attempting to use historical imagery to reconstruct past glacier changes. While the study is well-written and interesting and the results are encouraging, I think there are some potential issues with the reporting and the datasets used that need to be considered or addressed.
General comments
----------------1. While the results are promising, I think that the tone of the findings needs to be tempered somewhat. While there is a difference between the RMSE values for hypsometric approach and the XGBoost approach as reported in Table 3, they are fairly small. On a glacier-wide scale (Table 4, Figure 4), the differences are barely perceptible, implying that the hypsometric method is able to capture the majority of the variability in elevation change, and the improvement provided by the XGBoost method is minor in comparison to the other uncertainties in the dataset. This is itself an interesting result, but I think that the findings in the abstract and conclusions need to be softened.
2. I don't know if using the ASTER-derived dataset from Shean et al. (2020) is necessarily the best choice to use here. As much as I love the ASTER-derived results, they can be very noisy. They are also already interpolated (linearly in time rather than spatially), which could pose additional problems when trying to train a model for predicting elevation change. I think it might be beneficial to use a smaller but more detailed dataset such as the HMA DEMs (Shean, 2017) or Pléaides Glacier Observatory DEMs (Berthier et al. 2024). A combination of these high-resolution DEMs, differenced to the SRTM, might provide a cleaner, if smaller, input dataset to train and evaluate the model from.
3. I think that the section on model interpretability is very interesting and useful, but I don't understand why this is only done for two glaciers, rather than a much larger sample or even the entire dataset. If it's not possible to apply this to a larger sample, then this at least needs to be stated/explained. Assuming that it is possible, though, I think it would be highly beneficial to look deeper into the variability of feature importance over larger areas.
4. It appears that the method for generating artificial voids covers entire sections of the accumulation area of each glacier, which might also be much larger/more continuous than is needed. They are also only in the accumulation area, which means that the training data is more biased toward ablation areas. I think it would be helpful to compare examples of the voids generated by your approach with the voids included in historical datasets such as Maurer et al. 2019 - how well do your artificial voids capture the variability shown in that dataset? How often do you see voids at lower elevations? I think that a mix of void sizes and extents might be worth including.
Specific comments
-----------------l. 11-12: I agree that integrating multiple glaciological and topographic predictors is important and advantageous, but the results shown in section 4.3 would suggest that at least in those two detailed cases, x,y,z are the most important features in predicting elevation change, and the other glaciological/topographic predictors are less important. As with the main finding, I think this needs to be softened somewhat.
l. 23: A better citation here would be Berthier et al. 2023, I think - it's both more recent and more tailored to the topic of DEM differencing and geodetic mass balance
l. 40-50: McNabb et al. 2019 here is over-cited. There are many more appropriate references to discuss/cite individual methods, such as the original studies cited by McNabb et al., or a large number of studies published since 2019.
l. 83: why not use spatially-distributed maps of debris cover to assign "debris cover" as a feature for each pixel/point, rather than aggregating to the entire glacier? I think this might have more predictive power for individual pixels, and it should be easy enough to include given that the dataset cited here includes the spatial coverage.
l. 86: how did you determine this? RGI v6.0 does not include any information about "evidence of calving" for glaciers in HMA, as lake-terminating status is only included in a few regions for v6.0.
l. 97-102: This would seem to introduce additional voids into your dataset - how is this handled? Do you just remove those pixels/points from consideration entirely?
l. 108-111: why were these datasets removed? Also, why are you removing glaciers for being heavily debris-covered? This wasn't mentioned in the previous section.
l. 133: why 37%? Is this a constant value for all glaciers, meaning that your "testing" dataset is ~37% of all on-glacier pixels? Additionally, given that this appears to cover entire sections of the upper glacier, I wonder if some of the differences seen between the XGBoost and hypsometric methods aren't due in part to the linear interpolation required to fill the missing elevation bands - have you looked at this as well?
l. 144, elsewhere: define "outperformed" - do you mean that the RMSE and MAE were lower for the mean vs. the median?
Fig. 3: There's linear feature in panel f) at about +0.3 on the vertical axis and running from -1 to 0 on the horizontal axis. Do you have any ideas what this artifact might be?
Table 3: in the previous section (lines 200-201), you mention that you used the non-void pixels for training (~67%), then split the void pixels roughly 50-50 into testing and validation datasets. Here in Table 3, you've reported a single void metric rather than maintaining this split. Is the "void" value reported here only for the testing set (i.e., the void pixels not used to "tune" the model)?
Table 4: Is the standard deviation value for the hypsometric method in the W. Himalaya a typo? Looking at the corresponding violin plot, I can't see how the standard deviation of the pink plots would be an order of magnitude larger than the other two methods, especially when the other statistics are so similar.
Additional references
---------------------Berthier E and others (2023) Measuring glacier mass changes from space—a review. Reports on Progress in Physics 86(3), 036801. doi:10.1088/1361-6633/acaf8e.
Berthier E and others (2024) The Pléiades Glacier Observatory: high-resolution digital elevation models and ortho-imagery to monitor glacier change. The Cryosphere 18(12), 5551–5571. doi:10.5194/tc-18-5551-2024.
Shean D (2017) High Mountain Asia 8-meter DEMs Derived from Along-track Optical Imagery, Version 1. doi:10.5067/GSACB044M4PK.
Citation: https://doi.org/10.5194/egusphere-2025-4404-RC2 -
AC2: 'Reply on RC2', Cameron Markovsky, 03 Feb 2026
General Response to the Reviewer
We thank the reviewer for their thoughtful and constructive comments. In response, we have substantially revised both the artificial void generation strategy and the model validation/training framework to address concerns about spatial autocorrelation, the representativeness of voids, and the analysis of model results. Moreover, pixel value filtering has been removed because some erratic values appeared to be caused by a projection misalignment, an issue that has been remedied and will be reflected in the revised manuscript.
One of the most important changes in our revised workflow is the shift from a pixel-random validation paradigm to a glacier-aware experimental design. Artificial voids are now generated using a stochastic, elevation-weighted, morphology-constrained approach that produces a distribution of void sizes, shapes, and elevation ranges consistent with historical DEM voids, rather than a fixed fraction of contiguous void masking. This change aims to still generate most of the voids in the glacier's accumulation zone while imposing less stringent constraints on void location and geometry. In addition, we have updated the workflow to generate multiple void realizations. By testing each method across different stochastic void realizations, we hope to better quantify the uncertainty associated with each method.
Correspondingly, the model training and validation strategy has been redesigned to reduce spatial leakage and overly optimistic skill estimates. Training, validation, and testing are now separated using glacier-level partitioning, rather than random pixel-level splits. Validation is performed on the entire unseen voids on entirely unseen glaciers. This directly addresses the reviewer’s concern that the previous validation primarily measured short-range interpolation skill rather than reconstruction of large contiguous gaps.
Finally, we have revised the framing of the results and conclusions to emphasize that the machine-learning approach provides incremental but robust improvements, particularly in spatial realism and pixel-scale structure, rather than dramatic glacier-wide mass-balance differences relative to hypsometric methods. The results will be updated to reflect the new workflow, and the language used to report them will better reflect the improvement (if any) of the machine-learning method over traditional methods.
General Comment Responses
General Comments:
- Tone of results and marginal improvement over hypsometric methods
Reviewer comment:
The RMSE differences are small, glacier-wide differences are barely perceptible, and conclusions should be softened.Response:
We agree. In the revised manuscript, we have explicitly reframed the main contribution of the ML approach as improving the reconstruction of the spatial structure of elevation change and reducing bias within large voids, rather than large gains in glacier-wide mean mass balance. The analysis using the new void-generation methodology and training is still underway, but the abstract and conclusions will not overemphasize potential improvements over traditional methods.- Choice of ASTER-derived dh/dt dataset
Reviewer comment:
ASTER data are noisy and already interpolated; higher-resolution DEMs might be preferable.Response:
We agree and have clarified this limitation throughout the manuscript. While higher-resolution DEMs (e.g., from the Pléiades Glacier Observatory) would provide a superior benchmark, our study focuses on how these methods perform on noisy datasets. Although the spatial resolution of some historical satellite products (e.g., HEXAGON imagery) is high (~8m), these products often contain artifacts from processing or digitization. We will revise the text now to explicitly state that ASTER-derived elevation change products are not treated as error-free truth, but rather as representative of the data quality commonly available for large-scale historical reconstructions. Although this dataset is noisier than the Pléiades Glacier Observatory product, examining how these methods perform regionally across thousands of glaciers better aligns with our research goals of improving workflows for historical reconstructions. With this goal in mind, we retain this ASTER-based dh/dt product because it provides near-complete spatial coverage across many glaciers. While beyond the scope of this paper, we appreciate the recommendation and are currently exploring the PGO dataset and plan to use it for a future study examining other void-filling methods and how they perform region-to-region.
Another hopeful improvement in the methodology is that the revised model validation framework should reduce inflation caused by spatially correlated noise, thereby providing a more conservative test of model performance under noisy inputs. We also note that the revised artificial void-generation and validation workflow is dataset-independent and well-suited for future application to higher-resolution DEM differencing products (e.g., Pléiades or HMA DEM stacks), which we flag as a next step rather than a prerequisite.
- Limited SHAP analysis to two glaciers
Reviewer comment:
Why only two glaciers? Is this scalable?Response:
This is now explicitly addressed. In the revised manuscript, we clarify that the two glaciers are illustrative case studies, chosen to demonstrate interpretability across contrasting regimes. We also clarify that SHAP computation at the full regional scale is computationally expensive but feasible, and that the new workflow supports systematic sampling of representative glaciers and voids. After finishing runs with the new void-generation and model framework, we will examine the possibility of running the SHAP analysis for significantly more glaciers and giving regional summaries of variable importance.- Artificial voids too large, too continuous, and only in accumulation zones
Reviewer comment:
Voids appear overly large, occur only in accumulation areas, and may be unrealistic.Response:
This comment directly motivated the revised void-generation framework. Artificial voids are no longer defined as a fixed contiguous fraction of the upper glacier. Historical voids are predominantly in accumulation zones and tend to have large, contiguous regions. However, these voids vary in size and can extend into ablation zones. Our focus on these types of datasets informs our approach. We add a clearer discussion of the tendency for historical voids in the manuscript and revise our void-generation framework to allow voids outside the accumulation zone. Our new approach uses elevation-weighted sampling rather than a hard elevation cutoff. This means that voids are now allowed to extend into mid- to low-elevation zones, but with decreasing probability.Line-by-Line Responses
- 11–12: Importance of x, y, z dominating other predictors
Response:
We have softened the language and clarified that spatial coordinates and elevation often act as proxies for unresolved climatic and topographic variables, rather than implying that other predictors are unimportant.- 23: Citation update
Response:
We have updated the citation to include the suggested paper by Berthier et al. (2023).- 40–50: Over-citation of McNabb et al. (2019)
Response:
We have reduced reliance on McNabb et al. (2019) and expanded citations to include a broader range of void-filling and DEM-differencing studies.- 83: Glacier-wide vs spatially distributed debris cover
Response:
The decision to use a glacier-wide debris-cover value initially was made to limit computational complexity. However, with the improved computational efficiency of the new workflow, we now integrate spatially-distributed supraglacial debris cover from Rounce et al. (2021) (https://doi.org/10.5067/8DQKWY03KJWT). The revised manuscript reflects this change and clarifies that this variable is now per-pixel rather than glacier-wide.- 86: Evidence of calving in RGI v6
Response:
Thank you for pointing this out. This was an oversight when performing the initial data cleaning. The revised manuscript no longer aims to filter out lake-terminating glaciers (LTGs). Instead, we use an LTG database for High Mountain Asia (Yuo et al., 2025; https://doi.org/10.5281/zenodo.17369580) to assess whether differences exist between land- and lake-terminating glaciers in the results.- 97–102: Additional voids introduced by filtering
Response:
The original methodology may have introduced additional voids by filtering out outlier pixels. Our new void generation workflow addresses this problem by eliminating the pixel filtering step and splitting the training/validation/test sets by glacier rather than by pixel.- 108–111: Removal of debris-covered glaciers
Response:
The revised manuscript has been updated to reflect the change to a spatially distributed debris-cover dataset, which now contains data for each glacier in the HMA. The previous language around the removal of glaciers without debris-cover data has been edited.- 133: Why 37% void fraction and interaction with hypsometry
Response:
The original 37% void fraction was based on the mean/median(?) void sizes in Hexagon imagery. However, these void sizes vary significantly. The fixed 37% threshold has been removed. The new workflow uses stochastic void sizing to eliminate systematic biases associated with missing elevation bins and to reduce artificial inflation of hypsometric interpolation error.- 144: Definition of “outperformed”
Response:
We now explicitly define performance metrics wherever “outperformed” is used and avoid qualitative language surrounding each void-filling method’s performance.Figure 3 artifact
Response:
This feature is likely an artifact of the binning process related to tree-based decision thresholds. If this feature persists after the results are updated to reflect the new void generation and model workflows, we will explicitly discuss it in the figure caption and, if applicable, in the main text, and note any potential limitations of this artifact.Table 3: Validation vs test confusion
Response:
We agree that this language is ambiguous and does not clearly demonstrate the characteristics of each set. The new training-validation-test split at the glacier level should address this confusion. The table will be updated to display summary statistics for each dataset, and the language will be updated to clarify which dataset is used for each aspect of model training.Table 4: Hypsometric standard deviation typo
Response:
This is a typo; the standard deviation of the original results should be 0.086, rather than 0.86. We will update this table with the new results and double-check/explain any odd values.Citation: https://doi.org/10.5194/egusphere-2025-4404-AC2
-
AC2: 'Reply on RC2', Cameron Markovsky, 03 Feb 2026
Model code and software
Ensemble Machine Learning for Void Filling Cameron Markovsky https://github.com/cmarkovsky/ensemble_void_fill
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 305 | 130 | 37 | 472 | 21 | 22 |
- HTML: 305
- PDF: 130
- XML: 37
- Total: 472
- BibTeX: 21
- EndNote: 22
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of “An ensemble machine learning approach for filling voids in surface elevation change maps over glacier surfaces”, Markovsky et al., submitted to “The Cryosphere”
by Romain Hugonnet, University of Alaska Fairbanks
General comment:
This study by Markovsky and co-authors explores a valuable topic for the glaciological community. In order to provide constrained estimates of past glacier mass/volume change, we need benchmarked void-filling approaches for glacier elevation changes. This is especially important as voids are increasingly common when using historical archives, which hold the potential to unlock decades of past glacier change. Thus, the work explores the performance of an existing machine-learning approach to improve void-filling using exclusively topographical variables, in the aim to improve simple binning approaches currently used.
The study is clear, well-structured and reads well. However, I think important issues regarding its presentation and the reliability of its statistical analysis need to be addressed.
Firstly, the core finding (which is that there is only a marginal improvement in prediction using the machine-learning approach) is somewhat misrepresented, especially in the abstract (but less in the discussion/conclusion). If it is the authors’ intention to publish negative/neutral results, I think it is OK (and an important part of the scientific process), but that would need to be conveyed directly.
Secondly, I believe that the current statistical analysis has major limitations that have not been fully identified or discussed by the authors, namely: 1/ The test data is itself very noisy, so measurement errors are mixed with prediction errors, preventing a reliable statistical validation, 2/ This type of machine-learning approach is known to suffer from error autocorrelation and training regionalization, and can thus be poorly fit to provide uncertainty estimates, which is not discussed.
Finally, while the text reads well, I found that it greatly lacked diversity in the scope of its discussion and was somewhat biased in its scientific references (keeping to only 20 citations and omitting highly relevant work).
Major comments:
1/ Describing the prediction improvement (or lack thereof) accurately
From a statistical viewpoint, the improvement in prediction compared to the widely-used hypsometric method is clearly marginal:
- Per-pixel: RMSE 0.379 vs 0.328, i.e. barely 15%,
- Glacier-wide: Basically no difference (one region slightly worse, one region slightly better).
This “neutral finding” is not conveyed accurately by the authors, especially in their abstract. Compounded with the fact that the per-pixel data is noisy and thus does not necessarily represent true elevation changes (see next comment for details), the statistical significance of these results is quite limited.
While I suspect this could be partly due to the noisy test data, it might be that the prediction performance is also incompressible when using (almost) only topographical variables. If so, that would be an interesting finding in itself: The hypsometric method is largely sufficient when using only topographical characteristics, especially for glacier-wide estimates (what is currently used for model calibration). I know the authors put effort into developing a new prediction approach for this study, and thus a conclusion conveying only a narrow improvement is difficult to put forward, but negative/neutral conclusions are not a bad thing in research and should be clearly reported.
2/ Poor statistical validation due to noisy test data
As the core data for their entire analysis, the authors use elevation changes estimates from Shean et al. (2020), which are derived mostly from ASTER DEMs known to be very noisy (Girod et al., 2017). In High Mountain Asia in particular, where many accumulation areas are extremely bright, ASTER cannot resolve high elevations reliably. This means that the input data of the authors (used for training/validation) is itself affected by measurement errors often higher than the elevation change signal itself, especially at the pixel-scale. Therefore, this data is poorly adapted to study potential improvements in per-pixel prediction in hypsometric gap-filling. In Shean et al. (2020) (or Hugonnet et al. (2021), that performed a similar analysis with more validation and interpretation regarding errors), it is only by spatially aggregating many pixels that random errors cancel out (depending on their spatial autocorrelation) and that reliable glacier-wide estimates can eventually be derived with ASTER. (this is also why it was less of an issue for McNabb et al. 2019, mostly concerned with glacier-wide estimates).
To address this issue of test data, I think the authors have several options:
- Ideally, I'd recommend to use only high-resolution elevation changes, either from local surveys (lidar, aerial) or from high-resolution DEMs such as those of the Pléiades Glacier Observatory that are distributed at various sites globally (Berthier et al., 2024).
- Otherwise, as a “drop-in replacement” for the same region, the authors could potentially still use ASTER elevation change products, but would need to filter pixels with very high uncertainty relative to the signal. For this, the authors need a predicted uncertainty at the pixel level, which are not available from Shean et al. (2020). Hugonnet et al. (2021) provides uncertainty products based on a validated empirical framework, where the per-pixel variability in uncertainty varies with slope and quality of stereo-correlation (Hugonnet et al., 2022), and is propagated during the temporal fit, with validation against high-precision measurements. However, using this data, the authors should expect to have to remove a large part of the dataset including many of the accumulation areas they focused on, or to partition the relative per-pixel errors due to input error and prediction error (more difficult)…
3/ Relevance of the machine-learning approach and its validation
While the authors mostly praise the (potential) advantage of their approach, they fail to discuss known limitations. Many machine-learning methods have been shown to underperform in specific applications in geoscience, which include in particular variables prone to error autocorrelation, or subject to difficult regionalization during training (e.g. review by Hoffimann et al., 2021). Glacier elevation changes have errors that are highly autocorrelated, whether from noise in the DEMs (e.g., Rolstad et al., 2009; Hugonnet et al., 2022), or simply by adding error during temporal prediction, so the first limitation is highly relevant and potentially quite limiting here. Regionalization is also an issue here, given that elevation changes vary significantly from region to region (polar ice caps, alpine glaciers, tidewater glaciers), but also because the authors chose to only focus on upper-area voids (while voids can exist everywhere due to acquisition swath, see my line-to-line comment later) and chose a fixed relative size (37% of the accumulation area, defined as upper 50%).
In particular, providing reliable uncertainty estimates is something that this type of machine-learning approach can struggle with (by overfitting significantly the autocorrelated data), contrary to other machine-learning approaches (such as Gaussian Processes). As the reported improvement in prediction is marginal compared to hypsometric methods, I would argue that improving our estimate of the uncertainty in the prediction is currently as important (if not more) as further improving the prediction itself, which is a topic that was covered slightly in Seehaus et al. (2020). However this topic is omitted in the present manuscript.
All of these limitations should be thoroughly discussed, and the analysis expanded accordingly (e.g., using a varying size of void and not a fixed 37%).
Additionally, concerning the validation:
- Per-pixel accuracy analysis: RMSE and MAE are both pretty bad metrics as they mix random and systematic errors, consider reporting primarily the mean (or median) and standard deviation (or NMAD) of residuals, which capture both independently, as well as the metric used to optimize/learn.
- Glacier-wide analysis: Good inclusion by the authors, because glacier-wide accuracy is the most important output for total mass change. However, this analysis is very size-dependent (as mentioned above, errors cancel out over the glacier based on area), so the authors should not group glaciers of all sizes together, and rather study the performance depending on glacier size. Currently, the errors are probably entirely driven by those of tiny glaciers.
4/ Biased references
The authors repeatedly cite a few references for very different aspects of their study, omitting other relevant studies in the literature, sometimes even those at the origin of a given method. A couple of examples:
- McNabb et al. (2019) is used for the gap-filling methods, without citing original references,
- Shean et al. (2020) is used for most of the world of HMA/remote sensing, even when not especially relevant,
- Maurer et al. (2019) for everything historical and DEM-processing, even when widely used in much earlier and generic studies.
The authors should diversify their citations, and find the original references for a given method or processing step (sometimes cited within the study they cite). I have included some of these references below in line-by-line comments, but I didn’t elaborate on all, and there are many more to address across the manuscript.
Line-by-line comments:
23: The reference to the old Bamber & Rivera review feels a bit specific, given that the end statement is about density. For density, cite for example Huss (2013) that is the most widely used. To add a more recent review including DEM differencing, cite for example Berthier et al. (2023).
26-31: In the whole section, it is not explained that the voids “predominant in accumulation areas” and later described as “common in historical images” are directly due to limits during stereophotogrammetry (this key term almost never appears in the manuscript) performed on optical imagery (thus including historical archives) to generate DEMs. This needs to be clarified. But beyond this, voids also exist in every large-scale (= satellite) DEMs simply because of fixed-width satellite swaths during acquisition, no matter the instrument (optical, radar).
60: Shean et al. (2020) is clearly not the right citation for this statement… There are extensive review, inventories or other studies more adapted to describing HMA glaciers as a whole.
132: How is the artificial void grown from the seed? I assume you use a flood-filling (or seed-filling) algorithm with 4/8-pixel direction? If yes, describe which and include the appropriate reference, such as Newman et al. (1979).
145-154: The first occurrence of hypsometric void filling in glaciology is, to my knowledge, Arendt et al. (2002), and the elevation dependency has been greatly described long before the citations mentioned (Jakob et al., 2021, or McGrath et al, 2017; which can be removed), especially for spatial extrapolation. See for instance Huss (2012).
179-185: Those components are also called “Northness” and “Eastness”.
Fig. 3: Add a colormap for the density, even if it is a linear scale?
Table 4: There’s probably an error in the reported value of the STD of Western Himalaya/Hypsometric (it is an order of magnitude above all other STDs).
New references from this review
Arendt, A. et al. (2002), Rapid Wastage of Alaska Glaciers and Their Contribution to Rising Sea Level.Science297,382-386.DOI:10.1126/science.1072497
Berthier, E., Floriciou, D., Gardner, A. S., Gourmelen, N., Jakob, L., Paul, F., Treichler, D., Wouters, B., Belart, J. M. C., Dehecq, A., Dussaillant, I., Hugonnet, R., Kääb, A., Krieger, L., Pálsson, F., & Zemp, M. (2023). Measuring glacier mass changes from space-a review. Reports on Progress in Physics, 86(3). https://doi.org/10.1088/1361-6633/acaf8e
Girod, L., Nuth, C., Kääb, A., McNabb, R., & Galland, O. (2017). MMASTER: Improved ASTER DEMs for Elevation Change Monitoring. Remote Sensing, 9(7), 704. https://doi.org/10.3390/rs9070704
Hoffimann, J., Zortea, M., de Carvalho, B., & Zadrozny, B. (2021). Geostatistical Learning: Challenges and Opportunities. Frontiers in Applied Mathematics and Statistics, 7. https://doi.org/10.3389/fams.2021.689393
Huss, M. (2012): Extrapolating glacier mass balance to the mountain-range scale: the European Alps 1900–2100, The Cryosphere, 6, 713–727, https://doi.org/10.5194/tc-6-713-2012.
Huss, M. (2013). Density assumptions for converting geodetic glacier volume change to mass change. The Cryosphere, 7(3), 877–887. https://doi.org/10.5194/tc-7-877-2013
Hugonnet, R., Brun, F., Berthier, E., Dehecq, A., Mannerfelt, E. S., Eckert, N., & Farinotti, D. (2022). Uncertainty Analysis of Digital Elevation Models by Spatial Inference From Stable Terrain.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15, 6456–6472. https://doi.org/10.1109/JSTARS.2022.3188922
Hugonnet, R., McNabb, R., Berthier, E., Menounos, B., Nuth, C., Girod, L., Farinotti, D., Huss, M., Dussaillant, I., Brun, F., & Kääb, A. (2021). Accelerated global glacier mass loss in the early twenty-first century. Nature, 592(7856), 726–731. https://doi.org/10.1038/s41586-021-03436-z
Newman, William M; Sproull, Robert Fletcher (1979). Principles of Interactive Computer Graphics (2nd ed.). McGraw-Hill. p. 253. ISBN 978-0-07-046338-7.
Rolstad, C., Haug, T., & Denby, B. (2009). Spatially integrated geodetic glacier mass balance and its uncertainty based on geostatistical analysis: Application to the western Svartisen ice cap, Norway. Journal of Glaciology, 55(192), 666–680. https://doi.org/10.3189/002214309789470950