Machine learning for snow depth estimation over the European Alps, using Sentinel-1 observations, meteorological forcing data and physically-based model simulations
Abstract. Seasonal mountain snow is an indispensable resource, providing drinking water to more than a billion people worldwide, supporting agriculture, industry and hydropower generation, and sustaining river discharge, soil moisture and groundwater recharge. However, accurate estimates of this seasonal water storage remain limited, even in the European Alps, where there is a dense network of in situ monitoring stations. In this study, we address this issue by estimating Alpine snow depth at a 100 m spatial and sub-weekly temporal resolution with an extreme gradient boosting model (XGBoost) for the time period 2015–2024. We explore the potential for using Sentinel-1 C-band dual-polarized synthetic aperture radar polarimetry (PolSAR) observations to improve upon backscatter-based approaches, and include regionally downscaled meteorological forcing data and modeled snow depth inputs to further explain interannual and spatial variability. To account for the spatio-temporal dependencies present in the snow depth data, we conduct a threefold nested cross-validation, and incorporate spatial training data to better represent topographical patterns in snow depth variability. Finally, we utilize XGBoost's booster and Shapley additive explanation values to understand the relationship between the input features and predicted snow depths during both dry and wet snow conditions. Our results demonstrate that incorporating Sentinel-1 PolSAR observations leads to more accurate snow depth retrievals compared to using backscatter alone. In addition, our analyses indicate that including either meteorological forcing data or modeled snow depth estimates substantially improves the XGBoost snow depth estimates, both of which yield comparable accuracy. Finally, we demonstrate that the inclusion of spatial training data is essential for capturing the topographic influence on snow depth estimates, and to obtain good spatial prediction accuracy. Overall, this work contributes to an improved large-scale monitoring of water stored in seasonal mountain snow.
Summary:
This manuscript presents an extensive analysis of machine learning capabilities for snow depth estimation. The authors compare a variety of machine learning (XGBoost) model configurations and apply a threefold nested cross validation to evaluate their approach. The inputs to the machine learning model are remote sensing data from Sentinel-1 (of which PolSAR had not been used before to estimate snow depth), downscaled meteorological forcing data and physically-based model simulations. The authors then evaluate the importance of features in the machine learning model, and the spatial predictions of snow depth at unseen locations by the model.
The aims and findings of this study are interesting, with the main novelty being the inclusion of PolSAR variables as well as meteorogical forcings or a physically-based model forced with those to predict snow depth at high resolution (100 m) over the Alps. I am impressed with the amount of data processing and careful methodological procedures that the authors went through, which seems very robust.
However, I think the manuscript should more clearly state the novelty of this study in comparison with Dunmire et al. (2024). While the authors claim that the snow depth estimations are improved with the inclusion of PolSAR and meteorological forcings, it is hard to see any significant improvement when comparing similar figures between the manuscripts. Even within this manuscript, it is often claimed that a method improves snow depth estimates without this clearly seen in the figures. Furthermore, I have several concerns regarding the presentation of results, some of them are not very clear and there are many instances of “results not shown”. I believe the authors need to improve the manuscript before it can be published, and I hope my comments below will help.
Main comments:
The title claims snow depth estimation over the European Alps, but there is no map of estimated snow depth over the European Alps, and no map of the predicted snow depth validation over the entire mountain range (as there is in Figure 2, and Figure 7, in Dunmire et al 2024).
About XGBoost: Besides referring the reader to Chen and Guestrin (2016), I think there should be at least a few lines description of what this ML model is and its characteristics, and why the authors (or previous authors) chose this model.
The inclusion in the ML model of physically-based model simulations with meteorological forcing yields a comparable accuracy than using the meteorological forcings directly as input to the ML model. As I understand it, there is therefore no advantage of using physically-based model simulations, as this adds unnecessary complexity. It seems the ML models learns the physics already with the meteorological forcing. I think this is an interesting finding and should be better discussed.
Section 4.1 states a couple of times that differences are significant because p <<0.05 (e.g. differences due to Table C1). However, the improvements are quite marginal (R2 0.88 vs R2 0.89; MAE 0.3m vs MAE 0.29 m). I think the authors should discuss the significance based on the absolute improvement, which is very little, and not the statistical significance, which in this case is clearly just due to the large sample size. See https://www.nature.com/articles/s41598-021-00199-5 and https://linkinghub.elsevier.com/retrieve/pii/S026151771730078X . With this in mind, the authors should revise this section carefully.
There are several instances of “results not shown” in the paper. I think they should all be included as they seem relevant (lines 396, 411, 465, 474, 484, 494, 501, 526, 550). There are also instances where a result is discussed but not seen on any figure (lines 397-399, 431-433, 497-498, 511-512, 534-537, 586)
About the novelty with respect to Dunmire et al (2024). Line 344 even states that the errors are slightly higher in this study than in Dunmire, and another example in line 434 shows very similar results. There should be a more open discussion about the little improvement, despite the novelty of this paper.
Sometimes in the manuscript, it seems that meteorological forcing data AND physically-based model simulations are used simultaneously in the ML model, but that is not the case. I suggest to change the following to OR (not in the title, as that is a list of all the inputs). (Line 8, 568)
Comments by line number:
L34: a reference for “essential climate variables” is missing.
L40: I suggest to add example datasets: “measurements offer frequent data at many locations globally (e.g. Matiu et al. 2021 https://doi.org/10.5194/tc-151343-2021, Fontrodona-Bach et al. 2023; https://doi.org/10.5194/essd-15-2577-2023, Mortimer and Vionnet, 2025) https://doi.org/10.5194/essd-17-3619-2025).
L53: Does “this work” refer to the one in this manuscript? Not clear if it refers to the previous references.
L54: “an increasing snowpack DEPTH”?
L55: I recommend against the use of etc. Either complete the list or simply state the examples.
Lines 62-64 and 75-77 seem to be a repetition of each other regarding the current gap in knowledge.
L74: perhaps: “snow depth retrieval”?
L83: “compared to in-situ measurements.” This needs references.
L85: such as instead of e.g.
L88: This needs a reference at the end of the sentence.
L91: perhaps: “contribute to improving SD predictions”?
L103: coarse instead of course.
L116: The GHC needs a reference (and isn’t It GHCN?). Does the end of the sentence mean only Germany and Slovenia are taken from this dataset?
L145: Why does rescaling matter for interannual start of season differences? It is unclear what this means.
L158: How many are these remaining gaps? How many were filled?
L166: a quick definition of majority resampling would be useful.
L178: What other downscaling techniques?
Equation 1: Where does this downscaling equation come from? A reference or explanation is needed.
L223: With a rather long paper and a lot of specific nomenclature, it is sometimes easy to forget what LIA, or TPI mean, especially for unspecialised readers. I suggest to include a table or list in the appendix with all abbreviations used (or expand Table B1)
L239: Perhaps it is useful to remind the reader that here the input of meteorological data or physical model simulations is still not assessed. I thought there should be 5 configurations otherwise.
L241: Why “next” and not together with the previous?
L243: I suggest “The second configuration, focusing…” instead of “Conversely, the configuration…”
Table 2 caption: I suggest “within this study.”
Table 2: some of the features are presented in the text after the table is presented, therefore the reader does not know what all these variables are. I suggest to spell them out in the caption, or in the acronym table I suggested above.
3.2.3 Snow depth prediction: I do not understand how this title links to the paragraph. It seems that the paragraph is about standardization of features.
L276: If all folds contain at least one station from each of all the boxes of stations within 5 km of each other, how is this a blind validation? I am possibly understanding this wrong, please clarify this.
L278: The procedure for the temporal fold is also not entire clear to me. What does it mean that sites were kept separate, but grouped and divided in 5 folds?
Figure 2: Please a add a legend for the colours and textures.
L308: Does this mean that for these sites, the snow season is less than 10 days?
L314: The bias, although discussed, is not always shown in figures or tables. Please include it (e.g. Table 3, Table C1).
L332: Why is Table C1 not together with Table 3? As it seems quite important and thoroughly discussed.
L336: “the temporal framework overestimates model performance” what does this mean? Can model performance be overestimated or is snow depth overestimated?
L339: This says that the spatio-temporal framework provide a more realistic evaluation of model performance for this study, but lines 331-332 say that performance is highest for the temporal framework and progressively deteriorates in the spatial and spatio-temporal frameworks. These two statements contradict each other.
L349: in the figure c1b caption it says observed-predicted, so a negative bias would mean an overestimation of snow depth. Please standardise this.
L353-355: How is a deterioration of model performance seen as an accurate representation of model performance? This sentence is unclear.
L355: why “also”? Which other improvements were there?
L355-365: As stated in a general comment, I don’t see this little improvement as a significant improvement, I think it is the effect of the sample size.
L368: FSC instead of fSC.
Figure 3: It is difficult to see differences between configurations, perhaps a table in the supplement or Appendix would be useful.
L397-399: How are these results seen in Fig. 4a?
Figure 5: The predicted snow depth time series show a clear flat long period in the middle of the accumulation period (especial at 5a and 5b), which does not match observations very well, suggesting that snowfalls and increasing snow depth in mid-winter are not well captured? This should be discussed. Also please state which sites are these (name, location, source of the data).
L505-507: Linking to one of my main comments, I think the results underscore the potential of using meteorological forcing data alone, as input to ML models (as the improvement of Snowclim is minimal). I think this should be included here.
Figure 6. I suggest adding the title of each configuration on each row, to make the figure more easily readable.
L531: again, the improvement seems quite minimal.
542-543: the potential inability to correctly predict snow density is a key limitation for further refining this method to predict daily time series in the future. This could be discussed.
L539: The authors say weatherML and snowclimML overestimate snow depth, based on the biases. However, when comparing figure 7b with the measured Figure 7d, it seems the opposity. It seems that 7b (weatherML) shows much lower snow depths than measured. In fact, the scatter plot suggests that weatherML outperforms snowclimML, but the snowclimML snow depth map resembles the observations more. This discrepancy should be clarified.
Figure 7. Why do maps have different MAE than their respective scatter plots? Why do the scatter plots have a low density of points when approaching 0 m snow depth?
Figure 8. It would be better to show the maps of snow depth with survey data, without survey data, the difference, and the measured maps, to enable a better comparison.
L593-595: Compare these results to estimates from other studies, such as the results from Dunmire et al 2024.
Equation A1: Can Wsat not become infinite if any weight is zero? Revise or clarify.
L633-635: what downscaling techniques and what parameters?
Figure B1: It would be interesting to see different scatter plots for the snow surveys and the point measurements.
Figure C1. Why not just a map with the bias per station, and compare it with the one from Dunmire et al. (2024) in their Figure 2?