the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Improving forecasts of snow water equivalent with hybrid machine learning
Abstract. Accurate characterization of snow water equivalent (SWE) is important for water resource management in large parts of the Northern Hemisphere, but its large spatio-temporal variability and limited observational data make it difficult to quantify. Complex physically-based models have been developed that allow long-term SWE prediction, including scenarios without snowpack observations or in future events. However, those still suffer from large errors in their simulations, have long run times at large scales and provide challenges for integrating observational data. There have been attempts at using machine learning (ML) to improve SWE forecasting from meteorological data with promising results, but the data scarcity issue and concerns about the ability to extrapolate in time and space remain. In this study, we evaluate two hybrid setups that integrate physically-based simulations and ML. The first setup, referred to as post-processing, follows a common approach in which the simulated outputs from a numerical snow model, Crocus, are used as predictors to the ML component in addition to the meteorological data. The second setup, named data augmentation, involves an ML model trained not only on measured SWE but also on Crocus-simulated SWE at additional locations. These approaches are deployed using in-situ meteorological and SWE measurements available at ten stations throughout the Northern Hemisphere, and compared to Crocus and a ML setup using measured data only. The results show that the post processing setup outperforms all other approaches when predicting on left-out years in the training stations, but performs poorly when extrapolating to other locations compared to Crocus. The addition of a large set of Crocus-simulated variables besides SWE in the post-processing setup results in similar performance for left-out years but exacerbates the spatial extrapolation issue. On the other hand, the data-augmentation setup performs slightly worse on the left-out years, but showed much better transferability to new locations, improving the other ML-based setups greatly and reducing the RMSE in Crocus by more than 10%. The feature importances of the ML-models are consistent with physical knowledge, despite having unusual deviations at extreme values, which could be further improved with the data-augmentation setup. Lastly, the addition of lagged variables results in improved results, but they are only relevant for up to a week. These results prove the usefulness of hybrid models and particularly the data-augmentation setup for SWE prediction even in data-scarce domains, which has the potential to improve forecasts of SWE at unprecedented spatio-temporal scales.
- Preprint
(1693 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-1845', Anonymous Referee #1, 26 Jun 2025
-
AC1: 'Reply on RC1', Oriol Pomarol Moya, 21 Jul 2025
Dear reviewer,
Thank you for your kind comments and also for raising some interesting points of discussion about our manuscript. We hope to provide a comprehensive answer to all of them in the attached text, written in behalf of all co-authors. The original text is preserved in black, while the answers are given in green for ease of comparison.
Sincerely,
Oriol
-
AC1: 'Reply on RC1', Oriol Pomarol Moya, 21 Jul 2025
-
RC2: 'Comment on egusphere-2025-1845', Anonymous Referee #2, 04 Aug 2025
Summary
Pomarol Moya et al. in “Improving forecasts of snow water equivalent with hybrid machine learning” evaluate various machine learning (ML) based approaches in representing spatiotemporally in sample and out of sample predictions of daily snow water equivalent (SWE) across 10 measurement sites in the Northern Hemisphere, derived from the ESM-SnowMIP project, across 7-20 years. The ML-based estimates are compared to and, in some cases, informed by a physics-based snow model, Crocus. The analysis shows that ML-based models can benefit from learning about daily SWE behavior from both observations and the physics-based model, sometimes helping to offset physics-based snow model errors (e.g., snowmelt rate/snow off date). The order of importance of variables that influence the SWE prediction is also intercompared and rank ordered, many of which relate to snowpack thermodynamics, and could (potentially) be used to inform physics-based model development.
Overall, I think the paper fits within the scope of the Cryosphere and could be, given more work, a valuable contribution. ML-based methods have grown in popularity in recent years, and ML model development/sensitivity analyses like these help to inform where/when ML-based methods are or are not fit for purpose in predicting SWE spatiotemporally. However, I think there are still several major revisions that need to happen prior to this paper being accepted. While I appreciate the authors’ thorough analysis, the underlying data (ESM-SnowMIP station network) is quite sparse in space/time and makes me worry about the extensibility of their findings beyond the limited station locations and years assessed. I respect that the authors provided an entire section (Section 4.1) that discusses this very point, but I feel like a more thorough decomposition (e.g., snow climate, elevation, land surface heterogeneity, etc.) of station differences is still needed (given that only a few stations are used to train and assess ML model fidelity). I also think the authors could try and provide more take home messages for the physics-based modeling community from the ML-based results on which variables/processes to target (e.g., fix the long-standing snowmelt rate/snow off date biases in physics-based snow models). Too often it feels like ML-based papers try to show how they can outperform physics-based models rather than how ML-based models/methods can be used to advance physics-based model development. This is particularly salient given that ML-based models poorly predict out of sample in space/time and under different climate scenarios and, therefore, physics-based models appear that they will be needed for the foreseeable future. I have provided, hopefully, constructive comments and suggested edits below for the authors to consider.
Comments and suggested edits
Line 24 – cite “The cryosphere has a large impact on the Northern Hemisphere…”, maybe with this study…
Huss, M., Bookhagen, B., Huggel, C., Jacobsen, D., Bradley, R.S., Clague, J.J., Vuille, M., Buytaert, W., Cayan, D.R., Greenwood, G., Mark, B.G., Milner, A.M., Weingartner, R. and Winder, M. (2017), Toward mountains without permanent snow and ice. Earth's Future, 5: 418-435. https://doi.org/10.1002/2016EF000514
Line 28 – cite “…due to its spatio-temporal variability…”, maybe with this study…
Alonso-González, E., Revuelto, J., Fassnacht, S. R., & López-Moreno, J. I. (2022). Combined influence of maximum accumulation and melt rates on the duration of the seasonal snowpack over temperate mountains. Journal of Hydrology, 608, 127574
Line 34 – add “machine learning (ML)” as this is the first time it is introduced/defined
Line 36 – “find non-linear structure” – can machine learning only identify non-linear structures or both linear and non-linear?
Line 39-40 – you might also include this citation…
Song, Y., W. Tsai, J. Gluck, A. Rhoades, C. Zarzycki, R. McCrary, K. Lawson, and C. Shen, 2024: LSTM-Based Data Integration to Improve Snow Water Equivalent Prediction and Diagnose Error Sources. J. Hydrometeor., 25, 223–237, https://doi.org/10.1175/JHM-D-22-0220.1
Line 45-46 – this sentence needs a citation for this bold statement. Couldn’t the ML models inherent and amplify biases learned from the physics-based models? Also, is there peer-reviewed evidence that ML models can skillfully produce “out of sample” predictions from one mountain/seasonal snow region to another?
Line 60 – change “features” to “conditions”
Line 71-72 and Line 77-79 – are 10 stations with 7-20 years of measurements enough to properly sample intra- and inter-annual variability of snowpack lifecycles across the Northern Hemisphere? Also, worryingly, only three of the stations are automatic and the others “only [have] manual measurements at irregular intervals”. How many snow climates (Sturm and Liston, 2021), elevations, etc. are represented across these stations? Could the authors provide a map plot with automated/manual station lat/lon locations?
Sturm, M., and G. E. Liston, 2021: Revisiting the Global Seasonal Snow Classification: An Updated Dataset for Earth System Applications. J. Hydrometeor., 22, 2917–2938, https://doi.org/10.1175/JHM-D-21-0070.1.
Line 76 – change “snow water equivalent” to “SWE”
Line 82-83 – what does it mean that “aggregation methods were performed for some variables according to expert knowledge”? Can you provide readers with the physical basis/intuition for how each of these various aggregation methods for meteorological variables impacts a snowpack’s energy/mass balance? This is needed to ensure that the ML method is learning and estimating snow physics for the right reasons.
Line 92 – change “50 layers” to “50 snow layers”. Also, I might mention snow temperature or some other thermodynamic variable (given the mention of “energy and mass balance” in the previous sentence(s).
Line 99 – change “layer” to “snow layer”. Also, does “layer information” mean the dynamic ranges of snow depth when delineating the 50 snow layers over space/time as the snowpack lifecycles evolve?
Line 101 – I would delete “2100 J/kg*K” as it seems like TMI (if an equation to compute cold content is not shown).
Line 103 – why are most of the variables used to train the ML model daily averages? Was there a sensitivity analysis performed that is not mentioned here? For example, wouldn’t minimum (e.g., nighttime) or maximum (e.g., daytime) temperatures be important too given that the snowpack might refreeze or quickly melt depending on the range of temperatures experienced in a given day? On Line 114-115 you also mention how there can be a delay in the response of the snowpack (presumably from the erosion of cold content before a phase change occurs) over a 14-day period from the given day. This also seems to be an argument that some information might be contained in minimum/maximum/etc. of meteorological variables.
Figure 1 – is there a reason that different brackets are used “[]“ and “()” to describe time (t)?
Line 117 – “consecutive daily SWE measurements are available, that is, the automatic stations” does that mean you completely “throw out” seven of the 10 stations data? If so, I am even more worried about properly sampling intra-annual and inter-annual variability of snowpack lifecycles across the Northern Hemisphere. 1874 days (~5 years) is not very much data to train the ML model on purely observations of SWE/dSWE. A biggest question, can you more clearly state how the manual measurements are used then?
Line 139 – so you are splitting 1874 days of data into train, validation and test? Are manual measurements used for training, validation, and/or testing too?
Figure 2 – change “a) the station split and b) the temporal split strategies” to “a) the temporal split and b) station split strategies”. Either the a) and b) in the figure is wrong or the caption is wrong.
Figure 3 – at the moment, a reader (who quickly glances at this plot) might infer that “Sample size auto. stations” of 171, 348, 1355 would mean the number of stations not the number of station measurements used (as I think the authors intend to convey the information). Please change this to be more specific. Also, why would NSE go down for Crocus as more information is used? Is that because model bias becomes more severe as more stations are compared with it?
Line 184-194 – are these results indicating that Crocus degrades ML model performance in the temporal and enhances ML performance across stations (e.g., comparing AUG result between the two data splits)? Why would this be the case? Also, physically, what does it mean when a model does not perform well in the temporal split but does in the station split?
Line 196-206 – do they authors know why Crocus systematically underrepresents peak SWE (even when run at a point scale) and melts out the snowpack too early? Does it have to do with the rain-snow partitioning scheme in the accumulation season? Could this be enhanced? For example, Jennings et al. (2018) provides a potential path forward. Similarly, what might be driving the snowmelt/snow off date bias? Is there any literature to highlight this as a systematic snow model deficiency?
Jennings, K.S., Winchell, T.S., Livneh, B. et al. Spatial variation of the rain–snow temperature threshold across the Northern Hemisphere. Nat Commun 9, 1148 (2018). https://doi.org/10.1038/s41467-018-03629-7
Line 229-235 – do these differences in meteorological variables/etc. have to do with the stations being located in different snow climates, elevations, shaded/forested regions, etc.? Do the authors think they have sampled all of these properly in training/testing the ML models?
Line 263-264 – do the authors know which stations had more or less sensitivity to lagged meterological variables at +7 day vs 7 day vs 3 day vs 1 day? Do these stations (and their sensitivities) fall into different snow climates, elevation bands, shaded/forested landscapes, etc.? This sort of information would be important to glean to guide future ML model development/application over a larger spatiotemporal set of stations.
Section 3.4.2 – this seems like it should be in the Data and Methods section (or Supplemental Material)
Line 276-277 – in Figure 3, didn’t the authors show that the AUG model (i.e., hybrid Crocus-ML model) resulted in poorer performance for temporal split (i.e., worse NSE range compared to all physics-based and ML models) and slightly better performance in station split (i.e., NSE range is more constrained and the mean NSE is slightly higher than all physics-based and ML models) than Crocus? Is the difference between Crocus and AUG performance statistically significant/appreciably different for the station split?
Line 281-284 – could the tendencies or corrections made by the ML models be used to inform physics-based model development (e.g., to “fix” the snowmelt rate/snow off date bias)? At the very least, could the ML models be used to identify if the variable(s) driving this bias in Crocus (and other physics-based models) are mass or energy related? This could be a major value add from ML models.
Line 293-295 – what would constitute a “large, representative dataset”? How many days would be needed? How many stations? Etc.
Line 298 – “greater generalization capability” Do you mean Crocus has prognostic, physical equations that can make predictions “out of sample” rather than purely diagnostic/”in sample” inferences (as an ML model arguably does)?
Line 309 – change “downwards” to “downward”
Line 313-314 – Do you mean to say something like this “…variable selection should be based on an understanding of the snow climates and geographic heterogeneity (e.g., elevation, forest cover and topographic shading) of the location or region in which the ML model is applied”?
Line 315-338 – I appreciate that the authors explicitly stated the sample size issue here. I was looking for something like this earlier on though. Maybe a sentence or two in the Methods that references a larger discussion later on in the manuscript?
Line 325 – change “specially” to “especially”
Line 335 – see Song et al. (2024) citation above
Line 342 – change “northern hemisphere” to “Northern Hemisphere”
Citation: https://doi.org/10.5194/egusphere-2025-1845-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
367 | 82 | 12 | 461 | 30 | 44 |
- HTML: 367
- PDF: 82
- XML: 12
- Total: 461
- BibTeX: 30
- EndNote: 44
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1