Benchmarking Catchment-Scale Snow Water Equivalent Datasets and Models in the Western United States
Abstract. This study benchmarks a wide range of snow water equivalent (SWE) models and data products at a catchment scale in the Western US, and discusses an experimental protocol to facilitate community-wide intercomparisons. Utilizing lidar-based ASO (Airborne Snow Observatory) SWE estimates as a 'ground truth', this study evaluates the performance of multiple SWE products, including SNODAS, SWANN (4km and 800m), the US National Water Model (NWM), UCLA-SWE, SWEMLv2, NLDAS-2 (VIC, Noah, and Mosaic), ERA5-Land, Daymet and the CONUS404 dataset. We use SWE aggregated to hydrologic catchments as the standard spatial basis for assessment, focusing on multiple spatially-variable performance metrics. UCLA-SWE, SWANN (both 800 m and 4 km), and SWEMLv2 show the strongest agreement with ASO SWE, each achieving Kling–Gupta Efficiency (KGE) values above 0.6. SNODAS also performs competitively with these higher-performing models. The coarser-resolution products generally perform poorly at the catchment scale. Notably, ERA5-Land and the NLDAS-2 Mosaic and VIC models demonstrate strong skill for basin-average SWE (R² > 0.9), while the NLDAS-2 Noah model exhibits weak performance across both spatial scales. Noting the lack of a common community standard for SWE product and model evaluation, we use the results of the multi-dataset analysis to explore potential experimental protocols for a standardized SWE evaluation that could support community-wide intercomparison and benchmarking of existing and new SWE products. SWE datasets are a critical component in hydrologic prediction practices such as water supply forecasting, thus the use of experimental standards proposed herein could facilitate quantitative guidance for agency and stakeholder adoption of specific SWE products in decision support applications.
Ritchie et al. evaluate a large suite of snow water equivalent datasets in the United States by comparing the performance of these data products, aggregated at the hydrofabric scale, to lidar-based SWE products from Airborne Snow Observatories. In my opinion, this is a valuable contribution to the community, as SWE is a highly valuable quantity and many products are used for different purposes, often without a rigorous consideration of how product choice may impact outputs.
In general, I think this is an important contribution to the field. However, I have major comments related to how this comparison is framed. In general, I think the authors should frame this as a direct comparison (in known basins) of many products to ASO, but without calling ASO the ground truth as it is not a direct observation. Further, it is somewhat unclear if the goal of the paper is to produce a highly quantified intercomparison of SWE products or to discuss how to make a protocol for such an intercomparison. In my opinion, the manuscript’s strengths are in the former, and should focus on this, with notes on the intercomparison protocol appropriate in the discussion. I recommend that the authors consider the protocol content as a separate manuscript.
Line edits:
L57: “it’s” to “it is”
L55: it is currently called Airborne Snow Observatories (slightly different name)
L58-59: phrasing in parenthetical is clunky, but noting that swe is a model product is important
L60: a main limitation of ASO is temporal sparseness (as noted) as well as coverage – only covers select US basins currently
L60: “quasi-observational” is a confusing category in my opinion. The SD products are directly observed, the SWE products are better introduced in your longer explanation in the previous phrases
L98: it is not appropriate to call the ASO SWE product an observation because it is not! There are significant modeling efforts behind the ASO SWE product, many of which are not public. Further, ASO SWE are validated/calibrated by comparison to ground-based SWE observations, so they likely contain bias by prioritizing matching specific locations.
L98: “for” typo
L249: missing subsubsection title
Section 2.4: these are standard metrics (R2 etc) and the formulas do not need to be reproduced. If they are reproduced, please define all variables.
Section 5.2: This is interesting, but discussion and comparison of protocol choices could be another paper and in my opinion does not align with the main point of this work.
Figures:
General: please add letter labels to subplots
Fig 1: it is unnecessary to have separate panels with and without the hydrofabric overlay (panels a and b)
Fig 2 and Fig 6: are these showing essentially the same thing, except separated by state in Fig 6? If so, this is repetitive. If the point is to compare the states, consider another visualization that directly compares them, or use a numeric output.