the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Assessing the performance and explainability of an avalanche danger forecast model
Abstract. During winter, public avalanche forecasts provide crucial information for professional decision-makers as well as recreational backcountry users in the Swiss Alps. While avalanche forecasting has traditionally relied exclusively on human expertise, the Swiss avalanche warning service has recently integrated machine-learning models to support the forecasting process. This study assesses a random forest classifier trained with weather data and physical snow-cover simulations as input for predicting dry-snow avalanche danger levels during the initial live-testing in the winter season of 2020–2021. The model achieved ∼70 % agreement with published danger levels, performing equally well in nowcast- and forecast-mode. By using model-predicted probabilities, continuous expected danger values were computed, showing a high correlation with the sub-levels as published in the Swiss forecast. The model effectively captured temporal dynamics and variations across different slope aspects and elevations, though it decreased the performance during periods with persistent weak layers in the snowpack. SHapley Additive exPlanations (SHAP) were employed to make the model's decision process more transparent, reducing its 'black-box' nature. Beyond increasing the explainability of model predictions, the model encapsulates twenty years of forecasters' experience in aligning weather and snowpack conditions with danger levels. Therefore, the presented approach and visualization could also be employed as a training tool for new forecasters, highlighting relevant parameters and thresholds. In summary, machine-learning models as the danger-level model, often considered 'black-box' models, can provide high-resolution, comparably transparent "second opinions" that complement human forecasters' danger assessments.
- Preprint
(19325 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2024-2374', Simon Horton, 14 Aug 2024
General comments
This paper evaluates a machine learning model that predicts avalanche danger levels in Switzerland. The authors developed the model authors in a previous study, so this study focused on comparing the model's predictions with expert forecasts during the 2020-21 winter across space, time, and under different avalanche conditions. The authors also analyze which model inputs influenced predicted danger in different scenarios, adding transparency and explanations for an otherwise black-box model.
The study is well-designed, clearly organized, and effectively communicated, making it both interesting and relevant. Numerical avalanche forecasting is a rapidly developing field with significant implications for public safety and natural hazard management. I recommend publishing this paper in NHESS after addressing the relatively minor comments and clarifications below.
Specific comments
- One limitation that could be more clearly emphasized is that the model only accounts for dry snow avalanches. It's important to clarify whether the analysis excluded situations when wet avalanches may have influenced the danger rating, especially since the study period extended into May. According to the EAWS workflow, the danger level is determined based on the highest level indicated by the EAWS matrix for each avalanche problem. If wet snow problems significantly contributed to the danger rating on certain days, those days should be excluded from the evaluation, as the ML model was designed exclusively for dry snow avalanches.
- Within/outside categories. The within/outside categories could be described more clearly. It seems that the "within" group requires both the station elevation to be within the critical elevation range of the bulletin and the virtual slope to be a critical aspect. How are flat slopes within the critical elevation treated? The "outside" predictions are defined as not being in the critical elevation range. But then where would a simulation that falls within the critical elevation range but on a virtual slope that isn't a critical aspect belong? Also, is a subcategory for predictions on critical aspects outside the critical elevation range relevant? It might also be clearer to consistently use "critical" elevations and slope aspects, as in Fig. 1 and Sect. 3, instead of switching to other terms that appear to be synonyms such as "core zone" and "active slope", which may contribute to confusion.
- 5.2.4 / Fig. B1: The deviations between the expected danger and sub-levels might stem from the fact that expected values calculated with Eq. 3 would gravitate towards average values and away from extremes. Fig. B1 suggests the sub-level assessments have a wider spread compared to the expected values. This characteristic of expected values could be worth discussing in more detail.
- I appreciate several interesting results from this study, including how the model often predicted lower danger than human forecasters, responded to increases/decreases in danger faster than humans, and showed overall poorer performance for persistent weak layer problems. It was encouraging to see the recommendations for improving performance on persistent weak layers, as this seems to be the biggest limitation for operational adoption.
Technical comments
- Line 20: Several hundred million Swiss francs “per year”?
- Line 48: Consider a narrative in-text citation to be very clear that “a model” is precisely Pérez-Guillén et al. (2022), which may not be clear with a parenthetical citation.
- Fig. 1 and 7: These figures label wind-drifted snow problems as “snowdrift/SD” instead of “wind slab/WS” as defined in line 68.
- Eq. 1: The lowercase probability from each tree (pt) is not defined.
- Fig. 2: This is an excellent and clear illustration that helps the reader understand a complex model system.
- Line 157-163: I found these lines difficult to understand until reading Appendix B. Consider moving some details from the appendix to the main text for clarity. This also warrants its own paragraph. I assume that the nowcast versus forecast comparison involved comparing a nowcast with the forecast issued 24 hours earlier — if so, this could be stated explicitly. Additionally, please clarify that the rounding strategy was applied to the expected danger values, as it is initially unclear which variable is being rounded for the comparison.
- Line 183: Were the forecast predictions often higher than nowcast predictions due to a systematic bias in the COSMO forecasts compared to what was measured at stations? This could be worth discussing later, and perhaps linking with the case where COSMO underestimated precipitation from March 15 to 17.
- Line 192: “Frequently” or “often” are better choices than “essentially always”.
- Fig. 4: Perhaps clarify in the caption that the numbers below the percentages represent counts.
- Sect 5.2.2: It would be interesting to discuss possible reasons for trends in model performance on flat/south/north aspects in the discussion section, perhaps linking with which meteorological/snowpack features may be causing the differences.
- Line 216: Why is Table A2 with nowcast predictions cited when sections 5.2.2 onward are supposed to focus on forecast predictions? Table 1 seems like a better citation showing many predictions were within one level.
- Line 227: The phrase "model bias was towards the forecast in the bulletin" is unclear. Bias typically suggests a consistent directional trend, but Fig. 5b shows that when the model assigns the highest probability to a rating different from the bulletin, its second-highest probability often aligns with the bulletin rating. This doesn’t necessarily suggest a positive/negative bias.
- Fig. 5: The second-row plot for level 2 has some random characters in the middle (7BA7F5).
- Line 237: The phrase “showing a decrease in the number of samples with a larger difference” is unclear.
- Line 261: The model's response to precipitation several hours earlier might be attributed to its 3-hour temporal resolution, compared to the 24-hour resolution of the bulletin. Similarly, the patterns in Fig. 7 could reflect both the different temporal resolutions and the inherent differences between the two methods. A fairer comparison might involve using only the 1800 LT model predictions. unless the goal was to emphasize the advantages of higher temporal resolution.
- Line 288: I agree that the SHAP distribution of MS_Snow for levels 1 and 2 are inverted compared to level 4; however, the distribution for level 3 appears to be scattered.
- Fig. 9: The plot titles for Moderate and Considerable are incorrect. Additionally, could the top panel legend for the black and blue lines use consistent terms from the rest of the paper? I assume the black line is the sub-levels forecast in the bulletin and the blue line is the expected danger from the model.
- Line 310-311: It would be more intuitive to explain the transition from level 1 to level 2 before discussing the transition from level 2 to level 3. This order would make it easier for readers to follow and avoid confusion (I initially looked at the wrong table when trying to visualize the thresholds). Additionally, it might be helpful to guide readers on how to interpret approximate thresholds from Fig. 9. You could clarify this by stating: "Approximate thresholds for a given danger level can be estimated by identifying feature values when the SHAP values switch from negative to positive" (assuming this was the method).
- Line 318: MS_Snow is not shown for level 1.
- Line 324-326: Why would an unstable snowpack with regards to natural avalanches favour level 2? This would make sense for higher danger levels, but I would expect natural avalanches to be unlikely for level 2. The fact low Sn values favour levels 2 and 4, while high Sn values favour levels 1 and 3 suggests the impact of this variable may not be that simple.
- Line 333: Why is Fig. 6 cited here?
- Line 371: Wrong citation style.
- Appendices: The grouping of tables and figures into 2 appendices seems illogical. Why is Fig. B1 included in an appendix titled “Evaluation Metrics”. Consider splitting these into separate appendices.
- Line 490: Dbu,a should be defined here.
- Line 577: Is there a reason both the discussion paper and final paper are listed?
Citation: https://doi.org/10.5194/egusphere-2024-2374-RC1 -
RC2: 'Comment on egusphere-2024-2374', Spencer Logan, 22 Oct 2024
Excellent presentation of the model chain and results.
Citation: https://doi.org/10.5194/egusphere-2024-2374-RC2 -
RC3: 'Comment on egusphere-2024-2374', Karsten Müller, 08 Nov 2024
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-2374/egusphere-2024-2374-RC3-supplement.pdf
Model code and software
deapsnow_live_v1 Cristina Pérez-Guillén, Martin Hendrick, Frank Techel, Tasko Olevski, and Michele Volpi https://gitlabext.wsl.ch/perezg/deapsnow_live_v1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
210 | 79 | 23 | 312 | 4 | 8 |
- HTML: 210
- PDF: 79
- XML: 23
- Total: 312
- BibTeX: 4
- EndNote: 8
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1