Understanding the relationship between streamflow forecast skill and value across the western US

Modi, Parthkumar A.; Carbone, Jared C.; Jennings, Keith S.; Kamen, Hannah; Kasprzyk, Joseph R.; Szafranski, Bill; Wobus, Cameron W.; Livneh, Ben

doi:https://doi.org/10.5194/egusphere-2024-4046

Preprints

https://doi.org/10.5194/egusphere-2024-4046

Preprints

14 Jan 2025

| 14 Jan 2025

Understanding the relationship between streamflow forecast skill and value across the western US

Parthkumar A. Modi, Jared C. Carbone, Keith S. Jennings, Hannah Kamen, Joseph R. Kasprzyk, Bill Szafranski, Cameron W. Wobus, and Ben Livneh

Abstract. Accurate seasonal streamflow forecasts are essential for effective decision-making in water management. In a decision-making context, it is important to understand the relationship between forecast skill— the accuracy of forecasts against observations – and forecast value, which is the forecast’s economic impact assessed by weighing potential mitigation costs against potential future losses. This study explores how errors in these probabilistic forecasts can reduce their economic “value”, especially during droughts when decision-making is most critical. This value varies by region and is contextually dependent, which often limits retrospective insights to specific operational water management systems. Additionally, the value is shaped by the intrinsic qualities of the forecasts themselves. To assess this gap, this study examines how forecast skill transforms into value for true forecasts (using real-world models) in unmanaged snow-dominated basins that supply flows to downstream managed systems. We measure forecast skill using quantile loss and quantify forecast value through the Potential Economic Value framework. The framework is well-suited for categorical decisions and uses a cost-loss model, where the economic implications of both correct and incorrect decisions are considered for a set of hypothetical decision-makers. True forecasts are included, made with commonly used models within an Ensemble Streamflow Prediction (ESP) framework using a process-based hydrologic modeling system, WRF-Hydro; a deep learning model, Long Short-term Memory Networks; as well as operational forecasts from the NRCS. To better interpret the relationship between skill and value, we compare true forecasts with synthetic forecasts that are created by imposing regular error structures on observed streamflow volumes. We evaluate the sensitivity of skill and value from both synthetic and true forecasts to fundamental statistical measures - errors in mean and standard deviation. Our findings indicate that errors in mean and standard deviation consistently explain variations in forecast skill for true forecasts. However, these errors do not fully explain the variations in forecast value across the basins, primarily due to irregular error structures, which impact categorical measures such as hit and false alarm rates, causing high forecast skill to not necessarily result in high forecast value. We identify two key insights: first, hit and false alarm rates effectively capture variability in forecast value rather than errors in mean and standard deviation; second, the relationship between forecast skill and value shifts monotonically with drought severity. These findings emphasize the need for a deeper understanding of how forecast performance metrics relate to both skill and value, highlighting the complexities in assessing the effectiveness of forecasting systems.

Received: 20 Dec 2024 – Discussion started: 14 Jan 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3313 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (3313 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

22 Oct 2025

Understanding the relationship between streamflow forecast skill and value across the western US

Parthkumar A. Modi, Jared C. Carbone, Keith S. Jennings, Hannah Kamen, Joseph R. Kasprzyk, Bill Szafranski, Cameron W. Wobus, and Ben Livneh

Hydrol. Earth Syst. Sci., 29, 5593–5623, https://doi.org/10.5194/hess-29-5593-2025,https://doi.org/10.5194/hess-29-5593-2025, 2025

Short summary

Parthkumar A. Modi, Jared C. Carbone, Keith S. Jennings, Hannah Kamen, Joseph R. Kasprzyk, Bill Szafranski, Cameron W. Wobus, and Ben Livneh

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-4046', Anonymous Referee #1, 17 Feb 2025

Summary
The manuscript by Modi et al. presents a study on the link between forecast skill and value, in the case of a sample of unmanaged, snow-dominated stations in the United States. The authors focus on the prediction of low AMJJ volumes issued based on the ESP method with a distributed and an LSTM models, or taken directly from the NRCS operational forecasts. Synthetic forecasts based on streamflow climatology and introducing deviations in mean and in standard deviation serve as a reference to assess errors in true forecasts and derive skill and value for controlled forecast errors. Results reveal a symmetry in forecast skill, but an asymmetry in forecast value, and discuss the inadequacy of the initially chosen skill metric to explain value.
The paper is of very high quality, well written and very well illustrated. It tackles a lot of different scientific objectives, which include comparing the chosen LSTM and distributed models, and studying the relationship between skill in value in controlled and real forecast systems. I was unsure to which extent the first objective serves the second, or not, because the paper becomes lengthy with information that is secondary to the skill-value relationship. Nevertheless, I recommend this paper for publication provided that the points below are addressed or commented on.

General comments
Both WRFH and the LSTM generate daily streamflow volumes summed up to generate AMJJ volumes. In the case of the LSTM, it is not clear why the model was not trained on AMJJ volumes directly.
Section 2.1.3: The same cost is used for hits and false alarms. One could argue that a false alarm does more damage than just the preventive cost since it may deteriorate trust in or reputation of the decision-making institutions. This is not something accounting for here, but that would be worth discussing.
Throughout the results section, and related to Figure 8 and L237, it was not clear to me which tau value was chosen, or if a range of tau’s were used in the assessment of the APEVmax. I think this point requires clarification in Section 2.1.3, and potentially reminders in the interpretation of results.
Section 2.1.1 could benefit from a few clarifications. In particular the phrases “percentile of dryness” or “driest 2% conditions” were a bit unclear. The variables are listed, but the time step or period to be considered are unclear, and would be interesting to have for reference. It would also be interesting to add a sentence to state why this choice of methodology here, and why deviate from the methodology proposed by the USDM. The length of the historical period would also be interesting to have at this stage.
Section 2.1.2: A sum may be missing (in the equation or in the text) to compute losses for several forecasts. Related to this, the term n is not defined. Regarding notations, z is rather a probability of exceedance/non-exceedance associated with quantile y_z. Related to the final discussion on the inadequacy of this skill metric to reflect value, the equal weighting of the 3 quantiles is probably not suitable, nor resembling actual decision-making contexts (reflecting unequal importance on high/low volumes or asymmetrical decision thresholds). Could the author discuss this? Could another weighting or picking of quantile values be enough to match value patterns?
Section 2.3: This section may benefit from some discussion points about the choice of a normally distributed ensemble, which later appears to be a limit, about the fact that forecasts often overestimate in dry conditions and underestimate in wet conditions which is not mimicked here, about the fact that deviations applied reflect errors in mean (bias) or characteristics in terms of spread (sharpness), but the likely important feature here is rather discrimination, which is not experimented on. Related to Figure 4, a comment on the year-to-year variation in the forecast would be helpful. Are they solely due to the exclusion of the forecast year?
Section 2.5: Please clarify the RMAD criterion, in particular how errors between true and synthetic forecasts are calculated given that they are ensemble forecasts.
Throughout the manuscript and more specifically L521 “higher forecast skill and value were associated with negative errors in standard deviation”: “negative errors in standard deviation” can be misleading. Changes in sharpness, in themselves, are not errors if they are not associated with an absence of bias (seen in the skill matrices). Sharpness is not a performance metric. Here negative errors in standard deviation rather mean that the ensemble is close to a deterministic forecast, which is only associated with high forecast skill if, and only if, the forecasts are not biased. I recommend changing the phrase error in standard deviation throughout the paper, and recommend the following paper: Gneiting, T., Balabdaoui, F., Raftery, A.E., 2007. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69, 243–268.
Figure 12: I don’t understand why the three models appear in Figure 12a if this figure shows the synthetic forecasts. If my understanding is correct, only observations are used to generate the synthetic forecasts. Could you please clarify? Also L611 “the true LSTM and the corresponding synthetic forecast” has me confused.
The use of term “skill” in the paper is not always consistent: L630 “better captured in categorical measures than skill”: In the way the word “skill” is used in this study, it is not a metric (sometimes the case when it is the comparison of the performance of a forecast system with the performance of a benchmark) but rather a term used to qualify the performance of the forecast. Based on the use of the word “skill” in this study, categorical measures could very well be metrics used to define forecast skill. I suggest rephrasing. Also L654 “forecast skill generally reflects the accuracy of forecasts” can be unclear as accuracy can be perceived as one feature of forecast skill.
Key references are well used to corroborate or discuss results or limits of this work. Some references about asymmetry in decision-making, about synthetic forecasts, or about the need for adequacy between skill and value metrics can be found in the following works. I let the authors consider their relevance for their work:
Peñuela, A., Hutton, C., Pianosi, F., 2020. Assessing the value of seasonal hydrological forecasts for improving water resource management: insights from a pilot application in the UK. Hydrology and Earth System Sciences 24, 6059–6073. https://doi.org/10.5194/hess-24-6059-2020
Rouge, C., Peñuela, A., Pianosi, F., 2023. Forecast Families: A New Method to Systematically Evaluate the Benefits of Improving the Skill of an Existing Forecast. Journal of Water Resources Planning and Management 149, 04023015. https://doi.org/10.1061/JWRMD5.WRENG-5934
Crochemore, L., Materia, S., Delpiazzo, E. et al. 2024. A framework for joint verification and evaluation of seasonal climate services across socio-economic sectors. Bulletin of the American Meteorological Society. https://doi.org/10.1175/BAMS-D-23-0026.1

Detailed comments
L58-59: Please review the definition of probabilistic seasonal streamflow forecasts, as it is not necessarily volumes, the concept of season can be unclear, and the cited methods are not always combined.
L65: A definition and references for ESP would be necessary. Consider the following work: Day, G., 1985. Extended Streamflow Forecasting Using NWSRFS. J. Water Resour. Plann. Manage. 111, 157–170.
L66 “more accurate”: I am unsure whether this is about accuracy since ESP relies on climatology. It is rather about using outputs from dynamical meteorological or climate model instead.
L115-116 “particularly during extreme events like droughts”: references would be needed to support this. Consider the following work: Giuliani, M., Crochemore, L., Pechlivanidis, I., Castelletti, A., 2020. From skill to value: isolating the influence of end user behavior on seasonal forecast assessment. Hydrology and Earth System Sciences 24, 5891–5902. https://doi.org/10.5194/hess-24-5891-2020
L124 “forecasts respond to fundamental statistical measures” Consider reformulating.
L128 and elsewhere: the word “evaluate” may be ambiguous in a paper about forecast value if it is used for both skill and value. “assess” could be a more neutral option.
L128: Section 2.1.2 is rather about defining drought
L135 “fundamental performance metrics” as above, the choice of the adjective “fundamental” is not clear to me. I would suggest reformulating or clarifying.
L196-197: “rate of occurrence” I suggest introducing s here.
L224-225: This sentence is key for understanding these 3 parameters. I would suggest placing it earlier in the section.
L234-239: How are negative values accounted for when calculating the area? Is it possible to have negative PEVmax values?
L239: While 0 is the theoretical minimum, 0.9 seems to be the observed maximum. If that is correct, I suggest clarifying this by stating the theoretical maximum (infinity?) before this observed maximum.
L247 “snow-dominated basins (i.e., unmanaged headwater systems)”: the correspondence between the two basin types is not direct. Some snow-dominated basins in areas with altitude gradients can be heavily managed/influenced by hydropower dams. Please clarify.
L269: Give the full name for SWE as this is the first occurrence.
L269: “water-year-to-date” may be worth explaining in its first occurrence as well.
L270: The reference to Table A1 is not clear to me.
L283 “WY2006-2022”: this notation used throughout the manuscript should be explained here.
L290: I suggest citing the number of ensemble members used in practice here
L315 “snowpack information in the form of snow water equivalent”: Based on Figure 5, it seems only to be the case for the LSTM. Is this information also used in the case of WRFH?
L315: The reference to Table A1 does not seem correct.
L329: the probabilities extracted from the forecast ensembles can only be comparable if the ensembles have the same number of members. Here it seems to be the case (Section 2.4.1), but I suggest mentioning this here in Section 2.4 already for clarification.
L350 “~20-30 years”: In Figure 6, 23 years are mentioned, and L387 and 419, you mention the period 1983-2022 (40 years minus the forecast year). Please clarify.
L377 Here as well, the reference to Table A1 does not seem to match its content.
L389-390: I suggest the term “initial states” instead of “memory states”
Sections 2.4.2 and 2.4.3: it would be helpful to state in these sections the years used for model training/calibration (now in Appendices), in addition to the years of historical meteorology inputs and for which the forecasts are generated (already clear).
Section 2.5: It is generally advised to use different metrics for calibration/training and verification/validation. It is not clear here which metrics are used for which purpose.
L473: If my understanding is correct, there is a single AMJJ value per year for the period 2006-2022 (17 values). How many years remain once only dry years are selected? Can it really ensure robust results for the rest of the study?
L481 “As errors in mean or standard deviation increase beyond these ranges, forecast skill worsens”: This is arguable. Here the standard deviation reflects the sharpness of the probabilistic forecast. However, sharpness is a forecast characteristic rather than a performance metric.
L483: In the text, values seem to reach 0.9, but in Figure 8, the color scale ends at 0.5.

L496-498: This asymmetry is indeed interesting, and would benefit from some further discussion as to which parameters in the methodology cause this effect (AMJJ variable bounded by 0, tau value, alpha-s relationship and frequency of occurrence below 0.5 for droughts, …).
L532 “Each dot in Fig. 10 represents a basin with colors showing the median skill and value”: is it just skill?
L580: Given that only a type of forecast skill is investigated here, I suggest the following “This skill-value comparison between synthetic and true forecast systems indicates that factors beyond forecast skill, as defined in this study…”
L653: References would be helpful at the end of this sentence.

Figures
Figure 1: Equation numbers preceded by minus signs can be confusing. In the caption, “forecast probabilities are calculated from probabilistic forecasts” is redundant. Do you mean “threshold exceedance probabilities are calculated from probabilistic forecasts”?
Figure 2: The arrows used to indicate the cases when C<0 and C>L point to ranges where C>0 and C
Figure 4: This figure is helpful. It may be worth mentioning in sub-figure (a) that the AMJJ volume for the forecast year is excluded, if that is the case.
Figure 5: “Historical meteorology” and “Basin attributes” are rather vague. I suggest specifying these to better highlight the differences between the three types of modelling/forecasting chains.
Figure 7: The caption should explain the difference between shaded areas (distributions over the 76 basins) and the vertical lines.
Figure 8e: It is not clear why there is a miss based on the time series plot.
Figure 9: “Synthetic errors”: if they are calculated from the true forecasts, these are no longer synthetic errors.

Typos
L63: Give the full name for NRCS
L122 “performance of true forecasts against observations generated in this study” can be unclear as to what is generated in this study.
L123: “models”
L144: “when the AMJJ streamflow volume falls below”
L205 “where the value of”
L232: “REV” instead of PEV.
L260 “with one or fewer” : not sure about what this means, maybe this is correct.

L320 “statistical forecasts (…) operational forecasts”
L452-454: “during WY2001-2010 » appears twice in this sentence.
L456: “for both models”
L458 “These results suggest that the LSTM models, particularly LSTM”
Figure A3 is titled Figure A2 and referenced as Figure A3 in the text.
L473 “only for the drought years only”
L493: “the higher number of false alarms reduces”
L529 and L571 “the three true forecasts”

Citation: https://doi.org/10.5194/egusphere-2024-4046-RC1
- AC1: 'Reply on RC1', Parthkumar Modi, 24 Mar 2025
  
  Dear Referee 1,
  Thank you for your valuable feedback. Please find attached the detailed response document.
  We are grateful for your time and effort in reviewing our work.
  Regards,
  Parth
  
  Citation: https://doi.org/10.5194/egusphere-2024-4046-AC1
RC2:
'Comment on egusphere-2024-4046', Anonymous Referee #2, 20 Feb 2025

Dear authors,

Thank you for the interesting paper, which, in my view, reports an extensive well-documented hydrological seasonal forecasting research, around the observation and concern that many forecast verification publications do not report performance in terms of potential added value for decision making (e.g. through PEV, HR, FR). I support your call in the final sentence of your paper to '..adopt more sophisticated forecast evaluation approaches that prioritize forecast value..'
I do have the following general comments and questions:

I miss in the Introduction and the Discussion and Conclusion the clear recognition that the various performance scores and skill scores have been designed for, and serve, their own purpose. Continuous scores assessing accuracy (e.g. mean error, NSE), reliability (BS), overall performance (CRPS), etc., have been designed and are primarily used to intercompare forecasting systems and measure progress. This is, I believe, an understandable reason why in most scientific literature introducing a new or updated forecasting system, focus has been on such metrics. The potential economic value metric, and others based on contingency tables, and on multi-class decision problems, have been designed to asses and analyse forecast performance for operation and decision making, e.g. for specific applications informing the forecast and user community whether the performance is potentially good enough to use the forecasts and provide guidance on how to use them.

I would kindly request the authors to reflect on which findings were as to be expected, and which were the surprising findings and why. E.g., with hit rate as positive and false alarm rate as negative term in the definition of potential economic value, they indeed explain the variability in PEV. And with PEV assessed for low flow warnings, it is perhaps as expected that errors in mean and standard deviation do not work through to PEV in a consistent way?

The results are presented for three different forecasting systems, such that perhaps the following question can also be addressed in the paper: Would the differences in error, quantile loss, and PEV lead to different conclusions on which forecasting system best to use for low flow forecasting? And then for more general discussion/conclusion: Do your results indicate, or not, if there is a risk in potential users referring to papers intended for measuring performance progress (publishing only overall performance metrics) when selecting a forecasting system to use?
Detailed comments

Methodology: The paper seems to be quite long. The introduction and explanation of the development of LSTM and WRFH models and forecasts is extensive, while comparing them or whether one performance metric relates differently to the next for each of the models is not the focus of the paper. Consider further shortening the model and forecast descriptions and moving more information to the Annex.

Consider leaving out Figure 1 as the procedure is explained in the text as well and references to PEV papers have been given.

Consider leaving out Figure 5, as well as reducing the level of detail in explaining the steps of generating forecasts as this is I believe well-known by most people experienced in or interested in hydrometeorological forecasting.

Results: Heading 3.3.3 Consider just 'Hit and False Alarm Rate and forecast value', because HR and FR are per definition better estimators of forecast value (because expressed as PEV in this paper).

Consider merging section 3.3.4 (value (defined as PEV) is per definition largely explainable by hit and false alarm rates) with 3.3.3

Discussion:

p32 l653 - p33 l688 reads mostly as a summary of the paper, which fits better in a shortened version in Conclusions. Consider leaving out here.

Conclusions:

p35 l730 - 735. These sentences discuss LSTM outperforming the conceptual model, while such analysis is I believe not the main objective of this paper. Consider leaving out or reducing here.

l736: reformulate because now reads as if defining skill as error, while forecast skill is defined as improvement over a reference forecasting system (in this paper always climatology).

l743-744: "This disconnect is further compounded.." I do not understand what is intended here. Please kindly clarify.
Editorial comments

p9 l232: REV should probably be PEV

p22: check caption. Default and calibrated, and initial and final models are mentioned in the caption, but not shown in the legend or figure.

p30 l605: consider "..a low false alarm rate limits unnecessary.."

p30 l611-612: "..LSTM forecasts.." "..synthetic forecasts.."

p30 l616: "(Fig. 14a - right)"

p31: Caption Figure 14, l636: consider "..to each forecast system."

p35 l737: "..- exhibit complex.."

p35 l738-739: kindly clarify, "skill was more sensitive to error and SD", compared to what?

I prefer Annex to be after Reference list, but this is probably governed by HESS.

Citation: https://doi.org/10.5194/egusphere-2024-4046-RC2
- AC2: 'Reply on RC2', Parthkumar Modi, 24 Mar 2025
  
  Dear Referee 2,
  Thank you for your valuable feedback. Please find attached the detailed response document.
  We are grateful for your time and effort in reviewing our work.
  Regards,
  Parth
  
  Citation: https://doi.org/10.5194/egusphere-2024-4046-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-4046', Anonymous Referee #1, 17 Feb 2025

Summary
The manuscript by Modi et al. presents a study on the link between forecast skill and value, in the case of a sample of unmanaged, snow-dominated stations in the United States. The authors focus on the prediction of low AMJJ volumes issued based on the ESP method with a distributed and an LSTM models, or taken directly from the NRCS operational forecasts. Synthetic forecasts based on streamflow climatology and introducing deviations in mean and in standard deviation serve as a reference to assess errors in true forecasts and derive skill and value for controlled forecast errors. Results reveal a symmetry in forecast skill, but an asymmetry in forecast value, and discuss the inadequacy of the initially chosen skill metric to explain value.
The paper is of very high quality, well written and very well illustrated. It tackles a lot of different scientific objectives, which include comparing the chosen LSTM and distributed models, and studying the relationship between skill in value in controlled and real forecast systems. I was unsure to which extent the first objective serves the second, or not, because the paper becomes lengthy with information that is secondary to the skill-value relationship. Nevertheless, I recommend this paper for publication provided that the points below are addressed or commented on.

General comments
Both WRFH and the LSTM generate daily streamflow volumes summed up to generate AMJJ volumes. In the case of the LSTM, it is not clear why the model was not trained on AMJJ volumes directly.
Section 2.1.3: The same cost is used for hits and false alarms. One could argue that a false alarm does more damage than just the preventive cost since it may deteriorate trust in or reputation of the decision-making institutions. This is not something accounting for here, but that would be worth discussing.
Throughout the results section, and related to Figure 8 and L237, it was not clear to me which tau value was chosen, or if a range of tau’s were used in the assessment of the APEVmax. I think this point requires clarification in Section 2.1.3, and potentially reminders in the interpretation of results.
Section 2.1.1 could benefit from a few clarifications. In particular the phrases “percentile of dryness” or “driest 2% conditions” were a bit unclear. The variables are listed, but the time step or period to be considered are unclear, and would be interesting to have for reference. It would also be interesting to add a sentence to state why this choice of methodology here, and why deviate from the methodology proposed by the USDM. The length of the historical period would also be interesting to have at this stage.
Section 2.1.2: A sum may be missing (in the equation or in the text) to compute losses for several forecasts. Related to this, the term n is not defined. Regarding notations, z is rather a probability of exceedance/non-exceedance associated with quantile y_z. Related to the final discussion on the inadequacy of this skill metric to reflect value, the equal weighting of the 3 quantiles is probably not suitable, nor resembling actual decision-making contexts (reflecting unequal importance on high/low volumes or asymmetrical decision thresholds). Could the author discuss this? Could another weighting or picking of quantile values be enough to match value patterns?
Section 2.3: This section may benefit from some discussion points about the choice of a normally distributed ensemble, which later appears to be a limit, about the fact that forecasts often overestimate in dry conditions and underestimate in wet conditions which is not mimicked here, about the fact that deviations applied reflect errors in mean (bias) or characteristics in terms of spread (sharpness), but the likely important feature here is rather discrimination, which is not experimented on. Related to Figure 4, a comment on the year-to-year variation in the forecast would be helpful. Are they solely due to the exclusion of the forecast year?
Section 2.5: Please clarify the RMAD criterion, in particular how errors between true and synthetic forecasts are calculated given that they are ensemble forecasts.
Throughout the manuscript and more specifically L521 “higher forecast skill and value were associated with negative errors in standard deviation”: “negative errors in standard deviation” can be misleading. Changes in sharpness, in themselves, are not errors if they are not associated with an absence of bias (seen in the skill matrices). Sharpness is not a performance metric. Here negative errors in standard deviation rather mean that the ensemble is close to a deterministic forecast, which is only associated with high forecast skill if, and only if, the forecasts are not biased. I recommend changing the phrase error in standard deviation throughout the paper, and recommend the following paper: Gneiting, T., Balabdaoui, F., Raftery, A.E., 2007. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69, 243–268.
Figure 12: I don’t understand why the three models appear in Figure 12a if this figure shows the synthetic forecasts. If my understanding is correct, only observations are used to generate the synthetic forecasts. Could you please clarify? Also L611 “the true LSTM and the corresponding synthetic forecast” has me confused.
The use of term “skill” in the paper is not always consistent: L630 “better captured in categorical measures than skill”: In the way the word “skill” is used in this study, it is not a metric (sometimes the case when it is the comparison of the performance of a forecast system with the performance of a benchmark) but rather a term used to qualify the performance of the forecast. Based on the use of the word “skill” in this study, categorical measures could very well be metrics used to define forecast skill. I suggest rephrasing. Also L654 “forecast skill generally reflects the accuracy of forecasts” can be unclear as accuracy can be perceived as one feature of forecast skill.
Key references are well used to corroborate or discuss results or limits of this work. Some references about asymmetry in decision-making, about synthetic forecasts, or about the need for adequacy between skill and value metrics can be found in the following works. I let the authors consider their relevance for their work:
Peñuela, A., Hutton, C., Pianosi, F., 2020. Assessing the value of seasonal hydrological forecasts for improving water resource management: insights from a pilot application in the UK. Hydrology and Earth System Sciences 24, 6059–6073. https://doi.org/10.5194/hess-24-6059-2020
Rouge, C., Peñuela, A., Pianosi, F., 2023. Forecast Families: A New Method to Systematically Evaluate the Benefits of Improving the Skill of an Existing Forecast. Journal of Water Resources Planning and Management 149, 04023015. https://doi.org/10.1061/JWRMD5.WRENG-5934
Crochemore, L., Materia, S., Delpiazzo, E. et al. 2024. A framework for joint verification and evaluation of seasonal climate services across socio-economic sectors. Bulletin of the American Meteorological Society. https://doi.org/10.1175/BAMS-D-23-0026.1

Detailed comments
L58-59: Please review the definition of probabilistic seasonal streamflow forecasts, as it is not necessarily volumes, the concept of season can be unclear, and the cited methods are not always combined.
L65: A definition and references for ESP would be necessary. Consider the following work: Day, G., 1985. Extended Streamflow Forecasting Using NWSRFS. J. Water Resour. Plann. Manage. 111, 157–170.
L66 “more accurate”: I am unsure whether this is about accuracy since ESP relies on climatology. It is rather about using outputs from dynamical meteorological or climate model instead.
L115-116 “particularly during extreme events like droughts”: references would be needed to support this. Consider the following work: Giuliani, M., Crochemore, L., Pechlivanidis, I., Castelletti, A., 2020. From skill to value: isolating the influence of end user behavior on seasonal forecast assessment. Hydrology and Earth System Sciences 24, 5891–5902. https://doi.org/10.5194/hess-24-5891-2020
L124 “forecasts respond to fundamental statistical measures” Consider reformulating.
L128 and elsewhere: the word “evaluate” may be ambiguous in a paper about forecast value if it is used for both skill and value. “assess” could be a more neutral option.
L128: Section 2.1.2 is rather about defining drought
L135 “fundamental performance metrics” as above, the choice of the adjective “fundamental” is not clear to me. I would suggest reformulating or clarifying.
L196-197: “rate of occurrence” I suggest introducing s here.
L224-225: This sentence is key for understanding these 3 parameters. I would suggest placing it earlier in the section.
L234-239: How are negative values accounted for when calculating the area? Is it possible to have negative PEVmax values?
L239: While 0 is the theoretical minimum, 0.9 seems to be the observed maximum. If that is correct, I suggest clarifying this by stating the theoretical maximum (infinity?) before this observed maximum.
L247 “snow-dominated basins (i.e., unmanaged headwater systems)”: the correspondence between the two basin types is not direct. Some snow-dominated basins in areas with altitude gradients can be heavily managed/influenced by hydropower dams. Please clarify.
L269: Give the full name for SWE as this is the first occurrence.
L269: “water-year-to-date” may be worth explaining in its first occurrence as well.
L270: The reference to Table A1 is not clear to me.
L283 “WY2006-2022”: this notation used throughout the manuscript should be explained here.
L290: I suggest citing the number of ensemble members used in practice here
L315 “snowpack information in the form of snow water equivalent”: Based on Figure 5, it seems only to be the case for the LSTM. Is this information also used in the case of WRFH?
L315: The reference to Table A1 does not seem correct.
L329: the probabilities extracted from the forecast ensembles can only be comparable if the ensembles have the same number of members. Here it seems to be the case (Section 2.4.1), but I suggest mentioning this here in Section 2.4 already for clarification.
L350 “~20-30 years”: In Figure 6, 23 years are mentioned, and L387 and 419, you mention the period 1983-2022 (40 years minus the forecast year). Please clarify.
L377 Here as well, the reference to Table A1 does not seem to match its content.
L389-390: I suggest the term “initial states” instead of “memory states”
Sections 2.4.2 and 2.4.3: it would be helpful to state in these sections the years used for model training/calibration (now in Appendices), in addition to the years of historical meteorology inputs and for which the forecasts are generated (already clear).
Section 2.5: It is generally advised to use different metrics for calibration/training and verification/validation. It is not clear here which metrics are used for which purpose.
L473: If my understanding is correct, there is a single AMJJ value per year for the period 2006-2022 (17 values). How many years remain once only dry years are selected? Can it really ensure robust results for the rest of the study?
L481 “As errors in mean or standard deviation increase beyond these ranges, forecast skill worsens”: This is arguable. Here the standard deviation reflects the sharpness of the probabilistic forecast. However, sharpness is a forecast characteristic rather than a performance metric.
L483: In the text, values seem to reach 0.9, but in Figure 8, the color scale ends at 0.5.

L496-498: This asymmetry is indeed interesting, and would benefit from some further discussion as to which parameters in the methodology cause this effect (AMJJ variable bounded by 0, tau value, alpha-s relationship and frequency of occurrence below 0.5 for droughts, …).
L532 “Each dot in Fig. 10 represents a basin with colors showing the median skill and value”: is it just skill?
L580: Given that only a type of forecast skill is investigated here, I suggest the following “This skill-value comparison between synthetic and true forecast systems indicates that factors beyond forecast skill, as defined in this study…”
L653: References would be helpful at the end of this sentence.

Figures
Figure 1: Equation numbers preceded by minus signs can be confusing. In the caption, “forecast probabilities are calculated from probabilistic forecasts” is redundant. Do you mean “threshold exceedance probabilities are calculated from probabilistic forecasts”?
Figure 2: The arrows used to indicate the cases when C<0 and C>L point to ranges where C>0 and C
Figure 4: This figure is helpful. It may be worth mentioning in sub-figure (a) that the AMJJ volume for the forecast year is excluded, if that is the case.
Figure 5: “Historical meteorology” and “Basin attributes” are rather vague. I suggest specifying these to better highlight the differences between the three types of modelling/forecasting chains.
Figure 7: The caption should explain the difference between shaded areas (distributions over the 76 basins) and the vertical lines.
Figure 8e: It is not clear why there is a miss based on the time series plot.
Figure 9: “Synthetic errors”: if they are calculated from the true forecasts, these are no longer synthetic errors.

Typos
L63: Give the full name for NRCS
L122 “performance of true forecasts against observations generated in this study” can be unclear as to what is generated in this study.
L123: “models”
L144: “when the AMJJ streamflow volume falls below”
L205 “where the value of”
L232: “REV” instead of PEV.
L260 “with one or fewer” : not sure about what this means, maybe this is correct.

L320 “statistical forecasts (…) operational forecasts”
L452-454: “during WY2001-2010 » appears twice in this sentence.
L456: “for both models”
L458 “These results suggest that the LSTM models, particularly LSTM”
Figure A3 is titled Figure A2 and referenced as Figure A3 in the text.
L473 “only for the drought years only”
L493: “the higher number of false alarms reduces”
L529 and L571 “the three true forecasts”

Citation: https://doi.org/10.5194/egusphere-2024-4046-RC1
- AC1: 'Reply on RC1', Parthkumar Modi, 24 Mar 2025
  
  Dear Referee 1,
  Thank you for your valuable feedback. Please find attached the detailed response document.
  We are grateful for your time and effort in reviewing our work.
  Regards,
  Parth
  
  Citation: https://doi.org/10.5194/egusphere-2024-4046-AC1
RC2:
'Comment on egusphere-2024-4046', Anonymous Referee #2, 20 Feb 2025

Dear authors,

Thank you for the interesting paper, which, in my view, reports an extensive well-documented hydrological seasonal forecasting research, around the observation and concern that many forecast verification publications do not report performance in terms of potential added value for decision making (e.g. through PEV, HR, FR). I support your call in the final sentence of your paper to '..adopt more sophisticated forecast evaluation approaches that prioritize forecast value..'
I do have the following general comments and questions:

I miss in the Introduction and the Discussion and Conclusion the clear recognition that the various performance scores and skill scores have been designed for, and serve, their own purpose. Continuous scores assessing accuracy (e.g. mean error, NSE), reliability (BS), overall performance (CRPS), etc., have been designed and are primarily used to intercompare forecasting systems and measure progress. This is, I believe, an understandable reason why in most scientific literature introducing a new or updated forecasting system, focus has been on such metrics. The potential economic value metric, and others based on contingency tables, and on multi-class decision problems, have been designed to asses and analyse forecast performance for operation and decision making, e.g. for specific applications informing the forecast and user community whether the performance is potentially good enough to use the forecasts and provide guidance on how to use them.

I would kindly request the authors to reflect on which findings were as to be expected, and which were the surprising findings and why. E.g., with hit rate as positive and false alarm rate as negative term in the definition of potential economic value, they indeed explain the variability in PEV. And with PEV assessed for low flow warnings, it is perhaps as expected that errors in mean and standard deviation do not work through to PEV in a consistent way?

The results are presented for three different forecasting systems, such that perhaps the following question can also be addressed in the paper: Would the differences in error, quantile loss, and PEV lead to different conclusions on which forecasting system best to use for low flow forecasting? And then for more general discussion/conclusion: Do your results indicate, or not, if there is a risk in potential users referring to papers intended for measuring performance progress (publishing only overall performance metrics) when selecting a forecasting system to use?
Detailed comments

Methodology: The paper seems to be quite long. The introduction and explanation of the development of LSTM and WRFH models and forecasts is extensive, while comparing them or whether one performance metric relates differently to the next for each of the models is not the focus of the paper. Consider further shortening the model and forecast descriptions and moving more information to the Annex.

Consider leaving out Figure 1 as the procedure is explained in the text as well and references to PEV papers have been given.

Consider leaving out Figure 5, as well as reducing the level of detail in explaining the steps of generating forecasts as this is I believe well-known by most people experienced in or interested in hydrometeorological forecasting.

Results: Heading 3.3.3 Consider just 'Hit and False Alarm Rate and forecast value', because HR and FR are per definition better estimators of forecast value (because expressed as PEV in this paper).

Consider merging section 3.3.4 (value (defined as PEV) is per definition largely explainable by hit and false alarm rates) with 3.3.3

Discussion:

p32 l653 - p33 l688 reads mostly as a summary of the paper, which fits better in a shortened version in Conclusions. Consider leaving out here.

Conclusions:

p35 l730 - 735. These sentences discuss LSTM outperforming the conceptual model, while such analysis is I believe not the main objective of this paper. Consider leaving out or reducing here.

l736: reformulate because now reads as if defining skill as error, while forecast skill is defined as improvement over a reference forecasting system (in this paper always climatology).

l743-744: "This disconnect is further compounded.." I do not understand what is intended here. Please kindly clarify.
Editorial comments

p9 l232: REV should probably be PEV

p22: check caption. Default and calibrated, and initial and final models are mentioned in the caption, but not shown in the legend or figure.

p30 l605: consider "..a low false alarm rate limits unnecessary.."

p30 l611-612: "..LSTM forecasts.." "..synthetic forecasts.."

p30 l616: "(Fig. 14a - right)"

p31: Caption Figure 14, l636: consider "..to each forecast system."

p35 l737: "..- exhibit complex.."

p35 l738-739: kindly clarify, "skill was more sensitive to error and SD", compared to what?

I prefer Annex to be after Reference list, but this is probably governed by HESS.

Citation: https://doi.org/10.5194/egusphere-2024-4046-RC2
- AC2: 'Reply on RC2', Parthkumar Modi, 24 Mar 2025
  
  Dear Referee 2,
  Thank you for your valuable feedback. Please find attached the detailed response document.
  We are grateful for your time and effort in reviewing our work.
  Regards,
  Parth
  
  Citation: https://doi.org/10.5194/egusphere-2024-4046-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Publish subject to revisions (further review by editor and referees) (07 Apr 2025) by Albrecht Weerts

AR by Parthkumar Modi on behalf of the Authors (20 May 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (15 Jun 2025) by Albrecht Weerts

ED: Publish as is (17 Jul 2025) by Albrecht Weerts

AR by Parthkumar Modi on behalf of the Authors (01 Aug 2025)

Journal article(s) based on this preprint

22 Oct 2025

Understanding the relationship between streamflow forecast skill and value across the western US

Parthkumar A. Modi, Jared C. Carbone, Keith S. Jennings, Hannah Kamen, Joseph R. Kasprzyk, Bill Szafranski, Cameron W. Wobus, and Ben Livneh

Hydrol. Earth Syst. Sci., 29, 5593–5623, https://doi.org/10.5194/hess-29-5593-2025,https://doi.org/10.5194/hess-29-5593-2025, 2025

Short summary

Parthkumar A. Modi, Jared C. Carbone, Keith S. Jennings, Hannah Kamen, Joseph R. Kasprzyk, Bill Szafranski, Cameron W. Wobus, and Ben Livneh

Model code and software

Long Short Term Memory simulations and code for 664 basins in the Ensemble Streamflow Prediction framework (LSTM-ESP) P. Modi and B. Livneh https://doi.org/10.5281/zenodo.14213155

Parthkumar A. Modi, Jared C. Carbone, Keith S. Jennings, Hannah Kamen, Joseph R. Kasprzyk, Bill Szafranski, Cameron W. Wobus, and Ben Livneh

Viewed

Total article views: 473 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
306	143	24	473	22	50

HTML: 306
PDF: 143
XML: 24
Total: 473
BibTeX: 22
EndNote: 50

Views and downloads (calculated since 14 Jan 2025)

Month	HTML	PDF	XML	Total
Jan 2025	79	42	3	124
Feb 2025	46	16	3	65
Mar 2025	17	9	2	28
Apr 2025	19	11	3	33
May 2025	16	12	2	30
Jun 2025	34	10	4	48
Jul 2025	36	13	2	51
Aug 2025	19	11	3	33
Sep 2025	23	9	1	33
Oct 2025	17	10	1	28
Nov 2025	0

Cumulative views and downloads (calculated since 14 Jan 2025)

Month	HTML	PDF	XML	Total
Jan 2025	79	42	3	124
Feb 2025	46	16	3	65
Mar 2025	17	9	2	28
Apr 2025	19	11	3	33
May 2025	16	12	2	30
Jun 2025	34	10	4	48
Jul 2025	36	13	2	51
Aug 2025	19	11	3	33
Sep 2025	23	9	1	33
Oct 2025	17	10	1	28
Nov 2025	0

Viewed (geographical distribution)

Total article views: 453 (including HTML, PDF, and XML) Thereof 453 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 05 Nov 2025

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (3313 KB)
Metadata XML

Short summary

This study shows that in unmanaged snow-dominated basins, high forecast accuracy doesn’t always lead to high economic value, especially during extreme conditions like droughts. It highlights how irregular errors in modern forecasting systems weaken the connection between accuracy and value. These findings call for forecast evaluations to focus not only on accuracy but also on economic impacts, providing valuable guidance for better water resource management under uncertainty.


Total:	0
HTML:	0
PDF:	0
XML:	0