Exploring uncertainties across the modelling chain in machine-learning-based streamflow forecasting

Vinokić, Luka; Dotlić, Milan; Samac, Ana; Prodanović, Veljko; Kolaković, Slobodan; Stojković, Milan

doi:10.2139/ssrn.6737296

Preprints

https://doi.org/10.2139/ssrn.6737296

Preprints

23 Jun 2026

| 23 Jun 2026

Exploring uncertainties across the modelling chain in machine-learning-based streamflow forecasting

Luka Vinokić, Milan Dotlić, Ana Samac, Veljko Prodanović, Slobodan Kolaković, and Milan Stojković

Abstract. Operational streamflow forecasts underpin flood preparedness and reservoir operations, yet their utility is often constrained by poorly characterized and attributed predictive uncertainty. In machine-learning-based forecasting, uncertainty is frequently omitted or reported as a single aggregate output, leaving it unclear which parts of the end-to-end forecasting chain drive overconfidence and forecast degradation, particularly with increasing lead time. In this work, we develop an end-to-end uncertainty decomposition framework for operational streamflow forecasting that attributes predictive uncertainty across meteorological forcing choice, feature design, model architecture, hyperparameter optimization, and training variability, evaluated across multi-day horizons. The decomposition reveals a systematic, horizon-dependent shift in dominant uncertainty sources, with forcing-related contributions increasing with lead time while model-structure and feature choices remain influential at shorter horizons. During high-flow events, predictive intervals remain essential because pipeline heterogeneity can bias the central estimate even when ensemble dispersion widens appropriately. Tuning contributes little to the uncertainty budget but strongly affects compute–skill trade-offs, with Bayesian optimization delivering the most favorable cost–benefit performance under the tested constraints. Together, these results provide actionable guidance for operational freshwater management, showing where investment yields the largest reliability gains: model design at short lead times and forcing quality at longer lead times. This guidance can reduce the risk of costly or unsafe decisions in flood preparedness, reservoir operation, and other critical decision-making contexts in water management.

Received: 22 May 2026 – Discussion started: 23 Jun 2026

Luka Vinokić, Milan Dotlić, Ana Samac, Veljko Prodanović, Slobodan Kolaković, and Milan Stojković

Status: final response (author comments only)

RC1:
'Comment on egusphere-2026-2966', Anonymous Referee #1, 20 Jul 2026
Dear editor and authors,

The following comments detail my review of the manuscript "Exploring uncertainties across the modelling chain in machine-learning-based streamflow forecasting" submitted to Hydrology and Earth Systems Science.
Overall Recommendation

The paper addresses a current and very active area of research regarding data-driven models to forecast discharge. It is well written and the results are supported by the evidence shown. However, I suggest major revisions before publication.
Major Comments
I cannot help but feel that the experimental setting for this experiment is quite limited. Just over a year for a testing period for a single basin really doesn't tell me much about how the results of this study are generalizable or if the overall framework is useful to adopt. For this particular basin, adopting a data-splitting strategy of time-series cross-validation would be a step toward obtaining results that are more representative of the study area in general, rather than simply the periods chosen. Furthermore, it would be highly beneficial for the authors to comment on if and how this experimental framework could be adapted to a setting of large sample hydrology.

I have the feeling that there is a lot of information left out of the paper regarding the model setups. Particularly, the authors refer to 1024 pipeline realizations, yet within each pipeline, there is independent testing between different weather products, combinations of features, and hyperparameters of models. For the last one especially, combinations of hyperparameters can be infinite; therefore, what were the constraints used for the HPO algorithms? I can see them in the code, but they are not mentioned in the text. Mentioning the number of training replicates obtained would also be useful. What is the total number of model configurations tested?

This study doesn't have a benchmark. The authors only report the performance of their ensembles, and a comparison against simply predicting the last measured discharge value for the deterministic performance metrics would be very useful. For the probabilistic models, perhaps a single model configuration performing quantile regression across all quantiles could be used as a benchmark too.

Particularly, the fan for Peak 1 in Figure 2c is worrying because the model wouldn't be able to give the correct signal in a clear decision-making scenario of issuing a flood warning. Even at t-1, simply forecasting the last measured discharge value would be a better predictor than the ensemble.

Following the previous point, the manuscript introduces this framework as a tool to support operational management and decision-making, however, the evaluation relies entirely on standard goodness-of-fit metrics. If the purpose of the framework is to inform operations, there is a disconnect between the stated goals and the evaluation. It would be highly beneficial for the authors to clarify the exact scope of the framework and consider evaluating it, or discussing its limitations, using decision-centric metrics (e.g. derived from a confusion matrix like false alarm rate).

Section 3.3: For an uncertainty quantification paper, leaving 40% of the variance unexplained seems quite high. It would be useful if the authors commented on the unexplained sources of uncertainty or how they could be mapped within their framework.

Minor Comments
L27: The same could be said of underconfident predictions.

L80: At this point, I'm not sure what the authors refer to as 'idealized observed inputs'; a forecasting hydrological model necessarily has to be driven by NWP.

L124 and L131: Not that I'm asking you to do this, but could the proposed framework be used to assess the uncertainties introduced by re-gridding and interpolating?

L139: Using only sine encodes different days with the same value, e.g., days 45 and 137 in a 365-day year. Normally, this is addressed using both sine and cosine encoding.

L183: 100 trials sounds low. For example, in https://doi.org/10.1016/j.jhydrol.2023.129414, even if the reference is for conceptual models, trials for SCE are set to 1000.

Figure 4: I don't think these colors work for a person with color-blindness (monochromacy/achromatopsia).
Citation: https://doi.org/10.5194/egusphere-2026-2966-RC1
RC2:
'Comment on egusphere-2026-2966', Anonymous Referee #2, 02 Aug 2026
Dear Authors and Editors,
I believe the manuscript “Exploring uncertainties across the modelling chain in machine-learning-based streamflow forecasting” is of relevance to Machine Learning and forecasting community and fits the aim and scope of Hydrology and Earth Systems Sciences (HESS) journal. However, I believe the manuscript needs major reviews before being published. My comments are below.
Major comments:
The manuscript employs a single catchment to attribute uncertainty to the different phases of the Machine Learning (ML) modelling chain. How can the results be generalised and the same workflow be applied to large sample hydrology modelling (e.g. all the regional LSTM models developed with the different versions of CAMELS dataset)?

The possible combinations in the models architectures could be endless, considering that infinite layers could be added. Are there external constraints applied? And what about the uncertainty given by the loss function (I am not suggesting to add it to the pipelines, but to at least mention about it)?

Numerical Weather Prediction (NWP) model runs were used as input for the different ML pipelines but important information is missing: how often are the models initialised and, hence, how many samples are available for training? Is the full NWP ensemble used for training the models or is only the control member used (or ensemble mean) used?

In the building of the input features sequences, the NWP data used from t-L up to t, are they hindcast or reanalysis? If they are hindcast, which is their lead time?

Still in the building of the input features sequences: I believe using only 3 days of past historical information to forecast the next 5 days might be very little, even if the concentration time is around ~1 day; there is a well known literature of LSTM models using one year (365 days) of past observations to predict the next day streamflow, how do you justify using only 3 days to forecast next 5 days streamflow? Especially in snow-driven catchments, where the snowmelt process is very slow, longer information is needed.

Tables are used to summarise the choice of the selection of a specific model or metric, but they could be omitted in my view, or at least modified to retain the reason for selection within the main text. In table 6, the column of the Key reference is mixed with references and reasons for selection of the metric.

About the metrics used: the probabilistic metrics are they applied to the meteorological ensemble of NWP or to the ensemble created by all the pipelines j tested? If it is applied to the ensemble of the pipelines, what is the added value of using these probabilistic metrics?

About metrics again: I feel too many metrics are used but not all of them are discussed. Maybe some can be dropped.

Most of the manuscript, including the title, is focused on the uncertainty quantification and very little is said in the introduction about the computation cost-forecast accuracy analysis. I think more should be said about this analysis, or it can be dropped and keep the manuscript within the uncertainty quantification scope. Additionally, the effects of the tuner into the uncertainty quantification are not discussed, while they are presented as part of the UQ pipeline.

In the computational costs-accuracy analysis, in L226 it is mentioned that the aim is to minimize both costs and RMSE. It should be clarified, however, that there is no multi-objective optimization leading to the construction of the Pareto front, as the front is artificially built later by using the results of all independently trained pipelines.

Minor comments:
L80: what are “idealized observed inputs”?

L82 claims that statistically robust pipelines are identified. Where is this shown?

L131: gaps are filled with interpolation. I think this part should be clarified in terms of how many gaps were present and mention that this could also contribute to the uncertainty of the modelling chain.

Charts in Fig 2c in the last row do not correspond to those of Fig 2b, which should be referred to the same lead time. Can you clarify why?

L273 mentions a “skill”, but there is no benchmark model here?

L303 mentions that snow water equivalent is not relevant as input, despite the catchment being snow-driven. This is because only little days are used in the input sequence, compared to the speed of the snowmelt process.

In the conclusion (from L384) it is mentioned that richer feature exploration could be done. However, the conclusions do not mention that this would help only for the shorter lead times, while for the longer it would make more sense to invest in the NWP forecasts quality (as rightfully stated in the abstract).
Citation: https://doi.org/10.5194/egusphere-2026-2966-RC2

Luka Vinokić, Milan Dotlić, Ana Samac, Veljko Prodanović, Slobodan Kolaković, and Milan Stojković

Viewed

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 39 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
38	0	1	39	0	0

HTML: 38
PDF: 0
XML: 1
Total: 39
BibTeX: 0
EndNote: 0

Views and downloads (calculated since 23 Jun 2026)

Month	HTML	PDF	XML	Total
Jul 2026	29	1	30
Aug 2026	9	0	9

Cumulative views and downloads (calculated since 23 Jun 2026)

Month	HTML	PDF	XML	Total
Jul 2026	29	1	30
Aug 2026	9	0	9

Viewed (geographical distribution)

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 34 (including HTML, PDF, and XML) Thereof 34 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 05 Aug 2026

Short summary

Reliable streamflow forecasts require knowing not just what the model predicts, but how uncertain that prediction is and why. This study shows that uncertainty in machine-learning-based forecasts shifts with lead time: model design dominates at short horizons, while weather forecast quality takes over beyond day two. Ensemble mean forecasts can mislead during floods; predictive intervals remain reliable. Bayesian optimization offers the best accuracy-to-cost ratio among tuning strategies tested.


Total:	0
HTML:	0
PDF:	0
XML:	0