BiasCast: Learning and adjusting real time biases from meteorological forecasts to enhance runoff predictions

Konold, Oliver; Feigl, Moritz; Podest, Patrick; Klingler, Christoph; Schulz, Karsten

doi:10.5194/egusphere-2025-4978

Preprints

https://doi.org/10.5194/egusphere-2025-4978

Preprints

27 Nov 2025

| 27 Nov 2025

Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

BiasCast: Learning and adjusting real time biases from meteorological forecasts to enhance runoff predictions

Oliver Konold, Moritz Feigl, Patrick Podest, Christoph Klingler, and Karsten Schulz

Abstract. The use of deep learning models in hydrology is becoming an ever more prevalent application in operational flood forecasting. Such operational systems face performance degradation when transitioning from high quality reanalysis to meteorological forecast data with lower accuracy. This study investigates training strategies and Long Short-Term Memory network architectures to mitigate forecast-induced bias in maximum daily discharge predictions using the Extended LamaH- CE dataset and a subset of 451 basins. We systematically evaluated cross-domain generalization, transfer learning approaches, Encoder–Decoder LSTMs, Sequential Forecast LSTMs, and the role of input embeddings and integrating past discharge observations. The results show that domain shifts between reanalysis and forecast data lead to substantial skill loss, with median Nash–Sutcliffe Efficiency decreasing from 0.58 to 0.33. Among the tested strategies, the Sequential Forecast LSTM demonstrated the most stable improvements, achieving a median NSE of 0.63. Integrating recent discharge observations further enhanced performance, raising median NSE to 0.71 and surpassing even the reanalysis-driven baseline. In contrast, integrating archived forecasts or using more complex input embeddings did not yield consistent benefits and in some cases degraded model stability. These findings highlight the value of training strategies that allow models to directly learn bias correction during forecast transitions and emphasize the operational potential of combining sequential processing with near real-time discharge observations.

Received: 08 Oct 2025 – Discussion started: 27 Nov 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Oliver Konold, Moritz Feigl, Patrick Podest, Christoph Klingler, and Karsten Schulz

Status: open (until 14 Mar 2026)

Post a comment Subscribe to comment alert

RC1:
'Comment on egusphere-2025-4978', Anonymous Referee #1, 02 Jan 2026 reply
This manuscript addresses the challenge of deploying machine-learning hydrological models in operational forecasting by explicitly considering domain shift between reanalysis and forecast meteorological inputs. The authors explore alternative training strategies and LSTM architectures to improve 1-day streamflow forecasts, and the results suggest that architectures combining hindcast and forecast phases, which use reanalysis and forecast data respectively, provide the greatest performance gains. The study tackles an important problem, presents interesting results, and is structured well. Some additional analysis and clarifications would further strengthen the interpretation of the experiments and results.
General comments
The paper would be greatly improved by further analysis on the variability of the skill across basins (please see comments 15 and 24 below). The manuscript emphasizes the impact of domain shift between reanalysis and forecast meteorological datasets; however, this shift is not spatially uniform across basins or meteorological variables. Further analysis of how variability in domain shift relates to basin-to-basin differences in model performance would strengthen the conclusions. In addition, the role of catchment attributes in modulating model skill is not discussed. Given the focus on a 1-day lead time, differences in hydrological response will strongly influence streamflow predictability and should be considered.

The paper is largely framed in terms of mitigating biases in meteorological forecasts. However, the inclusion of near real-time streamflow observations (and to some extent the use of model architectures with hindcast-forecast phases) are likely reducing uncertainty associated with initial conditions and/or model structure. A slight adjustment to the framing, or an explicit discussion of the different uncertainty sources being addressed, would improve clarity (see comment 21).

Specific comments
The introduction is well written and covers may aspects of the topic, but some of the motivations are incomplete.
Lines 51-54: As the authors note, the focus on ECMWF-IFS data implies that the transferability of the method proposed by Han et al. (2021) to other NWP systems still needs to be tested. However, this limitation does not in itself imply that more robust bias-correction approaches are required, as suggested by the authors (“highlighting the need for more robust approaches”). The wording could be revised to reflect this distinction more clearly.

Line 64: It is my understanding that errors in river discharge measurements are often localised (e.g., by vegetation growth or changes in the river channel, hydraulic disturbances near the gauge, etc)? Potentially the authors mean that river discharge observations are impacted by less spatial representativeness errors? Please clarify.

Lines 72-74: The GloFAS forecasts are available across the globe so the comment “the quantile mapping correction is dependent on GloFAS forecasts, meaning it is not applicable for regions where no GloFAS forecast is available” may be misleading and should be clarified. The Hunt et al., (2022) method can also be applied to other distributed hydrological forecasts so is not dependent on GloFAS.

Meteorological datasets:
Section 2.1: It is not clear to me which meteorological variables and datasets are used. Are the E-OBS, MSWEP, and GLEAM datasets used? If so, in which experiments, and if not, please remove them from the discussion.

Appendix A: Are all the variables in appendix A used? If so, I suggest adding a column to indicate which experiment they are used in, and if not, I suggest indicating this.

Section 2.1: What is the distribution of observed timeseries available at each station? Do they all cover the full experimental period from 2003-2017:

Line 122: Are nested catchments included within the 451 basins?

Figure 2: It would be very helpful if this figure (or an additional figure) included a diagrammatic comparison of the model architectures, and, if possible, of the experiments performed.

Line 141-142: For clarity, do the temporal splits contain the lookback window of 365-days such that the validation and testing is done over 2-years’ of forecasts (2011-2013 and 2015-2017, respectively) rather than 3 years?

Line 160: “exclusively driven by reanalysis (e.g., ERA5) or spatially interpolated observational (e.g., E-OBS) data sources”. Does the RA baseline use ERA5 or E-OBS?

Section 2.2.1/2.2.2: Which LSTM architecture is used for the baselines and the cross-domain experiments? Are they all multi-basin models (i.e., trained on all data)? Please clarify and include in the experimental figure (see comment 3).

Line 230: I appreciate that the authors have kept the manuscript concise; however, since the premise of the paper is that differences between the reanalysis and forecast domains lead to errors, I feel that more discussion of these differences is needed. This could be addressed in a short additional section, or at minimum by adding text to Appendix B to guide readers on the key differences between the two domains.

Line 243: Are the weights for the embedding of both the dynamic and static features updated or just the dynamic?

Section 3: It would be useful to have a table that compares the statistics of each experiment. For example, in some sections the standard deviation is mentioned and in others it’s not. The interquartile range would also be an interesting statistic as the standard deviation is influenced a lot by the very worst performing stations.

Figure 3: I’m struggling to understand this figure. From the experiment description I would expect there to be three lines: reanalysis baseline, forecast baseline, and the Cross domain where the model used for the reanalysis baseline is driven by forecast data in the test period. What is the difference between the “Cross Domain (Reanalysis, Pre train)” and the “Baseline Reanalysis”? It would be nice to see a discussion on the cause between the difference between the two, particularly as it seems to have a big impact on the worst performing basins. It would also be useful if the caption labels corresponded to the experiment description more directly.

Section 3.1: It would strengthen the manuscript to explore the relationship between differences in meteorological reanalysis and forecast datasets and the resulting streamflow predictions. For example, do locations with the largest input differences show the greatest performance degradation in the cross-domain experiments?

Figure 4:
Are the arrows showing the median change in NSE or the change in median NSE (the caption suggests the former, but the arrows suggest the latter)? Please also check other figure captions for consistency.

Also please check the 0.3 value associated with the orange line (Sequential Forecast LSTM) as it doesn’t seem consistent with the other values.

Line 312: Interestingly, the difference between the two TL methods disappears at the 90th percentile. Can the authors explain this behaviour?

Line 329: How many and what size layers does the handover network have, and are they tuned?

Lines 335-336: I really like these summary sentences (e.g. “In short…”) at the start of each results section - very clear and helpful!

Line 346: If pretrained with both reanalysis and forecast data as input, what is the model fine-tuned on? Is it the loss function that is changed rather than the inputs?

Section 3.2/Section 3.4: The study focuses on training strategies and LSTM architectures to mitigate forecast-induced biases. However, the inclusion of near real-time streamflow observations may also be compensating for other uncertainties. A brief discussion of this effect would be helpful. Additionally, please justify the use of forecast- and reanalysis-based baselines as upper and lower benchmarks within this section (Seibert et al., 2018).

Line 358: Where are the outliers? What is causing these outliers? For example, are they at high elevations? Do they have short observation records?

Lines 380-381: “The Sequential Forecast LSTM exhibits even more impressive improvements when discharge is integrated, achieving a median NSE of 0.71 compared to 0.63 without discharge.” may be misleading because the improvement of the median NSE value is smaller than for the Encoder-decoder model despite it being for higher NSE values.

Section 3.4: It would be useful to explore the relationship between catchment attributes and model performance, particularly since persistence methods can sometimes achieve high skill at a 1-day lead time. This would strengthen the results of the manuscript.

Section 3.5: Please check throughout that the analysis of the graph is correct. For example, “In the baseline, encoder-decoder and sequential LSTM experiments, simple linear embedding networks produced slightly higher simulation performance compared to more complex embedding architectures.” (lines 409-411) but the dashed line (simple) is primarily to the left of the solid line (complex). I think there may also be an editing error in this section as the last paragraph seems to contradict the paragraph before.

Section 3.6: Some relevant topics could be discussed further. For example,
The study appears to use only ECMWF-HRES as the forecast dataset. While different resolutions are mentioned, how transferable are the results to other NWP systems?

The analysis focuses on a 1-day lead time—how does this choice affect the interpretation and generalizability of the results?

Static attributes are not discussed. How might they influence the results, particularly if static embedding networks are not fine-tuned in the TL experiments?

Technical corrections
Line 11: Suggest using “meteorological-forecast-induced” for consistency with other section

Line 32: “magnitudinal correct” should be “magnitude-correct” or “magnitudinally correct”

Lines 146-148: Something is not quite right in the grammar of the sentence “Hereafter, we use …”. Please rewrite for clarity.

Line 284: “with equal five” doesn’t make sense.

Lines 583-584: This reference has no way of finding the dataset doi/url etc.

Lines 454-456: Something is not quite right in the sentence beginning “The substantial improvements”. Please rephrase.

Line 478: The Hunt et al method is not GloFAS-specific it is just applied to GloFAS but it does require an initial streamflow forecast.

Appendix B: The orange colour is not defined in the caption

References
Seibert J, Vis MJP, Lewis E, van Meerveld HJ. Upper and lower benchmarks in hydrological modelling. Hydrological Processes. 2018; 32: 1120–1125. https://doi.org/10.1002/hyp.11476

Reply
Citation: https://doi.org/10.5194/egusphere-2025-4978-RC1

Oliver Konold, Moritz Feigl, Patrick Podest, Christoph Klingler, and Karsten Schulz

Data sets

Experimental Setups and Results for "BiasCast: Learning and adjusting real time biases from meteorological forecasts to enhance runoff predictions" Oliver Konold et al. https://doi.org/10.5281/zenodo.17241922

Extended LamaH-CE: LArge-SaMple DAta for Hydrology and Environmental Sciences for Central Europe Oliver Konold et al. https://doi.org/10.5281/zenodo.17119634

Model code and software

Forked NeuralHydrology Version Oliver Konold https://github.com/conestone/neuralhydrology

Interactive computing environment

Experiments and Results Code for "BiasCast: Learning and adjusting real time biases from meteorological forecasts to enhance runoff predictions" Oliver Konold https://github.com/conestone/biascast

Oliver Konold, Moritz Feigl, Patrick Podest, Christoph Klingler, and Karsten Schulz

Viewed

Total article views: 542 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
283	240	19	542	120	251

HTML: 283
PDF: 240
XML: 19
Total: 542
BibTeX: 120
EndNote: 251

Views and downloads (calculated since 27 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	77	15	2	94
Dec 2025	72	110	11	193
Jan 2026	78	56	4	138
Feb 2026	56	59	2	117

Cumulative views and downloads (calculated since 27 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	77	15	2	94
Dec 2025	72	110	11	193
Jan 2026	78	56	4	138
Feb 2026	56	59	2	117

Viewed (geographical distribution)

Total article views: 521 (including HTML, PDF, and XML) Thereof 521 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 28 Feb 2026

Short summary

Flood forecasting systems depend on weather forecasts. However, weather forecasts always have an error when compared with historical observations. This causes flood predictions to become less accurate when switching from historical to forecast data. We tested artificial intelligence (AI) methods across 451 European river basins to address this challenge and found that using appropriate model design can turn this accuracy problem into something the system can learn to fix "on the fly".


Total:	0
HTML:	0
PDF:	0
XML:	0