the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Multi-site learning for hydrological uncertainty prediction: the case of quantile random forests
Abstract. To improve hydrological uncertainty estimation, recent studies have explored machine learning (ML)-based post-processing approaches that enable both enhanced predictive performance and hydrologically informed probabilistic streamflow predictions. Among these, random forests (RF) and their probabilistic extension, quantile random forests (QRF), are increasingly used for their balance between interpretability and performance. However, the application of QRF in regional post-processing settings remains unexplored. In this study, we develop a hydrologically informed QRF post-processor trained in a multi-site setting and compare its performance against a locally (at-site) trained QRF using probabilistic evaluation metrics. The QRF framework leverages simulations and state variables from the GR6J hydrological model, along with readily available catchment descriptors, to predict daily streamflow uncertainty. Our results show that the regional QRF approach is beneficial for hydrological uncertainty estimation, particularly in catchments where local information is insufficient. The findings highlight that multi-site learning enables effective information transfer across hydrologically similar catchments and is especially advantageous for high-flow events. However, the selection of appropriate catchment descriptors is critical to achieving these benefits.
- Preprint
(3854 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 26 Oct 2025)
- AC1: 'Minor typo detected', Taha-Abderrahman El-Ouahabi, 18 Sep 2025 reply
-
RC1: 'Comment on egusphere-2025-3586', Anonymous Referee #1, 25 Sep 2025
reply
Overall comments
This is a consistently interesting and well-considered study on the benefits of using multiple sites to train a probabilistic machine learning method (QRF) to predict hydrological model errors. The study shows that the inclusion of multiple sites indeed does improve the performance of QRF. The study has clear aims, and its conclusions are well supported by rigorous cross-validation and forecast verification. This is a minor point, but I was quite taken with their innovative way of measuring sharpness using the CRPS, which neatly sidesteps the issue of having to focus on one or two intervals with average width of prediction intervals, the conventional method for assessing of sharpness (which can give contradictory results for different intervals, and of course tells you nothing about intervals that are omitted).The finding that the use of multiple sites helps the QRF may in some ways seem obvious in retrospect, but this is the thing with the best studies: the findings often look obvious after the authors have presented them! Accordingly, I think the study makes a significant contribution to the literature on hydrological error modelling. Their further investigation of the use of regional/national methods of including sites provides practical guidance to anyone wishing to implement their methods, of which I am one.
For my own interests I would have liked to have seen a comparison with a more conventional error modelling technique - e.g. a simple AR1 model assuming Gaussian errors after transformation - but I understand that this would have considerably lengthened the study, and is not strictly within the aims of what the authors set out to do. So I am happy for this to be omitted. I have a few questions about methods in the specific comments, the most notable of which is whether static climatic/hydrologic predictors are cross-validated. Assuming the answer is 'yes', I recommend this study be published essentially in its present form, subject to technical corrections.
Specific comments
L53 "They found that larger LSTM models trained on all available basins outperform smaller models trained on a limited set of catchments. This is because, for some ML approaches, models calibrated on larger training datasets can outperform smaller and more specialized models" This is effectively saying "LSTMs perform better on larger datasets because LSTMs perform better on larger datasets". Please avoid instances of circular reasoning like this. An additional point is that as far as I understand it LSTMs significantly outperform conventional rainfall runoff models for predictions in ungauged basins. This differs from applications where models are calibrated and used on the the same catchment, where conventional rainfall-runoff models can perform similarly well to LSTMs. This may be worth mentioning.L83 "Potential evaporation (PET) is calculated using the formula proposed by Oudin et al. (2005)." Could the authors briefly list the inputs used in this formula?
L84 "Since our interest is in developing a multi-site QRF post-processor, we used several static basin-averaged attributes describing climate, topography and geology." Would be good to foreshadow that these are listed in Table 1.
L108 "with a power of 0.5 and -0.5 prior transformations on streamflow" It's not clear to me where the power is being applied and what this transformation is. Please specify - in an appendix is fine.
L154 Table 1. As is common, many of the static descriptors are based on climatic/hydrologic predictions (mean precip, PE, temp, etc.). I just wanted to confirm that these were cross-validated for this study: i.e., that they were computed separately for each of the training, validation and testing periods. As I'm sure the authors are aware, rigorously cross-validating predictors is an important aspect of testing any prediction system. I realise this kind of cross-validation is sometimes not done in ML studies, but it should be.
L225 "The sharpness metric is the continuous ranked probability score (CRPS)" This is a really nifty way of measuring sharpness!
L258 "In terms of sharpness, the different QRF variants performed similarly, which is interesting given that multi-site setups significantly improve CRPSS values". Might be worth saying why it's interesting: it's quite possible (even unsurprising) that sharpness (a property of the prediction only) is the same but CRPSS (which considers the joint distribution of obs and predictions) differs.
L272 Figure 4. Might be worth stating that curves that track closer to the right of the plots indicate better performance in the caption.
L286 "Table 3 summarizes the average values of the alpha, dispersion, CRPSS, and interval scores for three flow groups: high (> 67%Qsim), medium
(> 34%Qsim and < 66%Qsim), and low flows (< 33%Qsim)." Please confirm that performance scores are stratified based on when predictions exceed these thresholds, not when observations exceed them.L314 "Furthermore, the aforementioned scale discrepancies occurred specifically for catchments characterized by frequent zero values in simulated and observed streamflows." I wondered about this. Normalising errors with a log transformation is one thing, but maintaining normality in the presence of zeros in observations - and potentially also in the the simulations after the QRF is applied - is quite another. While this isn't a total solution, might it be helpful to consider the proportion of zero flows as a static predictor?
L320 Discussion. I would have liked to see a paragraph or two added that briefly discusses the following topics. However, I understand that my interests are not necessarily the authors' and also may not be interesting to a more general audience, so I leave it to the authors to decide which of these issues (if any) they may wish to discuss:
- The weaknesses of the method (discussed throughout the manuscript) - e.g. application to ephemeral catchments - and how the authors might improve these.
- The sensitivity of the method to data availability. The authors used the astonishingly comprehensive CAMELS-FR dataset, but many of us work in regions with only a fraction of this gauge coverage. e.g. What would have happened if they only had access to 50 gauges in their dataset? What might have happened if observations are concentrated on a particular hydrological type, but applied outside this type?
- Are there prospects for applying this method to produce reliable probabilistic predictions in ungauged basins?
L362 "However, because of memory issues, we trained QRF-national on Jean-Zay HPC, where a single node with two CPUs (at 2.5 GHz) and 128 GBs of memory was sufficient." I'd be interested to know how long the parameters estimation took on this hardware. Cloud computing means that many now have access to large computers, but the run-time can still make these resources expensive.
Grammar etc.
L141 "But, it is important to note that these scale features are not available" suggest deleting 'But,'Citation: https://doi.org/10.5194/egusphere-2025-3586-RC1 -
RC2: 'Comment on egusphere-2025-3586', Derek Karssenberg, 17 Oct 2025
reply
The manuscript proposes and evaluates the use of quantile random forests for correction of streamflow predicted with a process-based model. The main innovation compared to previous studies on streamflow error correction is the use of quantile random forests as it provides a means to estimate uncertainty in corrected streamflow. Also, unlike previous studies, this study extensively compares results for approaches that use local, regional, or national (France) data for training the error correction model. In my opinion this is a very interesting study. The methodology is state of the art, and the manuscript is relevant to development of error correction methodology (also in other domains).
My comments mainly refer to how the study is presented while I suggest a number of relatively small additions. Please find below my main comments followed by a list of minor comments.
Introduction
The introduction needs some revision to further increase the impact of the study and to make it more accessible. The problem definition needs to be defined more completely and more precisely.
It remains unclear what the ‘simulation context’ (used in the methods section, line 125) is. In my opinion it is important to clearly state that this paper is about error correction of process-based model predictions, in the situation/context where predictions are made without relying on extrapolation of past observations of streamflow (for short range (small lead time) forecasts this would be more powerful). Also, the paper is not about prediction for ungauged catchments as all models are trained on historical streamflow at the location for which predictions are made. The simulation context thus is, I think, mainly reconstructing or projecting (e.g. under scenarios of climate change) streamflow for catchments that have streamflow available.
Also, the second contribution (line 62, spatial catchment descriptions) does not come with a clearly substantiated problem addressed by this contribution.
Please clearly describe what is meant by ‘regional’ in ‘regional learning’, ‘regional approaches’, ‘regional bias’, ‘regional post-processing’, etc. It is central to the study but it is not clearly defined. Is ‘multi-site learning’ (line 70) the same (please explain in manuscript).
Please clearly describe what is error corrected (i.e. streamflow from a process-based model). This is not clearly stated in the introduction (e.g. line 64 ‘model states’, model states from what?).
What are ‘spatially varying catchment characteristics? Line 64. Please explain or rephrase.
Hyper parameter tuning – metrics usedI suggest moving information from Appendix A (line 398 – 403) to the main text (Methods), in particular the fact that hyperparameter tuning is done on metrics that refer to probability distributions (instead of deterministic ones). To my knowledge this study is quite unique in doing so (but I may be wrong but even then I would still move it to the main text). It is also suggested to state in the main text that in the local modelling, hyperparameters are different between catchments (which should further improve the results for the local model compared to an approach fixing hyperparameters across catchments).
Assessment criteriaThe assessment criteria are well chosen. However, I think it can be presented better in the Methods and Results section.
First, I suggest giving additional explanation on the terminology. If I am correct ‘sharpness’ refers to the magnitude of the uncertainty of the prediction, i.e. the lower the better. Please try to explain this more extensively as not all readers will be familiar with this term. The term ‘reliability’ in the context of your manuscript refers to whether the modelling approach is capable of providing correct estimates of the uncertainty (preferably the complete distribution should be correct).
Second, I suggest then to somewhat more clearly explain to what (sharpness or reliability) each metric refers. For instance, both the alpha score and coverage refer to ‘reliability’. Connecting these metrics could also be done in the Results section; e.g. one would expect similar results (relative performance between local, regional, national) for alpha and coverage ratio as these both refer to reliability. This is not stated at all in the Results and the reader has to make it up by herself.
Third, please be precise in the explanation. For instance ‘It calculates the closeness of predicted uncertainty distributions to the statistical distribution of observed streamflows’ (line 221) reads like you compare the distribution of the error term (from the model) with distributions of streamflow (over time?). This is not at all the case! Instead alpha is a metric summarizing the QQ plot, which is really a probabilistic property (as the authors will certainly be aware of). Also, please use correct units (for instance, CRPSS is given as percentage in the figure while in the main text it is in the range of 0-1 it seems).
ResultsPlease summarize the results of the hyper parameter tuning.
Magnitude of uncertainty (sharpness)It seems the modelling approach underestimates uncertainty for all scenarios. This is an important outcome. Please add this information to the Results (it is not mentioned at all) and provide possible explanations in the Discussion section. One possible cause is the fact that the approach neglects uncertainty in the streamflow prediction of the process-based model.
Process-based model as benchmarkI am aware the main question of the manuscript is not on how much error correction contributes to improved streamflow prediction compared to using streamflow from the process-based model (without error correction). However I am in the opinion that it is still extremely interesting to add information (and if possible a short discussion) on the performance of the process-based model before adding the error correction. This could be done by adding curves for this process-based only benchmark to figures, or values in tables, or values in the main text. This would also allow you to compare the results regarding the improvement of streamflow prediction after error correction with those in Magni (2023), Shen (2022) and possibly others. It the improvement in your study comparable to other studies?
TitleConsider revising such that it also covers the fact that this manuscript is about error-correction (or combining process-based modelling and machine learning – sometimes referred to as hybrid modellig). I agree that the case is quantile random forests but the case is also error correction (maybe more so).
Minor comments
Line 64
What does ‘For this..’ refer to?Figure 1
Please add a scale bar.Line 105
Please state what parameters were calibrated.Line 105 ‘prior transformations’
What is meant by this?Line 125
It is stated here that in the simulation context of this study, streamflow is not available. This is not really true. The manuscript describes a methodology that only applies to locations where streamflow is available (for training, validation). For testing (or projections/reconstruction) I agree it can be done without measured streamflow (for the timesteps for which testing is done) but in this manuscript, results/testing metrics are only presented for locations where streamflow was used for training (i.e. this is not an ungauged catchment study). This is in my opinion not an important limitation, but it has to be clearly stated what this study is about (please refer to my comments related to the introduction).Line 130, ‘production’
What is meant here?Line 131, ‘moving averages’
What was the filter size?Line 194
Refer to Figure 1Line 236
Number -> proportionCitation: https://doi.org/10.5194/egusphere-2025-3586-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,047 | 60 | 18 | 1,125 | 24 | 32 |
- HTML: 1,047
- PDF: 60
- XML: 18
- Total: 1,125
- BibTeX: 24
- EndNote: 32
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Dear colleagues,
Please note there was an error in the description of the model's calibration period. The correct period is 1977-1999. This error does not affect the results. However, it does change how the period selection is presented, as the statistical model's training period now partially overlaps with the calibration period.
We will introduce a correction in the next version of the manuscript.
Thank you for your understanding,