the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Multi-site learning for hydrological uncertainty prediction: the case of quantile random forests
Abstract. To improve hydrological uncertainty estimation, recent studies have explored machine learning (ML)-based post-processing approaches that enable both enhanced predictive performance and hydrologically informed probabilistic streamflow predictions. Among these, random forests (RF) and their probabilistic extension, quantile random forests (QRF), are increasingly used for their balance between interpretability and performance. However, the application of QRF in regional post-processing settings remains unexplored. In this study, we develop a hydrologically informed QRF post-processor trained in a multi-site setting and compare its performance against a locally (at-site) trained QRF using probabilistic evaluation metrics. The QRF framework leverages simulations and state variables from the GR6J hydrological model, along with readily available catchment descriptors, to predict daily streamflow uncertainty. Our results show that the regional QRF approach is beneficial for hydrological uncertainty estimation, particularly in catchments where local information is insufficient. The findings highlight that multi-site learning enables effective information transfer across hydrologically similar catchments and is especially advantageous for high-flow events. However, the selection of appropriate catchment descriptors is critical to achieving these benefits.
- Preprint
(3854 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 26 Oct 2025)
- AC1: 'Minor typo detected', Taha-Abderrahman El-Ouahabi, 18 Sep 2025 reply
-
RC1: 'Comment on egusphere-2025-3586', Anonymous Referee #1, 25 Sep 2025
reply
Overall comments
This is a consistently interesting and well-considered study on the benefits of using multiple sites to train a probabilistic machine learning method (QRF) to predict hydrological model errors. The study shows that the inclusion of multiple sites indeed does improve the performance of QRF. The study has clear aims, and its conclusions are well supported by rigorous cross-validation and forecast verification. This is a minor point, but I was quite taken with their innovative way of measuring sharpness using the CRPS, which neatly sidesteps the issue of having to focus on one or two intervals with average width of prediction intervals, the conventional method for assessing of sharpness (which can give contradictory results for different intervals, and of course tells you nothing about intervals that are omitted).The finding that the use of multiple sites helps the QRF may in some ways seem obvious in retrospect, but this is the thing with the best studies: the findings often look obvious after the authors have presented them! Accordingly, I think the study makes a significant contribution to the literature on hydrological error modelling. Their further investigation of the use of regional/national methods of including sites provides practical guidance to anyone wishing to implement their methods, of which I am one.
For my own interests I would have liked to have seen a comparison with a more conventional error modelling technique - e.g. a simple AR1 model assuming Gaussian errors after transformation - but I understand that this would have considerably lengthened the study, and is not strictly within the aims of what the authors set out to do. So I am happy for this to be omitted. I have a few questions about methods in the specific comments, the most notable of which is whether static climatic/hydrologic predictors are cross-validated. Assuming the answer is 'yes', I recommend this study be published essentially in its present form, subject to technical corrections.
Specific comments
L53 "They found that larger LSTM models trained on all available basins outperform smaller models trained on a limited set of catchments. This is because, for some ML approaches, models calibrated on larger training datasets can outperform smaller and more specialized models" This is effectively saying "LSTMs perform better on larger datasets because LSTMs perform better on larger datasets". Please avoid instances of circular reasoning like this. An additional point is that as far as I understand it LSTMs significantly outperform conventional rainfall runoff models for predictions in ungauged basins. This differs from applications where models are calibrated and used on the the same catchment, where conventional rainfall-runoff models can perform similarly well to LSTMs. This may be worth mentioning.L83 "Potential evaporation (PET) is calculated using the formula proposed by Oudin et al. (2005)." Could the authors briefly list the inputs used in this formula?
L84 "Since our interest is in developing a multi-site QRF post-processor, we used several static basin-averaged attributes describing climate, topography and geology." Would be good to foreshadow that these are listed in Table 1.
L108 "with a power of 0.5 and -0.5 prior transformations on streamflow" It's not clear to me where the power is being applied and what this transformation is. Please specify - in an appendix is fine.
L154 Table 1. As is common, many of the static descriptors are based on climatic/hydrologic predictions (mean precip, PE, temp, etc.). I just wanted to confirm that these were cross-validated for this study: i.e., that they were computed separately for each of the training, validation and testing periods. As I'm sure the authors are aware, rigorously cross-validating predictors is an important aspect of testing any prediction system. I realise this kind of cross-validation is sometimes not done in ML studies, but it should be.
L225 "The sharpness metric is the continuous ranked probability score (CRPS)" This is a really nifty way of measuring sharpness!
L258 "In terms of sharpness, the different QRF variants performed similarly, which is interesting given that multi-site setups significantly improve CRPSS values". Might be worth saying why it's interesting: it's quite possible (even unsurprising) that sharpness (a property of the prediction only) is the same but CRPSS (which considers the joint distribution of obs and predictions) differs.
L272 Figure 4. Might be worth stating that curves that track closer to the right of the plots indicate better performance in the caption.
L286 "Table 3 summarizes the average values of the alpha, dispersion, CRPSS, and interval scores for three flow groups: high (> 67%Qsim), medium
(> 34%Qsim and < 66%Qsim), and low flows (< 33%Qsim)." Please confirm that performance scores are stratified based on when predictions exceed these thresholds, not when observations exceed them.L314 "Furthermore, the aforementioned scale discrepancies occurred specifically for catchments characterized by frequent zero values in simulated and observed streamflows." I wondered about this. Normalising errors with a log transformation is one thing, but maintaining normality in the presence of zeros in observations - and potentially also in the the simulations after the QRF is applied - is quite another. While this isn't a total solution, might it be helpful to consider the proportion of zero flows as a static predictor?
L320 Discussion. I would have liked to see a paragraph or two added that briefly discusses the following topics. However, I understand that my interests are not necessarily the authors' and also may not be interesting to a more general audience, so I leave it to the authors to decide which of these issues (if any) they may wish to discuss:
- The weaknesses of the method (discussed throughout the manuscript) - e.g. application to ephemeral catchments - and how the authors might improve these.
- The sensitivity of the method to data availability. The authors used the astonishingly comprehensive CAMELS-FR dataset, but many of us work in regions with only a fraction of this gauge coverage. e.g. What would have happened if they only had access to 50 gauges in their dataset? What might have happened if observations are concentrated on a particular hydrological type, but applied outside this type?
- Are there prospects for applying this method to produce reliable probabilistic predictions in ungauged basins?
L362 "However, because of memory issues, we trained QRF-national on Jean-Zay HPC, where a single node with two CPUs (at 2.5 GHz) and 128 GBs of memory was sufficient." I'd be interested to know how long the parameters estimation took on this hardware. Cloud computing means that many now have access to large computers, but the run-time can still make these resources expensive.
Grammar etc.
L141 "But, it is important to note that these scale features are not available" suggest deleting 'But,'Citation: https://doi.org/10.5194/egusphere-2025-3586-RC1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
1,004 | 49 | 15 | 1,068 | 21 | 29 |
- HTML: 1,004
- PDF: 49
- XML: 15
- Total: 1,068
- BibTeX: 21
- EndNote: 29
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Dear colleagues,
Please note there was an error in the description of the model's calibration period. The correct period is 1977-1999. This error does not affect the results. However, it does change how the period selection is presented, as the statistical model's training period now partially overlaps with the calibration period.
We will introduce a correction in the next version of the manuscript.
Thank you for your understanding,