the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Skilful probabilistic predictions of UK floods months ahead using machine learning models trained on multimodel ensemble climate forecasts
Abstract. Seasonal streamflow forecasts are an important component of flood risk management. Hybrid forecasting methods that predict seasonal streamflow using machine learning models driven by climate model outputs are currently underexplored, yet have some important advantages over traditional approaches using hydrological models. Here we develop a hybrid subseasonal to seasonal streamflow forecasting system to predict the monthly maximum daily streamflow up to four months ahead. We train a random forest machine learning model on dynamical precipitation and temperature forecasts from a multimodel ensemble of 196 members (eight seasonal climate forecast models) from the Copernicus Climate Change Service (C3S) to produce probabilistic hindcasts for 579 stations across the UK for the period 2004–2016, with up to four months lead time. We show that multi-site ML models trained on pooled catchment data together with static catchment attributes are significantly more skilful compared to single-site ML models trained on data from each catchment individually. Considering all initialization months, 60 % of stations show positive skill (CRPSS > 0) relative to climatological reference forecasts in the first month after initialization. This falls to 41 % in the second month, 38 % in the third month and 33 % in the fourth month.
Status: open (until 18 Nov 2024)
-
RC1: 'Comment on egusphere-2024-2324', Anonymous Referee #1, 25 Oct 2024
reply
Review comment: Skilful probabilistic predictions of UK floods months ahead using machine learning models trained on multimodel ensemble climate forecasts by Simon Moulds et al.
The manuscript by Moulds et al. presents a new hybrid model approach for flood forecasting at a subseasonal to seasonal (S2S) scale for the UK. In the process of developing the forecasting system, different model setups were tested, comparing a single vs multisite approach, including additional catchments attributes to the dynamical input data (precipitation and temperature) from the multimodel seasonal forecasting system C3S, to predict monthly maximum daily streamflow values.
The manuscript highlights the importance of incorporating the multisite approach into modelling practices and further elaborates on the skil of the framework for the four lead months considered in the analysis. While 60% of stations (over all initialization months) indicate positive skill compared to a climatological reference forecast, the skil over the following lead months decreases, which is to be expected. However, the skill compared to commonly used forecasting systems like EFAS remains higher over the lead periods considered. Overall, the manuscript tackles the questions of a) how skilful monthly maximum daily flow can be predicted up to four months lead time and to what extent the skill of S2S streamflow predictions can be improved by incorporating a multi vs single site framework.While the manuscript is well written and showcases an interesting approach to hybrid modelling for flood forecasting, some extra information on the model development process, such as the training, as well as a more in-depth comparison of the single vs multi-site model results would help the reader understand and follow the key findings more.
The following major comments/suggestions/questions came up during the reading process and could help to strengthen and clarify some aspects of the manuscript:
Model development:
- lines 150-161: I would suggest adding an overview figure highlighting the forecast design setup (with the different forecasting months and the data and timesteps used), as well as the different model options described (line 199) in the following section (and potentially combine it with table 1). I think something like this would greatly help the reader and prepare them for understanding and following the figures in the results section quicker.
- lines 204-218: It would generally be interesting and helpful to include some more aspects of the training, testing and validation approach to better understand the model setup.
-- For example: How many or which percentage of the climate forecast ensembles was used for training, testing and validation? As Table S1 highlights all the different options with varying ensemble sizes).
-- Was there a clear split between training, testing and validation datasets?
-- In line 216 it’s mentioned that the training period gets extended with every year: was the model retrained completely from scratch with all the data or just updated? Would that not lead to overfitting the model?
-- Was it tested separately what the influence of this training method is compared to not extending the training periods and just providing the current available data?
-- How did the training, testing and validation differ between the single and multi-site (ID or 15attributes) approaches (if they did)? How were the attributes incorporated? Are only the medians used as listed in Table S3? Was there a random selection of the catchments for training or where all considered?Single vs multi-site model results :
- line 259: would be nice it could be highlighted in the description of Figure 1b) (and 1a)) that this shows the average performance of the models. Furthermore, as I reader I was more interested the Figure S2 and understanding where and when the multi-model with attributes tends to outperform the single-site model than the Gini importance. Maybe consider switching out Figure S2 and 1c) from/to appendix? While I think the Gini importance is interesting and the discussion in lines 275-284 should remain, I believe having a more detailed look at Figure S2 might also help with the interpretation of Figure 2 and others where the discussion also goes over different months. Furthermore, the difference between single and multisite model performance is apparently one of the key findings of the work.
Minor comments:
- line 23: could refer directly to the model used (quantile regression forest)
- line 33: abstract could include additional line on outcome, relevance or future plans of this developed framework to highlight the relevance of their findings
- line 53: one period to many before ‘While the skill…’
- lines 105-108: a few lines explaining the necessity and urgency for implanting and testing such approach (in the UK but also generally) could be added to strengthen the objectives of the manuscript
- line 110: short explanation of C3S multimodel would be nice to be able to follow (similar to the EFAS one in previous lines). Furthermore, consider introducing the abbreviation already here and not only in the Data section on line 133
- section 2.1 and 2.2: could consider adding subsections for the different paragraphs to make it more obvious for the reader to find which information where
- line 128: could consider adding a map with the locations of the 579 stations throughout the country to give the reader a better understanding whether these locations are equally distributed or whether there are for example multiple stations for the catchments (also in regards to training the model on static variables and whether some locations might be overrepresented compared to others).
- line 148: do the authors know how much of a bias there is in the precipitation and temperature hindcasts compared to the observations of the catchment? Just curious.
As it was mentioned previously that the aim is to include uncorrected monthly dynamical climate forecasts (line 118), I was wondering what the specific reasoning of this is as well?- lines 150-161: I would suggest adding an overview figure highlighting the forecast design setup (with the different forecasting months and the data and timesteps used), as well as the different model options described (line 199) in the following section (and potentially combine it with table 1). I think something like this would greatly help the reader and prepare them for understanding and following the figures in the results section quicker.
- lines 204-218: It would generally be interesting and helpful to include some more aspects of the training, testing and validation approach to better understand the model setup.
-- For example: How many or which percentage of the climate forecast ensembles was used for training, testing and validation? As Table S1 highlights all the different options with varying ensemble sizes).
-- Was there a clear split between training, testing and validation datasets?
-- In line 216 it’s mentioned that the training period gets extended with every year: was the model retrained completely from scratch with all the data or just updated? Would that not lead to overfitting the model?
-- Was it tested separately what the influence of this training method is compared to not extending the training periods and just providing the current available data?
-- How did the training, testing and validation differ between the single and multi-site (ID or 15attributes) approaches (if they did)? How were the attributes incorporated? Are only the medians used as listed in Table S3? Was there a random selection of the catchments for training or where all considered?- line 228: general question out of curiosity: why evaluate only on monthly scale? The model seemed to be trained on daily (?) timesteps and floods are relatively quick phenomenon. Would it not also be interesting to see if the model can forecast the peaks or the timing of floods at a shorter timescale?
-line 250: same as before: consider adding subchapters to make it easier to for the reader to follow and potentially add a line on how the results are structured
- line 251: for this results and statement it would interesting again to know where the stations are located and in terms of their ID and static catchment attributes
- line 259: would be nice it could be highlighted in the description of Figure 1b) (and 1a)) that this shows the average performance of the models. Furthermore, as I reader I was more interested the Figure S2 and understanding where and when the multi-model with attributes tends to outperform the single-site model than the Gini importance. Maybe consider switching out Figure S2 and 1c) from/to appendix? While I think the Gini importance is interesting and the discussion in lines 275-284 should remain, I believe having a more detailed look at Figure S2 might also help with the interpretation of Figure 2 and others where the discussion also goes over different months. Furthermore, the difference between single and multisite model performance is apparently one of the key findings of the work.
- line 287: consider clarifying also in the text that following focus of the results lies on the multisite model with catchment attributes (it’s in the description of Figure 2 but would be nice to have in the main text as well)
- Figure 2: is it possible to make the same figure for comparison also for the single site model? Would be also interesting to see the single site and multisite also to the EFAS comparison
- line 288: ‘with lower skill during spring and autumn.’ Would it be possible to give the reader an estimate of which seasons or months are generally high flood seasons for the UK or different catchments? In other words, is the model able to simulate floods in those months or not or does it only show good performance in months where there are less floods?
- line 309: appreciate that explanation and clarification as a reader
- Table 2: out of curiosity: any idea why the percentage of stations in lead time 2 in March seems to increase and is even higher than in lead 0?
- Figure 3: would it be possible to change the background/areas of the map that is not considered in the analysis to a different color (e.g. white or transparent) to make the distinction between catchments with low CRPSS a bit clearer?
- Figure 4: What is the Qmax range of the observations for the different catchments shown? Does the difference in catchments (e.g. size or average Q) might have an impact? And was this considered in the training selection? Or can it be related to the Gini importance of the different variables?
- line 342: are these 90 stations roughly in the same area? Is there a common ground that explains their positive skill for all four lead times in a few selective months?
-line 338 Discussion: consider coming back to the initial research question from the introduction more clearly
- line 353: consider adding a line with the comparison to the reference (climatology and EFAS) to highlight how much or little difference the model framework brings to the current forecasting systems used or available in the UK.
- is there a specific reason for not having a conclusion to round up the manuscript, highlighting the main points findings? I believe adding one and coming back and answering the initial research questions from the introduction would strengthen the manuscript
Citation: https://doi.org/10.5194/egusphere-2024-2324-RC1 -
RC2: 'Comment on egusphere-2024-2324', Anonymous Referee #2, 03 Nov 2024
reply
Summary
This is a consistently interesting and very well presented study where the authors have used a machine learning method (Quantile Regression Forests) to forecast maximum daily streamflow occurring in a month to long lead times. The authors use a very large number of catchments (579) and stringent cross-validation to show that they can produce skillful forecasts in a majority of catchments in the first month. Skill declines in subsequent months, but a substantial minority of catchments is still skillful at long lead times. Importantly, the authors show they are able to outperform a credible dynamical forecasting system for these predictions. The analyses are appropriately rigorous, and strongly support the authors' conclusions. The discussion clearly outlines the significance of the work and limitations of their study. The paper is concise and enjoyable to read.
Accordingly, I recommend the study be published essentially as is, with minor technical corrections based on the comments below.
Minor comments/typos
L27-L30 "We show that multi-site ML models trained on pooled catchment data together with static catchment attributes are significantly more skilful compared to single-site ML models trained on data from each catchment individually." I think the authors should use the same phrasing for this result as they have used in body of the paper - i.e. 'narrowly but significantly more skillful'. Figure S2 shows that the advantage of multi-site forecasts over single site forecasts, while there, is generally slight. It is a sad truth that many who cite this paper will deprive themselves of the excellent contents and look only at the abstract, and the current phrasing is a little at odds with the body of the paper.
L92 "Hybrid methods are unconstrained by the need to conserve the water balance and implicitly handle biases in the climate data" No change here, but this is also true of conceptual models.
158-159 "estimate initial hydrologic condition predictability" I would put this as: "are proxies for hydrologic initial conditions"
L176-177 "to predict the monthly maximum of mean daily streamflow (Qmax) using" What does 'mean' imply here? Is it not simply the maximum daily streamflow?
L213-218 An admirably stringent cross-validation scheme!
L223-225 "We evaluated our forecasts against an observation-based ensemble climatological forecast consisting of the observed monthly streamflow values from the previous 20 years (e.g. Hauswirth et al., 2023)." I assume the climatology was an ensemble of observed Qmax (i.e. the same variable as is being forecast)? Please confirm.
L232 "reliability index (RI)" please provide more details on this index or a reference.
L302-303 "We bias corrected the EFAS outputs using a quantile mapping approach" Did the QM use parametric or empirical distributions to describe the CDFs? Because Qmax will tend to fall in the tails, fitting an appropriate parametric distribution as part of the QM could matter.
L305 "Figure S1" - Personally I think this is easily interesting enough to include in the main body of the text, and adds to the growing body of forecasting systems where ML methods outperform dynamical systems. I think this figure in particular is likely to be of considerable signficance because it is showing a prediction for events in the tails of distributions; I've often heard the view expressed that ML models are less able to make such predictions than dynamical models. I urge the authors to consider including this figure in the main body of the text.
Citation: https://doi.org/10.5194/egusphere-2024-2324-RC2
Viewed
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
132 | 0 | 0 | 132 | 0 | 0 |
- HTML: 132
- PDF: 0
- XML: 0
- Total: 132
- BibTeX: 0
- EndNote: 0
Viewed (geographical distribution)
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1