Comparing multi-model mosaic and multi-model combination methods to simulate streamflow across the contiguous USA

Thébault, Cyril; Knoben, Wouter J. M.; Addor, Nans; Newman, Andrew J.; Spieler, Diana; Vásquez, Nicolás A.; Song, Yalan; Gründemann, Gaby J.; Carney, Shaun; Kumar, Mukesh; van Werkhoven, Katie; Shen, Chaopeng; Wood, Andrew W.; Clark, Martyn P.

doi:10.5194/egusphere-2025-6083

Preprints

https://doi.org/10.5194/egusphere-2025-6083

Preprints

28 Jan 2026

| 28 Jan 2026

Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

Comparing multi-model mosaic and multi-model combination methods to simulate streamflow across the contiguous USA

Cyril Thébault, Wouter J. M. Knoben, Nans Addor, Andrew J. Newman, Diana Spieler, Nicolás A. Vásquez, Yalan Song, Gaby J. Gründemann, Shaun Carney, Mukesh Kumar, Katie van Werkhoven, Chaopeng Shen, Andrew W. Wood, and Martyn P. Clark

Abstract. The ability to accurately predict streamflow underpins decisions in water management, flood prevention, and sectoral planning. Traditional approaches for streamflow prediction often rely on one single model, thereby overlooking potential benefits from using multiple models. To address this limitation, this study explores alternative methods that select and combine multiple models to enhance streamflow simulations. Specifically, we assess the performance of multi-model mosaic methods that assign a single model to each catchment, and multi-model combination methods that merge multiple models using static or dynamic weighting schemes. The Framework for Understanding Structural Errors (FUSE) is used to create an ensemble of 78 hydrological models, which were applied to 559 catchments from the CAMELS dataset across the contiguous United States. Each of the 78 models is calibrated utilizing a composite objective function, calculated as the average of a high-flow and a low-flow performance metric, to cover a wide range of streamflow conditions. The results show that a carefully chosen single model from a larger ensemble can closely approach the performance of more complex multi-model strategies. Among the multi-model approaches, the combination and mosaic methods show broadly similar overall skill, although the combination approaches deliver slightly higher performance and lower sampling uncertainty. However, per-catchment differences persist, indicating that no single multi-model strategy dominates everywhere. This heterogeneity in performance makes it difficult to determine a priori which multi-model method will best represent streamflow in a given catchment.

Received: 06 Dec 2025 – Discussion started: 28 Jan 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Status: open (until 11 Mar 2026)

Post a comment Subscribe to comment alert

RC1:
'Comment on egusphere-2025-6083', Anonymous Referee #1, 20 Feb 2026 reply
General comments
This paper demonstrates and compares approaches to combine multiple hydrological models (i.e. multi-model mosaics vs multi-model combinations), answering the question of which multi-model approach performs best over a large sample of catchments in the US. First, I’d like to say that I really enjoyed reading this paper. It covers an important topic – how to improve streamflow simulations through multi-model approaches – in a novel way. I also appreciated the discussion of sampling uncertainty, which is often overlooked in modelling studies. The figures were excellent, well presented and very clear, and the paper was well-written. I would recommend that this manuscript is worthy of publication with minor corrections and clarification of the methods. Further comments and suggestions are outlined below.
Specific comments
Further justification is needed for the use of model structure with best median KGE as a benchmark. This is a difficult benchmark to beat – it already requires a multi-model approach running all 78 combinations of model structures and selecting the ones with the highest overall performance. The selected benchmark model is dependent on your catchment selection, and I wonder if this gives an unfair advantage to catchments which are the least similar to other CAMELS-US catchments (i.e. where the benchmark model structure is less suitable and therefore easier to beat). I am curious why you did not use the FUSE variants based on the four existing models (i.e. relating to VIC, PRMS, SAC and TOPMODEL) as benchmarks, as these may be a better representation of what we might expect from a single model approach which your multi-model approaches then build on.

The FUSE model variants share many similarities: all lumped conceptual hydrological models run at a daily timestep with the same input data. It would be worth discussing their similarities as well as the key differences given in Table 2 – as a more diverse multi-model ensemble may have even greater benefits. This is briefly touched upon in the discussion, but I feel that it is also worth elaborating upon in the methods.

Section 2.3: “Each model is calibrated for each catchment over the period 1989-1998 with a preliminary warm-up period of two years.” Please could you further specify these dates – were they run over water years or calendar years, and was the warm-up period before the calibration i.e. 1987-1988 inclusive or the first two years of the calibration period 1989-1990? Is two years sufficient for a warm-up period for your catchments? We have found that some groundwater dominated catchments require longer warm-up times depending on the model initialisation, but I have no experience of modelling catchments in the USA. This choice of calibration period should be explained in the paper – 10 years is relatively short and may not capture particularly dry/wet years.

Section 2.5 would benefit from a more thorough description of the multi-model approaches. In particular, I noted the following:

(1) section 2.5.2.2 left me with questions such as what are the benefits of minimising the number of models, how exactly does the method reduce the number of models required, and how many model structures remained? On further reading I found that more details are given in appendix A – it would be helpful to refer to this in the main text.
(2) section 2.5.3.1. Could you clarify how the models were combined? “using a simple average of up to four models” – did you take an equally weighted mean of discharge values from all four models for each timestep?
(3) Section 2.5.3.2. – the method selects “the combination of up to three models that yields the highest KGEcomp scores over the calibration period” – I’d be curious to know if there any cases where a single model is better than any combination of 2 or 3 models? And in this case would you use the single model as ‘the best combination’ or does this method require a minimum of 2 models? Again, this section could refer to appendix A.
Figure 9. This figure has a lot of information content with the locations of all gauges, but it is hard to see at a glance which methods are doing better and which are equivalent. I found myself trying to read and compare the numbers written above each map and struggled to see any patterns with so much information available. Could it be presented more clearly, e.g. as a table of pie charts/bar graphs rather than a table of maps?

Appendix line 646: “Interestingly, these are the two top-performing models in Figure A2, but model 72 is not selected in Figure A4, suggesting a large degree of similarity (i.e., equivalence) between both models.” Could this also be because model #72 is equivalent with other model structures (e.g. 96) – rather than necessarily being equivalent with #126? And does equivalence (i.e. similar KGE scores) necessarily mean similarity (i.e. similar hydrographs)?

Figure A10: how are the catchments ordered in this figure? Knowing if catchments are grouped by location, key characteristics, or performance would help with the interpretation of this plot.

Reply
Citation: https://doi.org/10.5194/egusphere-2025-6083-RC1

Viewed

Total article views: 359 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
223	124	12	359	11	13

HTML: 223
PDF: 124
XML: 12
Total: 359
BibTeX: 11
EndNote: 13

Views and downloads (calculated since 28 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	87	55	5	147
Feb 2026	136	69	7	212

Cumulative views and downloads (calculated since 28 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	87	55	5	147
Feb 2026	136	69	7	212

Viewed (geographical distribution)

Total article views: 344 (including HTML, PDF, and XML) Thereof 344 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 28 Feb 2026

Short summary

Reliable river flow prediction guide water supply planning and flood protection. We tested whether selecting or combining many models improves accuracy compared with single model. 78 models were used and tested in 559 river basins across the United States. A carefully chosen single model nearly matched more complex multi-model approaches, while combining models gave slightly higher accuracy and lower uncertainty. However, no approach worked best everywhere.


Total:	0
HTML:	0
PDF:	0
XML:	0