Comparing multi-model mosaic and multi-model combination methods to simulate streamflow across the contiguous USA

Thébault, Cyril; Knoben, Wouter J. M.; Addor, Nans; Newman, Andrew J.; Spieler, Diana; Vásquez, Nicolás A.; Song, Yalan; Gründemann, Gaby J.; Carney, Shaun; Kumar, Mukesh; van Werkhoven, Katie; Shen, Chaopeng; Wood, Andrew W.; Clark, Martyn P.

doi:10.5194/egusphere-2025-6083

Preprints

https://doi.org/10.5194/egusphere-2025-6083

Preprints

28 Jan 2026

| 28 Jan 2026

Comparing multi-model mosaic and multi-model combination methods to simulate streamflow across the contiguous USA

Cyril Thébault, Wouter J. M. Knoben, Nans Addor, Andrew J. Newman, Diana Spieler, Nicolás A. Vásquez, Yalan Song, Gaby J. Gründemann, Shaun Carney, Mukesh Kumar, Katie van Werkhoven, Chaopeng Shen, Andrew W. Wood, and Martyn P. Clark

Abstract. The ability to accurately predict streamflow underpins decisions in water management, flood prevention, and sectoral planning. Traditional approaches for streamflow prediction often rely on one single model, thereby overlooking potential benefits from using multiple models. To address this limitation, this study explores alternative methods that select and combine multiple models to enhance streamflow simulations. Specifically, we assess the performance of multi-model mosaic methods that assign a single model to each catchment, and multi-model combination methods that merge multiple models using static or dynamic weighting schemes. The Framework for Understanding Structural Errors (FUSE) is used to create an ensemble of 78 hydrological models, which were applied to 559 catchments from the CAMELS dataset across the contiguous United States. Each of the 78 models is calibrated utilizing a composite objective function, calculated as the average of a high-flow and a low-flow performance metric, to cover a wide range of streamflow conditions. The results show that a carefully chosen single model from a larger ensemble can closely approach the performance of more complex multi-model strategies. Among the multi-model approaches, the combination and mosaic methods show broadly similar overall skill, although the combination approaches deliver slightly higher performance and lower sampling uncertainty. However, per-catchment differences persist, indicating that no single multi-model strategy dominates everywhere. This heterogeneity in performance makes it difficult to determine a priori which multi-model method will best represent streamflow in a given catchment.

Received: 06 Dec 2025 – Discussion started: 28 Jan 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Status: closed

RC1:
'Comment on egusphere-2025-6083', Anonymous Referee #1, 20 Feb 2026
General comments
This paper demonstrates and compares approaches to combine multiple hydrological models (i.e. multi-model mosaics vs multi-model combinations), answering the question of which multi-model approach performs best over a large sample of catchments in the US. First, I’d like to say that I really enjoyed reading this paper. It covers an important topic – how to improve streamflow simulations through multi-model approaches – in a novel way. I also appreciated the discussion of sampling uncertainty, which is often overlooked in modelling studies. The figures were excellent, well presented and very clear, and the paper was well-written. I would recommend that this manuscript is worthy of publication with minor corrections and clarification of the methods. Further comments and suggestions are outlined below.
Specific comments
Further justification is needed for the use of model structure with best median KGE as a benchmark. This is a difficult benchmark to beat – it already requires a multi-model approach running all 78 combinations of model structures and selecting the ones with the highest overall performance. The selected benchmark model is dependent on your catchment selection, and I wonder if this gives an unfair advantage to catchments which are the least similar to other CAMELS-US catchments (i.e. where the benchmark model structure is less suitable and therefore easier to beat). I am curious why you did not use the FUSE variants based on the four existing models (i.e. relating to VIC, PRMS, SAC and TOPMODEL) as benchmarks, as these may be a better representation of what we might expect from a single model approach which your multi-model approaches then build on.

The FUSE model variants share many similarities: all lumped conceptual hydrological models run at a daily timestep with the same input data. It would be worth discussing their similarities as well as the key differences given in Table 2 – as a more diverse multi-model ensemble may have even greater benefits. This is briefly touched upon in the discussion, but I feel that it is also worth elaborating upon in the methods.

Section 2.3: “Each model is calibrated for each catchment over the period 1989-1998 with a preliminary warm-up period of two years.” Please could you further specify these dates – were they run over water years or calendar years, and was the warm-up period before the calibration i.e. 1987-1988 inclusive or the first two years of the calibration period 1989-1990? Is two years sufficient for a warm-up period for your catchments? We have found that some groundwater dominated catchments require longer warm-up times depending on the model initialisation, but I have no experience of modelling catchments in the USA. This choice of calibration period should be explained in the paper – 10 years is relatively short and may not capture particularly dry/wet years.

Section 2.5 would benefit from a more thorough description of the multi-model approaches. In particular, I noted the following:

(1) section 2.5.2.2 left me with questions such as what are the benefits of minimising the number of models, how exactly does the method reduce the number of models required, and how many model structures remained? On further reading I found that more details are given in appendix A – it would be helpful to refer to this in the main text.
(2) section 2.5.3.1. Could you clarify how the models were combined? “using a simple average of up to four models” – did you take an equally weighted mean of discharge values from all four models for each timestep?
(3) Section 2.5.3.2. – the method selects “the combination of up to three models that yields the highest KGEcomp scores over the calibration period” – I’d be curious to know if there any cases where a single model is better than any combination of 2 or 3 models? And in this case would you use the single model as ‘the best combination’ or does this method require a minimum of 2 models? Again, this section could refer to appendix A.
Figure 9. This figure has a lot of information content with the locations of all gauges, but it is hard to see at a glance which methods are doing better and which are equivalent. I found myself trying to read and compare the numbers written above each map and struggled to see any patterns with so much information available. Could it be presented more clearly, e.g. as a table of pie charts/bar graphs rather than a table of maps?

Appendix line 646: “Interestingly, these are the two top-performing models in Figure A2, but model 72 is not selected in Figure A4, suggesting a large degree of similarity (i.e., equivalence) between both models.” Could this also be because model #72 is equivalent with other model structures (e.g. 96) – rather than necessarily being equivalent with #126? And does equivalence (i.e. similar KGE scores) necessarily mean similarity (i.e. similar hydrographs)?

Figure A10: how are the catchments ordered in this figure? Knowing if catchments are grouped by location, key characteristics, or performance would help with the interpretation of this plot.
Citation: https://doi.org/10.5194/egusphere-2025-6083-RC1
- AC1: 'Reply on RC1', Cyril Thébault, 15 Apr 2026
  
  Thank you for this very positive feedback, please find attached our detailed reply.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6083-AC1
RC2:
'Comment on egusphere-2025-6083', Anonymous Referee #2, 20 Mar 2026
This paper investigates and compare several instances of multi-model approaches over a large set of catchments:
mosaic methods that assign a single model to each catchment,

combination methods that merge multiple models using static or dynamic weighting schemes.

1. General assessment
By all means an excellent work: clear presentation, smooth writing, a pleasure to read and (why not) an example to cite.
2. Details
[introduction]
If you know who first introduced the expression “multi-model mosaic”, write it.

I may be old-fashioned, but I like to pay tribute to the “eminent forebears” at least in an introduction. I would suggest to cite here to initial ambition of Linsley (1982) who defended the idea of a single model arguing that it should be no longer “necessary for each hydrologist to develop his or her own model for each catchment”. One of the arguments of Linsley was that “a new model for every application eliminates the opportunity for learning that comes with repeated applications of the same model.” Note that your “Spatially and temporally static combination” represent to some extent a “multi-model” extension of this ambition.

eventually, you could cite the work of van Esse et al. (2013) as an (unsuccessful) attempt in mosaic-type approaches.

[Materials and methods]
147 : “couple” -> “coupled”

[Results]
Why did you wait the results section to mention the 15 catchements that you excluded? I would have said it from the beginning.

[Conclusion]
Based on the surprising result you obtained with your benchmark, one of the first things I would personally try would be to test the Oudin et al’s (2006) multi-calibration approach with this one-size-fits-all structure!

Also, it would be extremely interesting to check whether, even if a multi-model is not much superior to a single one-size-fits-all structure, it does not prove more robust in a climate-change perspective. A possibility would be to run some kind of climate robustness test (for example the RAT of Nicolle et al., 2021).

3. References
van Esse, W. R., C. Perrin, M. J. Booij, D. C. M. Augustijn, F. Fenicia, D. Kavetski, F. Lobligeois, 2013. The influence of conceptual model structure on model performance: a comparative study for 237 French catchments. Hydrol. Earth Syst. Sci. 17(10): 4227-4239, doi: 10.5194/hess-17-4227-2013.
Linsley, R.K., 1982. Rainfall-runoff models-an overview. In: V.P. Singh (Editor), Proceedings of the international symposium on rainfall-runoff modelling. Water Resources Publications, Littleton, CO, pp. 3-22.
Nicolle, P., V. Andréassian, P. Royer-Gaspard, C. Perrin, G. Thirel, L. Coron, & L. Santos. 2021. Technical Note – RAT: a Robustness Assessment Test for calibrated and uncalibrated hydrological models. Hydrol. Earth Syst. Sci., 25, 5013–5027. https://doi.org/10.5194/hess-25-5013-2021.
Oudin, L., Andréassian, V., Mathevet, T., Perrin, C. & Michel, C. 2006. Dynamic averaging of rainfall-runoff model simulations from complementary model parameterizations, Water Resources Research, 42(7): W07410, https://dx.doi.org/10.1029/2005WR004636.
Citation: https://doi.org/10.5194/egusphere-2025-6083-RC2
- AC2: 'Reply on RC2', Cyril Thébault, 15 Apr 2026
  
  Thank you for this very positive feedback. Please find attached our detailed reply.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6083-AC2

Status: closed

RC1:
'Comment on egusphere-2025-6083', Anonymous Referee #1, 20 Feb 2026
General comments
This paper demonstrates and compares approaches to combine multiple hydrological models (i.e. multi-model mosaics vs multi-model combinations), answering the question of which multi-model approach performs best over a large sample of catchments in the US. First, I’d like to say that I really enjoyed reading this paper. It covers an important topic – how to improve streamflow simulations through multi-model approaches – in a novel way. I also appreciated the discussion of sampling uncertainty, which is often overlooked in modelling studies. The figures were excellent, well presented and very clear, and the paper was well-written. I would recommend that this manuscript is worthy of publication with minor corrections and clarification of the methods. Further comments and suggestions are outlined below.
Specific comments
Further justification is needed for the use of model structure with best median KGE as a benchmark. This is a difficult benchmark to beat – it already requires a multi-model approach running all 78 combinations of model structures and selecting the ones with the highest overall performance. The selected benchmark model is dependent on your catchment selection, and I wonder if this gives an unfair advantage to catchments which are the least similar to other CAMELS-US catchments (i.e. where the benchmark model structure is less suitable and therefore easier to beat). I am curious why you did not use the FUSE variants based on the four existing models (i.e. relating to VIC, PRMS, SAC and TOPMODEL) as benchmarks, as these may be a better representation of what we might expect from a single model approach which your multi-model approaches then build on.

The FUSE model variants share many similarities: all lumped conceptual hydrological models run at a daily timestep with the same input data. It would be worth discussing their similarities as well as the key differences given in Table 2 – as a more diverse multi-model ensemble may have even greater benefits. This is briefly touched upon in the discussion, but I feel that it is also worth elaborating upon in the methods.

Section 2.3: “Each model is calibrated for each catchment over the period 1989-1998 with a preliminary warm-up period of two years.” Please could you further specify these dates – were they run over water years or calendar years, and was the warm-up period before the calibration i.e. 1987-1988 inclusive or the first two years of the calibration period 1989-1990? Is two years sufficient for a warm-up period for your catchments? We have found that some groundwater dominated catchments require longer warm-up times depending on the model initialisation, but I have no experience of modelling catchments in the USA. This choice of calibration period should be explained in the paper – 10 years is relatively short and may not capture particularly dry/wet years.

Section 2.5 would benefit from a more thorough description of the multi-model approaches. In particular, I noted the following:

(1) section 2.5.2.2 left me with questions such as what are the benefits of minimising the number of models, how exactly does the method reduce the number of models required, and how many model structures remained? On further reading I found that more details are given in appendix A – it would be helpful to refer to this in the main text.
(2) section 2.5.3.1. Could you clarify how the models were combined? “using a simple average of up to four models” – did you take an equally weighted mean of discharge values from all four models for each timestep?
(3) Section 2.5.3.2. – the method selects “the combination of up to three models that yields the highest KGEcomp scores over the calibration period” – I’d be curious to know if there any cases where a single model is better than any combination of 2 or 3 models? And in this case would you use the single model as ‘the best combination’ or does this method require a minimum of 2 models? Again, this section could refer to appendix A.
Figure 9. This figure has a lot of information content with the locations of all gauges, but it is hard to see at a glance which methods are doing better and which are equivalent. I found myself trying to read and compare the numbers written above each map and struggled to see any patterns with so much information available. Could it be presented more clearly, e.g. as a table of pie charts/bar graphs rather than a table of maps?

Appendix line 646: “Interestingly, these are the two top-performing models in Figure A2, but model 72 is not selected in Figure A4, suggesting a large degree of similarity (i.e., equivalence) between both models.” Could this also be because model #72 is equivalent with other model structures (e.g. 96) – rather than necessarily being equivalent with #126? And does equivalence (i.e. similar KGE scores) necessarily mean similarity (i.e. similar hydrographs)?

Figure A10: how are the catchments ordered in this figure? Knowing if catchments are grouped by location, key characteristics, or performance would help with the interpretation of this plot.
Citation: https://doi.org/10.5194/egusphere-2025-6083-RC1
- AC1: 'Reply on RC1', Cyril Thébault, 15 Apr 2026
  
  Thank you for this very positive feedback, please find attached our detailed reply.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6083-AC1
RC2:
'Comment on egusphere-2025-6083', Anonymous Referee #2, 20 Mar 2026
This paper investigates and compare several instances of multi-model approaches over a large set of catchments:
mosaic methods that assign a single model to each catchment,

combination methods that merge multiple models using static or dynamic weighting schemes.

1. General assessment
By all means an excellent work: clear presentation, smooth writing, a pleasure to read and (why not) an example to cite.
2. Details
[introduction]
If you know who first introduced the expression “multi-model mosaic”, write it.

I may be old-fashioned, but I like to pay tribute to the “eminent forebears” at least in an introduction. I would suggest to cite here to initial ambition of Linsley (1982) who defended the idea of a single model arguing that it should be no longer “necessary for each hydrologist to develop his or her own model for each catchment”. One of the arguments of Linsley was that “a new model for every application eliminates the opportunity for learning that comes with repeated applications of the same model.” Note that your “Spatially and temporally static combination” represent to some extent a “multi-model” extension of this ambition.

eventually, you could cite the work of van Esse et al. (2013) as an (unsuccessful) attempt in mosaic-type approaches.

[Materials and methods]
147 : “couple” -> “coupled”

[Results]
Why did you wait the results section to mention the 15 catchements that you excluded? I would have said it from the beginning.

[Conclusion]
Based on the surprising result you obtained with your benchmark, one of the first things I would personally try would be to test the Oudin et al’s (2006) multi-calibration approach with this one-size-fits-all structure!

Also, it would be extremely interesting to check whether, even if a multi-model is not much superior to a single one-size-fits-all structure, it does not prove more robust in a climate-change perspective. A possibility would be to run some kind of climate robustness test (for example the RAT of Nicolle et al., 2021).

3. References
van Esse, W. R., C. Perrin, M. J. Booij, D. C. M. Augustijn, F. Fenicia, D. Kavetski, F. Lobligeois, 2013. The influence of conceptual model structure on model performance: a comparative study for 237 French catchments. Hydrol. Earth Syst. Sci. 17(10): 4227-4239, doi: 10.5194/hess-17-4227-2013.
Linsley, R.K., 1982. Rainfall-runoff models-an overview. In: V.P. Singh (Editor), Proceedings of the international symposium on rainfall-runoff modelling. Water Resources Publications, Littleton, CO, pp. 3-22.
Nicolle, P., V. Andréassian, P. Royer-Gaspard, C. Perrin, G. Thirel, L. Coron, & L. Santos. 2021. Technical Note – RAT: a Robustness Assessment Test for calibrated and uncalibrated hydrological models. Hydrol. Earth Syst. Sci., 25, 5013–5027. https://doi.org/10.5194/hess-25-5013-2021.
Oudin, L., Andréassian, V., Mathevet, T., Perrin, C. & Michel, C. 2006. Dynamic averaging of rainfall-runoff model simulations from complementary model parameterizations, Water Resources Research, 42(7): W07410, https://dx.doi.org/10.1029/2005WR004636.
Citation: https://doi.org/10.5194/egusphere-2025-6083-RC2
- AC2: 'Reply on RC2', Cyril Thébault, 15 Apr 2026
  
  Thank you for this very positive feedback. Please find attached our detailed reply.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6083-AC2

Viewed

Total article views: 2,720 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,741	867	112	2,720	93	121

HTML: 1,741
PDF: 867
XML: 112
Total: 2,720
BibTeX: 93
EndNote: 121

Views and downloads (calculated since 28 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	435	275	25	735
Feb 2026	746	366	43	1,155
Mar 2026	392	140	35	567
Apr 2026	95	44	5	144
May 2026	60	35	3	98
Jun 2026	13	7	1	21

Cumulative views and downloads (calculated since 28 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	435	275	25	735
Feb 2026	746	366	43	1,155
Mar 2026	392	140	35	567
Apr 2026	95	44	5	144
May 2026	60	35	3	98
Jun 2026	13	7	1	21

Viewed (geographical distribution)

Total article views: 2,708 (including HTML, PDF, and XML) Thereof 2,708 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 10 Jun 2026

Short summary

Reliable river flow prediction guide water supply planning and flood protection. We tested whether selecting or combining many models improves accuracy compared with single model. 78 models were used and tested in 559 river basins across the United States. A carefully chosen single model nearly matched more complex multi-model approaches, while combining models gave slightly higher accuracy and lower uncertainty. However, no approach worked best everywhere.


Total:	0
HTML:	0
PDF:	0
XML:	0