Comparative Evaluation of ERA5-Land and ISIMIP3 Runoff Forcing for Global River Streamflow Simulation

Boulange, Julien E. S.; Zhao, Fang; Gosling, Simon N.; Pokhrel, Yadu; Yamazaki, Dai; Zhou, Xudong

doi:10.5194/egusphere-2026-2739

Preprints

https://doi.org/10.5194/egusphere-2026-2739

Preprints

05 Jun 2026

| 05 Jun 2026

Comparative Evaluation of ERA5-Land and ISIMIP3 Runoff Forcing for Global River Streamflow Simulation

Julien E. S. Boulange, Fang Zhao, Simon N. Gosling, Yadu Pokhrel, Dai Yamazaki, and Xudong Zhou

Abstract. Flooding is among the most widespread natural hazards worldwide, yet many high-risk regions lack the observational data needed for effective flood planning. In these data-sparse regions, global flood models remain essential tools for estimating flood hazard, although their performance is strongly influenced by the choice of runoff forcing data. Two widely used global runoff products are the reanalysis-based ERA5-Land dataset and the ISIMIP3a multi-model hydrological ensemble. Their selection involves an inherent trade-off between high-resolution reanalysis runoff and runoff simulated by hydrological models driven by bias-corrected meteorological inputs, the latter also providing an explicit representation of uncertainty through ensemble spread. This study presents a comparative evaluation of these two products by routing both through a consistent global hydrodynamic framework (CaMa-Flood). Model performance was assessed across IPCC SREX regions against observations from 5,071 gauging stations using the Kling-Gupta Efficiency and its components, while long-term trends in low, mean, and high streamflow were evaluated from a subset of 3,135 stations with sufficient temporal coverage. Simulations forced by ERA5-Land show superior skill in reproducing observed daily streamflow, with consistently higher correlation and stronger agreement in the spatial pattern of regional streamflow trends. However, systematic biases in streamflow magnitude and a tendency to exaggerate drying trends, particularly for low streamflow, are also evident. In contrast, the ISIMIP3a ensemble shows lower skill in reproducing observed daily streamflow metrics but provides more conservative and observation-consistent estimates of long-term trends. Ensemble averaging further improves robustness, with simulated trend ranges more frequently overlapping observational uncertainty bounds, albeit at the expense of dampened variability and extremes. Differences between native and spatially aggregated ERA5-Land runoff were negligible within the present modelling framework. Overall, the results demonstrate that no single runoff product is universally optimum: ERA5-Land is well suited for reproducing historical streamflow dynamics, whereas ISIMIP3a is particularly valuable for robust assessments of long-term hydrological change and uncertainty.

Received: 13 May 2026 – Discussion started: 05 Jun 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 1718 KB)

Supplement (2642 KB)

Download & links

Julien E. S. Boulange, Fang Zhao, Simon N. Gosling, Yadu Pokhrel, Dai Yamazaki, and Xudong Zhou

Status: final response (author comments only)

RC1: 'Comment on egusphere-2026-2739', Anonymous Referee #1, 10 Jun 2026

This manuscript compares streamflow simulations forced by rear5land runoff and ISIMIP3a runoff products within a consistent camaflood routing framework, using GRDC observations for global validation. The topic is timely and potentially important. However, the manuscript in its current form requires very very substantial revision. The study design does not cleanly isolate the causes of the differences between Ear5land and ISIMIP3a, the statistical treatment of trends and uncertainty is underdeveloped, several conclusions are overstated, and key methodological details are missing. There are also serious presentation problems, including duplicated text in the Methods section. I would not recommend publication in the present form
1. The manuscript presents a comparison between ear5land and ISIMIP3a runoff to address an important research gap, but the novelty is not yet convincingly established. Many previous studies have already evaluated global runoff products, streamflow simulations, hydrological model ensembles, reanalysis products, and routing uncertainty. The authors need to explain more explicitly what is new in this study and how it advances beyond existing benchmark comparisons.
2. Currently, the paper mostly comes across as a benchmark comparison. However, it would be helpful to clearly explain the main goal of the regional performance assessment. While the authors share regional KGE maps, they don’t specify the scientific question this analysis aims to address. maybe clarify whether this part is intended to evaluate the spatial consistency of Ear5land’s benefits, identify regions where both runoff products may not perform well, or or explore regional relationships among ISIMIP3a models may provdie some help.
3. The experiment does not isolate the effect of forcing dataset, hydrological model structure, bias correction, calibration, human impacts, or spatial resolution. Yet the Discussion, especially Section 4.2, frequently suggests causal explanations. The authors should avoid causal overinterpretation unless additional experiments are added. At minimum, they should clearly state that the comparison is between two runoff-product configurations rather than a controlled attribution of individual uncertainty sourcs
4. The authors should explain why the SREX regional scale is appropriate for evaluating model performance. SREX regions are broad climate-impact regions rather than hydrological units, and aggregation at this scale may mask substantial basin-level variability. The authors should justify this choice and discuss whether using major river basins, hydroclimatic zones, or aridity-based regions would affect the conclusions.
5. ear5land runoff does not include the same human impact representation as ISIMIP3a. Therefore, it is not clear whether the two products are fully comparable in regions strongly affected by reservoirs, water abstraction, irrigation, or other human interventions. The authors should explain how human impacts are represented in each product and whether this difference may bias the comparison.
6. Related to the previous point, the manuscript states that the camaflood dam module is enabled. However, ISIMIP3a simulations may already include some human influences depending on the hydrological model and experimental setup. The authors should clarify whether applying the camaflood dam module to ISIMIP3a runoff could create inconsistencies or doublecount some forms of regulation. A sensitivity experiment without the dam module would strengthen the study.
7. The Discussion needs some significant rewriting. Right now, it repeats the Results and includes some speculative interpretations. It would be helpful for the authors to concentrate the Discussion on the main scientific question. Why does Ear5land seem to do better for daily streamflow changes, while ISIMIP3a appears to be more conservative or more consistent when looking at long-term trends?
8. The Camaflood description is duplicated. The text describing Camaflood appears in Section 2.1.2 and then again at the start of Section 2.2.
9. The Introduction spends too much space on global flood exposure, climate pledges, and future flood risk. Should be shortened and refocused on runoff forcing uncertainty, streamflow simulation, and trend evaluation.
10. The authors have done an interesting job estimating TS slopes for annual Q10, mean flow, and Q90, and then comparing regional medians. To make their trend analysis even more robust and trustworthy, they might consider incorporating MK tests at the station level. Discussing field significance or multiple testing, and checking whether the modeled and observed regional trends differ significantly, could also add valuable insights. The suggestion that ISIMIP3a offers more observation-aligned trend estimates is promising, and with a bit more statistical backing, it can be even more convincing
11. KGE is useful, but KGE alone cannot fully characterize hydrological performance. Possible additions include NSE, logNSE, flow duration curve bias, or others. This is very important because the manuscript is motivated by flood modeling, but the current evaluation does not directly assess flood peaks or floodplain inundation.
12. The manuscript presents trends as percent changes per decade relative to the longterm mean for each flow metric. This can produce very large or unstable values when mean low flow is small, especially in arid and semi-arid regions.\
13. The conclusion that spaatial resolution has limited influence is too broad. The experiment only shows that, within the present Camaflood configuration and for the selected streamflow metrics, upscaling ear5land runoff from 0.1° to 0.5° has little effect.
14. The regional median KGE values don’t fully capture how well the model performs in each region. Keep in mind that within each SREX region, the performance at individual stations can differ quite a bit, highlighting the importance of looking at more detailed data.
15. The interpretation of the uncertainty overlap in Table 1 might benefit from a little bit of clarification. Keep in mind that a model with broader uncertainty intervals tends to overlap with observations more frequently, but this doesn’t automatically mean it performs better.
16. The staation filtering and allacation procedure needs more detail.
17. The manuscript should better connect the daily streamflow performance analysis with the long-term trend analysis. For example, does region with high KGE also show better trend agreement? Do regions with low KGE show larger trend errors? The current manuscript treats these two analyses mostly separately, but their relationship is central to the paper’s interpretation.
18. The manuscript should be more cautious when stating that Ear5land is well suited for reproducing historical streamflow dynamiccs and highlight that ISIMIP3a is especially valuable for long-term trend assessment. These are reasonable hypotheses, but the current evidence isn't yet strong enough to endorse such broad recommendation.
19. The authors should also address the non-independence between ear5land and ISIMIP3a since ISIMIP3a W5E5 partly uses ERA5 data and applies bias adjustments based on observational products, and ear5land is similarly driven by ERA5 atmospheric forcing, this connection needs acknowledgment.
20. Additionally, the paper should avoid suggesting that the results directly apply to flood hazard modeling unless flood-specific metrics are incorporated. Currently, the analysis focuses only on streamflow skill and trends.
21. Cama-flood simulation use saome river width and depth for ear5land and ISIMIP3a? Or they calculate seperately.

Citation: https://doi.org/10.5194/egusphere-2026-2739-RC1
RC2:
'Comment on egusphere-2026-2739', Anonymous Referee #2, 15 Jun 2026
The study simulates global river streamflow using the ERA5-Land runoff dataset and the ISIMIP3a hydrological runoff ensemble within a consistent CaMa-Flood modelling framework, and evaluates the resulting simulations against GRDC observational data. The topic is timely and potentially valuable, given the widespread use of both runoff products in global hydrological and flood-risk applications. However, the manuscript does not clearly establish the rationale for comparing ERA5-Land and ISIMIP3a, nor does it sufficiently explain the mechanisms behind their contrasting performance. The analysis mainly reports that ERA5-Land performs better for daily streamflow, while ISIMIP3a gives more conservative trend estimates, but the roles of forcing construction, bias correction, model structure, ensemble design, routing effects, and spatial resolution are not adequately disentangled. Therefore, I consider the manuscript unsuitable for publication in its current form and recommend rejection. My comments are as follows:

The scientific question is not sufficiently developed. The manuscript shows that ERA5-Land performs better for daily streamflow KGE, while ISIMIP3a gives more conservative long-term streamflow trends within the CaMa-Flood framework. However, the paper mainly reports these contrasts rather than explaining why they occur. More diagnostic analysis is needed to identify whether the differences arise from forcing construction, precipitation bias correction, model structure, ensemble averaging, human impacts, calibration, or routing interactions. Additional experiments or targeted diagnostic analyses should be conducted to explain the mechanisms underlying the differences between the two runoff products in streamflow simulations, thereby strengthening the scientific significance of the study.

The rationale for comparing ERA5-Land and ISIMIP3a is not clearly established. Lines 65–75 first discuss streamflow-corrected runoff products and ISIMIP-type runoff generated from bias-corrected climate inputs, but ERA5-Land is then introduced rather abruptly as a reanalysis-based dataset, mainly with emphasis on its high resolution and near-real-time availability. In essence, both ERA5-Land and ISIMIP3a generate runoff from meteorological forcing through land or hydrological models, but their forcing construction and modelling chains differ substantially. ERA5-Land runoff is produced by a land-surface model driven by ERA5 atmospheric reanalysis forcing, whereas ISIMIP3a runoff is produced by multiple hydrological models driven by bias-adjusted meteorological forcing. The authors should therefore more clearly justify why these two products are selected and explain their fundamental differences in terms of data assimilation versus bias correction, land-surface versus hydrological model structures, single-product versus ensemble design, and their expected implications for streamflow simulation. Without this clarification, the study reads more like a comparison of two available datasets than a well-justified scientific evaluation.

The conclusions may be too dependent on a single routing model. All main results are derived from CaMa-Flood simulations, so it is difficult to separate runoff-forcing effects from routing-model effects. Streamflow timing, variability, high-flow behaviour, and trend propagation may depend on the specific routing scheme, river network, dam treatment, floodplain storage, and parameter settings. The authors should either test whether the conclusions are robust using other routing models or clearly limit their claims to the CaMa-Flood framework.

The conclusion that spatial resolution has limited influence is too broad. The manuscript only compares the native ERA5-Land runoff resolution with one aggregated resolution, and this single sensitivity test is insufficient to support a general statement that resolution has little impact on streamflow simulations. The result should be interpreted more cautiously as showing limited sensitivity within the present CaMa-Flood configuration and selected metrics. A more robust assessment would require additional resolution levels and, ideally, tests across different hydrological settings or basin sizes.

The Introduction devotes considerable space to the general impacts of flood hazards, but gives relatively limited attention to the uncertainty of runoff forcing products, which is the central issue of this study. The authors should shorten the broad background on flood disasters and refocus the Introduction on how different runoff datasets affect global streamflow and flood simulations. A more comprehensive review of previous studies on forcing-data uncertainty, runoff-product differences, and their propagation into flood modelling would help better motivate the comparison between ERA5-Land and ISIMIP3a.

Sections 2.1.2 (“Global flood simulation”) and 2.2 (“Evaluation of simulations”) contain duplicated descriptions of the CaMa-Flood model. The repeated text should be removed, and the Methods section should be reorganized to avoid redundancy and improve readability.

In Section 3.1.1, the use of regional median KGE alone may not adequately represent model performance within each SREX region. Although the median is a useful robust summary statistic, it can mask substantial station-level variability, especially in large and hydroclimatically heterogeneous regions. The authors should provide additional information, such as interquartile ranges, boxplots, station-level distributions, or the fraction of stations with positive KGE values, to better support the regional performance assessment.

In Section 3.2.1, the statement that the leave-one-out sensitivity tests indicate “moderate dependence on individual regions with pronounced hydroclimatic signals” is not clearly supported by Table S2. The table only reports the range of leave-one-out cross-regional correlation coefficients, but does not identify which excluded regions drive the minimum or maximum correlations, nor does it show that these regions correspond to pronounced hydroclimatic signals. The authors should either provide the region-specific leave-one-out results and explain which regions control the sensitivity, or soften this interpretation.

Figure 4 presents trend comparisons using regional averages, but this aggregation may mask important station-level variability. The authors should consider adding a station-level comparison using all available gauging stations, for example as a supplementary scatter or density plot. This would allow readers to better assess whether the reported trend relationships are consistently supported across individual stations rather than mainly reflecting regional aggregated values.
Citation: https://doi.org/10.5194/egusphere-2026-2739-RC2
RC3: 'Comment on egusphere-2026-2739', Anonymous Referee #3, 03 Aug 2026

The manuscript presents a large global benchmarking exercise comparing ERA5-Land and ISIMIP3a runoff forcing within a common CaMa-Flood framework. The manuscript is based on substantial analysis, but the scientific contribution remains unclear. In particular, the study documents performance differences, but it does not explain their causes or provide results that can be readily transferred to broader hydrological questions. For this reason, I find the manuscript reads more like a technical report for a targeted group of users working with ERA5-Land, ISIMIP3a, and CaMa-Flood, but it provides limited transferable insight for the wider HESS community. I therefore do not recommend publication in its current form.
1. The main conclusion is that ERA5-Land performs better for historical daily streamflow dynamics, whereas ISIMIP3a provides more conservative estimates of long-term trends and a broader representation of uncertainty. However, ERA5-Land and ISIMIP3a differ simultaneously in meteorological bias correction, land-surface or hydrological model structure, calibration, treatment of human influences, ensemble design, and spatial resolution. As the authors acknowledge, these simultaneous differences prevent attribution of the reported performance contrasts to any specific process or modeling choice. Consequently, the analysis remains largely descriptive and provides limited general insight beyond the two products evaluated here. Moreover, the principal result largely confirms expectations already established in the Introduction "Reanalysis-based datasets such as ERA5-Land provide comparatively high spatial resolution and near-real-time coverage, which may better capture fine-scale hydrological patterns (Muñoz-Sabater et al., 2021). In contrast, ensemble hydrological products, such as ISIMIP3, employ bias-adjusted meteorological forcing and multi-model hydrological ensembles designed to reduce systematic forcing biases while preserving large-scale climatic trends and representing structural uncertainty (Hempel et al., 2013; Lange, 2021)". The reported findings are broadly consistent with these expected product characteristics. The manuscript does not yet demonstrate what new hydrological understanding emerges beyond this confirmation.
2. The manuscript does not clearly describe at which stage the ISIMIP3a multi-model mean is calculated. The methods suggest that runoff from the nine hydrological models is routed separately through CaMa-Flood, after which multi-model means are calculated during the regional assessment. However, it remains unclear whether the authors average routed daily streamflow series, station-level performance metrics, or regional statistics.
3. The best ISIMIP3a model is selected using observed streamflow performance within each region and is then evaluated using the same general evidence base. This is an in-sample selection and does not show that the best member can be identified in an ungauged basin, a data-poor region, or an independent period. It does not provide an operational model selection method and has limited practical meaning IMO.
4. If I understand correctly, the ISIMIP3a simulations already include time-varying human influences such as water abstraction, dams, and reservoirs. At the same time, the dam module is enabled in CaMa-Flood for the routing simulations. The manuscript does not clearly explain how reservoir and regulation effects represented in the ISIMIP3a runoff interact with the CaMa-Flood dam module. It is also unclear whether ERA5-Land and ISIMIP3a are treated consistently in this respect.
5. The manuscript states that station-level trends are bootstrapped for all datasets, but the ISIMIP3a analysis additionally includes variability across nine hydrological models. It's not clear how these two dimensions are combined. Are model members pooled within the bootstrap, resampled separately, or averaged before station resampling?
6. The sensitivity test remapped an existing runoff product, but it does not test how runoff generation would change if the land-surface model and meteorological inputs operated natively at 0.5 degrees. The experiment therefore supports only the narrower conclusion that CaMa-Flood results are relatively insensitive to post hoc aggregation of this particular runoff field. It does not mean that forcing resolution is generally less important than runoff realism.
7. The presentation quality needs substantial improvement. Figures 1 and 2 are difficult to read at standard journal-page size. The Methods section contains substantial duplicate text. The Introduction is longer and broader than needed for the research question. The opening discussion covers global flood losses, population exposure, the Paris Agreement, future warming, and demographic projections before reaching the specific issue of runoff forcing. Much of this material is generic and does not help define the comparison between ERA5-Land and ISIMIP3a.

Citation: https://doi.org/10.5194/egusphere-2026-2739-RC3

Julien E. S. Boulange, Fang Zhao, Simon N. Gosling, Yadu Pokhrel, Dai Yamazaki, and Xudong Zhou

Supplement

https://doi.org/10.5194/egusphere-2026-2739-supplement

Julien E. S. Boulange, Fang Zhao, Simon N. Gosling, Yadu Pokhrel, Dai Yamazaki, and Xudong Zhou

Viewed

Total article views: 152 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
107	39	6	152	9	5	5

HTML: 107
PDF: 39
XML: 6
Total: 152
Supplement: 9
BibTeX: 5
EndNote: 5

Views and downloads (calculated since 05 Jun 2026)

Month	HTML	PDF	XML	Total
Jun 2026	59	24	4	87
Jul 2026	36	11	2	49
Aug 2026	12	4	0	16

Cumulative views and downloads (calculated since 05 Jun 2026)

Month	HTML	PDF	XML	Total
Jun 2026	59	24	4	87
Jul 2026	36	11	2	49
Aug 2026	12	4	0	16

Viewed (geographical distribution)

Total article views: 143 (including HTML, PDF, and XML) Thereof 143 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 05 Aug 2026

Short summary

Many flood-prone regions lack the river observations needed for flood planning. We compared two widely used global datasets by testing how well they reproduced observed river flow at more than 5,000 gauging stations worldwide. One dataset better captured daily river flow changes, while the other provided more reliable estimates of long-term change. The results show that the most suitable dataset depends on whether the goal is flood monitoring or long-term risk assessment.


Total:	0
HTML:	0
PDF:	0
XML:	0