the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Calibrating a large-domain land/hydrology process model in the age of AI: the SUMMA CAMELS experiments
Abstract. Process-based hydrological modeling is a long-standing strategy for simulating and predicting complex water processes over large, hydro-climatically diverse domains, yet model parameter estimation (calibration) remains a persistent challenge for large-scale applications. New techniques and concepts arising in the artificial intelligence (AI) context for hydrology point to new opportunities to tackle this problem in process-based models. This study presents a machine learning (ML) based calibration strategy for large-domain modeling, implemented using the Structure for Unifying Multiple Modeling Alternatives (SUMMA) land/hydrology model coupled with the mizuRoute channel routing model. We explore various ML methods to develop and evaluate a model emulation and parameter estimation scheme, applied here to optimizing SUMMA parameters for streamflow simulation. Leveraging a large-sample catchment dataset, the large-sample emulator (LSE) approach integrates static catchment attributes, model parameters, and performance metrics, providing a basis for large-domain regionalization to unseen watersheds. The LSE approach is compared with a single-site emulator (SSE), demonstrating improved calibration outcomes across temporal and spatial cross-validation experiments. The joint training of the LSE framework yields comparable performance to traditional individual basin calibration while enabling potential for parameter regionalization to out-of-sample, unseen catchments. Motivated by the need to optimize complex hydrology models over continental-scale domains to support national water security applications, this work introduces a scalable strategy for the calibration of large-domain process-based hydrological models.
- Preprint
(8406 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 21 Apr 2025)
-
RC1: 'Comment on egusphere-2025-38', Anonymous Referee #1, 18 Mar 2025
reply
Dear Editor,
In the manuscript ‘Calibrating a large-domain land/hydrology process model in the age of AI: the SUMMA CAMELS experiments’ the authors present a novel hydrological model calibration method. Using a machine learning model, the authors map directly from the calibration parameters, and catchment attributes for the generalized calibration experiment, to the model performance. Subsequently, increasingly better calibration parameters are iteratively selected by using a genetic algorithm in tandem with the machine learning model, updating the machine learning model when new results are in. The manuscript is well written, thorough, and relevant, although some parts of the manuscript remain vague and could be improved. Therefore, I would recommend minor revisions for this manuscript. Below is a more expansive description of my main arguments, as well as a list of line-by-line comments.
Vague manuscript sections
Although the manuscript is well written, the novel calibration approach introduced in the manuscript remains unclear and in the background throughout the manuscript (except for the methods). The title, abstract and introduction could be improved by clearly stating what this study has done, instead of which model or which dataset was used. The same holds true for the discussion and conclusions, where more focus should be on the specific contribution of this study’s calibration approach, instead of generalization statements already made by various other studies. See the below line-by-line comments for more details.
Specific comments
Title: The title is not very descriptive and does not capture the study well. There are many different studies that calibrate large-domain hydrological models using AI. I would suggest revising the title.
Line 12: “a machine learning (ML) based calibration strategy”: What are the novel aspects of this strategy? This provides little information the study.
Lines 15-18: “the large-sample emulator (LSE) approach” / ”a single-site emulator (SSE)” these terms are very unclear as they have not been properly introduced.
Line 64: “physics-based PB”: double
Lines 71: “model emulation”: This study does not actually emulate the model, but the model performance. This distinct difference should be made more clear, especially as this is contrary to most of the other studies discussed in the introduction.
Lines 98-103: But how is the model actually configured. Are these gridded simulations (which seems to be the case based on lines 112-120)?
Lines 122-123: “expert judgment and review of model parameterizations (i.e. process algorithms)”. This sentence is unclear. What does expert judgement and review entail? In addition, are the model parameters the input parameters to the model or the processes included in the model? If the latter, maybe it is better to find a different term than “parameters”, maybe “configuration”?
Section 2.3: This section could benefit from some restructuring (see comments below).
Lines 148-154: Up until this paragraph, the study’s subbasin calibration approach (i.e. each subbasin is seen as a single calibration element; not, for example, each grid cell) was unclear to me. This approach could be better introduced in the introduction.
Lines 164-169: Up until this paragraph, the study’s iterative calibration approach was unclear to me. This approach could be better introduced in the introduction. In addition, this paragraph is better explained in section 2.3.2 (lines 189-199). Perhaps these sections could be restructured as there is a large overlap between the SSE and LSE experiments?
Lines 164-175: This iterative approach is very similar to traditional calibration approaches except for the speedup offered by the model performance emulator. Moreover, significant numbers of process-based model simulations are still needed, even when considering the generalization opportunities. This trade-off could be better discussed in the discussion.
Line 199: This could be a new section, which allows for more detailed description of hyperparameters and cross-validation.
Lines 420-455: These paragraphs do not discuss the novel aspects and strengths of this study. They do not have to, but they take up a relatively large portion of the discussion.
Line 423: “differentiablee”: differentiable
Discussion section: Personally, I would love to see more discussion on the trade-offs between different ML/DL based calibration approaches, and the place of this study’s calibration approach among them. In addition, I would like to know if and how this study’s calibration approach could be used for other (gridded) hydrological models, and to what extent fewer model simulations (e.g. only iteration 0) could be used to generate the same results.
Code and data availability: The code and associated datasets should be made public and cited.Citation: https://doi.org/10.5194/egusphere-2025-38-RC1 -
AC1: 'Reply on RC1', Mozhgan Askarzadehfarahani, 15 Apr 2025
reply
We thank Referee #1 for the thoughtful and constructive feedback. We appreciate the recognition of the manuscript’s strengths and the clear suggestions for improvement. The comments helped us clarify several parts of the manuscript. Below we provide point-by-point responses, with reviewer comments followed by our responses in italics.
Title: The title is not very descriptive and does not capture the study well. There are many different studies that calibrate large-domain hydrological models using AI. I would suggest revising the title.
Response: Thank you for this suggestion. We are considering revising the title to explicitly highlight the emulator-based calibration method: "Calibrating a large-domain land/hydrology process model in the age of AI: the SUMMA-CAMELS emulator experiments". That said, we like the original title because it captures both the general context of the study, ie that opportunities for calibration are changing in this new era of AI methods (where AI has become a common umbrella term for ML, DL, generative AI and other techniques), and also that the study focuses on a particular set of experiments conducted with a recognizable model and dataset. It’s unclear that adding ‘emulator’ is a necessary detail, but we will consider it. We somewhat disagree that there are many different studies calibrating large domain process-based/complex models using AI … versus the many studies calibrating ML or simple conceptual models. There are actually fairly few studies (as we explain in the intro) using a fully dynamical model emulation approach for complex models, and none that have tried this approach to date, except for the companion paper that we reference, Tang et al (2025). Incidentally, that paper appears to be heading for acceptance based on the reviews.
Line 12: “a machine learning (ML) based calibration strategy”: What are the novel aspects of this strategy? This provides little information the study.
Response: We revised the abstract to emphasize that the emulator joint training is the novelty of this work. The abstract is necessarily concise and further information is provided in the manuscript.
Lines 15-18: “the large-sample emulator (LSE) approach” / ”a single-site emulator (SSE)” these terms are very unclear as they have not been properly introduced.
Response: We revised the abstract to clarify LSE and SSE: “This study introduces a new scalable calibration framework that jointly trains a machine learning emulator for model responses across a large-sample collection of watersheds while leveraging sequential optimization to iteratively refine hydrological model parameters. We evaluate this strategy using the Structure for Unifying Multiple Modeling Alternatives (SUMMA) hydrological modeling framework coupled with the mizuRoute channel routing model for streamflow simulation. This ‘large-sample emulator’ (LSE) approach integrates static catchment attributes, model parameters, and performance metrics, and yields a powerful new strategy for large-domain PB model parameter regionalization to unseen watersheds. The LSE approach is compared to using a more traditional individual basin calibration approach, in this case using a single-site emulator (SSE), trained separately for each basin.”
Line 64: “physics-based PB”: double
Response: Corrected.
Lines 71: “model emulation”: This study does not actually emulate the model, but the model performance. This distinct difference should be made more clear, especially as this is contrary to most of the other studies discussed in the introduction.
Response: We added a paragraph in introduction to explain how different studies use emulators: “Generally, emulator strategies have evolved along two primary lines: (i) emulating model performance by directly relating model parameters to one or more performance objective functions, without explicitly modeling the dynamic behavior of the system (Gong et al., 2016; Herrera et al., 2022; Maier et al., 2014; Razavi et al., 2012; Sun et al., 2023), and (ii) emulating key dynamic model states or fluxes, then using the resulting emulator outputs (e.g., time series) to cheaply explore parameter-output sensitivities (Bennett et al., 2024; Maxwell et al., 2021). Importantly, this study explicitly focuses on the first strategy—emulation of model performance metrics—which originated primarily within hydrological modeling contexts. This distinct choice bypasses the need to iteratively run the full hydrological model during calibration, substantially reducing computational expense and enabling scalable optimization for increasingly complex, large-domain hydrology models.”
We also revised this sentence: “ The large-sample emulator (LSE) approach employs a novel joint training strategy that combines model performance (i.e., response surface) emulation and parameter optimization scheme to estimate parameters jointly across diverse catchments…”,, and there is a explanation of how we trained the emulator: “By training an emulator on a large sample catchment dataset to predict model performance as a function of catchment geo-attributes and parameters…”
Lines 98-103: But how is the model actually configured? Are these gridded simulations (which seems to be the case based on lines 112-120)?
Response: As explicitly stated in lines 125-126 (v1); The SUMMA model configuration adopted a single HRU per GRU, in which the GRU was the entire lumped area of each catchment.” we also clarifies lines 112-114 (v1) to remove ambiguity: “The associated sub-daily forcing, including precipitation, temperature, specific humidity, shortwave and longwave radiation, wind speed, and air pressure, were derived from gridded datasets but spatially aggregated across each basin area, resulting in basin-averaged input time series.”
Lines 122-123: “expert judgment and review of model parameterizations (i.e. process algorithms)”. This sentence is unclear. What does expert judgement and review entail? In addition, are the model parameters the input parameters to the model or the processes included in the model? If the latter, maybe it is better to find a different term than “parameters”, maybe “configuration”?
Response: We revised this sentence to explicitly define what "expert judgment" entailed and clarified that the term "parameters" refers explicitly to numerical model inputs rather than processes themselves to clarify explicitly, we have revised lines 122–123 (v1) to: “expert judgment involving consultation with model developers, evaluation of previous modeling experiments and sensitivity analyses, and model process algorithms that directly influence runoff generation. These choices include model physics selections, soil and aquifer configuration, spatial and temporal resolution, an a priori parameter set and target calibration parameters.” Also, model ‘parameters’ is a widely used and well understood term in both hydrologic and land modeling, and will be retained. ‘Configuration’ relates to other modeling choices and is more general. We retain the term to abide by convention.
Section 2.3: This section could benefit from some restructuring (see comments below).
Lines 148-154: Up until this paragraph, the study’s subbasin calibration approach (i.e. each subbasin is seen as a single calibration element; not, for example, each grid cell) was unclear to me. This approach could be better introduced in the introduction.
Response: We add a clarifying sentence to Figure 2.1: “The spatial unit for the calibration experiments is each CAMELS watershed.”
Lines 164-169: In addition, this paragraph is better explained in section 2.3.2 (lines 189-199). Perhaps these sections could be restructured as there is a large overlap between the SSE and LSE experiments?
Response: The introduction contains more general framing material about the study without presenting the details of the method, but now gives a high-level description of the method (eg emulator based, iterative, large-sample) -- see earlier comments -- but a higher level of detail is appropriate in the methods section. We restructure and consolidate by moving figure 2 to section 2.3, with some associated discussion. The main point of this intro part of 2.3 is to give the overview context that there are two main sets of experiments, and some cross-cutting details, e.g. length of run, spinup.
Lines 164-175: This iterative approach is very similar to traditional calibration approaches except for the speedup offered by the model performance emulator. Moreover, significant numbers of process-based model simulations are still needed, even when considering the generalization opportunities. This trade-off could be better discussed in the discussion.
Response: This computational trade-off was is highlighted in Section 2.3.2 lines 203-207(v1): "The computational demand of the LSE approach was significant; even using an emulator, it still requires conducting a large number of simulations to generate parameter sets based on optimization algorithms, as well as testing them in a computationally expensive LHM. To address this, the number of iterations was minimized while the number of parameter trials per iteration was increased, which we found improved efficiency without sacrificing accuracy."
However, we recognize the importance of further discussing this trade-off explicitly in the Discussion section. Therefore, we have added the following paragraph to Section 4 (Discussion): “While the LSE strategy still requires a set of process-based model simulations for training, it offers a substantial computational advantage over traditional calibration approaches by drastically reducing the number of required simulations in subsequent iterations. Rather than incurring the cost of repeated full-model evaluations across basins, the emulator enables efficient exploration of the parameter space with far fewer model runs. As described in Section 2.3.2, we further improved efficiency by increasing the number of parameter trials per iteration while reducing the total number of iterations—an approach that maintained accuracy while accelerating convergence. This balance between emulator fidelity and computational cost demonstrates the practicality of the method for large-domain hydrological modeling. Looking ahead, we are optimistic that future enhancements such as adaptive sampling, transfer learning, or cross-domain emulator reuse could further reduce the up-front simulation demand, opening new possibilities for applying this approach to even more complex or higher-resolution modeling systems.”
Also, we disagree with this comment: “This iterative approach is very similar to traditional calibration approaches except for the speedup offered by the model performance emulator.” We explain throughout the paper that there are similarities to the emulator-based optimization described in Gong et al (MO-ASMO), and indeed it was a starting point for this work. A major conceptual difference, however, is the large-sample joint training using geo-attributes, a concept which is now common in new ML modeling (as we discuss) but has not been applied to a PB or conceptual model before (because it takes a lot of computational effort). The resulting jointly trained emulator offers more than ‘speedup’ -- it opens the door to potential regionalization, as well as transfer learning from the large sample that we show leads to better performance than can be found in single site training. These are both significant advantages, and very different from the current practice for PB models, and we believe the reviewer is overlooking these aspects.
Line 199: This could be a new section, which allows for more detailed description of hyperparameters and cross-validation.
Response: This paper has several overarching objectives, the main ones being to present the key findings and outcomes from a new conceptual strategy for calibrating a large-domain complex process model, and to describe the approach & concept. In the years of development leading to the paper, many sub-focus areas emerged, such as understanding and optimizing the impact of hyper-parameters, which could be the core of entirely different papers. We probably tested about a dozen different variations, at different stages in the development process. But given that our work is deliverable-directed for US water agencies (i.e., we need to provide a calibrated US wide model by a certain date), we did not have time to do the kind of controlled/extended experiments you would ideally conduct if you were going to present this in a paper. Also, the results are likely to be highly dependent on the exact application (eg, the model, the size of the catchment collection, the geo-attributes chosen, the computing infrastructure and so on), and along dimensions we had no time to explore. For these reasons, such a broader discussion won’t be included here, but reserved for future work should we receive the funding to undertake it. Or it can and probably will be taken up by others who are motivated to do so. We now include the sentence at the end of this paragraph: “Further discussion of these hyperparameter experiments and workflow development is beyond the scope of this paper, but may be tackled in a subsequent publication after more controlled experimentation.”
Lines 420-455: These paragraphs do not discuss the novel aspects and strengths of this study. They do not have to, but they take up a relatively large portion of the discussion.
Response: These paragraphs emphasize important context to understand the overall significance of the approach and its findings. For instance, we describe how the work overturns a commonly held belief about joint calibration for PB models. We re-emphasize our original motivations arising from the ML community. We summarize/highlight the key conceptual advance and also the potential value. These paragraphs do, in fact, "discuss the novel aspects and strengths of this study" and as such we feel that they are entirely appropriate for inclusion in the paper. They may help some readers to a more complete understanding.
Line 423: “differentiablee”: differentiable
Response: corrected.
Discussion section: Personally, I would love to see more discussion on the trade-offs between different ML/DL based calibration approaches, and the place of this study’s calibration approach among them. In addition, I would like to know if and how this study’s calibration approach could be used for other (gridded) hydrological models, and to what extent fewer model simulations (e.g. only iteration 0) could be used to generate the same results.
Response: Unfortunately, such a discussion, unless it is just brief and speculative, is beyond scope. We did compare the new innovation (LSE) to its major logical benchmark, SSE. Such broader papers, involving techniques that would take our group a long time to set up and use in comparative experiments, or alternatively to organize with other method authors, will likely follow as this study's approach is digested by the community. Already, the approach has been recreated by a colleague and is being compared in a subsequent separate paper involving some of the authors (but using a conceptual model, HBV) to both pure ML and differentiable learning models run by another group. We feel that the multiple paragraphs included already to orient this work against the concepts arising in ML, or those used traditionally in hydrology, are enough to place it in context.
That said, in response to the reviewer’s suggestion, we have added a brief paragraph toward the end of the Discussion and Conclusions section that acknowledges the broader momentum in ML/DL-based calibration methods and highlights the adaptability of our framework to other model structures, including gridded implementations:
“While this study focuses on introducing and testing a large-sample emulator framework, we recognize the broader momentum within the ML/DL hydrology community toward methodological innovation and intercomparison. Although a full review or benchmarking against other ML/DL-based calibration strategies is beyond the scope of this paper, we view our approach as complementary to ongoing work exploring differentiable, hybrid, and purely data-driven methods. While we applied the method to lumped basin-scale configurations, the emulator framework itself is generalizable and could be adapted to models with different spatial structures, including gridded domains. Still, the question of spatial scale alignment between the training basins and the regionalization targets introduces important challenges.Th method may remain robust under moderate scale inconsistencies, but further research is needed to evaluate this and to understand the limits of spatial generalization in emulator-guided calibration.”Code and data availability: The code and associated datasets should be made public and cited.
Response: Our initial draft states that they will be made public, which is our goal (pending a publication). We further revised our statement to the following: “The LSE-based optimization codes and associated datasets will be shared on an open-access platform if/when the manuscript is accepted and in advance of publication. This statement will be revised accordingly with the final access details.”
Citation: https://doi.org/10.5194/egusphere-2025-38-AC1
-
AC1: 'Reply on RC1', Mozhgan Askarzadehfarahani, 15 Apr 2025
reply
-
RC2: 'Comment on egusphere-2025-38', Anonymous Referee #2, 16 Apr 2025
reply
Farahani et al. evaluate an emulator-based calibration technique, as proposed by Tang et al. (2024. currently as preprint submitted elsewhere) on 600+ CAMELS basins over the US using the SUMMA modelling framework, coupled with the mizuRoute channel routing model. The authors explore the single basin approach (single-site emulator) as a benchmark against the large-sample emulator approach, which integrates static attributes and performance metrics, and provides a basis for large-domain regionalization to unseen basins. The authors have established a comprehensive framework, and their current manuscript is very suitable for publication in HESS after addressing some minor comments listed below. The paper is clearly written and well-referenced.
- Figures 3,4,5 it’s not clear if this is an example of a random gauge/basin or whether it covers the entire dataset shown in Figure 1. In the case of the first one, how are these figures representative? Add this information to the figure captions.
- Figure 6: can you also show the native KGE values (unscaled into -1 to 1 interval), in the supporting information document? What is the motivation for using 6 iterations?
- It might be useful to plot model CDF performance to some other benchmark performances, which are available from the earlier studies over the used CAMELS basins.
- To my understanding, yes, SUMMA can simulate river flows. Still, its main strength over simple rainfall-runoff models is that it can also provide more realistic estimates of soil moisture, snow, etc. Could you also check how the model performance after the emulator-based calibration against variables independent from the discharge calibration has changed? Similar way, what Tsai et al, (2021), Nature Communication 12(1):5988 showed either for evapotranspiration of soil moisture?
- From the figures, I see some evidence that your methodology improved regarding streamflow. Can you discuss how the model run times difference, using your methodology, with the hypothetical example of full calibration of your SUMMA runs (without an emulator)?
- Figure 1: polylines of the river should be overlayed over state boundaries. Currently, rivers are ending up in the middle of nowhere, not flowing towards the sea. Why are the colours of lakes different from rivers?
Citation: https://doi.org/10.5194/egusphere-2025-38-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
195 | 53 | 3 | 251 | 2 | 4 |
- HTML: 195
- PDF: 53
- XML: 3
- Total: 251
- BibTeX: 2
- EndNote: 4
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
United States of America | 1 | 110 | 42 |
Canada | 2 | 45 | 17 |
China | 3 | 17 | 6 |
Germany | 4 | 11 | 4 |
Chile | 5 | 7 | 2 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 110