the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Process diagnostics of snowmelt runoff in global hydrological models: Part II – Are more complex models better?
Abstract. The added value of increased process complexity has long been a central yet unresolved question in hydrological modeling, particularly for snowmelt runoff (SMR), where multiple physical processes interact in complex ways. To address this, we develop a Tree-Based Model Complexity Scoring (TBMCS) method to systematically quantify the complexity of snow-related processes across 13 global hydrological and land surface models. Then by using SMR characteristics, i.e., total runoff (Qsum), peak discharge (Qmax), and centroid timing (CTQ), as integrated indicators to evaluate these models, we systematically quantify the linkage between model complexity and model performance in 1,513 snow-dominated basins. Results show that (1) models differ substantially in their representation of physical processes, with the largest divergence in melting process treatments, followed by sublimation, interception and rainfall-snowfall partitioning processes. (2) While the model performance for Qsum >and Qmax shows limited sensitivity to model complexity, CTQ performance exhibits a positive correlation with model complexity (r = 0.56, P < 0.05) particularly in highly complex basins, highlighting the role of process complexity in stern conditions. (3) We also find that the model performance depends more on systematic and balanced representations of key processes than on complexity alone. High-complexity models with well-integrated processes (e.g., DBH) show high robustness, whereas models lacking critical modules exhibit poor accuracy, and even simpler models with well-designed modules (e.g., PCR-GLOBWB) can perform robustly. This study provides a quantitative framework for assessing model complexity and emphasizes that systematic process design is critical for improving SMR simulations in complex environments, offering guidance for future model development.
- Preprint
(2625 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-6073', Anonymous Referee #1, 24 Feb 2026
-
RC2: 'Comment on egusphere-2025-6073', Anonymous Referee #2, 16 Mar 2026
Overall, this manuscript addresses an interesting question by linking snow-related model process complexity to SMR performance across a large sample of basins. The proposed TBMCS framework is potentially useful and may provide a valuable starting point for future discussion of process complexity. However, the manuscript still has several important limitations, particularly its close dependence on Part I, the interpretation of complexity effects under inconsistent calibration status, and the limited validation of the new complexity metric. In its current form, I am not yet fully convinced that Part II is sufficiently strong as a standalone paper. Addressing these concerns may require substantial revision and further clarification of the study scope and framing.
Major comments:
- I have a generally positive view of Part I, which provides substantial large-sample evaluation results and offers a useful basis for understanding model performance during snowmelt periods. In contrast, I find Part II somewhat less convincing as a standalone paper. In its current form, Part II reads more as a natural extension of Part I than as a fully independent study. Its design, performance metrics, basin framework, and much of the interpretive context are inherited directly from Part I, while TBMCS-based complexity analysis is the main new element. I therefore encourage the authors to further clarify the independence and scientific completeness of Part II as a separate contribution. Given that Part I contains only a limited number of main-text figures, whereas Part II includes a fair amount of supporting or less essential material, one possible option would be to integrate the core complexity analysis into Part I. Such integration might improve the overall coherence of the study, especially if some of the less essential analyses in Part I are simplified.
- A major concern is that the interpretation of model complexity may be strongly confounded by differences in calibration status across models. In Part I, this issue is less critical because the main goal is model evaluation itself. In Part II, however, the manuscript attempts to interpret SMR performance differences in terms of process complexity and to draw implications for model development. In this context, calibration becomes a more important limitation. If some models are calibrated, some are not, and others are only partially calibrated, the reported complexity–performance relationships may not be cleanly attributable to model complexity itself. For any model, performance can change substantially with parameter values while model complexity remains unchanged. Ideally, this type of analysis would be more convincing under calibrated conditions, which also suggests that relying on existing public runoff datasets may impose important constraints on the current study design.
- The discussion of model complexity could be further strengthened. At present, the literature review is somewhat limited and does not sufficiently engage with the broader literature on model structural complexity, structural uncertainty, and the interaction between complexity and calibration. The current framing sometimes gives the impression that the central question is whether “more complex is better,” whereas the existing literature suggests a much more nuanced picture. I encourage the authors to broaden the review and place the present study in a more balanced methodological context.
- Although TBMCS is an interesting idea, its validation remains somewhat limited. The manuscript does not provide enough comparison with previous or simpler ways of describing model complexity, making it difficult to assess whether TBMCS offers a clear advantage beyond being a new scoring framework. A new complexity metric would be more convincingly introduced through stricter comparisons at smaller scales, under calibrated conditions, and against existing measures. At minimum, the present limitations should be more clearly acknowledged.
- I also have some reservations about how rainfall–snowfall partitioning complexity is treated in TBMCS. Unlike many internal hydrological processes, precipitation phase partitioning is closely tied to forcing data and upstream processing. Some models diagnose phase internally, whereas others rely on externally provided phase information or forcing products that already contain this distinction. In the current framework, such externally handled complexity may be scored as lower simply because it is not explicitly represented in the model code, while the complexity embedded in the forcing data is not considered. For this reason, it could be better to exclude rainfall–snowfall partitioning from the complexity analysis.
Other:
- Figures 2–5 contain useful information on model structure, but some of them may be better placed in the Appendix or Supplement, since they mainly support the construction of TBMCS rather than serving as central result figures.
Citation: https://doi.org/10.5194/egusphere-2025-6073-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 230 | 100 | 18 | 348 | 22 | 16 |
- HTML: 230
- PDF: 100
- XML: 18
- Total: 348
- BibTeX: 22
- EndNote: 16
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of the manuscript: “Process diagnostics of snowmelt runoff in global hydrological models: Part II - Are more complex models better?” by Lei et al.
Summary and general assessment
In their manuscript, the authors study how the complexity of snow processes in global hydrological and land-surface models influences their representation of snow melt runoff. They introduce a new tree-based complexity scoring method and apply this to snow process formulations of 13 global hydrological and land-surface models. The authors then compare the performance of these models for snowmelt runoff over a large set of catchments. They show that (1) there is diversity in how models represent snow processes, (2) model complexity correlates well with performance for the centroid timing of snowmelt but not with the total snowmelt and the snowmelt peak, and that (3) model complexity cannot explain performance alone. They conclude that a balanced representation of snow processes is necessary for good model performance for snow melt runoff.
While I think their study could be of value to the hydrological community, I think that major revisions are necessary to improve the clarity of the text and the interpretation of the results. In particular, methodological concepts need to be better introduced, some methodological choices surrounding the complexity scoring method require additional clarification and the results require better discussion in light of the existing literature.
Please find my detailed comments below.
Major
1. While I understand that this is an accompanying paper to Part 1, information that is necessary to understand this paper on its own is not properly introduced or only very late in the text. For example, the term “Robustness” is only introduced around line 272 in the Results and the reader is referred to Part 1 for details, but significant parts of the subsequent analysis depend on this concept (e.g. Figure 9). The robustness index should be introduced in the methodological section. Similarly, the term “model performance” is now introduced in the results in L219, but should be defined in the methodology (where there are already references to this). Lastly, the metrics Qsum, Qmax, and CTQ should be better introduced.
2. Terminology should be precise and consistent. For example, the authors use “total complexity score” and “model complexity” and “complexity”, which can be confused with basin complexity; “Qsum” is sometimes used instead of “model performance for Qsum”; etc.
See other examples below.
3. L100. What remains unclear from the text, is what the authors base their assessment of model complexity on. Do they base the assessment on previous model intercomparison papers (e.g. Telteu et al., 2021; Müller-Schmied et al., 2025), on discussions with experts, on the model code, or on the supporting publications of the models? Please elaborate.
4. Table 1. A major component missing in the complexity analysis is the representation of the snow storage itself. Different models have different numbers of snow layers, and some have a liquid snow storage in the snow and others don’t (e.g. Telteu et al., 2021).
Furthermore, models can have sub-grid routines to differentiate between different elevation zones within the cells which can thus melt at different times (e.g. CWatM).
Such complexity - in particular elevation zones and liquid water storage where meltwater is temporarily stored - might cause a delay in the release of snowmelt runoff and thus might strongly influence the metrics considered in this study.
The authors need to explain why they did not consider such model complexity or include it in additional analyses.
5. Throughout the results: Results should be discussed more thoroughly and placed in the context of the wider literature.
For example, performance differences between the different models are now all attributed to model complexity. The authors should reflect more on the possibility that other factors play a role, such as differences in calibration between the models, the complexity from other processes (complexity in the representation of soil layers, groundwater or glaciers), or uncertainty in the forcing data.
Also differences between the presented findings and previous studies should be better discussed. For example, what explains differences between the findings here (e.g. that CQT performance correlates with model complexity) and other studies, such as Beck et al., 2017 who found that uncalibrated “simpler” GHMs seem to outperform more elaborate LSMs in snow-dominated basins; and Merz et al., 2022 who show that complexity does not always lead to better performance?
6. Finally, I would like to point out that improvements might also depend on potential changes made in the accompanying Part 1 paper.
Minor
Title: should it not be “global hydrological and land-surface models”?
L57: Explain what centroid timing is.
L85: While I understand that this is a paper accompanying Part 1, I would still advise to elaborate more on the set-up to make it more likely that complexity is the main explanation for the results, e.g. that model resolutions, forcing, and routing are largely held constant.
L88: “The key runoff characteristics considered here are Qsum, Qmax, and CTQ, which are crucial for water resource utilization, flood hazard prevention, and water resource management, respectively.” I would better introduce Qsum, Qmax and CTQ here and discuss how they are defined. Furthermore, references are required when stating that these are crucial for water resources and flood hazards.
L93: Canopy radiative transfer and surface albedo are later not treated as separate processes and do not have their own process tree. I would clarify which processes are the focus of the analysis and so I would not discuss these here.
L116: Highlight that, after weights are assigned, everything is summed to results in a complexity score for each process.
L118: Provide some more elaboration on why the weights were chosen this way. I do think it could influence the ranking of the models.
L122: Highlight better in Section 2.2 what the four key processes are (see also comment on L93).
L134: What is plant functional type entropy?
L150: I am wondering if taking the precipitation phase of the forcing data makes a model’s process representation necessarily less complex. The phase-partitioning in the forcing data is likely very similar as the way it is handled in the model and might even be more complex than just applying a constant temperature threshold. So does it actually make sense to add this to model complexity? Please elaborate.
L174: The authors write that almost all models explicitly represent sublimation. However, Telteu et al., 2021 and Müller-Schmied et al., 2025 classify PCR-GLOBWB as not representing snow sublimation. Please elaborate.
L204: Here individual complexity scores (for each process) are summed to form the total complexity score. Later on in the text, this is always referred to as model complexity. Make sure that the terminology of complexity score vs total complexity score vs model complexity vs complexity is clear and consistent (particularly because there is also basin complexity). See also comments below.
L218: Performance needs to be introduced before in the methodology section and the text should be consistent with the terms performance vs model performance.
L 224: “Figure 7 shows that, overall, the correlation between model complexity and performance is stronger under high basin complexity.” This is too strong of a statement, as this is only true for CTQ and the differences are not meaningful for Qsum(-0.04-0.14) and Qmax (-0.12 to -0.16).
L231 “Qmax depends more on reproducing input peaks and runoff routing, yet most models remain simple in key processes such as rainfall–snowfall partitioning and interception, so added complexity offers limited benefit.”
I would argue that this maximum also very much depends on the rate at which the snow melts (fast melt can lead to high peaks) and thus the melt process complexity. Why would added complexity for melt parameterization not lead to better performance at Qmax?
L231 Furthermore, if the authors attribute their findings to the representation of individual snow processes, the authors should also test the correlation of model performance vs the complexity scores for individual processes (e.g. rainfall-snowfall partitioning only, interception only, snowmelt only) and see if this leads to more significant results.
L234 “Simple degree-day schemes often misrepresent melt timing, whereas more complex energy-based formulations capture the on set and magnitude of snowmelt more accurately, ...” Does this not depend on the calibration of the degree-day schemes? For example, other studies suggest that temperature index approaches and energy balance can reach similar results (e.g. Magnusson et al., 2015) and that uncalibrated simple GHMs can outperform uncalibrated complex LHMs in snow-dominated basins (Beck et al., 2017).
L241: “Error compensation further explains these differences. For Qsum and Qmax, process-level errors can be offset through calibration, masking the potential advantages of higher complexity. By comparison, CTQ, as a normalized timing metric, is less amenable to such compensation because it reflects the full temporal distribution of flows.”
I do not fully agree with this statement and I wonder if it is not the opposite: Qsum and Qmax depend directly on the amount of water, which is fixed (e.g. forcing input). In contrast, CTQ does not depend on the actual amount of water, but depends on the timing. Timing can be calibrated (e.g. DDF is in m/day, which is a rate). Please explain.
L248-L250: These sentences are not precise and should be rephrased. Do the authors mean the following? “Model complexity exhibits ... negative correlations with model performance for Qsum and Qmax..”, “whereas the relationship between model performance for CTQ and model complexity is more strongly influenced by basin complexity factors, ...”
L250: “stronger under higher heterogeneity”: What is meant by heterogeneity? And the correlations are not always stronger with higher quantiles but instead become weaker again with higher values. Why is it not “strongest under intermediate heterogeneity”?
L256: “However, beyond this threshold...begin to dominate..” This statement appears very certain, but it is not tested and is a hypothesis. The statement should be supported by references or should be weakened it with words such as “likely” or “we expect”?
L278: “significant”: I would not consider a correlation with P<0.1 to be significant
L267 “high-elevation regions receive strong surface radiation...accelaration of snowmelt” A reference should be added for this statement. Furthermore, this could also be related to the low temperatures here (which might not rise above the melt threshold of temperature index models) and the issue of snow towers (Freudiger et al., 2017).
L287: What is resilience? The term is not used before.
L287: HH, HL, LH, LL need to be defined in the text. Now they are only defined in the figure caption of Figure 9.
L290 “This suggests that robust performance...designed process representation”. The distribution of complexity in the HH plots and HL plots look the same to me: I would not say that one has a more balanced distribution than the other. I think that this statement is not supported by this observation.
L302 ““CWATM enhances snowmelt simulation by introducing additional parameters”. What additional parameters?
L331: Do you mean “model performance for Qsum and Qmax”?
L334: “CTQ exhibits a strong positive effect” Do you mean “performance for CTQ”? And a correlation with what?
Figure 8: Caption is not completely clear to me.
“...from the 10th to 100th percentile..” Percentile of the basin complexity factor?
“...values shows..” What values, e.g. the median value in this decile?
Fig 10: What is the “ideal model”: Do you mean the most complex model? Ideal suggests the best performing model, so maybe reconsider the terminology.
Technical points:
Figure 6,7, 9: The color coding is confusing to me. Both the high complexity/high performance quadrant and the low complexity/low performance quadrant are colored red, which are opposite concepts. Same applies to the HL and LH, which are both green. Please reconsider the color design of these plots.
Figure 10: Often the legend is overlapping with the figures, making it difficult to read.
L306. Do you mean Figure 9c?
L307: I would add: “These models illustrate a gradient of structural complexity and robustness”.
L356 Remove the title Appendix A
References:
Beck, H. E., Van Dijk, A. I., De Roo, A., Dutra, E., Fink, G., Orth, R., & Schellekens, J. (2017). Global evaluation of runoff from 10 state-of-the-art hydrological models. Hydrology and Earth System Sciences, 21(6), 2881-2903.
Freudiger, D., Kohn, I., Seibert, J., Stahl, K., & Weiler, M. (2017). Snow redistribution for the hydrological modeling of alpine catchments. Wiley Interdisciplinary Reviews: Water, 4(5), e1232.
Magnusson, J., Wever, N., Essery, R., Helbig, N., Winstral, A., & Jonas, T. (2015). Evaluating snow models with varying process representations for hydrological applications. Water Resources Research, 51(4), 2707-2723.
Merz, R., Miniussi, A., Basso, S., Petersen, K. J., & Tarasova, L. (2022). More complex is not necessarily better in large-scale hydrological modeling: a model complexity experiment across the contiguous United States. Bulletin of the American Meteorological Society, 103(8), E1947-E1967.
Müller Schmied, H., Gosling, S. N., Garnsworthy, M., Müller, L., Telteu, C. E., Ahmed, A. K., ... & Yokohata, T. (2025). Graphical representation of global water models. Geoscientific Model Development, 18(8), 2409-2425.
Telteu, C. E., Müller Schmied, H., Thiery, W., Leng, G., Burek, P., Liu, X., ... & Herz, F. (2021). Understanding each other's models: an introduction and a standard representation of 16 global water models to support intercomparison, improvement, and communication. Geoscientific Model Development, 14(6), 3843-3878.