the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Autocalibration of a physically-based hydrological model: does it produce physically realistic parameters?
Abstract. Hydrological models are essential tools when predicting water availability, floods, and droughts. Physically-based models are capable of representing sophisticated degrees of realism compared to conceptual or data-driven models as they explicitly solve equations based on well-established physical laws that are directly related to catchment processes. However, they can require extensive calibration, which can be computationally demanding. This study develops and applies an autocalibration method for SHETRAN, a physically-based model, to improve its performance across 698 catchments in the UK. This paper discusses the process of model calibration, the benefits and caveats of the approach and discuss the extent to which physical realism of the parameters are preserved through the autocalibration.
Results show that the autocalibration process significantly improves SHETRAN’s performance, raising the median NSE value for the 698 catchments from 0.69 to 0.82. After calibration, 85 % of catchments achieve NSE values of ≥0.7, demonstrating a substantial enhancement in accuracy of simulations across a range of catchments with different climatic, hydrological, topographical, and geological characteristics. The greatest improvements were observed in groundwater-dominated catchments, where uncalibrated simulations struggled. Additionally, simulated transmissivity values align well with measured data, providing confidence in the model’s ability to produce parameters that mirror physical realism.
This study highlights the feasibility of applying physically-based models at a national scale when combined with effective autocalibration techniques. Autocalibrated-SHETRAN-UK performs comparably to conceptual and data-driven models, whilst offering improved transparency of hydrological processes. Future work will focus on integrating groundwater levels into the calibration process of SHETRAN and refining the model by introducing more spatial complexity in soil and aquifer representation within the model to better reflect real-world variability. These advancements will further enhance our capability to simulate hydrological responses under changing climatic and land-use conditions using SHETRAN.
- Preprint
(2179 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-1824', Anonymous Referee #1, 15 Jul 2025
-
AC1: 'Reply on RC1', Eleyna McGrady, 14 Dec 2025
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1824/egusphere-2025-1824-AC1-supplement.pdf
-
AC1: 'Reply on RC1', Eleyna McGrady, 14 Dec 2025
-
RC2: 'Comment on egusphere-2025-1824', Anonymous Referee #2, 16 Nov 2025
This work uses the SCEUA automatic calibration platform to calibrate surface water flows predicted over a number of watersheds in the UK. The SHETRAN distributed model is used as the simulation platform, run daily and compared to daily streamflow values. While I appreciate the goal of arriving at the right answers for the right reasons, the analysis to establish the physical basis for the calibrated values is insufficient to establish this. Furthermore, the approach used to conceptualize each model is very similar if not identical to the watershed models used to contrast the work done here. I think more work is needed along with revisions to the manuscript to establish these points. Comments are below.
* what are the limitations and biases in the shallow subsurface? The authors state that a single aquifer unit 20m in thickness was used; this is only shallow surficial, unconfined groundwater. The fixed 20m thickness has an influence on the transmissivity estimates. The authors should explore this sensivity with some number of representative watersheds to ensure this does not impart bias.*The authors state that some watersheds were modeled with 1km columns while others were modeled with 5km columns. This is a substantial and seemingly arbitrary adjustment in resolution. What are the limitations of the grid resolution switch for the 16 catchments run at 5km? Did the authors conduct a sensitivity study on resolution to establish these final values? Given that topography will be smoothed out substantially at 5km, a sensitivity study to ensure that results might be transferrable between these different resolutions is needed.
*It is well known the groundwater does not follow surface topographic divides and aquifer systems connect watersheds laterally. It appears that each of these watersheds is modeled independently which does not allow for groundwater import and export between watersheds along with regional groundwater flow. Can the authors show that lateral groundwater flow does not impact their results?
*The authors visually compare the transmissivity values produced at the end of the calibration process with BGS aquifers (Fig 11 and 12). Visually, the maps appear to bear no similarity. The other continental scale efforts the authors mention (e.g. Naz et al, Yang et al) start with geologic maps and aquifers to parameterize their model, then adjust parameter values accordingly. 1. Why was this more common approach not taken? 2. What quantitative steps can be taken in the manuscript to show any agreement between the geologically derived transmissivity values and the ones arrived at from this current study?
*It would appear that a major component of the physics based approach would be to generate water table depth values and ET estimates along with streamflow. These values could help characterize change in storage and ET fluxes, which along with streamflow discharge would close the water budget for each watershed potentially limiting the space of equifinality. Given the unconstrained nature of the calibration exercise and simple model configuration, it appears that many solutions might provide the same streamflow estimates. A very thick soil layer could be produced that mimics the aquifer or AE/PE values could change the water budget. These could easily take on unrealistic bounds, detracting from the central theme of the work. The authors should justify the use of just streamflow in calibration. More discussion is needed, perhaps also with sensitivity cases, to ensure that parameter values are not unrealistic.
*There are no surface water parameters, such as channel width or mannings roughness coefficients included in the analysis or sensitivity. Is this because channel routing was somehow not included in the study or for some other reason?
Citation: https://doi.org/10.5194/egusphere-2025-1824-RC2 -
AC2: 'Reply on RC2', Eleyna McGrady, 14 Dec 2025
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1824/egusphere-2025-1824-AC2-supplement.pdf
-
AC2: 'Reply on RC2', Eleyna McGrady, 14 Dec 2025
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 890 | 123 | 27 | 1,040 | 22 | 38 |
- HTML: 890
- PDF: 123
- XML: 27
- Total: 1,040
- BibTeX: 22
- EndNote: 38
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Title: Autocalibration of a physically-based hydrological model: does it produce physically realistic parameters?
The paper presents the results of a calibrated process-based model and compares them with other models. Moreover, the paper describes qualitatively the similarity of the hydrogeological parameters versus other sources of information. The improvement is substantial compared with the uncalibrated model, which sets the calibrated model at the level of conceptual and machine learning models, with the addition of having the interpretability of its parameters.
Two main concerns emerge from the paper. First, it is well known by the hydrology community that uncalibrated models can improve substantially if some calibration of their parameters is applied. Therefore, the author should put more effort into highlighting the difficulty of applying a calibration in such models. In this context, a better description of the process applied during the calibration will benefit a broader community that needs to calibrate those models. However, when the methodology was finally mentioning the autocalibration, the authors sent the information to the appendix, which goes in detriment of the importance of presenting a methodology for calibrating such models. From my point of view, this is a key point that is not presented adequately.
The second main concern is about the “realism” of the parameters. The authors spent many sections of the paper trying to probe that, but they did so qualitatively. Proving this strong argument needs more than comparing maps. The authors can analyze the results with scatter plots, correlations, % of catchments with consistent parameters, developing parameter maps with Krigging (or cokriging), etc., just to mention more robust analyses. Without that analysis is impossible to answer the title question. Moreover, the authors mention the word “real” many times to refer to the comparison with other maps. However, they fail to understand that such maps are just the result of some model. Therefore, they cannot be considered as “real” data. If they want to compare with real data, they should compare with the parameters extracted from the wells, therefore, point observations. Any other spatial distribution of such parameters is just a model (synthetic data). Another situation that the author did not mention about the groundwater parameters is that they are probably the parameter with the most uncertainty. The authors should incorporate an analysis of other parameters that are probably more easily constrained by observation or using remote sensing products.
In summary, I consider that the paper has results that are valuable for the hydrology community, but major changes must be made to highlight the points that are important for such a community, moving away from just presenting the result of a model.
Minor comments:
Line 11-12. That is a very strong argument that will create a lot of controversy. I like process-based models because they can represent a very complicated world with their simplified equations. However, they do not represent the truth because the complexity of a catchment is many orders of magnitude more than the representation of a process-based model. Moreover, in general, given that machine learning models can generate better results than process-based models under the same data, it shows us that there is more to improve in such models. Therefore, I don’t think the authors need to enter into this controversy.
Line 27-29. I do not think talking about future work in the abstract is a good idea because authors should summarize their findings, not what they did not do.
Line 50-52. This statement could be easily said too for a process-based model. To generate the equations used in the model, a lot of data and years of experiments were needed; therefore, saying that data-based models require more data is not fair.
Line 56. “well-established physical laws”. I am pretty sure that the only physical laws implemented in the model are mass and energy conservation. Moreover, these physical laws are probably not satisfied at the resolution of the model. Any other equations used by the model are just simplifications of the truth.
Line 59. The degree of uncertainty of such parameters is huge when they are used in catchment-scale models (or at 1km2 resolution). This statement does not have support.
Line 125. The idea of “realism” is oversold. It is well known that parameters do not necessarily represent something real in the world, especially if we work at 1km2 resolution.
Line 143. It would be beneficial if more details about the variability were added. The variability in the CAMELS dataset of GB, US, and CL is very different.
Line 181. Does it mean you treated the catchment as lumped? Is the meteorological forcing lumped too?
Line 205. The first word in the title is about the autocalibration, so the reader will think that something novel is presented about that in the paper. However, the authors did not talk about that and sent it to the appendix. This method should be highlighted more.
Line 281. The paper must be general enough for a worldwide audience; therefore, referencing places must be avoided.
Line 282-286. A section cannot be referenced before it is introduced in the text.
Line 295. Where are the PBIAS results presented in the paper?
Line 297. Add reference to the BGS hydrogeological aquifer map.
Figure 4. This information is better represented by a scatter plot.
Line 319-322. This statement is not supported by the results, or at least by the presented figures. Add some reference, or you must clarify that this is just a hypothesis.
Figure 5. The same color must be used for positive (reds) and negative (blues) changes. Blue cannot be used for positive changes.
Line 335. Why are the CAMELS attributes not used?
Line 344. There are a few catchments with NSE lower than 0.6; therefore, there is not enough information to support this statement.
Line 358. This is good, but a figure is not needed to show that no relationship was found. Send the figure to the appendix.
Line 360. I think this analysis would be more relevant if the best ML and lumped model were included.
Line 378-380. Is this statement checked by changes in the model (sensitivity analysis), or is it just a statement assumed, given the simplifications of the calibration?
Line 317. Why simpler? The calibration method used is equally simple to many of the models presented.
Figure 8. This figure does not have a fair comparison between models because each study has different catchments (probably different training periods). However, only one autocalibrated result is presented. The result of the autocalibrated model for each study must be added.
Line 442. Why were these parameters selected? They are probably the most uncertain of all of them.
Line 443. The main agreement is in the southern areas. The rest of the catchments are not necessary in agreement. Try to qualify the number of catchments in agreement or disagreement.
Figure 10. Color over color is not the best way to present the information. Try to incorporate a hatch for the catchments.
Figure 12. A comparison only with maps is not enough to show the similarities between the parameters and the external source. A scatter plot would be a better option. Try to incorporate the NSE as color to have another dimension for the differences.
Line 529. The term autocalibrated was used without describing it. What makes the model autocalibrated or just calibrated?
Line 536-537. I disagree. One analysis of performance and characteristics was presented, which was not conclusive. Later, only groundwater parameter analysis is presented without applying a direct comparison with performance.
Line 544-544. Where did you present these results?
Line 546-548. The authors are leaking information about some results they did not present in the paper. It is nice that more results would be available, so the authors know the reason, but the discussion and conclusion must be associated only with the results presented in the paper.
Line 566. “measured data”. That is not true. The authors compared with external sources that used observations to create a spatial distribution of the parameters. However, this is far from observed data. The author can calculate the density of observations used in such maps, and probably they will be less than one per catchment. Therefore, the map is just an interpolation method with high uncertainty that cannot be considered as “measured data”.
Line 569. Several remote sensing products could be used to analyze Ae/Pe in the model, but they were not used in the analysis.
Line 578-579. The statement about the sub-daily scale is true; however, GR4J and LSTM models were able to predict better than SHETRAN using lumped data. Therefore, the sub-daily scale will not solve the problems in the architecture that SHETRAN could have.
Line 586. “approximate measured values”, “observed data”. This is not true. See comment in line 566.
Line 587-588. The author did not test for extended future scenarios or climate change; therefore, they cannot be confident about that.
Line 590-593. This is information that is not relevant to this paper. The authors should focus only on the results presented.
Line 605. “real world”. The authors are overselling their result. They just compared the result visually with maps generated by other sources. This is not enough to probe consistency. For example, they did not check for spatial consistency between catchments. How can the “real world” be mentioned if the parameters change drastically between adjacent catchments?