On the importance of observation uncertainty when evaluating and comparing models: a hydrological example

Aerts, Jerom P.M.; Hoch, Jannis M.; Coxon, Gemma; van de Giesen, Nick C.; Hut, Rolf W.

doi:https://doi.org/10.5194/egusphere-2023-1156

Preprints

https://doi.org/10.5194/egusphere-2023-1156

Preprints

05 Jun 2023

| 05 Jun 2023

On the importance of observation uncertainty when evaluating and comparing models: a hydrological example

Jerom P.M. Aerts, Jannis M. Hoch, Gemma Coxon, Nick C. van de Giesen, and Rolf W. Hut

Abstract. The comparison of models in geosciences involves refining a single model or comparing various model structures. However, such model comparison studies are potentially invalid without considering the uncertainty estimates of observations in evaluating relative model performance. The temporal sampling of the observation and simulation time series is an additional source of uncertainty as a few observation and simulation pairs, in the form of outliers, might have a disproportionate effect on the model skill score. In this study we highlight the importance of including observation uncertainty and temporal sampling uncertainty when comparing or evaluating hydrological models.

In hydrology, large-sample hydrology datasets contain a collection of catchments with hydro-meteorological time series, catchment boundaries and catchment attributes that provide an excellent test-bed for model evaluation and comparison studies. In this study, two model experiments that cover different purposes for model evaluation are set up using 396 catchments from the CAMELS-GB dataset. The first experiment, intra-model, mimics a model refinement case by evaluating the streamflow estimates of the distributed wflow_sbm hydrological model with and without additional calibration. The second experiment, inter-model, is a model comparison based on the streamflow estimates of the distributed PCR-GLOBWB and wflow_sbm hydrological models.

The temporal sampling uncertainty, the result of outliers in observation and simulation pairs, is found to be substantial throughout the case study area. High temporal sampling uncertainty indicates that the model skill scores used to evaluate model performance are heavily influenced by only a few data points in the time series. This is the case for half of the simulations (210) of the first intra-model experiment and 53 catchment simulations of the second inter-model experiment as indicated by larger sampling uncertainty than the difference in the KGE-NP model skill score. These cases highlight the importance of reporting and determining the cause of temporal sampling uncertainty before drawing conclusions on large-sample hydrology based model performance. The streamflow observation uncertainty analysis shows similar results. One third of the catchments simulations (123) of the intra-model experiment contains smaller streamflow simulation differences between models than streamflow observation uncertainties, compared to only 4 catchment simulations of the inter-model experiment due to larger differences between streamflow simulations. These catchments simulations should be excluded before drawing conclusions based on large-samples of catchments. The results of this study demonstrate that it is crucial for benchmark efforts based on large-samples of catchments to include streamflow observation uncertainty and temporal sampling uncertainty to obtain more robust results.

Received: 30 May 2023 – Discussion started: 05 Jun 2023

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Jerom P.M. Aerts, Jannis M. Hoch, Gemma Coxon, Nick C. van de Giesen, and Rolf W. Hut

Status: final response (author comments only)

RC1:
'Comment on egusphere-2023-1156', Keith Beven, 06 Jun 2023
I must start by saying that I think that this paper takes a totally misguided approach to the benchmarking of hydrological models and ignores some important earlier discussions dating back more than 20 years of equifinality, disinformation in data and their impacts on model calibrations. My reasons for this are set out in a recent commentary in Hydrological Processes (Beven, 2023).   The authors will probably not have seen this before submission but the arguments made there are important.

I would note in particular:

Observation uncertainties should not be separated from model calibration as has been done here – since any optimal parameter set depends on the particular sequence of errors in the observations (as well as whatever objective function is used).In fact it is pretty obvious that allowing for ANY source of observation error as part of the calibration process means that the concept of an optimal parameter set has no real meaning (as discussed at least since 1989).

Input uncertainties should not be ignored, as has been done here, since they can be an important source of disinformation in calibration data sets.

The authors refer to “outliers”, but the authors do not differentiate between whether outliers might be the most important events in distinguishing models (e.g. Singh and Bardossy, AWR 2012), or whether they might in some cases introduce disinformation into the calibration process (e.g. the runoff coefficients greater than 1 of Beven and Smith, JHE ASCE 2015; Beven, PRSL 2019; Beven et al., HP 2022, 2023).

Is the temporal sampling issue (in terms of different sampling periods) really relevant?Should we not give models the maximum chance to fail over extremes (after assessing for possible disinformation) by using as much data as possible (Shen et al., WRR 2022 also recently suggested this as the most robust calibration for use in prediction).

There are, however, other temporal sampling issues in terms of the discretisation error of using daily time steps on some rather small UK basins.This really should be taken into account in that for some events it might be more significant than the rating curve error depending if an observed peak falls on one day or another relative to the model prediction.

The authors recognise, as in other large sample modelling studies, that there is a significant percentage of catchments for which models perform badly. In terms of considering the potential impacts of observation uncertainties, why is this not taken more seriously (it has also been ignored in all the recent machine learning studies). Is it not important to learn why that is the case (and no it is not all down to chalk catchments – and if your perceptual model already informed you that your models would not perform well on chalk catchments why did you make the applications? Needs justification, and not just because everyone else has included as many catchments as possible)

So my suggestion is that the whole basis for this study really needs to be rethought.   In doing so, however there are some other points that the authors should take into account.

L45. Correlation is not causality – it might actually be better to search for understanding at the local level.

L49. Some important papers missing here - Beven and Smith JHE ASCE 2015; Beven HSJ 2016, PRSL 2019, Beven and Lane HP 2022, Beven et al HP 2022

L52. There were earlier papers – e.g. Liu et al JH 2009; Blazkova and Beven, WRR, 2009

L114. Aggregated how? In a way consistent with the hydrological process descriptions?

L118/119. Return flow has a specific meaning in hillslope hydrology that is different?   And surely water use and water demand are not included in this data set so are not used here?

L124. Respectively is the wrong way round!

L130. Does not require additional calibration? Why not – surely it could benefit?????   Should you not just say it was applied without additional calibration.    And why is a 30 year steady state based on average daily values an appropriate initial condition for the start of 2008?   Would not seem appropriate for either baseflow dominated (chalk) catchments or flashier catchments? OK, at least 2008 was discarded so not too important.

L140 why would you expect that lateral Ks should be much greater than vertical Ks?   Are not most macropores in the near surface vertical.    Is this an indication that the process representations are inadequate and sufficient to reject the model (e.g. subsurface celerities not being handled properly).   And does a value of 100 already mean 100x or a factor of 1.   Needs more discussion/clarification.

Section 2.4.2 see comments above

Section 2.4.3 see comments above – observation uncertainties should be included in the calibration process

L186/187. This is really unclear – averages of uncertainty bounds? Why do not these come simply from the rating curve uncertainties at each time step?

L189/190. T-test?   But these are not independent values?

Figure 3.   Something seems wrong here.   On both plots the calibrated model has worse values than the default model for the values > 0

L228. The relevance of which is debatable?    Really???    The models are really poor for these sites – THAT is important - it is he relative values of just how bad are that is not so relevant.

L230 ff. See comments above about outliers and disinformation.

L254. Not clear here – if you have taken average percentage uncertainties by flow class and multiplied by the flow then how is there such variation?

L265/266 is that not the inverse of what you started at the start of this paragraph?

Section 4.1. See comments above about temporal variability, outliers and disinformation

L294/295. But equifinality has been discussed for more than 20 years now – but does not get an explicit mention in the text anywhere?

L300/301 So only take your best cases??? is it not more important to understand what is happening at these sites and allow for that understanding in what you use to predict?   Might these be model invalidation sites (see Beven and Lane, HP 2022)

Section 4.3.   See Beven HP 2023 benchmarking paper for an alternative view.

L346 experiment

Keith Beven
Citation: https://doi.org/10.5194/egusphere-2023-1156-RC1
- AC1: 'Reply on RC1', Jerom Aerts, 28 Aug 2023
  
  Please see the attached PDF for our reply to the review.
  
  On behalf of all the authors,
  
  Jerom Aerts
  
  Citation: https://doi.org/10.5194/egusphere-2023-1156-AC1
RC2:
'Comment on egusphere-2023-1156', Anonymous Referee #2, 17 Jul 2023

General comments:
The study tackles a worthwhile problem. I won’t be addressing what Prof. Beven pointed out already (which I found to be very good comments). It is difficult to give a verdict on these type of studies as they only point out, an unsolvable, problem in detail. There are some aspects of this studies that cannot be rectified with the state-of-the-art of the knowledge i.e., taking into account of all type of uncertainties simultaneously. I accept that and this has to be forgiven, for now. However, I found other major problems while going through the text which are stated below. I would like the authors to address my concerns satisfactorily. A significant overhaul is needed in my opinion. Mainly, the title says "observation uncertainty" but precipitation is also an observation and its uncertainty is left out. The paper should have said "observed discharge uncertainty" because that is the only thing it deals with. Furthermore, I found that the authors did not commit enough to what exists in literature about uncertain data in the text. Many of the problems that I point out could have been avoided had they simply done more reading.

Specific comments:

Abstract:
What is temporal sampling and observation uncertainty?
L2-3: While mentioning that comparison studies are invalid, it should also be mentioned that all models (and the inputs used) are invalid to begin with. No model incorporates true nature. In my opinion, the point is more about finding out models that are useful for a given purpose.
L3-5: Regarding the problem of temporal sampling, same data is fed to all models. If some perform better than others then, isn’t this what we are looking for?
L10-13: Only two models are compared? I would have used may be 10 given how large the number of test catchments is. Gao et al. 2018 show many in their first table (both conceptual and physically-based). It would be interesting to see how the results change by taking more models.
L11-13: For the inter-model case, please mention whether the models are calibrated or not.

1. Introduction
L31: I am really really sorry for my nit picking but hydrologists were well aware of the challenging aspects of hydrological modeling long before 2018. It is a well-known problem. I think you can omit the citation.
L33-45: Very informative. Thanks.

2. Methodology
L71: Observation uncertainty is mentioned but temporal uncertainty isn’t. Just add a few words for the sake of completeness.
L94-99: Nicely summarized.
L103: Fine spatial resolution is used, but the problem of the daily temporal resolution is not treated. Small catchments (area < 1000 km2) have problems with time-of-concentrations. There, the peak precipitation and discharge take place at the same time step. Something that the model cannot solve and produces parameters that are unrealistic during calibration. The problem of a few values dominating the objective function is also the consequence of incorrect temporal resolution. At least in my experience. I have seen this problem for catchments of more than 4000 km2 size. And I have a sneaking suspicion that CAMELS-GB has smaller catchments inside it. A procedure has to be used that discards catchments where the precipitation peak and flow peak happen at the same time step, most of the times. Such a problem exists for larger catchment on daily time scales but to a much smaller degree. That is when a precipitation event happens near to the catchment mouth.
L105-106: It is not mentioned why only two models were chosen. I don’t understand why legacy gave us two models only. In GB, Keith Beven has, for sure, used others.
L127: Yes, they will most probably lead to different conclusions. Hence, the recommendation of more than two models.
L129-131: Please elaborate as to why additional calibration is not needed. I do not understand. Is it so that model parameters are somehow known already? Comparing uncalibrated models to calibrated ones is unfair in my opinion.
L145-151: Interesting. I am glad that this study relies so much on other’s work and doesn’t try to reinvent the wheel.
L161-174: Here, I have a major problem with this study. The problem being that input uncertainty is not taken into account. Normally, observation locations are not enough to capture the point of the maximum precipitation which consequently leads to underestimation (in some cases overestimation) of the precipitation volume. This problem was demonstrated recently in Bárdossy and Anwar 2023. From the methodology explained here, I do not see any mention/treatment of this major problem till now. Also, I saw in the CAMELS-GB paper that they give a mean value of precipitation over the catchment. This is problematic, if used for a distributed model. However, dealing with input, model and discharge uncertainty is an ill-posed problem and doesn’t seem to have any acceptable solution, as far as I know. A study that deals with observation uncertainty and leaves out precipitation will, in my humble opinion, lead to incorrect/invalid conclusions. One could argue that all models are presented with the same input, and therefore it is not much of a problem. But then, why consider observation uncertainty as all models are evaluated based on the same discharge? Slightly off topic. I think the readers would benefit if the temporal uncertainty methods are summarized like other topics previously. Using words like bootstrapping and jackknifing are not so helpful. After all, it is what the study is about.
L175-181: Don’t all models struggle with the upper 5% of the distribution? I find it disconcerting that such an important detail is left out and is only mentioned now. These are the flows that cause actual problems; this is where major timing and volume problems exist and these are the time steps where the squared error dominates the objective function generally. I understand that not enough data was available but leaving the good stuff out is akin to ignoring the major problem at hand. Such details should be mentioned in the abstract as many are interested in the upper 5%. The low flows, I can forgive as they are contaminated by wastewater flowing in to the river which may or may not be originating, in terms of source, from the same catchment.
L183: What is model A and B in figure 2A? Are 1A and 2A showing the same event? There is no value of discharge on the y-axis. How is one supposed to tell, say, whether there was an actual peak when model B also shows a peak or just that model B spontaneously rose to a high value? And given that A is not as reactive as B, my guess is that something is very wrong with A.

3. Results
L214-215: I find such a comparison to be meaningless. wflow_sbm default has some quasi-arbitrary parameters. These could have performed very good or very bad. In my opinion, if one has to take one single model, the calibrated model is the one because it the best we could do, assuming that the validation also shows improvement. Also, only one parameter was optimized. Which I find strange. If there is access to a supercomputer then, why not all (that may be optimized)? It would be interesting to see. I have no attachment with wflow_sbm or PCR-GLOBWB, but some sort of parameter optimization should also be done for this. At the end, it could be what the authors point out about how it routes flow. We would only know if optimization is carried out.
L253-255: The relative low flow uncertainty could be higher due to wastewater being introduced into the streams as I mentioned earlier. And could also be due the presence of karst that is mentioned by the authors.
L258-267: Here, I would like to stress again that without considering input uncertainty, the conclusions are incorrect as the results are (heavily) influenced by input (precipitation). And the authors should acknowledge this.

4. Discussion:
L269-274: The authors finally mention the other sources of uncertainty this deep in the text. The reason why this problem is over-looked (by hydrologists that know about this problem) is that evaluating uncertainties is not trivial and requires much more data (which Coxon (2015) had the luxury of to some extent) computational power, and many assumptions (that likely remain unfulfilled or cannot be verified to hold). Working with uncertain data has been tried before but all end up at the same point i.e., if uncertainties have to be handled then the proper way is to take all types in to account simultaneously. This is a major problem. Uncertainty bounds of any variable are calculated, normally, using Gaussian-dependence. For precipitation, for example, advection and convection exists. Something that interpolation schemes cannot capture by considering only a subset of points in the catchment. Also, they are non-Gaussian fields. Radar shows some structure of the precipitation field but is also limited in its capabilities when it comes to precipitation volumes and is not always better than using gauge data.

References:
Bárdossy and F. Anwar. Why do our rainfall–runoff models keep underestimating the peak flows? Hydrology and Earth System Sciences, 27(10):1987–2000, 2023. doi: 10.5194/hess-27-1987-2023. URL https://hess.copernicus.org/articles/27/1987/2023/.

Citation: https://doi.org/10.5194/egusphere-2023-1156-RC2
- AC2: 'Reply on RC2', Jerom Aerts, 28 Aug 2023
  
  Please see the attached PDF for our reply to the review.
  
  On behalf of all the authors,
  
  Jerom Aerts
  
  Citation: https://doi.org/10.5194/egusphere-2023-1156-AC2

Jerom P.M. Aerts, Jannis M. Hoch, Gemma Coxon, Nick C. van de Giesen, and Rolf W. Hut

Viewed

Total article views: 1,003 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
695	267	41	1,003	33	32

HTML: 695
PDF: 267
XML: 41
Total: 1,003
BibTeX: 33
EndNote: 32

Views and downloads (calculated since 05 Jun 2023)

Month	HTML	PDF	XML	Total
Jun 2023	331	118	9	458
Jul 2023	82	12	4	98
Aug 2023	57	19	4	80
Sep 2023	39	16	2	57
Oct 2023	25	11	3	39
Nov 2023	14	12	0	26
Dec 2023	42	17	3	62
Jan 2024	17	8	1	26
Feb 2024	16	11	0	27
Mar 2024	23	21	3	47
Apr 2024	14	2	4	20
May 2024	9	12	3	24
Jun 2024	12	6	3	21
Jul 2024	14	2	2	18

Cumulative views and downloads (calculated since 05 Jun 2023)

Month	HTML	PDF	XML	Total
Jun 2023	331	118	9	458
Jul 2023	82	12	4	98
Aug 2023	57	19	4	80
Sep 2023	39	16	2	57
Oct 2023	25	11	3	39
Nov 2023	14	12	0	26
Dec 2023	42	17	3	62
Jan 2024	17	8	1	26
Feb 2024	16	11	0	27
Mar 2024	23	21	3	47
Apr 2024	14	2	4	20
May 2024	9	12	3	24
Jun 2024	12	6	3	21
Jul 2024	14	2	2	18

Viewed (geographical distribution)

Total article views: 992 (including HTML, PDF, and XML) Thereof 992 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 26 Jul 2024

Short summary

Hydrological model performance involves comparing simulated states and fluxes with observed counterparts. Often, it is overlooked that there is inherent uncertainty surrounding the observations. This can significantly impact the results. In this publication, we emphasize the significance of accounting for observation uncertainty in model comparison. We propose a practical method that is applicable for any observational time series with available uncertainty estimations.


Total:	0
HTML:	0
PDF:	0
XML:	0