the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
On the need for physical constraints in deep learning rainfall-runoff projections under climate change
Abstract. Deep learning rainfall-runoff models have recently emerged as state-of-the-science tools for hydrologic prediction that outperform conventional, process-based models in a range of applications. However, it remains unclear whether deep learning models can produce physically plausible projections of streamflow under significant amounts of climate change. We investigate this question here, focusing specifically on modeled responses to increases in temperature and potential evapotranspiration (PET). Previous research has shown that temperature-based methods to estimate PET lead to overestimates of water loss in rainfall-runoff models under warming, as compared to energy budget-based PET methods. Consequently, we assess the reliability of streamflow projections under warming by comparing projections with both temperature-based and energy budget-based PET estimates, assuming that reliable streamflow projections should exhibit less water loss when forced with smaller (energy budget-based) projections of future PET. We conduct this assessment using three process-based rainfall-runoff models and three deep learning models, trained and tested across 212 watersheds in the Great Lakes basin. The deep learning models include a regional Long Short-Term Memory network (LSTM), a mass-conserving LSTM (MC-LSTM) that preserves the water balance, and a novel variant of the MC-LSTM that also respects the relationship between PET and water loss (MC-LSTM-PET). We first compare historical streamflow predictions from all models under spatial and temporal validation, and also assess model skill in estimating watershed-scale evapotranspiration. We then force all models with scenarios of warming, historical precipitation, and both temperature-based (Hamon) and energy budget-based (Priestley-Taylor) PET, and compare their projections for changes in average flow, as well as low flows, high flows, and streamflow timing. Finally, we also explore similar projections using a National LSTM fit to a broader set of 531 watersheds across the contiguous United States. The main results of this study are as follows:
1. The three Great Lakes deep learning models significantly outperform all process models in streamflow estimation under spatiotemporal validation, with only small differences between the DL models. The MC-LSTM-PET also matches the best process models and outperforms the MC-LSTM in estimating evapotranspiration under spatiotemporal validation.
2. All process models show a downward shift in average flows under warming, but this shift is significantly larger under temperature-based PET estimates than energy budget-based PET. The MC-LSTM-PET model exhibits similar differences in water loss across the different PET forcings, consistent with the process models. However, the LSTM exhibits unrealistically large water losses under warming as compared to the process models using Priestley-Taylor PET, while the MC-LSTM is relatively insensitive to PET method.
3. All deep learning models exhibit smaller changes in high flows and streamflow timing as compared to the process models, while deep learning projections of low flows are all very consistent and within the range projected by process models.
4. Like the Great Lakes LSTM, the National LSTM also shows unrealistically large water losses under warming. However, when compared to the Great Lakes deep learning models, projections from the National LSTM were more stable when many inputs were changed under warming and better aligned with process model projections for streamflow timing. This suggests that the addition of more, diverse watersheds in training does help improve climate change projections from deep learning models, but this strategy alone may not guarantee reliable projections under unprecedented climate change.
Ultimately, the results of this work suggest that physical considerations regarding model architecture and input variables are necessary to promote the physical realism of deep learning-based hydrologic projections under climate change.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(2365 KB)
-
Supplement
(2191 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(2365 KB) - Metadata XML
-
Supplement
(2191 KB) - BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'When the rooster crows the sun rises', Daniel Klotz, 22 Aug 2023
This study examines the behavior of different models under a hypothetical scenario where 4°C are added to the daily minimum and maximum temperatures. In doing so, the contribution finds that models with more explicit representations of hydrological processes are likely to exhibit more realistic behaviors under this shift.
I find this kind of study very important and very timely (as other discussion papers show; see e.g.: Reichert et al. 2023). On top of that, the execution is done well: The work is, by and large, well motivated; the idea is good; and, all tables and images are clear; almost everything is documented. I therefore think that the study should definitely be published on HESS. In terms of critique I have one point about the literature that I think is crucial, and some small questions/comments. The latter are, however, not so important.
Major Comment
The references are quite thorough with regard to the recent use of deep learning in hydrology. I complement the authors for that. They do, however, ignore large amounts of work from the outside the field. Normally this would not be a concern --- since one feeds into the other --- but here it does skew the motivation somewhat. As of now the introduction/motivation of the work reads as if current researcher are not aware that one can increase the temperature by some degrees and then test what the model would do under such circumstances. This is however not the case. For example, the group I am involved with, did not conduct such counterfactual experiments because we knew that deep learning models are out of the box not be able cope with arbitrary shifts in the covariance structures of the inputs. Statistical learning hinges on the idea that the future looks similar to the past --- and in a counterfactual setting this property is not given by design.
I strongly believe that the paper should give a better overview of the current machine learning literature and use that to discuss the merits and limits of the study design. This would give readers a much richer picture of what the proposed evaluation can probe.Specifically, I am thinking that the paper should reference current work on (a) causality and (b) distribution shifts; and then use it feed into the discussion of the limitations of the current study. The reason why I think of (a) and (b) is that both research branches are fundamental to understand the study design: (a) Causality is important because the examination is a true counterfactual in that the adopted input has not --- and will never be --- observed in reality (remember, the daily values of the min and max temperatures change by adding exactly 4°C to all basins, while inputs like the radiation, wind, precipitation, and vapor pressure remain entirely the same). (a) The research on distribution shifts is important because adding 4°C to each day is a prime example of a covariate shift. Detecting, handling, "robustifiyng" and/or adapting to distribution shifts is an active area of research and should be seen as an open problem. Roughly speaking, results from (a) and (b) provide a counter point to the current motivation of the research in that they suggest that dara-driven models should per-se not be able to withstand a counterfactual examination. I think this would help readers to understand that the "physical plausible" response of the catchment model is measured with a "physically implausible" counterfactual signal (which is not observed in any catchment no matter what and will force the models into a sort of "extrapolation regime"). I believe that only then readers will understand that this is a very special form of test --- and that is very impressive that it is possible to design data-driven models that already show promising result in this setting, while having just a few more inductive biases than the current LSTM based rainfall-runoff models. In this regard, I do not want to force the authors to cite any particular work, but beg them to align their work with these branches of research (even if it means that they need to relativize their a-priori expectations)
Minor Comments
L. 85-86. Please add a reference to this sentence (or an explanation why no reference is given). You make the claim that "many argue" without even giving a single example.L. 85-86. I think the meaning of "state-of-the-science" should be outlined. As far as I am aware it is not common terminology in hydrology (I, for one, had to look it up and am still not sure what is meant with it in this context).
L.100-101. I disagree with the claim about the corollary. Maybe it is an implication? I am not sure however: (a) Given the noise in the data, even without new climate conditions the predictions might be physically implausible. (b) Just because a ML model is "physically plausible" in in a out os sample setting does not mean that it remains so under a shift setting. What do you think about writing something like "From these results one might think that ..." or "If we spin these results further one could think that...".
L.108ff Is it correct that, from a hydrological perspective, this assume that there are no Glaciers/permanent snow in the basins (which, I think is, e.g., not true for CAMELS US as used in Liu et al. 2022)? The mechanism would be that as long as there is more melting happening, we should see higher water levels with higher temperatures.
L. 334. Would you be so kind and make a comparison of the (normal) LSTM performance with normalized streamflow and without it? I once made a similar test, where I trained an LSTM on CMALES US without setting the standard deviations to one. And, got very bad results... However, your performance seems to be comparable to the ones reported in Mai et al. (2022). To me it is not really clear how you did that (especially since you used a relatively small learning rate and since the linear layer requires much bigger parameters in your setting). Maybe it is because the magnitude and behavior of the GRIP-GL rivers are much less diverse than the ones in CAMELS US?
L. 165ff. I know this is a choice of style and I will not mention this for the other occurrences, but: I would appreciate if you could already sketch the outcome of the experiments here (and in the other instances where you hypothesize about properties that one actually already knows at the time of writing).
L. 176. Maybe adjust sentence a bit. I pretty sure that Frame et al. 2022 did not made an argument that physical constraints are not needed in for generating plausible projections under climate change. And, this sentence could easily be misread in that way.
L.268ff & L.350-351. It is probably an oversight on my side, but cannot find the code for this analysis in the zenodo repository.
L.344ff. Can you add a description or table with the grid you searched the hyper-parameters for to the supplementary?
L.377. I would recommend to explicitly write about $\sigma$ and $\hat{\sigma}$ here so that readers know what you are referring to.
Table 2. I think the MC-LSTM KGE for "Testing Sites: Testing Period" should also be marked in bold since it is also 0.72 (the decimals that follow and are not shown should not be considered for a tie breaker here).
L.497ff Please describe the actual changes that you made to the static attributes either here or in the supplementary. I can see the changes in the data, but that requires readers to reconstruct what you did.
L.497ff I am probably missing something here, but to me its is not obvious why you changed the snow fraction of the precipitation with temperatures below 0°C? If the model gets an input with -3°C it should not matter to this whether this value was the true input or the counterfactually modified one; no?
L.656 consist -> consistent
L. 803ff. Is it really necessary to discuss short-wave radiation for so long here? You also did not consider that the thermic and dynamic behavior of the atmosphere and hence, the precipitation patterns would, for example, change over the whole region. I think you could abbreviate this paragraph considerably by just stating that the input modification is pragmatic and intuitiv, but does not reflect how the meteorological behavior would actually play out under climate change. This would then also my proposed literature references if you decide to include it.
References
- Reichert, P., Ma, K., Höge, M., Fenicia, F., Baity-Jesi, M., Feng, D., and Shen, C.: Metamorphic Testing of Machine Learning and Conceptual Hydrologic Models, Hydrol. Earth Syst. Sci. Discuss. [preprint], https://doi.org/10.5194/hess-2023-168, in review, 2023.
- Mai, J., Shen, H., Tolson, B. A., Gaborit, É., Arsenault, R., Craig, J. R., Fortin, V., Fry, L. M., Gauch, M., Klotz, D., Kratzert, F., O'Brien, N., Princz, D. G., Rasiya Koya, S., Roy, T., Seglenieks, F., Shrestha, N. K., Temgoua, A. G. T., Vionnet, V., and Waddell, J. W.: The Great Lakes Runoff Intercomparison Project Phase 4: the Great Lakes (GRIP-GL), Hydrol. Earth Syst. Sci., 26, 3537–3572, https://doi.org/10.5194/hess-26-3537-2022, 2022.Citation: https://doi.org/10.5194/egusphere-2023-1744-RC1 -
RC4: 'Reply on RC1', Daniel Klotz, 03 Oct 2023
Upon reflection I would like to add that I think it would be highly beneficial if you could add some representative Hydrographs to an Appendix. This is, for one because I am interested to see some because of my personal experience with mass-conserving models; but secondly I also genuinely believe that it would help readers to put the performance and interventions into perspective.
Citation: https://doi.org/10.5194/egusphere-2023-1744-RC4 -
AC4: 'Reply on RC4', Sungwook Wi, 30 Oct 2023
We now reference individual hydrographs for specific sites (at both daily and monthly timescales) in our revised supplemental information. For details, please refer to the pdf file attached to your first comments.
Thank you.
Citation: https://doi.org/10.5194/egusphere-2023-1744-AC4
-
AC4: 'Reply on RC4', Sungwook Wi, 30 Oct 2023
- AC1: 'Reply on RC1', Sungwook Wi, 29 Oct 2023
-
RC4: 'Reply on RC1', Daniel Klotz, 03 Oct 2023
-
RC2: 'Comment on egusphere-2023-1744', Shijie Jiang, 12 Sep 2023
General comments:The study provides a comparison of various deep learning models with process-based models across a large number of catchments. It provides insights into their strengths and weaknesses for prediction under "climate change" conditions (that is temperature mean shift). The general conclusion of the study is that careful consideration of their architecture and large sample learning is important to ensure physical plausibility of projections under different scenarios. I believe that the content and findings of the research would be valuable and may be of interest to the HESS readership.However, there are key concerns that I believe detract from the overall quality of the manuscript. The primary issues are the heavy emphasis on the role of PET while omitting snowmelt, and the potential overstatement in labeling the scenarios as "climate change".1) The study appears to rely too heavily on PET as the primary determinant in understanding the impacts of climate change on streamflow. While PET is undoubtedly critical, it is only one of many factors influencing hydrologic responses, particularly snowmelt, that may be important to streamflow generation in the Great Lakes region (https://agupubs.onlinelibrary.wiley.com/doi/10.1002/2016GL068070). For example, lines 513-521 suggest that the assumption is primarily concerned with the model's ability to discriminate between differences in water loss based on different PET projections under similar warming conditions. However, in regions where snowmelt may play a critical role in determining streamflow, temperature sensitivity could have dual implications - one for PET and another for snowmelt dynamics. Ignoring the latter could bias the results. In particular, I think it could explain the results in Figure 7 g and h that the authors didn't explicitly explain (lines 660-664): for process-based models that rely on physical processes, early snowmelt can significantly shift the seasonal pattern of streamflow as temperature increases. However, for machine-learning model, which mainly make predictions on the possible seasonal correlation, didn't present such a significant shift due to the seasonality of T and PET does not change. Therefore, if some models, especially process-based models, inherently account for snowmelt while others don't, then the comparison may not be apples to apples.That is to say, the observed differences between process-based and machine learning models could be due in part to the fact that some models capture snowmelt dynamics while others don't. Therefore, extrapolating the study's findings to broader climate change impacts may be premature, especially if the full range of factors isn't accounted for, which is related to my second concern.2) While the study examines the sensitivity of hydrological models to temperature changes, it may be misleading to equate this solely with climate change. Climate change is multifaceted and includes more than just temperature changes. Although the authors attempted to indicate this as a limitation in constructing climate change scenarios, the use of the term "climate change" in the title, abstract, and elsewhere could inadvertently downplay the myriad ways in which climate change affects hydrologic systems. For example, factors such as land surface changes due to elevated CO2 have been shown to play a more dominant role in changing runoff (https://www.nature.com/articles/s41558-023-01659-8). Therefore, I think it may be more accurate to frame the study as a sensitivity analysis of hydrologic models to temperature and related PET shifts, rather than an examination of so-called climate change scenarios.I would therefore suggest a major revision to explicitly state the assumptions made about snowmelt in each model, and to include snowmelt dynamics in the discussion of runoff differences. In addition, if the label "climate change" is to be retained, the study should consider a broader range of factors that might be influenced by climate change, not just uniformly increasing temperature.Specific comments follow,Abstract: I can't find a word limit for the abstract for HESS, but as the submission guidelines say, "An abstract should be short, clear, concise...". The abstract in its current form is too long.L105-124, I understand that this study may build on the previous WS22 study. However, the depth of detail should be balanced, as many of the methods, challenges, and conclusions of the earlier work are now repeated in the new paper. The introductory section should focus primarily on setting up the current study.L141, please specify what you mean by "this work". If it is what is shown in the current manuscript, it is strange to discuss the results in the introduction. If it is still from WS22, the opinions seem reiterated again.L155, the assertion that temperature-based PET methods "significantly overestimate future projections of PET" compared to energy budget-based methods is a strong one. It might be beneficial to provide more evidence from literature.L170-172, as noted in my previous comments, the hypotheses may suggest that PET is the sole or overwhelming cause of declining streamflow.L180-182, it would be beneficial to show how the correlation between temperature and PET shifts with different estimation methods.Fig. 2, I am confused by the presentation of the timing, as it seems to suggest that the cumulative flux of warming stops increasing after a certain day of the water year.L408-409, is there any post-examination of the trash cell (if it is constrained by PET or some value) to support this assumption?L412, I think it depends on how many catchments of your study are natural without human intervention.L494-495, but this adjustment is not reflected in Figure 2. Why does the correlation bias have to take effect here, while the possible correlation between temperature and other input variables is omitted in the previous experiments?L499, please justify the selection of static input features that need to be changed, is this done by examining the dependence between mean temperature and other static input features?L510, does the "year" here indicate calendar year or water year, it seems water year based on Figure 2.L617-618, references are needed here to support the argument.L651-653, the explanation is not convincing for me, since the process-based model shows a more obvious change compared to the deep learning-based model, while no explanation of the difference between the two types of models was provided here.L663-664, as I mentioned in the general comments, this may be due to the role of snow dynamics being treated differently between the process-based and deep learning models. Unfortunately, analysis on snowmelt was not included in the study.L669-671, this finding is interesting, since if we really want to use the DL model to do the climate change impact analysis, considering the future surface climate and surface context variables changes are necessary, but the DL models do not seem to learn physically plausible relationships here when doing cross-region learning. It is also nice to see that a physically informed strategy can help mitigate the problem.L672-673, perhaps the additional implementation of test sites in the test period can be informed earlier around L609.L803-805, it is also important to note that the constructed climate change scenarios break the Clausius-Clapeyron scaling, so I would suggest not calling them "climate change".Citation: https://doi.org/
10.5194/egusphere-2023-1744-RC2 - AC2: 'Reply on RC2', Sungwook Wi, 29 Oct 2023
-
RC3: 'Comment on egusphere-2023-1744', Larisa Tarasova, 13 Sep 2023
Review of the manuscript „On the need for physical constraints in deep learning rainfall-runoff projections under climate change“
This manuscript compares the simulations of unconstrained and physically-constrained deep learning models with the simulations of the three conceptual hydrological models in the Great Lakes Region under climate change scenarios with the goal to investigate versatility of the former in changing climatic conditions.
I find the premises of the experiment is very interesting and have a potential to provide useful insights on the fitness of the state-of-the-art models under changing climatic conditions. However, I find the experimental set up somewhat inconsistent: 1) national LSTM model does not use the same input (i.e., not the same PET data) as all the variants of Great Lakes Models, making it hard to disentangle the true reason behind its different behavior; 2) The implementation of PET-constrain in the LSTM model is rather crude and is based on the assumption that evapotranspiration and streamflow is the only way how the water can leave the system, which might not hold universally. Moreover, there are some subjective and ambiguous choices (e.g., choice of the performance metrics; choice of PET methods; choice of the conceptual hydrological models as baseline) in the experiment set up that needs to be clarified. Finally, although I find it interesting to compare behavior of the deep learning and conceptual hydrological models for future simulations, I find the results of the study rather inconclusive, because the differences between the simulations of the three conceptual hydrological models (e.g., Figure 8) seems to be very large (sometimes these differences even larger than the difference with the deep learning models), making one question the reliability of these conceptual models as the baseline. Please, find my detailed comments below.
General comments
Inconsistent set up of the national LSTM: The national LSTM model was driven by temperature, radiation and vapor pressure (Line 270-271), but not by either temperature-based or energy-based PET that were used for Great Lakes LSTMs. When comparing its simulations with the simulations of other LSTM-variants it is not possible to disentangle the origin of the observed differences, making one of the conclusions of the manuscript, that more diverse set of catchments might to some extent support learning physically-based processes, rather questionable, because the differences might be just as well be due to the difference in the forcing data.
Choice of PET methods: I very much like the idea of comparing temperature-based vs energy-based method for PET estimation. However, there is no rationale provided on the choice of the particular method (Hamon and Priestley-Taylor, Line 453). In my experience the difference between different temperature-based approaches can be huge. Likely, the same is true for the energy-based methods. I think using several methods for each type of the PET estimation would strengthen the argument of the manuscript.
PET-constrain for LSTM instead of hybrid models: The motivation for using PET constrains for LSTM by means of a trash cell that assumes that there are exclusively evaporative water losses (which might not always be the case, Jaschecko et al., 2021 https://doi.org/10.1038/s41586-021-03311-x) is not clear to me. Using hybrid models that seem to be exactly the tool that combines the strength of deep learning models and concepts of hydrological processes and therefore, are potentially more fit for producing reliable future simulations under changing conditions, seems like a more straightforward choice to me. One sentence in the discussion merely mentioning the existence of such models is definitely not enough in my opinion, because their existence and comparable performance with deep learning models question the need to develop any constrained variants of LSTM as done in this study.
Choice of conceptual hydrological models, their parametric uncertainties and discrepancy in their simulations: The choice of the three conceptual hydrological models used as a benchmark is not clear. What is the rationale for selecting exactly these three models? What are the major structural differences between them? Are there any studies indicating the fitness of these models for future simulations/ET simulations? These questions have to be addressed to justify the choice of the baseline. Moreover, although the authors have accounted for uncertainty of training of deep learning models by running a 10 members ensemble, the parametric uncertainty of conceptual models (that can be very substantial) is completely ignored by using only one best simulation for each model, instead of using e.g., X% of best performing models (or so called behavioral parameter sets, Beven and Freer, 2001 https://doi.org/10.1016/S0022-1694(01)00421-8). Accounting for parametric uncertainty of the conceptual models might shed a light on large discrepancies between the simulations of conceptual models (e.g., Figure 8) that sometimes is even larger than the differences with the deep learning models.
Choice of model performance metrics: The choice of the performance metrics is also not very clear to me. I can imagine that the inadequate partitioning of evaporative fluxes might especially affect the mean and the low flows, but what is the rationale for examining high flows and the seasonality of the flows? This needs to be clarified. There is also a discrepancy in how low and high flows are defined (98th percentile and the 30th percentile) that also needs clarification.
Detailed comments
Line 28 and elsewhere I would not really call the hydrological models used in this study as process-based. These are conceptual hydrological models that require extensive calibration and that can be very physically-unrealistic as well.
Line 32 and elsewhere: The term water loss is rather unclear. Please clarify and use a consistent term for that throughout the manuscript.
Line 42 and elsewhere: Is actual evapotranspiration meant here? Please clarify
Line 45-47: At this point the application of national of LSTM in addition to the regional LSTM sounds rather inconsistent. Please clarify here the objective for that.
Line 48 and elsewhere: If the statistical test was not performed, omit term “significantly”. Use term “considerably” instead.
Line 52 and elsewhere: Average is a very ambiguous term. Is it mean or median? Is it daily or annual flows? Please use a clearer term throughout the manuscript.
Line 58: Smaller than what? It would be more helpful to include more quantitative results in the Abstract (e.g., Lines 63-64).
Introduction: I feel that introduction is very one-sided focusing on the purely deep learning models and not paying enough attention to the problems that conceptual hydrological models have when simulating future (e.g., Merz et al., 2011 https://doi.org/10.1029/2010WR009505; Wallner and Haberlandt, 2015 https://doi.org/10.1002/hyp.10430). It also completely omits the field of hybrid models (Jiang et al., 2020 https://doi.org/10.1029/2020GL088229; Höge et al., 2022 https://doi.org/10.5194/hess-26-5085-2022) that in my opinion might be more fit for future predictions than the deep learning models or even their constrained variants. Moreover, the Introduction is very much focused on the previous study by the authors, but fails to clearly distinguish the difference between them.
Line 121: Does ET mean actual evapotranspiration here? This is not clear and I think this acronym is not used later anymore. Please revise.
Line 147-150: The energy-based methods (although indisputably more realistic) are also based on empirical relationships, are they not?
Line 172: Evaporative water loss instead? This term is unclear.
Line 204: I do not think that the reference to CAMELS-GB is appropriate here. It was not created with the sole purpose to benchmark deep learning models, nor does it actually benchmark them. Please revise.
Line 243: Acronym AET was never later in the manuscript. Consider omitting it.
Line 243-245: I think it is worth to note here that GLEAM can be also associated with considerable uncertainties. Therefore, validation using this product might be questionable as well.
Line 253: It is not clear what is meant by hydrological losses here and if this term is different from the term “water losses” used earlier. Please clarify.
Figure 2: A much more comprehensive caption describing every step and every acronym presented in the Figure is needed.
Line 337: For the purity of the test, I suggest that all models (conceptual and deep learning models) should be trained on the same objective function.
Line 351: It would be helpful to mention around here how many of these catchments overlap with the Great Lake catchment sample. Even better would be to indicate them in Figure 1.
Line 412: This statement requires a reference
Line 435: The rationale for using both KGE and NSE as performance metrics is unclear to me
Line 441 GLEAM estimates are not observations and can be associated with large uncertainties too.
Line 449: It is not clear how fraction of snowfall was adjusted. Please clarify. Moreover, please use full terms and not the acronyms that were not previously introduced.
Line 497-501: A table with the overview of all scenarios and setups would be helpful.
Line 507-511: It is not clear what is meant by “average” here. Please clarify. Consider avoiding using so many acronyms, the manuscript is oversaturated with them, making it hard to understand.
Table 2: I miss here the timing error introduced earlier.
Line 617-618: This statement requires a reference and it would be helpful if it will be presented in a more quantitative way.
Figure 5 and 6: Provide the names and the locations of the selected watersheds. It would also be helpful to indicate them on Figure 1 to show their geographical location.
Line 657: This is not really the timing of streamflow per se. It is rather a seasonality of the flow. Please clarify that.
Line 681-690: This part is rather confusing and difficult to read. Please revise.
Line 696-698: This is rather vague. Please provide a more quantitative assessment. Moreover, nothing is mentioned about huge differences between the simulations of the conceptual models and how this affects the reliability of the baseline chosen in this experiment.
Figure 8: Please explain all the acronyms in the caption.
Line 720-721 This part is rather confusing, please revise.
Editorial comments
Line 27: state-of-the-art
Line 30: under exacerbating climate change
Line 32: overestimation
Line 170: similarly large
Line 334: by drainage area
Line 656: consistent
Line 657: changes in high flows
Line 828: considerable errors?
Citation: https://doi.org/10.5194/egusphere-2023-1744-RC3 - AC3: 'Reply on RC3', Sungwook Wi, 29 Oct 2023
Interactive discussion
Status: closed
-
RC1: 'When the rooster crows the sun rises', Daniel Klotz, 22 Aug 2023
This study examines the behavior of different models under a hypothetical scenario where 4°C are added to the daily minimum and maximum temperatures. In doing so, the contribution finds that models with more explicit representations of hydrological processes are likely to exhibit more realistic behaviors under this shift.
I find this kind of study very important and very timely (as other discussion papers show; see e.g.: Reichert et al. 2023). On top of that, the execution is done well: The work is, by and large, well motivated; the idea is good; and, all tables and images are clear; almost everything is documented. I therefore think that the study should definitely be published on HESS. In terms of critique I have one point about the literature that I think is crucial, and some small questions/comments. The latter are, however, not so important.
Major Comment
The references are quite thorough with regard to the recent use of deep learning in hydrology. I complement the authors for that. They do, however, ignore large amounts of work from the outside the field. Normally this would not be a concern --- since one feeds into the other --- but here it does skew the motivation somewhat. As of now the introduction/motivation of the work reads as if current researcher are not aware that one can increase the temperature by some degrees and then test what the model would do under such circumstances. This is however not the case. For example, the group I am involved with, did not conduct such counterfactual experiments because we knew that deep learning models are out of the box not be able cope with arbitrary shifts in the covariance structures of the inputs. Statistical learning hinges on the idea that the future looks similar to the past --- and in a counterfactual setting this property is not given by design.
I strongly believe that the paper should give a better overview of the current machine learning literature and use that to discuss the merits and limits of the study design. This would give readers a much richer picture of what the proposed evaluation can probe.Specifically, I am thinking that the paper should reference current work on (a) causality and (b) distribution shifts; and then use it feed into the discussion of the limitations of the current study. The reason why I think of (a) and (b) is that both research branches are fundamental to understand the study design: (a) Causality is important because the examination is a true counterfactual in that the adopted input has not --- and will never be --- observed in reality (remember, the daily values of the min and max temperatures change by adding exactly 4°C to all basins, while inputs like the radiation, wind, precipitation, and vapor pressure remain entirely the same). (a) The research on distribution shifts is important because adding 4°C to each day is a prime example of a covariate shift. Detecting, handling, "robustifiyng" and/or adapting to distribution shifts is an active area of research and should be seen as an open problem. Roughly speaking, results from (a) and (b) provide a counter point to the current motivation of the research in that they suggest that dara-driven models should per-se not be able to withstand a counterfactual examination. I think this would help readers to understand that the "physical plausible" response of the catchment model is measured with a "physically implausible" counterfactual signal (which is not observed in any catchment no matter what and will force the models into a sort of "extrapolation regime"). I believe that only then readers will understand that this is a very special form of test --- and that is very impressive that it is possible to design data-driven models that already show promising result in this setting, while having just a few more inductive biases than the current LSTM based rainfall-runoff models. In this regard, I do not want to force the authors to cite any particular work, but beg them to align their work with these branches of research (even if it means that they need to relativize their a-priori expectations)
Minor Comments
L. 85-86. Please add a reference to this sentence (or an explanation why no reference is given). You make the claim that "many argue" without even giving a single example.L. 85-86. I think the meaning of "state-of-the-science" should be outlined. As far as I am aware it is not common terminology in hydrology (I, for one, had to look it up and am still not sure what is meant with it in this context).
L.100-101. I disagree with the claim about the corollary. Maybe it is an implication? I am not sure however: (a) Given the noise in the data, even without new climate conditions the predictions might be physically implausible. (b) Just because a ML model is "physically plausible" in in a out os sample setting does not mean that it remains so under a shift setting. What do you think about writing something like "From these results one might think that ..." or "If we spin these results further one could think that...".
L.108ff Is it correct that, from a hydrological perspective, this assume that there are no Glaciers/permanent snow in the basins (which, I think is, e.g., not true for CAMELS US as used in Liu et al. 2022)? The mechanism would be that as long as there is more melting happening, we should see higher water levels with higher temperatures.
L. 334. Would you be so kind and make a comparison of the (normal) LSTM performance with normalized streamflow and without it? I once made a similar test, where I trained an LSTM on CMALES US without setting the standard deviations to one. And, got very bad results... However, your performance seems to be comparable to the ones reported in Mai et al. (2022). To me it is not really clear how you did that (especially since you used a relatively small learning rate and since the linear layer requires much bigger parameters in your setting). Maybe it is because the magnitude and behavior of the GRIP-GL rivers are much less diverse than the ones in CAMELS US?
L. 165ff. I know this is a choice of style and I will not mention this for the other occurrences, but: I would appreciate if you could already sketch the outcome of the experiments here (and in the other instances where you hypothesize about properties that one actually already knows at the time of writing).
L. 176. Maybe adjust sentence a bit. I pretty sure that Frame et al. 2022 did not made an argument that physical constraints are not needed in for generating plausible projections under climate change. And, this sentence could easily be misread in that way.
L.268ff & L.350-351. It is probably an oversight on my side, but cannot find the code for this analysis in the zenodo repository.
L.344ff. Can you add a description or table with the grid you searched the hyper-parameters for to the supplementary?
L.377. I would recommend to explicitly write about $\sigma$ and $\hat{\sigma}$ here so that readers know what you are referring to.
Table 2. I think the MC-LSTM KGE for "Testing Sites: Testing Period" should also be marked in bold since it is also 0.72 (the decimals that follow and are not shown should not be considered for a tie breaker here).
L.497ff Please describe the actual changes that you made to the static attributes either here or in the supplementary. I can see the changes in the data, but that requires readers to reconstruct what you did.
L.497ff I am probably missing something here, but to me its is not obvious why you changed the snow fraction of the precipitation with temperatures below 0°C? If the model gets an input with -3°C it should not matter to this whether this value was the true input or the counterfactually modified one; no?
L.656 consist -> consistent
L. 803ff. Is it really necessary to discuss short-wave radiation for so long here? You also did not consider that the thermic and dynamic behavior of the atmosphere and hence, the precipitation patterns would, for example, change over the whole region. I think you could abbreviate this paragraph considerably by just stating that the input modification is pragmatic and intuitiv, but does not reflect how the meteorological behavior would actually play out under climate change. This would then also my proposed literature references if you decide to include it.
References
- Reichert, P., Ma, K., Höge, M., Fenicia, F., Baity-Jesi, M., Feng, D., and Shen, C.: Metamorphic Testing of Machine Learning and Conceptual Hydrologic Models, Hydrol. Earth Syst. Sci. Discuss. [preprint], https://doi.org/10.5194/hess-2023-168, in review, 2023.
- Mai, J., Shen, H., Tolson, B. A., Gaborit, É., Arsenault, R., Craig, J. R., Fortin, V., Fry, L. M., Gauch, M., Klotz, D., Kratzert, F., O'Brien, N., Princz, D. G., Rasiya Koya, S., Roy, T., Seglenieks, F., Shrestha, N. K., Temgoua, A. G. T., Vionnet, V., and Waddell, J. W.: The Great Lakes Runoff Intercomparison Project Phase 4: the Great Lakes (GRIP-GL), Hydrol. Earth Syst. Sci., 26, 3537–3572, https://doi.org/10.5194/hess-26-3537-2022, 2022.Citation: https://doi.org/10.5194/egusphere-2023-1744-RC1 -
RC4: 'Reply on RC1', Daniel Klotz, 03 Oct 2023
Upon reflection I would like to add that I think it would be highly beneficial if you could add some representative Hydrographs to an Appendix. This is, for one because I am interested to see some because of my personal experience with mass-conserving models; but secondly I also genuinely believe that it would help readers to put the performance and interventions into perspective.
Citation: https://doi.org/10.5194/egusphere-2023-1744-RC4 -
AC4: 'Reply on RC4', Sungwook Wi, 30 Oct 2023
We now reference individual hydrographs for specific sites (at both daily and monthly timescales) in our revised supplemental information. For details, please refer to the pdf file attached to your first comments.
Thank you.
Citation: https://doi.org/10.5194/egusphere-2023-1744-AC4
-
AC4: 'Reply on RC4', Sungwook Wi, 30 Oct 2023
- AC1: 'Reply on RC1', Sungwook Wi, 29 Oct 2023
-
RC4: 'Reply on RC1', Daniel Klotz, 03 Oct 2023
-
RC2: 'Comment on egusphere-2023-1744', Shijie Jiang, 12 Sep 2023
General comments:The study provides a comparison of various deep learning models with process-based models across a large number of catchments. It provides insights into their strengths and weaknesses for prediction under "climate change" conditions (that is temperature mean shift). The general conclusion of the study is that careful consideration of their architecture and large sample learning is important to ensure physical plausibility of projections under different scenarios. I believe that the content and findings of the research would be valuable and may be of interest to the HESS readership.However, there are key concerns that I believe detract from the overall quality of the manuscript. The primary issues are the heavy emphasis on the role of PET while omitting snowmelt, and the potential overstatement in labeling the scenarios as "climate change".1) The study appears to rely too heavily on PET as the primary determinant in understanding the impacts of climate change on streamflow. While PET is undoubtedly critical, it is only one of many factors influencing hydrologic responses, particularly snowmelt, that may be important to streamflow generation in the Great Lakes region (https://agupubs.onlinelibrary.wiley.com/doi/10.1002/2016GL068070). For example, lines 513-521 suggest that the assumption is primarily concerned with the model's ability to discriminate between differences in water loss based on different PET projections under similar warming conditions. However, in regions where snowmelt may play a critical role in determining streamflow, temperature sensitivity could have dual implications - one for PET and another for snowmelt dynamics. Ignoring the latter could bias the results. In particular, I think it could explain the results in Figure 7 g and h that the authors didn't explicitly explain (lines 660-664): for process-based models that rely on physical processes, early snowmelt can significantly shift the seasonal pattern of streamflow as temperature increases. However, for machine-learning model, which mainly make predictions on the possible seasonal correlation, didn't present such a significant shift due to the seasonality of T and PET does not change. Therefore, if some models, especially process-based models, inherently account for snowmelt while others don't, then the comparison may not be apples to apples.That is to say, the observed differences between process-based and machine learning models could be due in part to the fact that some models capture snowmelt dynamics while others don't. Therefore, extrapolating the study's findings to broader climate change impacts may be premature, especially if the full range of factors isn't accounted for, which is related to my second concern.2) While the study examines the sensitivity of hydrological models to temperature changes, it may be misleading to equate this solely with climate change. Climate change is multifaceted and includes more than just temperature changes. Although the authors attempted to indicate this as a limitation in constructing climate change scenarios, the use of the term "climate change" in the title, abstract, and elsewhere could inadvertently downplay the myriad ways in which climate change affects hydrologic systems. For example, factors such as land surface changes due to elevated CO2 have been shown to play a more dominant role in changing runoff (https://www.nature.com/articles/s41558-023-01659-8). Therefore, I think it may be more accurate to frame the study as a sensitivity analysis of hydrologic models to temperature and related PET shifts, rather than an examination of so-called climate change scenarios.I would therefore suggest a major revision to explicitly state the assumptions made about snowmelt in each model, and to include snowmelt dynamics in the discussion of runoff differences. In addition, if the label "climate change" is to be retained, the study should consider a broader range of factors that might be influenced by climate change, not just uniformly increasing temperature.Specific comments follow,Abstract: I can't find a word limit for the abstract for HESS, but as the submission guidelines say, "An abstract should be short, clear, concise...". The abstract in its current form is too long.L105-124, I understand that this study may build on the previous WS22 study. However, the depth of detail should be balanced, as many of the methods, challenges, and conclusions of the earlier work are now repeated in the new paper. The introductory section should focus primarily on setting up the current study.L141, please specify what you mean by "this work". If it is what is shown in the current manuscript, it is strange to discuss the results in the introduction. If it is still from WS22, the opinions seem reiterated again.L155, the assertion that temperature-based PET methods "significantly overestimate future projections of PET" compared to energy budget-based methods is a strong one. It might be beneficial to provide more evidence from literature.L170-172, as noted in my previous comments, the hypotheses may suggest that PET is the sole or overwhelming cause of declining streamflow.L180-182, it would be beneficial to show how the correlation between temperature and PET shifts with different estimation methods.Fig. 2, I am confused by the presentation of the timing, as it seems to suggest that the cumulative flux of warming stops increasing after a certain day of the water year.L408-409, is there any post-examination of the trash cell (if it is constrained by PET or some value) to support this assumption?L412, I think it depends on how many catchments of your study are natural without human intervention.L494-495, but this adjustment is not reflected in Figure 2. Why does the correlation bias have to take effect here, while the possible correlation between temperature and other input variables is omitted in the previous experiments?L499, please justify the selection of static input features that need to be changed, is this done by examining the dependence between mean temperature and other static input features?L510, does the "year" here indicate calendar year or water year, it seems water year based on Figure 2.L617-618, references are needed here to support the argument.L651-653, the explanation is not convincing for me, since the process-based model shows a more obvious change compared to the deep learning-based model, while no explanation of the difference between the two types of models was provided here.L663-664, as I mentioned in the general comments, this may be due to the role of snow dynamics being treated differently between the process-based and deep learning models. Unfortunately, analysis on snowmelt was not included in the study.L669-671, this finding is interesting, since if we really want to use the DL model to do the climate change impact analysis, considering the future surface climate and surface context variables changes are necessary, but the DL models do not seem to learn physically plausible relationships here when doing cross-region learning. It is also nice to see that a physically informed strategy can help mitigate the problem.L672-673, perhaps the additional implementation of test sites in the test period can be informed earlier around L609.L803-805, it is also important to note that the constructed climate change scenarios break the Clausius-Clapeyron scaling, so I would suggest not calling them "climate change".Citation: https://doi.org/
10.5194/egusphere-2023-1744-RC2 - AC2: 'Reply on RC2', Sungwook Wi, 29 Oct 2023
-
RC3: 'Comment on egusphere-2023-1744', Larisa Tarasova, 13 Sep 2023
Review of the manuscript „On the need for physical constraints in deep learning rainfall-runoff projections under climate change“
This manuscript compares the simulations of unconstrained and physically-constrained deep learning models with the simulations of the three conceptual hydrological models in the Great Lakes Region under climate change scenarios with the goal to investigate versatility of the former in changing climatic conditions.
I find the premises of the experiment is very interesting and have a potential to provide useful insights on the fitness of the state-of-the-art models under changing climatic conditions. However, I find the experimental set up somewhat inconsistent: 1) national LSTM model does not use the same input (i.e., not the same PET data) as all the variants of Great Lakes Models, making it hard to disentangle the true reason behind its different behavior; 2) The implementation of PET-constrain in the LSTM model is rather crude and is based on the assumption that evapotranspiration and streamflow is the only way how the water can leave the system, which might not hold universally. Moreover, there are some subjective and ambiguous choices (e.g., choice of the performance metrics; choice of PET methods; choice of the conceptual hydrological models as baseline) in the experiment set up that needs to be clarified. Finally, although I find it interesting to compare behavior of the deep learning and conceptual hydrological models for future simulations, I find the results of the study rather inconclusive, because the differences between the simulations of the three conceptual hydrological models (e.g., Figure 8) seems to be very large (sometimes these differences even larger than the difference with the deep learning models), making one question the reliability of these conceptual models as the baseline. Please, find my detailed comments below.
General comments
Inconsistent set up of the national LSTM: The national LSTM model was driven by temperature, radiation and vapor pressure (Line 270-271), but not by either temperature-based or energy-based PET that were used for Great Lakes LSTMs. When comparing its simulations with the simulations of other LSTM-variants it is not possible to disentangle the origin of the observed differences, making one of the conclusions of the manuscript, that more diverse set of catchments might to some extent support learning physically-based processes, rather questionable, because the differences might be just as well be due to the difference in the forcing data.
Choice of PET methods: I very much like the idea of comparing temperature-based vs energy-based method for PET estimation. However, there is no rationale provided on the choice of the particular method (Hamon and Priestley-Taylor, Line 453). In my experience the difference between different temperature-based approaches can be huge. Likely, the same is true for the energy-based methods. I think using several methods for each type of the PET estimation would strengthen the argument of the manuscript.
PET-constrain for LSTM instead of hybrid models: The motivation for using PET constrains for LSTM by means of a trash cell that assumes that there are exclusively evaporative water losses (which might not always be the case, Jaschecko et al., 2021 https://doi.org/10.1038/s41586-021-03311-x) is not clear to me. Using hybrid models that seem to be exactly the tool that combines the strength of deep learning models and concepts of hydrological processes and therefore, are potentially more fit for producing reliable future simulations under changing conditions, seems like a more straightforward choice to me. One sentence in the discussion merely mentioning the existence of such models is definitely not enough in my opinion, because their existence and comparable performance with deep learning models question the need to develop any constrained variants of LSTM as done in this study.
Choice of conceptual hydrological models, their parametric uncertainties and discrepancy in their simulations: The choice of the three conceptual hydrological models used as a benchmark is not clear. What is the rationale for selecting exactly these three models? What are the major structural differences between them? Are there any studies indicating the fitness of these models for future simulations/ET simulations? These questions have to be addressed to justify the choice of the baseline. Moreover, although the authors have accounted for uncertainty of training of deep learning models by running a 10 members ensemble, the parametric uncertainty of conceptual models (that can be very substantial) is completely ignored by using only one best simulation for each model, instead of using e.g., X% of best performing models (or so called behavioral parameter sets, Beven and Freer, 2001 https://doi.org/10.1016/S0022-1694(01)00421-8). Accounting for parametric uncertainty of the conceptual models might shed a light on large discrepancies between the simulations of conceptual models (e.g., Figure 8) that sometimes is even larger than the differences with the deep learning models.
Choice of model performance metrics: The choice of the performance metrics is also not very clear to me. I can imagine that the inadequate partitioning of evaporative fluxes might especially affect the mean and the low flows, but what is the rationale for examining high flows and the seasonality of the flows? This needs to be clarified. There is also a discrepancy in how low and high flows are defined (98th percentile and the 30th percentile) that also needs clarification.
Detailed comments
Line 28 and elsewhere I would not really call the hydrological models used in this study as process-based. These are conceptual hydrological models that require extensive calibration and that can be very physically-unrealistic as well.
Line 32 and elsewhere: The term water loss is rather unclear. Please clarify and use a consistent term for that throughout the manuscript.
Line 42 and elsewhere: Is actual evapotranspiration meant here? Please clarify
Line 45-47: At this point the application of national of LSTM in addition to the regional LSTM sounds rather inconsistent. Please clarify here the objective for that.
Line 48 and elsewhere: If the statistical test was not performed, omit term “significantly”. Use term “considerably” instead.
Line 52 and elsewhere: Average is a very ambiguous term. Is it mean or median? Is it daily or annual flows? Please use a clearer term throughout the manuscript.
Line 58: Smaller than what? It would be more helpful to include more quantitative results in the Abstract (e.g., Lines 63-64).
Introduction: I feel that introduction is very one-sided focusing on the purely deep learning models and not paying enough attention to the problems that conceptual hydrological models have when simulating future (e.g., Merz et al., 2011 https://doi.org/10.1029/2010WR009505; Wallner and Haberlandt, 2015 https://doi.org/10.1002/hyp.10430). It also completely omits the field of hybrid models (Jiang et al., 2020 https://doi.org/10.1029/2020GL088229; Höge et al., 2022 https://doi.org/10.5194/hess-26-5085-2022) that in my opinion might be more fit for future predictions than the deep learning models or even their constrained variants. Moreover, the Introduction is very much focused on the previous study by the authors, but fails to clearly distinguish the difference between them.
Line 121: Does ET mean actual evapotranspiration here? This is not clear and I think this acronym is not used later anymore. Please revise.
Line 147-150: The energy-based methods (although indisputably more realistic) are also based on empirical relationships, are they not?
Line 172: Evaporative water loss instead? This term is unclear.
Line 204: I do not think that the reference to CAMELS-GB is appropriate here. It was not created with the sole purpose to benchmark deep learning models, nor does it actually benchmark them. Please revise.
Line 243: Acronym AET was never later in the manuscript. Consider omitting it.
Line 243-245: I think it is worth to note here that GLEAM can be also associated with considerable uncertainties. Therefore, validation using this product might be questionable as well.
Line 253: It is not clear what is meant by hydrological losses here and if this term is different from the term “water losses” used earlier. Please clarify.
Figure 2: A much more comprehensive caption describing every step and every acronym presented in the Figure is needed.
Line 337: For the purity of the test, I suggest that all models (conceptual and deep learning models) should be trained on the same objective function.
Line 351: It would be helpful to mention around here how many of these catchments overlap with the Great Lake catchment sample. Even better would be to indicate them in Figure 1.
Line 412: This statement requires a reference
Line 435: The rationale for using both KGE and NSE as performance metrics is unclear to me
Line 441 GLEAM estimates are not observations and can be associated with large uncertainties too.
Line 449: It is not clear how fraction of snowfall was adjusted. Please clarify. Moreover, please use full terms and not the acronyms that were not previously introduced.
Line 497-501: A table with the overview of all scenarios and setups would be helpful.
Line 507-511: It is not clear what is meant by “average” here. Please clarify. Consider avoiding using so many acronyms, the manuscript is oversaturated with them, making it hard to understand.
Table 2: I miss here the timing error introduced earlier.
Line 617-618: This statement requires a reference and it would be helpful if it will be presented in a more quantitative way.
Figure 5 and 6: Provide the names and the locations of the selected watersheds. It would also be helpful to indicate them on Figure 1 to show their geographical location.
Line 657: This is not really the timing of streamflow per se. It is rather a seasonality of the flow. Please clarify that.
Line 681-690: This part is rather confusing and difficult to read. Please revise.
Line 696-698: This is rather vague. Please provide a more quantitative assessment. Moreover, nothing is mentioned about huge differences between the simulations of the conceptual models and how this affects the reliability of the baseline chosen in this experiment.
Figure 8: Please explain all the acronyms in the caption.
Line 720-721 This part is rather confusing, please revise.
Editorial comments
Line 27: state-of-the-art
Line 30: under exacerbating climate change
Line 32: overestimation
Line 170: similarly large
Line 334: by drainage area
Line 656: consistent
Line 657: changes in high flows
Line 828: considerable errors?
Citation: https://doi.org/10.5194/egusphere-2023-1744-RC3 - AC3: 'Reply on RC3', Sungwook Wi, 29 Oct 2023
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
514 | 191 | 33 | 738 | 40 | 14 | 13 |
- HTML: 514
- PDF: 191
- XML: 33
- Total: 738
- Supplement: 40
- BibTeX: 14
- EndNote: 13
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Sungwook Wi
Scott Steinschneider
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(2365 KB) - Metadata XML
-
Supplement
(2191 KB) - BibTeX
- EndNote
- Final revised paper