Better continental-scale streamflow predictions for Australia: LSTM as a land surface model post-processor and standalone hydrological model

Shokri, Ashkan; Bennett, James C.; Robertson, David E.; Perraud, Jean-Michel; Frost, Andrew J.; Lehmann, Eric A.

doi:10.5194/egusphere-2025-805

Preprints

https://doi.org/10.5194/egusphere-2025-805

Preprints

07 Apr 2025

| 07 Apr 2025

Better continental-scale streamflow predictions for Australia: LSTM as a land surface model post-processor and standalone hydrological model

Ashkan Shokri, James C. Bennett, David E. Robertson, Jean-Michel Perraud, Andrew J. Frost, and Eric A. Lehmann

Abstract. Accurate large-scale hydrological predictions are essential for water resource planning. However, many land surface models encounter difficulties in capturing streamflow timing and magnitudes, particularly in large catchments and when calibrated across broad regions and multiple hydrological variables. In this study, two Long Short-Term Memory (LSTM)-based approaches are assessed to enhance streamflow predictions across Australia: (i) LSTM-QC, in which an LSTM post-processes runoff outputs from the Australian Water Resources Assessment–Landscape model (AWRA-L), and (ii) LSTM-C, a standalone rainfall–runoff LSTM that relies solely on precipitation and potential evapotranspiration as inputs. These approaches are tested in 218 minimally impacted catchments from the CAMELS-AUS dataset under three cross-validation strategies—temporally out-of-sample, spatially out-of-sample, and spatiotemporal out-of-sample—to evaluate their robustness for historical reconstructions, predictions in ungauged basins, and climate-projection scenarios. The results indicate that both LSTM-QC and LSTM-C consistently outperform AWRA-L runoff across nearly all catchments and exceed the predictive skill of a widely used conceptual model (GR4J) in most basins. Under a temporally out-of-sample framework, LSTM-QC demonstrates a performance advantage over LSTM-C by leveraging information embedded in AWRA-L, particularly when fine-tuned to local catchment observed data. This advantage is primarily attributed to the LSTM’s ability to correct systematic biases in AWRA-L and enhance channel-routing signals. However, under spatial and spatiotemporal cross-validation LSTM-C performs comparably well, suggesting that a purely data-driven approach can generalize effectively to ungauged or future conditions without reliance on AWRA-L.

Received: 20 Feb 2025 – Discussion started: 07 Apr 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Ashkan Shokri, James C. Bennett, David E. Robertson, Jean-Michel Perraud, Andrew J. Frost, and Eric A. Lehmann

Status: closed

CC1:
'Comment on egusphere-2025-805', Ather Abbas, 14 Apr 2025

Dear Authors,
Is it possible to provide the list of selected 218 catchments from CAMELS_AUS dataset and the median values of performance metrics for all the models?
Thanks

Citation: https://doi.org/10.5194/egusphere-2025-805-CC1
- AC3: 'Reply on CC1', Ashkan Shokri, 07 Oct 2025
  
  We thank the commenter for their interest. We have provided a ZIP file containing the list of the 218 selected CAMELS-AUS catchments and the NSE results for all models (LSTM-C, LSTM-QC, GR4J, and AWRA-L) under the three cross-validation experiments (TooS, SooS, and TSooS).
  
  Citation: https://doi.org/10.5194/egusphere-2025-805-AC3
RC1:
'Comment on egusphere-2025-805', Anonymous Referee #1, 03 Jun 2025
In this paper, the authors evaluate the performance of various LSTM models for hydrological simulations across Australian catchments : a land surface model-LSTM hybrid based on climate data as well as runoff from the AWRA-L land surface model, and an LSTM model based on climate data. They show that both models outperform runoff simulation from AWRA-L as well as from the conceptual hydrological model GR4J across most catchments. They investigate the impacts of methodological decisions, namely the cross-validation strategies, on the results. They additionally discuss the relevance of the proposed approches for three real-world applications: long-term historical simulations, predictions in ungauged basins, and climate projections. This application-focused framing is a welcomed perspective in a scientific paper. Overall, this is a well-executed, well-written study that addresses important research questions and contributes valuable insights to the hydrological modelling community. Below are some comments that will hopefully help review your paper for publication.
Main comments:
Clarification of terminology: The term ”static features” or ”static predictors” is somewhat misleading, especially since some of these predictors are recalculated for different time windows to demonstrate the impact on model performance. Please consider using a different term for the recalculated features (e.g., quasi-static) to remove any confusion.

Model design clarity: It is unclear whether the static predictors are used for both LSTM-C and LSTM-QC models presented in section 2.5.1 by default, or only for some versions of these models. You could clarify this by merging sections 2.5.1 and 2.5.2 into a single model design section, summarizing model inputs more clearly (e.g., table with two columns to separate predictors into dynamic and static for each model). If static predictors are not always used and I misunderstood this, consider renaming the various versions of each model to differentiate them clearly.

Catchment characteristics: It would be useful to know what the diversity of catchments is, for example using characteristics found in CAMELS-AU. This would help contextualize model transferability results (especially with SooS and TSooS) for an audience that is not familiar with Australian geography/hydrology.

Uncertainty quantification: While comprehensive uncertainty analysis may be beyond the paper's scope, some quantification would enhance the robustness of the results, especially for hydroclimate projections or more specific applications such as regional-scale climate change vulnerability assessments (mentioned on L485-491). You could give an appreciation of the uncertainty for example by showing the spread in results from the cross-validation experiments (as mentioned on L267-268) in the Appendix. Additionally, you could apply bootstrapping when evaluating the model performance.

Model evaluation methodology: Please consider adding a dedicated subsection on model evaluation in the methods. This could include: i) an explanation of the NSE and what it measures (i.e., a measure of overall performance rather than an evaluation of extremes), and ii) the criterion for ”best performing model” selection - was an ≥0.01 NSE difference sufficient or was there a more stringent measure (e.g., with a larger buffer)? My fear is that the results might be a bit noisy, and that using a more stringent measure or adding a statistical test to assess differences would be beneficial.

Terminology around climate projections: The term ”climate projection capabilities” is somewhat misleading, as actual climate projections were not used here. Please consider reframing as ”proxy for climate projection capabilities”.

Specific comments:
L96-98: The ”worthwhile” type of modelling system likely depends on the use. For example, for climate change scenarios the hybrid method might be favoured. Please clarify this nuance in the paper.

L100-102: How can hybrid models help assess the dominant deficiencies? This warrant one to two more sentences in the paper.

L125: Is there any specific study of the application of GR4J in Australia that you could cite here?

L225: Please explicitly mention what the AWRA-L output is here to remind the reader.

L280-283: The differences seen using an increasing time window could also be impacted by the catchment size, with larger differences between the two performance measures expected in larger catchments. It would be interesting to compare the length of the sequence with the known response time of each catchment.

L312: ”higher exceedance probabilities” might be misinterpreted as referring to flow exceedance. Please clarify that this refers to the distribution of values when introducing the first plot of this kind.

L312-314: It would be interesting to speculate why some catchments don’t benefit from finetuning. Are there commonalities in catchment type, data quality, hydro-meteorological processes, etc?

L388-389: Please consider spelling out what you mean by ”possibly other hydrological processes” in the paper. For example, groundwater storage and processes, as well as lakes and reservoirs could be mentioned here. A side question to this, are the effects of lakes and reservoirs accounted for by the AWRA model?

L492: One advantage of land surface or conceptual hydrological models compared to LSTM models is that they can output various other hydrological variables in addition to streamflow. Please considering adding this to the list.

Typos:
L62: Some kind of sentence separation is needed between ”streamflow” and ”for instance”.

L114: Missing ”in” or similar between ”performance” and ”218 catchments”.

L133: In the introduction it says that outputs are available from 1910.

L138: Missing closing parenthesis.

L152: Missing ”for catchments” or similar between ”)” and ”that have been”.

L154: Rephrase ”covering from”.

L157: ”is produce” is missing a ”d”.

L263: Missing ”the” between ”from” and ”calibration”.

L469: Small ”b” for ”behavior”.

L472: ”strongly performing”?
Citation: https://doi.org/10.5194/egusphere-2025-805-RC1
- AC1: 'Reply on RC1', Ashkan Shokri, 16 Jul 2025
  
  We thank the referee for their valuable comments. We have addressed each point in detail below and will incorporate the following changes:
  Summary
  Comment: In this paper, the authors evaluate the performance of various LSTM models for hydrological simulations across Australian catchments: a land surface model-LSTM hybrid based on climate data as well as runoff from the AWRA-L land surface model, and an LSTM model based on climate data. They show that both models outperform runoff simulation from AWRA-L as well as from the conceptual hydrological model GR4J across most catchments. They investigate the impacts of methodological decisions, namely the cross-validation strategies, on the results. They additionally discuss the relevance of the proposed approches for three real-world applications: long-term historical simulations, predictions in ungauged basins, and climate projections. This application-focused framing is a welcomed perspective in a scientific paper. Overall, this is a well-executed, well-written study that addresses important research questions and contributes valuable insights to the hydrological modelling community. Below are some comments that will hopefully help review your paper for publication.
  Response: We sincerely appreciate your thorough and constructive review of our manuscript. Your positive assessment and insightful comments will help us improve both the clarity and depth of the study. In response, we will revise the manuscript to better articulate model design choices, clarify terminology, provide more context on catchment diversity, and address uncertainty and model evaluation procedures. Below, we address your comments point by point.
  Main Comments
  Comment: The term “static features” or “static predictors” is somewhat misleading, especially since some of these predictors are recalculated for different time windows to demonstrate the impact on model performance. Please consider using a different term for the recalculated features (e.g., quasi-static) to remove any confusion.
  Response: We agree with this observation. In the revised manuscript, we will adopt the term "quasi-static predictors"
  Comment: Model design clarity: It is unclear whether the static predictors are used for both LSTM-C and LSTM-QC models presented in section 2.5.1 by default, or only for some versions of these models. You could clarify this by merging sections 2.5.1 and 2.5.2 into a single model design section, summarizing model inputs more clearly (e.g., a table with two columns to separate predictors into dynamic and static for each model). If static predictors are not always used and I misunderstood this, consider renaming the various versions of each model to differentiate them clearly.
  Response: Thank you for pointing this out. To clarify, all "quasi-static" predictors are used in both LSTM-C and LSTM-QC models. We will explicitly state this in the revised text. As suggested, we will merge Sections 2.5.1 and 2.5.2 into a single "Model Design" section.
  Comment: Catchment characteristics: It would be useful to know what the diversity of catchments is, for example using characteristics found in CAMELS-AU. This would help contextualize model transferability results (especially with SooS and TSooS) for an audience that is not familiar with Australian geography/hydrology.
  Response: We agree this would be helpful, especially for readers unfamiliar with Australian catchment diversity. We will include a map of catchments overlaid with the Köppen-Geiger climate classification and provide summary statistics of relevant catchment attributes (e.g. area, aridity, baseflow index) in the Supplement. This will provide context for understanding model generalisability across diverse hydroclimatic regions.
  Comment: Uncertainty quantification: While comprehensive uncertainty analysis may be beyond the paper's scope, some quantification would enhance the robustness of the results, especially for hydroclimate projections or more specific applications such as regional-scale climate change vulnerability assessments (mentioned on L485-491). You could give an appreciation of the uncertainty for example by showing the spread in results from the cross-validation experiments (as mentioned on L267-268) in the Appendix. Additionally, you could apply bootstrapping when evaluating the model performance.
  Response: We appreciate this recommendation. While our primary focus was on deterministic performance, we agree that presenting uncertainty helps contextualise the robustness of our findings. To address this comment, we will include the spread of results from the cross-validation experiments and multiple training trials in the supplementary material. We will also discuss the limitations of interpreting this variability, as we implement these additions.
  Comment: Model evaluation methodology: Please consider adding a dedicated subsection on model evaluation in the methods. This could include: i) an explanation of the NSE and what it measures (i.e., a measure of overall performance rather than an evaluation of extremes), and ii) the criterion for “best performing model” selection - was an ≥0.01 NSE difference sufficient or was there a more stringent measure (e.g., with a larger buffer)? My fear is that the results might be a bit noisy, and that using a more stringent measure or adding a statistical test to assess differences would be beneficial.
  Response: We will include a dedicated “Model Evaluation” subsection in the Methods section. This will include explanation about NSE.
  Comment: Terminology around climate projections: The term “climate projection” capabilities is somewhat misleading, as actual climate projections were not used here. Please consider reframing as “proxy for climate projection capabilities”.
  Response: Agreed. We will revise this throughout the manuscript to refer instead to a “proxy for climate projection capabilities”.
  Specific comments:
  Comment: L96-98: The “worthwhile” type of modelling system likely depends on the use. For example, for climate change scenarios the hybrid method might be favoured. Please clarify this nuance in the paper.
  Response: Thank you for pointing this out. In the revised manuscript, we will clarify this by adding the following sentences:
  
  "The value of hybrid approaches may depend on the application context. For instance, hybrid models may be preferable in climate change scenario analysis, where maintaining physical consistency and leveraging land surface model outputs is important. In contrast, standalone LSTM models may be more suitable for ungauged basin prediction, where purely data-driven performance is prioritised."
  Comment: L100-102: How can hybrid models help assess the dominant deficiencies? This warrant one to two more sentences in the paper.
  Response: Thank you for the suggestion. We will revise the text to clarify how hybrid models can help assess dominant deficiencies in land surface models. Specifically, we explain that by comparing performance across different input sequence lengths, one can distinguish between improvements due to routing correction and those due to bias correction. This diagnostic insight can assist land surface model developers in targeting specific weaknesses. The revised paragraph now reads:
  
  "In addition, in cases where LSTMs improve predictions from land surface models, as we show in the current study, the source of these improvements can be diagnosed. For instance, land surface models often exhibit two main deficiencies: routing errors and systematic biases in specific catchments. By comparing hybrid models trained with short input sequences (i.e. one time step) to those trained with longer sequences, we can isolate the contribution of each deficiency. Short sequence lengths limit the LSTM’s capacity to correct routing errors, meaning improvements in this case are more likely due to bias correction."
  Comment: L125: Is there any specific study of the application of GR4J in Australia that you could cite here?
  Response: Thank you for the suggestion. We will add the following references to support the use of GR4J in the Australian context. Coron et al. (2012) provide a comprehensive evaluation of GR4J performance across 216 Australian catchments under diverse climate conditions. Hapuarachchi et al. (2022) describe the use of GR4J as part of the operational ensemble streamflow forecasting system for Australia. Zheng et al. (2024) further demonstrate the application of GR4J in projecting future streamflow under various climate change scenarios for Australia.
  
  - Coron, L., Andréassian, V., Perrin, C., Lerat, J., Vaze, J., Bourqui, M., & Hendrickx, F. (2012). Crash testing hydrological models in contrasted climate conditions: An experiment on 216 Australian catchments. Water Resources Research, 48, W05552.
  
  - Hapuarachchi, H. A. P., Bari, M. A., Kabir, A., Hasan, M. M., Woldemeskel, F. M., Gamage, N., Sunter, P. D., Zhang, X. S., Robertson, D. E., Bennett, J. C., & Feikema, P. M. (2022). Development of a national 7-day ensemble streamflow forecasting service for Australia. Hydrology and Earth System Sciences, 26, 4801–4821.
  
  - Zheng, H., Chiew, F. H. S., Post, D. A., Robertson, D. E., Charles, S. P., Grose, M. R., & Potter, N. J. (2024). Projections of future streamflow for Australia informed by CMIP6 and previous generations of global climate models. Journal of Hydrology, 636, 131286.
  Comment: L225: Please explicitly mention what the AWRA-L output is here to remind the reader.
  Thank you for this recommendation. We will revise the text to clearly state that the AWRA-L output used in our study refers to gridded runoff (surface and subsurface) at a 5 km × 5 km resolution across Australia.
  Comment: L280-283: The differences seen using an increasing time window could also be impacted by the catchment size, with larger differences between the two performance measures expected in larger catchments. It would be interesting to compare the length of the sequence with the known response time of each catchment.
  Response: Thank you for this suggestion. We will use the sig_dur_RespTime attribute from the CAMELS-AUS dataset, which represents the response time of catchments, to analyse the relationship between sequence length, catchment response time, and model performance. This analysis will be included as a new figure in the revised manuscript.
  Comment: L312: “higher exceedance probabilities” might be misinterpreted as referring to flow exceedance. Please clarify that this refers to the distribution of values when introducing the first plot of this kind.
  Response: We will clarify that this refers to the distribution of NSE values across catchments, not to flow exceedance.
  Comment: L312-314: It would be interesting to speculate why some catchments don’t benefit from finetuning. Are there commonalities in catchment type, data quality, hydro-meteorological processes, etc?
  Response: Thanks for the suggestion. In our analysis, only four catchments showed reduced performance after fine-tuning. These catchments are either ephemeral rivers with highly variable and intermittent flow (meaning they can cease to flow for many consecutive years and experience very large, infrequent flow events when they do occur) or are characterised by limited data. Such conditions make them difficult to model, and due to their variability, local fine-tuning may overfit or underperform relative to a more generalized global model. We will expand on this point in the revised manuscript, providing additional detail on the characteristics of these catchments.
  Comment: L388-389: Please consider spelling out what you mean by “possibly other hydrological processes” in the paper. For example, groundwater storage and processes, as well as lakes and reservoirs could be mentioned here. A side question to this, are the effects of lakes and reservoirs accounted for by the AWRA model?
  Response: We acknowledge the need for greater clarity here. In the revised manuscript, we will specify the types of hydrological processes we are referring to, which include additional lag processes such as percolation, groundwater interactions, and human activities (e.g. farm dams).
  
  Regarding the side question: the AWRA model does not explicitly simulate lakes or large reservoirs. However, the catchments used in this study are not impounded by major reservoirs. Nonetheless, small farm dams are present and may impact local hydrology, particularly by modifying runoff and storage patterns. While not directly modelled, their effects are likely implicitly represented through calibration where data are available. These farm dams have a widespread and growing impact on water availability across Australian agricultural regions and can significantly reduce downstream flows, particularly during dry years (Peña-Arancibia et al., 2023; Malerba et al., 2021).
  
  - Peña-Arancibia, J. L., M. E. Malerba, N. Wright and D. E. Robertson (2023). "Characterising the regional growth of on-farm storages and their implications for water resources under a changing climate." Journal of Hydrology 625: 130097.
  
  - Malerba, M. E., N. Wright and P. I. Macreadie (2021). "A Continental-Scale Assessment of Density, Size, Distribution and Historical Trends of Farm Dams Using Deep Learning Convolutional Neural Networks." Remote Sensing 13(2): 319.
  Comment: L492: One advantage of land surface or conceptual hydrological models compared to LSTM models is that they can output various other hydrological variables in addition to streamflow. Please considering adding this to the list.
  Response: We agree and will add that conceptual and land surface models can output additional hydrological variables.
  Typos:
  L62: Some kind of sentence separation is needed between “streamflow” and “for instance”.
  L114: Missing “in” or similar between “performance” and ”218 catchments”.
  L133: In the introduction it says that outputs are available from 1910.
  L138: Missing closing parenthesis.
  L152: Missing “for catchments” or similar between “)” and “that have been”.
  L154: Rephrase “covering from”.
  L157: “is produce” is missing a “d”.
  L263: Missing “the” between “from” and “calibration”.
  L469: Small “b” for “behavior”.
  L472: “strongly performing”?
  Response: Thank you for identifying these typographical issues. We will correct all of them in the revised manuscript.
  
  Citation: https://doi.org/10.5194/egusphere-2025-805-AC1
RC2:
'Comment on egusphere-2025-805', Anonymous Referee #2, 05 Jun 2025

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-805/egusphere-2025-805-RC2-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-805-RC2
- AC2: 'Reply on RC2', Ashkan Shokri, 16 Jul 2025
  
  We appreciate the reviewer’s careful reading and constructive feedback. We address each comment below and describe the revisions we will incorporate into the manuscript.
  Overview:
  Comment: This manuscript compares a standalone LSTM (LSTM-C) and an LSTM that includes simulated streamflow from a land surface model (AWRA-L) as an additional dynamic input (LSTM-QC). The two LSTM-based models are additionally benchmarked against AWRA-L and GR4J, a conceptual hydrological model widely used in an Australian context. They compare the models using three different cross-validation strategies, evaluating the ability of the models to predict temporally out of-sample, spatially out-of-sample, and spatiotemporally out-of-sample. Overall, they show that the two LSTM models outperform AWRA-L and GR4J in most catchments under all three cross-validation strategies, although there are some notable exceptions. The authors discuss the potential relevance of their study in three real-world applications, namely historical reconstruction, predictions in ungauged basins, and simulating hydrological change under climate change projections.
  Response: We thank you for the clear and thoughtful summary of our manuscript. Your overview accurately captures the key elements of our study and reflects our intent to assess the relative performance of different modelling approaches under varying out-of-sample conditions. We hope the revisions we have made in response to your detailed comments further clarify the study’s motivations, methodology, and implications.
  Main comments
  Comment: 1. As far as I can tell, LSTM-C and LSTM-QC are identical in every respect except for the inclusion of streamflow from AWRA-L as an additional dynamic input in LSTM-QC. I therefore question whether this paper is really testing whether the LSTM can correct the AWRA-L output, or whether the AWRA-L output provides any additional information content that can be leveraged by the LSTM architecture. In effect these two possibilities amount to the same thing, but the paper would benefit from emphasising one over the other. In my opinion, the latter characterisation more accurately reflects what the LSTM is actually doing.
  Response: We agree with the reviewer that the distinction between correcting AWRA‑L output versus leveraging its additional information content is important, and we appreciate this opportunity to clarify the framing. In the revised manuscript, we will:
  
  (1) Recast our framing in the Introduction to emphasise that our primary goal is to assess the information content of AWRA‑L, rather than to correct its outputs. We will clearly state that LSTM‑QC and LSTM‑C differ only by the inclusion of AWRA‑L streamflow, and that our comparison is designed to quantify the added value of that information.
  
  (2) Update the Discussion by re-labelling Figures 4 and 7 to highlight “Information Gain from AWRA‑L” rather than simply “Performance Difference.”
  
  (3) Revise the Conclusions to reinforce that the contribution of this work is not in bias correction, but in evaluating the informational value of a process-based model (AWRA‑L) when used as an input to a data-driven model.
  Comment: 2. The results of the comparison with LSTM-C (i.e. the LSTM model that does not include the AWRA-L streamflow as a dynamic input) is undermined by the limited number of static catchment attributes that are supplied to the model (Table 1). A large number of studies have shown that LSTM models perform best when they are trained across many catchments at once using catchment descriptors that adequately describe the physical diversity of catchments in the training set. For example, (, 2019) train an LSTM on static catchment attribtues that include soils, climate, vegetation, topography, and geology. Here, the authors have selected attributes that broadly cover climate and geomorphology, but discard a large number of attributes from CAMELS-AUS that are potentially highly influential in determining the hydrological behaviour of Australian catchments (e.g. geology, land cover). Presumably, in common with most land surface models, AWRA-L is parameterised using land cover and geological data. Therefore, I believe it is at least a possibility that LSTM-QC is utilising the information on catchment diversity that is encoded in the AWRA-L output but which has been arbitrarily excluded from LSTM-C. It seem to me that LSTM-C is trained in a way that is inconsistent with our current understanding of how best to use this class of model for hydrological simulation, raising doubts about whether it is a fair comparison.
  Response: Thank you for your thoughtful comment. We agree that the choice of static catchment attributes can significantly affect LSTM model performance. The selection of static features in our study was deliberate and guided by both performance considerations and data availability in operational or ungauged settings.
  
  As noted in Kratzert et al. (2019), they also did not use the full set of static attributes available in CAMELS. Instead, they selected 27 features as a subset of those explored by Addor et al. (2017), focusing on variables derivable from remote sensing or nationally available datasets. Similarly, in our case, we explored a wide range of attributes available in CAMELS-AUS and found through systematic testing that a subset of 12 variables consistently improved model performance across catchments. These variables cover key aspects of climate and geomorphology and were chosen to ensure applicability in real-world, data-limited contexts.
  
  We also deliberately excluded static variables derived from streamflow to avoid highly correlated predictors, and we omitted attributes that are difficult to estimate reliably for ungauged basins. This aligns with our focus on creating a parsimonious and operationally feasible model.
  Minor comments
  Comment: 1. L27: “The ubiquity of these model predictions...”- are you referring to the spatial coverage or widespread use? Please clarify.
  Response: Thank you for pointing this out. In this context, "ubiquity" refers primarily to the widespread spatial coverage of land surface model predictions across large regions, often at continental or global scales. We will revise the sentence to clarify this and avoid ambiguity. For example:
  
  "The widespread spatial coverage of these model predictions often trades off against accuracy..."
  Comment: 2. L33: It’s worth pointing out that most land surface models were not originally designed to predict streamflow, but rather to provide the lower boundary condition to Earth system models.
  Response: We agree that many land surface models were originally developed to provide lower boundary conditions for Earth system models rather than for direct streamflow prediction. However, we would like to clarify that unlike most of other land surface models AWRA-L was specifically developed for water balance estimation and runoff prediction across Australia, with a focus on hydrological applications rather than atmospheric coupling. We will clarify this distinction in the revised manuscript to avoid conflating AWRA-L with more typical land surface models used in Earth system modelling.
  Comment: 3. L39-48: I agree that the lack of channel routing and calibration scheme are weaknesses of AWRA-L with respect to streamflow simulation, but is a lack of process understanding not also a weakness?
  Response: We agree that understanding and simulating key processes adds to confidence in a model, and that the sometimes-imperfect representation of these processes in AWRA-L (as well as the total lack of such processes in LSTMs), may contribute to a lack of confidence in these models. We will update the manuscript to explicitly acknowledge the lack of process understanding as an additional limitation of AWRA-L, alongside the issues of channel routing and calibration.
  Comment: 4. L62: Punctuation needed.
  Response: Thank you for pointing this out. We will revise the sentence for clarity and correct punctuation as follows:
  
  “Apart from physically based approaches for representing routing, several methods have been developed applying machine learning to estimate streamflow. For instance, Nagesh Kumar et al. (2004) used a feedforward Artificial Neural Network to estimate monthly flow time series of a single river.”
  Comment: 5. L73: A third advantage is that they are unconstrained by physical laws such as mass balance, so they are better able to implicitly correct biases in the input data. In land surface models, uncertainty in the input will propagate to the output.
  Response: Thank you for the suggestion. We agree and will incorporate this point into the revised manuscript. The paragraph will be updated as follows:
  
  “… A third advantage is that they are not constrained by physical laws such as mass balance, which allows them to implicitly correct biases in the input data. In contrast, in land surface models, uncertainty in the inputs typically propagates directly to the outputs.”
  Comment: 6. L85-93: This passage is not particularly relevant to the topic in hand. As the introduction is already quite long it could be safely removed.
  Response: Thank you for the suggestion. We will revise the text to ensure the introduction remains focused and appropriately scoped.
  Comment: 7. L98: “...as we show in the current study...”- This would seem to pre-empt the results.
  Response: Thank you for pointing this out. We agree that the phrase pre-empts the results and will remove “as we show in the current study” to maintain a more neutral tone in the introductory text.
  Comment: 8. L99-100: Arguably deficiencies in routing and bias in individual catchments amount to the same thing. Perhaps you could clarify what you mean here?
  Response: While both routing deficiencies and catchment-specific biases contribute to model error, they stem from different sources and have distinct implications for model behaviour. Routing deficiencies primarily affect the timing and shape of the hydrograph (e.g., delayed or premature peak flows), whereas biases typically refer to systematic over- or underestimation of flow magnitude, independent of timing. This distinction is important for diagnosing model limitations. We will update the manuscript to more clearly articulate this difference.
  Comment: 9. L122: I’m not sure it is particularly easy to test the ability of the model to perform well under climate change projections, because it is likely that the range of input values in the climate projections will exceed those the LSTM would encounter in the training set. Thus, what you really ought to be testing is the ability of the model to extrapolate, but I’m not sure the experimental design achieves this at present.
  Response: Thank you for raising this important point. We agree that true testing under climate change conditions requires the model to extrapolate beyond the historical range of climate inputs, which is inherently challenging. While our experimental setup does not fully replicate future climate scenarios, the Temporally and Spatially out of Sample (TSooS) experiment partially addresses this concern: in two of the four folds, the model is trained on data from 1975–1995 and evaluated in 2000–2014, which introduces a degree of extrapolation in both space and time. However, we acknowledge that the future climate may involve more extreme conditions than those seen in our historical training period. To better reflect this limitation, we will revise the manuscript to refer to this analysis as a “proxy for climate projection capabilities,” rather than implying direct applicability to future climate conditions.
  Comment: 10. L156: You could acknowledge here that using multiple precipitation datasets can enhance LSTM performance (Kratzert et al., 2021).
  Response: Thank you for the suggestion. While previous studies such as Kratzert et al. (2021) have shown that using multiple precipitation datasets can enhance LSTM performance, in our case the AGCD (formerly known as AWAP) and SILO datasets share many common rain gauges and differ mainly in processing methods. As a result, incorporating both datasets do not provide additional independent information nor improve model performance.
  Comment: 11. L175: Please clarify that you are referring to the hidden state size here.
  Response: Thank you for the comment. We will clarify in the text that the reference is specifically to the hidden state size.
  Comment: 12. L221: You describe the static and dynamic predictors, but not the target (i.e. streamflow). Please could you describe your treatment of the target variable (e.g. do you normalize by catchment area)?
  Response: Good observation. The target is gauged daily streamflow observation which is provided in the CAMELS-AUS dataset and is normalized by catchment area in mm unit. We will clarify this in the text.
  Comment: 13. L225: Please could you confirm that the two LSTM models are identical in every respect except for the inclusion of AWRA-L streamflow in LSTM-QC?
  Response: Yes, we can confirm that the LSTM models are identical in every respect except for the use of AWRA-L runoff as a predictor in LSTM-QC. We will state this explicitly in the text.
  Comment: 14. L240: You say this important but not that you actually do it. We later find out that you have, although this information is in the results section. Please consider moving 3.1.3 to 2.5.2.
  Response: Thank you for pointing this out. We will move the explanation about the method from Section 3.1.3 to Section 2.5.2.
  Comment: 15. L245: In general I think the training approach for LSTMs is well established and so you don’t need to go into so much detail here. The text could also be shortened by using scientific notation (e.g. Section 2.3 of (Lees et al., 2021)). Typically when training an LSTM there will be a training period, a validation period (that is used during training to test each parameter set) and a hold-out test period. However, Table 2 only details a training and validation period. Please could you clarify whether the model is tested on an unseen dataset?
  Response: We appreciate this useful suggestion.
  
  Regarding first point: In the revised manuscript, we will shorten this section and adopt the scientific shorthand style recommended (e.g., as in Lees et al., 2021, Section 2.3).
  
  Regarding hold‑out test period: You are correct that Table 2 currently lists only training and validation periods. In fact, we also employ a separate hold‑out test period. We use the period 2014–2022 as our holdout period, and add a more explicit reference to this in the Methods.
  Comment: 16. L265: This needs some clarification. I think it is feasible (i.e. it could be done under the experimental setup) but not meaningful, because in a real out-of-sample situation you would not have any data to conduct fine-tuning.
  Response: Thank you for the comment. We agree that the term “feasible” may be misleading in this context. While fine-tuning is technically possible within the experimental setup, it would not be meaningful in a true out-of-sample scenario where no data from the target catchment would be available for adjustment. We will revise the manuscript accordingly and replace “feasible” with “realistic” to better reflect the intention. The revised sentence will read:
  
  “Fine-tuning for individual catchments would not be realistic in a true out-of-sample scenario, as no target catchment data would be available for adjustment.”
  Comment: 17. L285: I can see the argument for including GR4J in the model comparison, but I wonder whether it would be better to only use it in the TooS test. I would argue that by including GR4J in the SooS and TSooS tests you are really testing the parameter regionalization scheme, which is not really the main focus of the manuscript.
  Response: While we understand the concern, we maintain that testing GR4J under the SooS and TSooS setups aligns with one of our primary objectives, which is to evaluate model performance in regionalization scenarios. Including GR4J across all experimental setups allows for a consistent benchmark against a widely used conceptual model, noting that GR4J is also used in out-of-spatial-sample prediction in Australia. This helps illustrate the value of LSTM models in both interpolation and extrapolation contexts.
  Comment: 18. Figure 4/5/6: Your description of the results would benefit from using subplot labels, so the reader knows what they should be looking at.
  Response: Thank you for the suggestion. We will add subplot labels to Figures 4, 5, and 6 to improve clarity and help readers more easily follow the results.
  Comment: 19. L373: Notwithstanding my previous point about climate projections, I’m not sure why this is categorised as TSooS rather than TooS?
  Response: Thank you for your comment. The reason this is categorized as TSooS rather than TooS relates to the data partitioning strategy used for training and testing. In the TooS setup, the model is trained on all catchments but during one period, then tested on the same catchments during a different period. In contrast, TSooS is a stricter test where the model is trained on only half of the catchments for half of the overall time period and then tested on the remaining unseen catchments during the other half of the time. This means the model effectively trains on only about a quarter of the total data in TSooS, compared to about three-quarters in TooS. This distinction is important because TSooS better evaluates the model’s ability to generalize to completely new catchments and unseen periods, which may include extreme events like the millennium drought that are absent from the training data.
  Comment: 20. L390: I’m not sure it is meaningful to compare with LSTM-C at short sequence lengths, as we already know that LSTMs require long sequence lengths to make good predictions.
  Response: We agree that it is already known that LSTM-C requires longer sequence lengths to perform well. However, we use LSTM-C here primarily as a baseline to demonstrate the added value of the routing component in LSTM-QC. Even at shorter sequence lengths, the difference in performance between LSTM-C and LSTM-QC highlights the extent to which AWRA-L is resolving routing processes. We believe this comparison offers important insights for AWRA-L users.
  Comment: 21. L481: This could arise because the LSTM training is suboptimal, as it has not been exposed to catchment attributes that may help it learn the hydrological behaviour in these regions.
  Response: We agree that suboptimal LSTM training due to limited exposure to catchment attributes is one possible explanation. However, we believe the poorer performance of LSTMs compared to GR4J in south-west Western Australia is more likely due to the more informative regionalisation scheme used by GR4J. This region is hydrologically distinct (Petrone et al., 2010; Hughes et al., 2012), and since GR4J’s regionalisation is weighted by inverse-distance, it places greater emphasis on local catchments, whereas the LSTM does not. That said, it is possible that training the LSTM on a more global dataset might improve its performance in this region. We consider a thorough investigation of these hypotheses outside the scope of the current paper and intend to address them in future research.
  
  Petrone, K. C., J. D. Hughes, T. G. Van Niel, and R. P. Silberstein (2010), Streamflow decline in southwestern Australia, 1950–2008, Geophys. Res. Lett., 37, L11401, doi:10.1029/2010GL043102.
  
  Hughes, J. D., K. C. Petrone, and R. P. Silberstein (2012), Drought, groundwater storage and stream flow decline in southwestern Australia, Geophys. Res. Lett., 39, L03408, doi:10.1029/2011GL050797.
  
  Citation: https://doi.org/10.5194/egusphere-2025-805-AC2

Status: closed

CC1:
'Comment on egusphere-2025-805', Ather Abbas, 14 Apr 2025

Dear Authors,
Is it possible to provide the list of selected 218 catchments from CAMELS_AUS dataset and the median values of performance metrics for all the models?
Thanks

Citation: https://doi.org/10.5194/egusphere-2025-805-CC1
- AC3: 'Reply on CC1', Ashkan Shokri, 07 Oct 2025
  
  We thank the commenter for their interest. We have provided a ZIP file containing the list of the 218 selected CAMELS-AUS catchments and the NSE results for all models (LSTM-C, LSTM-QC, GR4J, and AWRA-L) under the three cross-validation experiments (TooS, SooS, and TSooS).
  
  Citation: https://doi.org/10.5194/egusphere-2025-805-AC3
RC1:
'Comment on egusphere-2025-805', Anonymous Referee #1, 03 Jun 2025
In this paper, the authors evaluate the performance of various LSTM models for hydrological simulations across Australian catchments : a land surface model-LSTM hybrid based on climate data as well as runoff from the AWRA-L land surface model, and an LSTM model based on climate data. They show that both models outperform runoff simulation from AWRA-L as well as from the conceptual hydrological model GR4J across most catchments. They investigate the impacts of methodological decisions, namely the cross-validation strategies, on the results. They additionally discuss the relevance of the proposed approches for three real-world applications: long-term historical simulations, predictions in ungauged basins, and climate projections. This application-focused framing is a welcomed perspective in a scientific paper. Overall, this is a well-executed, well-written study that addresses important research questions and contributes valuable insights to the hydrological modelling community. Below are some comments that will hopefully help review your paper for publication.
Main comments:
Clarification of terminology: The term ”static features” or ”static predictors” is somewhat misleading, especially since some of these predictors are recalculated for different time windows to demonstrate the impact on model performance. Please consider using a different term for the recalculated features (e.g., quasi-static) to remove any confusion.

Model design clarity: It is unclear whether the static predictors are used for both LSTM-C and LSTM-QC models presented in section 2.5.1 by default, or only for some versions of these models. You could clarify this by merging sections 2.5.1 and 2.5.2 into a single model design section, summarizing model inputs more clearly (e.g., table with two columns to separate predictors into dynamic and static for each model). If static predictors are not always used and I misunderstood this, consider renaming the various versions of each model to differentiate them clearly.

Catchment characteristics: It would be useful to know what the diversity of catchments is, for example using characteristics found in CAMELS-AU. This would help contextualize model transferability results (especially with SooS and TSooS) for an audience that is not familiar with Australian geography/hydrology.

Uncertainty quantification: While comprehensive uncertainty analysis may be beyond the paper's scope, some quantification would enhance the robustness of the results, especially for hydroclimate projections or more specific applications such as regional-scale climate change vulnerability assessments (mentioned on L485-491). You could give an appreciation of the uncertainty for example by showing the spread in results from the cross-validation experiments (as mentioned on L267-268) in the Appendix. Additionally, you could apply bootstrapping when evaluating the model performance.

Model evaluation methodology: Please consider adding a dedicated subsection on model evaluation in the methods. This could include: i) an explanation of the NSE and what it measures (i.e., a measure of overall performance rather than an evaluation of extremes), and ii) the criterion for ”best performing model” selection - was an ≥0.01 NSE difference sufficient or was there a more stringent measure (e.g., with a larger buffer)? My fear is that the results might be a bit noisy, and that using a more stringent measure or adding a statistical test to assess differences would be beneficial.

Terminology around climate projections: The term ”climate projection capabilities” is somewhat misleading, as actual climate projections were not used here. Please consider reframing as ”proxy for climate projection capabilities”.

Specific comments:
L96-98: The ”worthwhile” type of modelling system likely depends on the use. For example, for climate change scenarios the hybrid method might be favoured. Please clarify this nuance in the paper.

L100-102: How can hybrid models help assess the dominant deficiencies? This warrant one to two more sentences in the paper.

L125: Is there any specific study of the application of GR4J in Australia that you could cite here?

L225: Please explicitly mention what the AWRA-L output is here to remind the reader.

L280-283: The differences seen using an increasing time window could also be impacted by the catchment size, with larger differences between the two performance measures expected in larger catchments. It would be interesting to compare the length of the sequence with the known response time of each catchment.

L312: ”higher exceedance probabilities” might be misinterpreted as referring to flow exceedance. Please clarify that this refers to the distribution of values when introducing the first plot of this kind.

L312-314: It would be interesting to speculate why some catchments don’t benefit from finetuning. Are there commonalities in catchment type, data quality, hydro-meteorological processes, etc?

L388-389: Please consider spelling out what you mean by ”possibly other hydrological processes” in the paper. For example, groundwater storage and processes, as well as lakes and reservoirs could be mentioned here. A side question to this, are the effects of lakes and reservoirs accounted for by the AWRA model?

L492: One advantage of land surface or conceptual hydrological models compared to LSTM models is that they can output various other hydrological variables in addition to streamflow. Please considering adding this to the list.

Typos:
L62: Some kind of sentence separation is needed between ”streamflow” and ”for instance”.

L114: Missing ”in” or similar between ”performance” and ”218 catchments”.

L133: In the introduction it says that outputs are available from 1910.

L138: Missing closing parenthesis.

L152: Missing ”for catchments” or similar between ”)” and ”that have been”.

L154: Rephrase ”covering from”.

L157: ”is produce” is missing a ”d”.

L263: Missing ”the” between ”from” and ”calibration”.

L469: Small ”b” for ”behavior”.

L472: ”strongly performing”?
Citation: https://doi.org/10.5194/egusphere-2025-805-RC1
- AC1: 'Reply on RC1', Ashkan Shokri, 16 Jul 2025
  
  We thank the referee for their valuable comments. We have addressed each point in detail below and will incorporate the following changes:
  Summary
  Comment: In this paper, the authors evaluate the performance of various LSTM models for hydrological simulations across Australian catchments: a land surface model-LSTM hybrid based on climate data as well as runoff from the AWRA-L land surface model, and an LSTM model based on climate data. They show that both models outperform runoff simulation from AWRA-L as well as from the conceptual hydrological model GR4J across most catchments. They investigate the impacts of methodological decisions, namely the cross-validation strategies, on the results. They additionally discuss the relevance of the proposed approches for three real-world applications: long-term historical simulations, predictions in ungauged basins, and climate projections. This application-focused framing is a welcomed perspective in a scientific paper. Overall, this is a well-executed, well-written study that addresses important research questions and contributes valuable insights to the hydrological modelling community. Below are some comments that will hopefully help review your paper for publication.
  Response: We sincerely appreciate your thorough and constructive review of our manuscript. Your positive assessment and insightful comments will help us improve both the clarity and depth of the study. In response, we will revise the manuscript to better articulate model design choices, clarify terminology, provide more context on catchment diversity, and address uncertainty and model evaluation procedures. Below, we address your comments point by point.
  Main Comments
  Comment: The term “static features” or “static predictors” is somewhat misleading, especially since some of these predictors are recalculated for different time windows to demonstrate the impact on model performance. Please consider using a different term for the recalculated features (e.g., quasi-static) to remove any confusion.
  Response: We agree with this observation. In the revised manuscript, we will adopt the term "quasi-static predictors"
  Comment: Model design clarity: It is unclear whether the static predictors are used for both LSTM-C and LSTM-QC models presented in section 2.5.1 by default, or only for some versions of these models. You could clarify this by merging sections 2.5.1 and 2.5.2 into a single model design section, summarizing model inputs more clearly (e.g., a table with two columns to separate predictors into dynamic and static for each model). If static predictors are not always used and I misunderstood this, consider renaming the various versions of each model to differentiate them clearly.
  Response: Thank you for pointing this out. To clarify, all "quasi-static" predictors are used in both LSTM-C and LSTM-QC models. We will explicitly state this in the revised text. As suggested, we will merge Sections 2.5.1 and 2.5.2 into a single "Model Design" section.
  Comment: Catchment characteristics: It would be useful to know what the diversity of catchments is, for example using characteristics found in CAMELS-AU. This would help contextualize model transferability results (especially with SooS and TSooS) for an audience that is not familiar with Australian geography/hydrology.
  Response: We agree this would be helpful, especially for readers unfamiliar with Australian catchment diversity. We will include a map of catchments overlaid with the Köppen-Geiger climate classification and provide summary statistics of relevant catchment attributes (e.g. area, aridity, baseflow index) in the Supplement. This will provide context for understanding model generalisability across diverse hydroclimatic regions.
  Comment: Uncertainty quantification: While comprehensive uncertainty analysis may be beyond the paper's scope, some quantification would enhance the robustness of the results, especially for hydroclimate projections or more specific applications such as regional-scale climate change vulnerability assessments (mentioned on L485-491). You could give an appreciation of the uncertainty for example by showing the spread in results from the cross-validation experiments (as mentioned on L267-268) in the Appendix. Additionally, you could apply bootstrapping when evaluating the model performance.
  Response: We appreciate this recommendation. While our primary focus was on deterministic performance, we agree that presenting uncertainty helps contextualise the robustness of our findings. To address this comment, we will include the spread of results from the cross-validation experiments and multiple training trials in the supplementary material. We will also discuss the limitations of interpreting this variability, as we implement these additions.
  Comment: Model evaluation methodology: Please consider adding a dedicated subsection on model evaluation in the methods. This could include: i) an explanation of the NSE and what it measures (i.e., a measure of overall performance rather than an evaluation of extremes), and ii) the criterion for “best performing model” selection - was an ≥0.01 NSE difference sufficient or was there a more stringent measure (e.g., with a larger buffer)? My fear is that the results might be a bit noisy, and that using a more stringent measure or adding a statistical test to assess differences would be beneficial.
  Response: We will include a dedicated “Model Evaluation” subsection in the Methods section. This will include explanation about NSE.
  Comment: Terminology around climate projections: The term “climate projection” capabilities is somewhat misleading, as actual climate projections were not used here. Please consider reframing as “proxy for climate projection capabilities”.
  Response: Agreed. We will revise this throughout the manuscript to refer instead to a “proxy for climate projection capabilities”.
  Specific comments:
  Comment: L96-98: The “worthwhile” type of modelling system likely depends on the use. For example, for climate change scenarios the hybrid method might be favoured. Please clarify this nuance in the paper.
  Response: Thank you for pointing this out. In the revised manuscript, we will clarify this by adding the following sentences:
  
  "The value of hybrid approaches may depend on the application context. For instance, hybrid models may be preferable in climate change scenario analysis, where maintaining physical consistency and leveraging land surface model outputs is important. In contrast, standalone LSTM models may be more suitable for ungauged basin prediction, where purely data-driven performance is prioritised."
  Comment: L100-102: How can hybrid models help assess the dominant deficiencies? This warrant one to two more sentences in the paper.
  Response: Thank you for the suggestion. We will revise the text to clarify how hybrid models can help assess dominant deficiencies in land surface models. Specifically, we explain that by comparing performance across different input sequence lengths, one can distinguish between improvements due to routing correction and those due to bias correction. This diagnostic insight can assist land surface model developers in targeting specific weaknesses. The revised paragraph now reads:
  
  "In addition, in cases where LSTMs improve predictions from land surface models, as we show in the current study, the source of these improvements can be diagnosed. For instance, land surface models often exhibit two main deficiencies: routing errors and systematic biases in specific catchments. By comparing hybrid models trained with short input sequences (i.e. one time step) to those trained with longer sequences, we can isolate the contribution of each deficiency. Short sequence lengths limit the LSTM’s capacity to correct routing errors, meaning improvements in this case are more likely due to bias correction."
  Comment: L125: Is there any specific study of the application of GR4J in Australia that you could cite here?
  Response: Thank you for the suggestion. We will add the following references to support the use of GR4J in the Australian context. Coron et al. (2012) provide a comprehensive evaluation of GR4J performance across 216 Australian catchments under diverse climate conditions. Hapuarachchi et al. (2022) describe the use of GR4J as part of the operational ensemble streamflow forecasting system for Australia. Zheng et al. (2024) further demonstrate the application of GR4J in projecting future streamflow under various climate change scenarios for Australia.
  
  - Coron, L., Andréassian, V., Perrin, C., Lerat, J., Vaze, J., Bourqui, M., & Hendrickx, F. (2012). Crash testing hydrological models in contrasted climate conditions: An experiment on 216 Australian catchments. Water Resources Research, 48, W05552.
  
  - Hapuarachchi, H. A. P., Bari, M. A., Kabir, A., Hasan, M. M., Woldemeskel, F. M., Gamage, N., Sunter, P. D., Zhang, X. S., Robertson, D. E., Bennett, J. C., & Feikema, P. M. (2022). Development of a national 7-day ensemble streamflow forecasting service for Australia. Hydrology and Earth System Sciences, 26, 4801–4821.
  
  - Zheng, H., Chiew, F. H. S., Post, D. A., Robertson, D. E., Charles, S. P., Grose, M. R., & Potter, N. J. (2024). Projections of future streamflow for Australia informed by CMIP6 and previous generations of global climate models. Journal of Hydrology, 636, 131286.
  Comment: L225: Please explicitly mention what the AWRA-L output is here to remind the reader.
  Thank you for this recommendation. We will revise the text to clearly state that the AWRA-L output used in our study refers to gridded runoff (surface and subsurface) at a 5 km × 5 km resolution across Australia.
  Comment: L280-283: The differences seen using an increasing time window could also be impacted by the catchment size, with larger differences between the two performance measures expected in larger catchments. It would be interesting to compare the length of the sequence with the known response time of each catchment.
  Response: Thank you for this suggestion. We will use the sig_dur_RespTime attribute from the CAMELS-AUS dataset, which represents the response time of catchments, to analyse the relationship between sequence length, catchment response time, and model performance. This analysis will be included as a new figure in the revised manuscript.
  Comment: L312: “higher exceedance probabilities” might be misinterpreted as referring to flow exceedance. Please clarify that this refers to the distribution of values when introducing the first plot of this kind.
  Response: We will clarify that this refers to the distribution of NSE values across catchments, not to flow exceedance.
  Comment: L312-314: It would be interesting to speculate why some catchments don’t benefit from finetuning. Are there commonalities in catchment type, data quality, hydro-meteorological processes, etc?
  Response: Thanks for the suggestion. In our analysis, only four catchments showed reduced performance after fine-tuning. These catchments are either ephemeral rivers with highly variable and intermittent flow (meaning they can cease to flow for many consecutive years and experience very large, infrequent flow events when they do occur) or are characterised by limited data. Such conditions make them difficult to model, and due to their variability, local fine-tuning may overfit or underperform relative to a more generalized global model. We will expand on this point in the revised manuscript, providing additional detail on the characteristics of these catchments.
  Comment: L388-389: Please consider spelling out what you mean by “possibly other hydrological processes” in the paper. For example, groundwater storage and processes, as well as lakes and reservoirs could be mentioned here. A side question to this, are the effects of lakes and reservoirs accounted for by the AWRA model?
  Response: We acknowledge the need for greater clarity here. In the revised manuscript, we will specify the types of hydrological processes we are referring to, which include additional lag processes such as percolation, groundwater interactions, and human activities (e.g. farm dams).
  
  Regarding the side question: the AWRA model does not explicitly simulate lakes or large reservoirs. However, the catchments used in this study are not impounded by major reservoirs. Nonetheless, small farm dams are present and may impact local hydrology, particularly by modifying runoff and storage patterns. While not directly modelled, their effects are likely implicitly represented through calibration where data are available. These farm dams have a widespread and growing impact on water availability across Australian agricultural regions and can significantly reduce downstream flows, particularly during dry years (Peña-Arancibia et al., 2023; Malerba et al., 2021).
  
  - Peña-Arancibia, J. L., M. E. Malerba, N. Wright and D. E. Robertson (2023). "Characterising the regional growth of on-farm storages and their implications for water resources under a changing climate." Journal of Hydrology 625: 130097.
  
  - Malerba, M. E., N. Wright and P. I. Macreadie (2021). "A Continental-Scale Assessment of Density, Size, Distribution and Historical Trends of Farm Dams Using Deep Learning Convolutional Neural Networks." Remote Sensing 13(2): 319.
  Comment: L492: One advantage of land surface or conceptual hydrological models compared to LSTM models is that they can output various other hydrological variables in addition to streamflow. Please considering adding this to the list.
  Response: We agree and will add that conceptual and land surface models can output additional hydrological variables.
  Typos:
  L62: Some kind of sentence separation is needed between “streamflow” and “for instance”.
  L114: Missing “in” or similar between “performance” and ”218 catchments”.
  L133: In the introduction it says that outputs are available from 1910.
  L138: Missing closing parenthesis.
  L152: Missing “for catchments” or similar between “)” and “that have been”.
  L154: Rephrase “covering from”.
  L157: “is produce” is missing a “d”.
  L263: Missing “the” between “from” and “calibration”.
  L469: Small “b” for “behavior”.
  L472: “strongly performing”?
  Response: Thank you for identifying these typographical issues. We will correct all of them in the revised manuscript.
  
  Citation: https://doi.org/10.5194/egusphere-2025-805-AC1
RC2:
'Comment on egusphere-2025-805', Anonymous Referee #2, 05 Jun 2025

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-805/egusphere-2025-805-RC2-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-805-RC2
- AC2: 'Reply on RC2', Ashkan Shokri, 16 Jul 2025
  
  We appreciate the reviewer’s careful reading and constructive feedback. We address each comment below and describe the revisions we will incorporate into the manuscript.
  Overview:
  Comment: This manuscript compares a standalone LSTM (LSTM-C) and an LSTM that includes simulated streamflow from a land surface model (AWRA-L) as an additional dynamic input (LSTM-QC). The two LSTM-based models are additionally benchmarked against AWRA-L and GR4J, a conceptual hydrological model widely used in an Australian context. They compare the models using three different cross-validation strategies, evaluating the ability of the models to predict temporally out of-sample, spatially out-of-sample, and spatiotemporally out-of-sample. Overall, they show that the two LSTM models outperform AWRA-L and GR4J in most catchments under all three cross-validation strategies, although there are some notable exceptions. The authors discuss the potential relevance of their study in three real-world applications, namely historical reconstruction, predictions in ungauged basins, and simulating hydrological change under climate change projections.
  Response: We thank you for the clear and thoughtful summary of our manuscript. Your overview accurately captures the key elements of our study and reflects our intent to assess the relative performance of different modelling approaches under varying out-of-sample conditions. We hope the revisions we have made in response to your detailed comments further clarify the study’s motivations, methodology, and implications.
  Main comments
  Comment: 1. As far as I can tell, LSTM-C and LSTM-QC are identical in every respect except for the inclusion of streamflow from AWRA-L as an additional dynamic input in LSTM-QC. I therefore question whether this paper is really testing whether the LSTM can correct the AWRA-L output, or whether the AWRA-L output provides any additional information content that can be leveraged by the LSTM architecture. In effect these two possibilities amount to the same thing, but the paper would benefit from emphasising one over the other. In my opinion, the latter characterisation more accurately reflects what the LSTM is actually doing.
  Response: We agree with the reviewer that the distinction between correcting AWRA‑L output versus leveraging its additional information content is important, and we appreciate this opportunity to clarify the framing. In the revised manuscript, we will:
  
  (1) Recast our framing in the Introduction to emphasise that our primary goal is to assess the information content of AWRA‑L, rather than to correct its outputs. We will clearly state that LSTM‑QC and LSTM‑C differ only by the inclusion of AWRA‑L streamflow, and that our comparison is designed to quantify the added value of that information.
  
  (2) Update the Discussion by re-labelling Figures 4 and 7 to highlight “Information Gain from AWRA‑L” rather than simply “Performance Difference.”
  
  (3) Revise the Conclusions to reinforce that the contribution of this work is not in bias correction, but in evaluating the informational value of a process-based model (AWRA‑L) when used as an input to a data-driven model.
  Comment: 2. The results of the comparison with LSTM-C (i.e. the LSTM model that does not include the AWRA-L streamflow as a dynamic input) is undermined by the limited number of static catchment attributes that are supplied to the model (Table 1). A large number of studies have shown that LSTM models perform best when they are trained across many catchments at once using catchment descriptors that adequately describe the physical diversity of catchments in the training set. For example, (, 2019) train an LSTM on static catchment attribtues that include soils, climate, vegetation, topography, and geology. Here, the authors have selected attributes that broadly cover climate and geomorphology, but discard a large number of attributes from CAMELS-AUS that are potentially highly influential in determining the hydrological behaviour of Australian catchments (e.g. geology, land cover). Presumably, in common with most land surface models, AWRA-L is parameterised using land cover and geological data. Therefore, I believe it is at least a possibility that LSTM-QC is utilising the information on catchment diversity that is encoded in the AWRA-L output but which has been arbitrarily excluded from LSTM-C. It seem to me that LSTM-C is trained in a way that is inconsistent with our current understanding of how best to use this class of model for hydrological simulation, raising doubts about whether it is a fair comparison.
  Response: Thank you for your thoughtful comment. We agree that the choice of static catchment attributes can significantly affect LSTM model performance. The selection of static features in our study was deliberate and guided by both performance considerations and data availability in operational or ungauged settings.
  
  As noted in Kratzert et al. (2019), they also did not use the full set of static attributes available in CAMELS. Instead, they selected 27 features as a subset of those explored by Addor et al. (2017), focusing on variables derivable from remote sensing or nationally available datasets. Similarly, in our case, we explored a wide range of attributes available in CAMELS-AUS and found through systematic testing that a subset of 12 variables consistently improved model performance across catchments. These variables cover key aspects of climate and geomorphology and were chosen to ensure applicability in real-world, data-limited contexts.
  
  We also deliberately excluded static variables derived from streamflow to avoid highly correlated predictors, and we omitted attributes that are difficult to estimate reliably for ungauged basins. This aligns with our focus on creating a parsimonious and operationally feasible model.
  Minor comments
  Comment: 1. L27: “The ubiquity of these model predictions...”- are you referring to the spatial coverage or widespread use? Please clarify.
  Response: Thank you for pointing this out. In this context, "ubiquity" refers primarily to the widespread spatial coverage of land surface model predictions across large regions, often at continental or global scales. We will revise the sentence to clarify this and avoid ambiguity. For example:
  
  "The widespread spatial coverage of these model predictions often trades off against accuracy..."
  Comment: 2. L33: It’s worth pointing out that most land surface models were not originally designed to predict streamflow, but rather to provide the lower boundary condition to Earth system models.
  Response: We agree that many land surface models were originally developed to provide lower boundary conditions for Earth system models rather than for direct streamflow prediction. However, we would like to clarify that unlike most of other land surface models AWRA-L was specifically developed for water balance estimation and runoff prediction across Australia, with a focus on hydrological applications rather than atmospheric coupling. We will clarify this distinction in the revised manuscript to avoid conflating AWRA-L with more typical land surface models used in Earth system modelling.
  Comment: 3. L39-48: I agree that the lack of channel routing and calibration scheme are weaknesses of AWRA-L with respect to streamflow simulation, but is a lack of process understanding not also a weakness?
  Response: We agree that understanding and simulating key processes adds to confidence in a model, and that the sometimes-imperfect representation of these processes in AWRA-L (as well as the total lack of such processes in LSTMs), may contribute to a lack of confidence in these models. We will update the manuscript to explicitly acknowledge the lack of process understanding as an additional limitation of AWRA-L, alongside the issues of channel routing and calibration.
  Comment: 4. L62: Punctuation needed.
  Response: Thank you for pointing this out. We will revise the sentence for clarity and correct punctuation as follows:
  
  “Apart from physically based approaches for representing routing, several methods have been developed applying machine learning to estimate streamflow. For instance, Nagesh Kumar et al. (2004) used a feedforward Artificial Neural Network to estimate monthly flow time series of a single river.”
  Comment: 5. L73: A third advantage is that they are unconstrained by physical laws such as mass balance, so they are better able to implicitly correct biases in the input data. In land surface models, uncertainty in the input will propagate to the output.
  Response: Thank you for the suggestion. We agree and will incorporate this point into the revised manuscript. The paragraph will be updated as follows:
  
  “… A third advantage is that they are not constrained by physical laws such as mass balance, which allows them to implicitly correct biases in the input data. In contrast, in land surface models, uncertainty in the inputs typically propagates directly to the outputs.”
  Comment: 6. L85-93: This passage is not particularly relevant to the topic in hand. As the introduction is already quite long it could be safely removed.
  Response: Thank you for the suggestion. We will revise the text to ensure the introduction remains focused and appropriately scoped.
  Comment: 7. L98: “...as we show in the current study...”- This would seem to pre-empt the results.
  Response: Thank you for pointing this out. We agree that the phrase pre-empts the results and will remove “as we show in the current study” to maintain a more neutral tone in the introductory text.
  Comment: 8. L99-100: Arguably deficiencies in routing and bias in individual catchments amount to the same thing. Perhaps you could clarify what you mean here?
  Response: While both routing deficiencies and catchment-specific biases contribute to model error, they stem from different sources and have distinct implications for model behaviour. Routing deficiencies primarily affect the timing and shape of the hydrograph (e.g., delayed or premature peak flows), whereas biases typically refer to systematic over- or underestimation of flow magnitude, independent of timing. This distinction is important for diagnosing model limitations. We will update the manuscript to more clearly articulate this difference.
  Comment: 9. L122: I’m not sure it is particularly easy to test the ability of the model to perform well under climate change projections, because it is likely that the range of input values in the climate projections will exceed those the LSTM would encounter in the training set. Thus, what you really ought to be testing is the ability of the model to extrapolate, but I’m not sure the experimental design achieves this at present.
  Response: Thank you for raising this important point. We agree that true testing under climate change conditions requires the model to extrapolate beyond the historical range of climate inputs, which is inherently challenging. While our experimental setup does not fully replicate future climate scenarios, the Temporally and Spatially out of Sample (TSooS) experiment partially addresses this concern: in two of the four folds, the model is trained on data from 1975–1995 and evaluated in 2000–2014, which introduces a degree of extrapolation in both space and time. However, we acknowledge that the future climate may involve more extreme conditions than those seen in our historical training period. To better reflect this limitation, we will revise the manuscript to refer to this analysis as a “proxy for climate projection capabilities,” rather than implying direct applicability to future climate conditions.
  Comment: 10. L156: You could acknowledge here that using multiple precipitation datasets can enhance LSTM performance (Kratzert et al., 2021).
  Response: Thank you for the suggestion. While previous studies such as Kratzert et al. (2021) have shown that using multiple precipitation datasets can enhance LSTM performance, in our case the AGCD (formerly known as AWAP) and SILO datasets share many common rain gauges and differ mainly in processing methods. As a result, incorporating both datasets do not provide additional independent information nor improve model performance.
  Comment: 11. L175: Please clarify that you are referring to the hidden state size here.
  Response: Thank you for the comment. We will clarify in the text that the reference is specifically to the hidden state size.
  Comment: 12. L221: You describe the static and dynamic predictors, but not the target (i.e. streamflow). Please could you describe your treatment of the target variable (e.g. do you normalize by catchment area)?
  Response: Good observation. The target is gauged daily streamflow observation which is provided in the CAMELS-AUS dataset and is normalized by catchment area in mm unit. We will clarify this in the text.
  Comment: 13. L225: Please could you confirm that the two LSTM models are identical in every respect except for the inclusion of AWRA-L streamflow in LSTM-QC?
  Response: Yes, we can confirm that the LSTM models are identical in every respect except for the use of AWRA-L runoff as a predictor in LSTM-QC. We will state this explicitly in the text.
  Comment: 14. L240: You say this important but not that you actually do it. We later find out that you have, although this information is in the results section. Please consider moving 3.1.3 to 2.5.2.
  Response: Thank you for pointing this out. We will move the explanation about the method from Section 3.1.3 to Section 2.5.2.
  Comment: 15. L245: In general I think the training approach for LSTMs is well established and so you don’t need to go into so much detail here. The text could also be shortened by using scientific notation (e.g. Section 2.3 of (Lees et al., 2021)). Typically when training an LSTM there will be a training period, a validation period (that is used during training to test each parameter set) and a hold-out test period. However, Table 2 only details a training and validation period. Please could you clarify whether the model is tested on an unseen dataset?
  Response: We appreciate this useful suggestion.
  
  Regarding first point: In the revised manuscript, we will shorten this section and adopt the scientific shorthand style recommended (e.g., as in Lees et al., 2021, Section 2.3).
  
  Regarding hold‑out test period: You are correct that Table 2 currently lists only training and validation periods. In fact, we also employ a separate hold‑out test period. We use the period 2014–2022 as our holdout period, and add a more explicit reference to this in the Methods.
  Comment: 16. L265: This needs some clarification. I think it is feasible (i.e. it could be done under the experimental setup) but not meaningful, because in a real out-of-sample situation you would not have any data to conduct fine-tuning.
  Response: Thank you for the comment. We agree that the term “feasible” may be misleading in this context. While fine-tuning is technically possible within the experimental setup, it would not be meaningful in a true out-of-sample scenario where no data from the target catchment would be available for adjustment. We will revise the manuscript accordingly and replace “feasible” with “realistic” to better reflect the intention. The revised sentence will read:
  
  “Fine-tuning for individual catchments would not be realistic in a true out-of-sample scenario, as no target catchment data would be available for adjustment.”
  Comment: 17. L285: I can see the argument for including GR4J in the model comparison, but I wonder whether it would be better to only use it in the TooS test. I would argue that by including GR4J in the SooS and TSooS tests you are really testing the parameter regionalization scheme, which is not really the main focus of the manuscript.
  Response: While we understand the concern, we maintain that testing GR4J under the SooS and TSooS setups aligns with one of our primary objectives, which is to evaluate model performance in regionalization scenarios. Including GR4J across all experimental setups allows for a consistent benchmark against a widely used conceptual model, noting that GR4J is also used in out-of-spatial-sample prediction in Australia. This helps illustrate the value of LSTM models in both interpolation and extrapolation contexts.
  Comment: 18. Figure 4/5/6: Your description of the results would benefit from using subplot labels, so the reader knows what they should be looking at.
  Response: Thank you for the suggestion. We will add subplot labels to Figures 4, 5, and 6 to improve clarity and help readers more easily follow the results.
  Comment: 19. L373: Notwithstanding my previous point about climate projections, I’m not sure why this is categorised as TSooS rather than TooS?
  Response: Thank you for your comment. The reason this is categorized as TSooS rather than TooS relates to the data partitioning strategy used for training and testing. In the TooS setup, the model is trained on all catchments but during one period, then tested on the same catchments during a different period. In contrast, TSooS is a stricter test where the model is trained on only half of the catchments for half of the overall time period and then tested on the remaining unseen catchments during the other half of the time. This means the model effectively trains on only about a quarter of the total data in TSooS, compared to about three-quarters in TooS. This distinction is important because TSooS better evaluates the model’s ability to generalize to completely new catchments and unseen periods, which may include extreme events like the millennium drought that are absent from the training data.
  Comment: 20. L390: I’m not sure it is meaningful to compare with LSTM-C at short sequence lengths, as we already know that LSTMs require long sequence lengths to make good predictions.
  Response: We agree that it is already known that LSTM-C requires longer sequence lengths to perform well. However, we use LSTM-C here primarily as a baseline to demonstrate the added value of the routing component in LSTM-QC. Even at shorter sequence lengths, the difference in performance between LSTM-C and LSTM-QC highlights the extent to which AWRA-L is resolving routing processes. We believe this comparison offers important insights for AWRA-L users.
  Comment: 21. L481: This could arise because the LSTM training is suboptimal, as it has not been exposed to catchment attributes that may help it learn the hydrological behaviour in these regions.
  Response: We agree that suboptimal LSTM training due to limited exposure to catchment attributes is one possible explanation. However, we believe the poorer performance of LSTMs compared to GR4J in south-west Western Australia is more likely due to the more informative regionalisation scheme used by GR4J. This region is hydrologically distinct (Petrone et al., 2010; Hughes et al., 2012), and since GR4J’s regionalisation is weighted by inverse-distance, it places greater emphasis on local catchments, whereas the LSTM does not. That said, it is possible that training the LSTM on a more global dataset might improve its performance in this region. We consider a thorough investigation of these hypotheses outside the scope of the current paper and intend to address them in future research.
  
  Petrone, K. C., J. D. Hughes, T. G. Van Niel, and R. P. Silberstein (2010), Streamflow decline in southwestern Australia, 1950–2008, Geophys. Res. Lett., 37, L11401, doi:10.1029/2010GL043102.
  
  Hughes, J. D., K. C. Petrone, and R. P. Silberstein (2012), Drought, groundwater storage and stream flow decline in southwestern Australia, Geophys. Res. Lett., 39, L03408, doi:10.1029/2011GL050797.
  
  Citation: https://doi.org/10.5194/egusphere-2025-805-AC2

Ashkan Shokri, James C. Bennett, David E. Robertson, Jean-Michel Perraud, Andrew J. Frost, and Eric A. Lehmann

Viewed

Total article views: 1,301 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,019	247	35	1,301	25	43

HTML: 1,019
PDF: 247
XML: 35
Total: 1,301
BibTeX: 25
EndNote: 43

Views and downloads (calculated since 07 Apr 2025)

Month	HTML	PDF	XML	Total
Apr 2025	159	41	6	206
May 2025	46	15	0	61
Jun 2025	64	15	9	88
Jul 2025	49	25	4	78
Aug 2025	114	27	2	143
Sep 2025	429	38	2	469
Oct 2025	58	31	3	92
Nov 2025	48	20	5	73
Dec 2025	48	32	4	84
Jan 2026	4	3	0	7

Cumulative views and downloads (calculated since 07 Apr 2025)

Month	HTML	PDF	XML	Total
Apr 2025	159	41	6	206
May 2025	46	15	0	61
Jun 2025	64	15	9	88
Jul 2025	49	25	4	78
Aug 2025	114	27	2	143
Sep 2025	429	38	2	469
Oct 2025	58	31	3	92
Nov 2025	48	20	5	73
Dec 2025	48	32	4	84
Jan 2026	4	3	0	7

Viewed (geographical distribution)

Total article views: 1,304 (including HTML, PDF, and XML) Thereof 1,304 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 09 Jan 2026

Short summary

Predicting river flow accurately is crucial for managing water resources, especially in a changing climate. This study used deep learning to improve streamflow predictions across Australia. By either enhancing existing models or working independently with climate data, the deep learning approaches provided more reliable results than traditional methods. These findings can help water managers better plan for floods, droughts, and long-term water availability.


Total:	0
HTML:	0
PDF:	0
XML:	0