Which strategy to improve the performances of an LSTM-based model for extreme stream temperature values?

Saadi, Mohamed; Guichard, Louis; Cognot, Gabrielle; Labbouz, Laurent; Roux, Hélène

doi:10.5194/egusphere-2025-3393

Preprints

https://doi.org/10.5194/egusphere-2025-3393

Preprints

30 Jul 2025

| 30 Jul 2025

Which strategy to improve the performances of an LSTM-based model for extreme stream temperature values?

Mohamed Saadi, Louis Guichard, Gabrielle Cognot, Laurent Labbouz, and Hélène Roux

Abstract. Deep-learning models have demonstrated strong performances in reproducing stream temperature dynamics, which is promising for the reconstruction of missing stream temperature records at ungauged locations. However, model accuracy for the range of high, summer stream temperature has been usually overlooked, raising the question of the suitability of using deep-learning methods during this crucial season. In this study, we investigated strategies to improve the performances of a stream-temperature model based on LSTM (Long Short-Term Memory) cells over the highest 10 % observed values at 21 stations located in the Garonne river catchment. We quantified the gain in model performance thanks to regional multi-catchment training with static attributes, exploiting hydrologically relevant variables, and further penalizing the errors at extreme temperature values using custom loss functions. Our key results are: (1) Regional multi-catchment training is the best strategy to improve the performances of LSTM models not only over the top 10 % values but also over the whole range of observations. (2) The gain in performances was mainly brought by the use of static, catchment and reach attributes. (3) Customizing the loss function to emphasize the model errors on extreme temperature values did not lead to significant gains in test performances. This study further confirms the suitability of well-trained LSTM models for extreme stream temperature values, offering significant advantages for water management at data-sparse regions during summer periods.

Received: 14 Jul 2025 – Discussion started: 30 Jul 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Mohamed Saadi, Louis Guichard, Gabrielle Cognot, Laurent Labbouz, and Hélène Roux

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-3393', Anonymous Referee #1, 25 Aug 2025
General Comments
This manuscript addresses the important question of how to improve the performance of LSTM-based models in reproducing extreme stream temperature values. The study focuses on the Garonne river catchment and evaluates three strategies: (i) regional multi-catchment training, (ii) inclusion of static and hydrological variables, and (iii) adaptation of the loss function. The topic is timely and relevant, as accurate modelling of high stream temperatures is critical for ecological and water-management applications.
The paper is ambitious in scope, draws on a substantial dataset, and tests multiple modelling configurations. It has the potential to contribute meaningfully to the hydrological community by clarifying the role of regionalization and input design for extreme value prediction. However, the manuscript in its current form requires major revision before it can be considered for publication.
Key limitations include the exclusion of essential predictors (notably catchment air temperature and simple temporal features such as day of year or seasonality), an insufficiently clear description of how static variables are incorporated into the LSTM setup, and a narrow framing of the loss-function evaluation that limits the robustness of the conclusions. Together with issues of presentation and readability, these aspects reduce the impact and clarity of the work.
I therefore recommend major revisions. Addressing these issues—by streamlining presentation, clarifying the study’s novelty, incorporating or justifying the omission of key predictors, benchmarking against established methods, and refining both methodological detail and evaluation metrics—would substantially strengthen the manuscript and increase its value for the hydrological community.

Specific Comments
1. Presentation and readability
The manuscript is currently too long and dense, which makes it difficult to follow the main arguments.

The introduction could be reduced substantially (perhaps to a quarter of its current length), while clearly highlighting the novelty of this work relative to existing literature.

The description of data collection and preprocessing (e.g. GR6J modelling for discharge) is overly detailed and would be better placed in supplementary material.

Results sections 4.2 and 4.3 repeat exhaustive comparisons of all loss functions, which add little beyond the conclusion already drawn in section 4.1. This makes the results harder to interpret.

2. Input variables and methodological choices
A key omission is the absence of catchment-scale air temperature as a predictor. Station air temperature is a proxy that may suffice for small basins but is not adequate for larger catchments where thermal dynamics evolve along the river. This limitation likely explains why models using potential evapotranspiration perform comparatively well, as it implicitly represents catchment-scale air temperature.

No time-based features (e.g. day of year, seasonality) are included, even though prior work (e.g. Feigl et al., 2021, doi.org/10.5194/hess-25-2951-2021) has demonstrated their strong predictive value for stream temperature modelling. These features are straightforward to compute and do not require any additional external datasets. If the authors choose not to include them, it is important to provide a clear justification and to explain why the validity of their results and comparisons is not compromised.

The rationale for testing so many sets of input variables is unclear, as this is not aligned with the stated research questions. Either the scope should be reduced or the research questions reframed.

Why are you predicting daily mean stream temperature values if the stated aim is to model extremes? Since extreme ecological and management impacts are often driven by peak daily temperatures, it would arguably be more appropriate to predict daily maxima rather than means. Please clarify the rationale for focusing on daily mean values, and discuss whether modelling daily maxima might be a more suitable target for assessing extreme conditions.

3. LSTM architecture and training details
The manuscript states that model performance was insensitive to the number of layers, cells, and batch sizes. This may be an artefact of using a very low learning rate (1e-4) combined with a high dropout rate (0.4). At minimum, additional tests with higher learning rates (e.g. 1e-3) and lower dropout values should be provided.

The description of how static attributes are incorporated into the LSTM is insufficient. Is it via concatenation at each timestep, embeddings, or as additional inputs to the final dense layer? Without this clarity, it is difficult to interpret results.

4. Loss function evaluation
The study introduces several “regional” loss functions but does not benchmark them against published alternatives (e.g. Kratzert et al., 2019, doi.org/10.5194/hess-23-5089-2019) or against a standard MSE baseline. Especially the MSE baseline would be interesting, as it is not clear how different catchment water temperature ranges, which do not show as large differences as runoff, affect regional training. This limits the significance of the results.

Evaluation relies on MAE, which biases the study towards MAE-based loss functions and does not adequately reflect extreme-value performance. Since the research objective is specifically focused on extremes, an evaluation metric more sensitive to high values (e.g. RMSE, quantile-based metrics, or extreme value scores) would be more appropriate.

As currently presented, the main result is that custom loss functions did not improve performance. However, this may reflect the design of the evaluation rather than a fundamental limitation.

Technical corrections
Abstract: Key results (1) and (2) appear redundant since regional modelling is inherently linked to extended static inputs. Please clarify.

Abstract: The phrase “well-trained LSTM” is vague—better to define relative to baseline approaches.

Line 126: Why do you need exactly a minimum of 2434 daily observations for 1 test year?

Line 127: “We call these 21 stations test station” – please clarify whether this refers to an ML-style train/validation/test split. Overall, it is not entirely clear to me how you split the data, especially not in which situations you split the time series or split by stations? Please state this more clearly.

Table 1: Consider presenting mean and range (min–max) values per train/test group instead of medians only, which are not necessarily more robust here.
Citation: https://doi.org/10.5194/egusphere-2025-3393-RC1
- AC1: 'Reply on RC1', Mohamed Saadi, 08 Nov 2025
  
  Dear Anonymous Referee #1,
  thank you very much for your comments and suggestions. Please find attached our detailed response.
  Best regards,
  Mohamed Saadi, on behalf of all co-authors.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3393-AC1
RC2:
'Comment on egusphere-2025-3393', Anonymous Referee #2, 25 Sep 2025

Please find my comments attached.

Citation: https://doi.org/10.5194/egusphere-2025-3393-RC2
- AC2: 'Reply on RC2', Mohamed Saadi, 08 Nov 2025
  
  Dear Anonymous Referee #2,
  thank you very much for your comments and suggestions. Please find attached our detailed answer.
  Best regards,
  Mohamed Saadi, on behalf of all co-authors.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3393-AC2

Mohamed Saadi, Louis Guichard, Gabrielle Cognot, Laurent Labbouz, and Hélène Roux

Viewed

Total article views: 1,408 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,238	127	43	1,408	43	51

HTML: 1,238
PDF: 127
XML: 43
Total: 1,408
BibTeX: 43
EndNote: 51

Views and downloads (calculated since 30 Jul 2025)

Month	HTML	PDF	XML	Total
Jul 2025	35	3	3	41
Aug 2025	502	13	8	523
Sep 2025	429	8	6	443
Oct 2025	67	6	3	76
Nov 2025	64	12	6	82
Dec 2025	40	15	3	58
Jan 2026	33	30	10	73
Feb 2026	31	16	3	50
Mar 2026	37	24	1	62

Cumulative views and downloads (calculated since 30 Jul 2025)

Month	HTML	PDF	XML	Total
Jul 2025	35	3	3	41
Aug 2025	502	13	8	523
Sep 2025	429	8	6	443
Oct 2025	67	6	3	76
Nov 2025	64	12	6	82
Dec 2025	40	15	3	58
Jan 2026	33	30	10	73
Feb 2026	31	16	3	50
Mar 2026	37	24	1	62

Viewed (geographical distribution)

Total article views: 1,405 (including HTML, PDF, and XML) Thereof 1,405 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 21 Mar 2026

Short summary

LSTM networks are excellent deep-learning tools to reproduce stream temperature observations, but their performances over the range of extreme (summer) stream temperature values have been overlooked. We close this gap by looking at strategies to improve the LSTM performances over the highest 10 % values of stream temperature observations. We found that the best strategy is to train the LSTM models at several locations with input variables that include static catchment and reach attributes.


Total:	0
HTML:	0
PDF:	0
XML:	0