Four decades of full-depth profiles reveal layer-resolved drivers of reservoir thermal regimes and event-scale hypolimnetic warming

Mi, Chenxi; Gai, Bo; Kong, Xiangzhen; Jiang, Yuzhe; Chan, Chun Ngai; Rinke, Karsten

doi:10.5194/egusphere-2025-6442

Preprints

https://doi.org/10.5194/egusphere-2025-6442

Preprints

20 Feb 2026

| 20 Feb 2026

Four decades of full-depth profiles reveal layer-resolved drivers of reservoir thermal regimes and event-scale hypolimnetic warming

Chenxi Mi, Bo Gai, Xiangzhen Kong, Yuzhe Jiang, Chun Ngai Chan, and Karsten Rinke

Abstract. Thermal structure shapes ecological dynamics in lakes and reservoirs. Yet full-profile temperature records over multi-decades remain scarce, constraining mechanistic understanding of depth-resolved thermal changes and subseasonal extremes (e.g., surface heat waves and late-season hypolimnetic warming). In this study, we focused on Rappbode Reservoir—Germany’s largest drinking-water reservoir—and compiled four decades of high-resolution, full-depth temperature profiles with concurrent hydro-meteorological records that are rarely available for stratified systems. Building on these data, we developed a novel two-step analytical framework that integrates long-term monitoring and process-based modelling to yield a high-resolution, internally consistent dataset of spatiotemporal temperature dynamics. We then applied interpretable machine learning to quantify dominant external controls on depth-specific stratification dynamics and determine causal mechanisms governing late-stratification hypolimnetic warming. Our results suggested that influence of external drivers on the thermodynamic structure varied markedly with depth and stratification phase: stratification-strength metrics governed by atmospheric heat fluxes (i.e., surface temperature, vertical temperature difference, Schmidt stability) were controlled mainly by 30-day antecedent shortwave radiation and air temperature. For hypolimnetic temperatures and mixed-layer depth, outflow discharge turned out to be the primary driver during late stratification. Further analysis indicated that episodic hypolimnetic warming up to 10 °C in four specific years was mainly triggered by intensified deep withdrawals that weakened the density gradient and shortened the compensatory-flow pathway. The dual-perspective framework developed here—integrating process-based and machine-learning approaches—is broadly transferable for analyzing ecological processes and supporting evidence-based management in stratified waters.

Received: 23 Dec 2025 – Discussion started: 20 Feb 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2792 KB)

Supplement (740 KB)

Download & links

Chenxi Mi, Bo Gai, Xiangzhen Kong, Yuzhe Jiang, Chun Ngai Chan, and Karsten Rinke

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-6442', Salim Heddam, 17 Mar 2026
The present study applies two types of modelling approaches namely the CE-QUAL-W2 numerical model and a machine learning model based on XGBoost to predict the thermal structure of a reservoir. The machine learning model is trained using data generated from the CE-QUAL-W2 simulations. The modelling framework is clearly structured, with the ML model used to predict temperature-related variables at several depths, including epilimnetic (5 m), metalimnetic (15 m), hypolimnetic (30 m), and bottom (50 m) temperatures, as well as key thermal structure indicators such as Schmidt stability, bottom-to-surface temperature difference, and mixed layer depth. Based on certain assumptions, the authors restricted the analysis to the May-October period, which corresponds to the stratified season of the reservoir. A set of meteorological variables was used as predictive features, and 30-day moving average predictors were introduced to represent the cumulative influence of atmospheric and hydraulic forcing on thermal conditions across different depths.

Although the topic of the manuscript is relevant and demonstrates a satisfactory degree of originality, the overall organization of the paper particularly the Results section does not yet meet the scientific standards expected for publication in a high-quality journal. Substantial improvements are required before the manuscript can be considered for publication. In its current form, the study requires major revision. The following issues should be addressed carefully by the authors:

Lack of baseline comparison. It is neither appropriate nor convincing to present a machine learning prediction framework without a solid baseline for comparison. The present study relies essentially on a single ML model, which makes the analysis incomplete and weak from a methodological standpoint. Additional models (e.g., classical regression models or alternative ML approaches) should be included to provide a meaningful benchmark.

Absence of comprehensive result tables. The manuscript does not contain adequate tables summarizing the performance of the models. For a study focused on machine learning applications, detailed quantitative comparisons presented in well-structured tables are essential.

Insufficient description of the dataset. A much clearer presentation of the dataset used for model development is required. This should include:

Detailed description of input features and output variables

Descriptive statistical analysis

Explanation of the data splitting strategy (training and testing datasets)

Distributional comparisons between training and testing sets using Kolmogorov-Smirnov tests (empirical CDF comparison)

Jensen–Shannon distance comparisons between training and testing sets

Energy distance comparisons to further evaluate distributional consistency.

Hyperparameter specification. The hyperparameters used in the ML models should be clearly reported and justified to ensure reproducibility.

Clear separation of training and testing results. Model performance should be presented distinctly for both training and testing datasets, using appropriate evaluation metrics and visualizations.

Need for a methodological flowchart. A clear flowchart summarizing the overall framework of the study from data preparation to model development, validation, and interpretation should be provided.

Incomplete interpretability analysis. Since the authors devoted a significant portion of the manuscript to SHAP-based analysis (SHAP), both global and local interpretability analyses should be conducted. Currently, the local interpretability component is missing.

Additional explainability analyses. To improve the understanding of feature importance and interactions, the study should incorporate complementary explainability techniques, including Permutation Feature Importance (PFI) for global ranking, Partial Dependence Plots (PDP) for marginal effects, and Individual Conditional Expectation (ICE) plots to capture response heterogeneity. Including these analyses would significantly strengthen the scientific contribution of the work.

Statistical comparison of models. Once a robust baseline comparison is established among several models, the authors should perform appropriate statistical tests such as the Kruskal-Walli’s test and the Diebold-Mariano test to assess whether differences in predictive performance between models are statistically significant.

In summary, while the research idea is interesting and relevant, the manuscript requires substantial methodological, analytical, and presentation improvements before it can be considered suitable for publication. A thorough revision addressing the points outlined above is strongly recommended.
Citation: https://doi.org/10.5194/egusphere-2025-6442-RC1
- CC1: 'Reply on RC1', Chenxi Mi, 17 Mar 2026
  
  We highly appreciate the reviewer’s important comments regarding benchmarking, dataset description, result presentation, and reproducibility. We are currently preparing a comprehensive point-by-point response and corresponding manuscript revisions, and we will address these issues in detail in a structured reply together with the remaining comments.
  Chenxi
  
  Citation: https://doi.org/10.5194/egusphere-2025-6442-CC1
- AC1:
  'Reply on RC1', Xiangzhen Kong, 20 Apr 2026
  Reply on RC1
  The present study applies two types of modelling approaches namely the CE-QUAL-W2 numerical model and a machine learning model based on XGBoost to predict the thermal structure of a reservoir. The machine learning model is trained using data generated from the CE-QUAL-W2 simulations. The modelling framework is clearly structured, with the ML model used to predict temperature-related variables at several depths, including epilimnetic (5 m), metalimnetic (15 m), hypolimnetic (30 m), and bottom (50 m) temperatures, as well as key thermal structure indicators such as Schmidt stability, bottom-to-surface temperature difference, and mixed layer depth. Based on certain assumptions, the authors restricted the analysis to the May-October period, which corresponds to the stratified season of the reservoir. A set of meteorological variables was used as predictive features, and 30-day moving average predictors were introduced to represent the cumulative influence of atmospheric and hydraulic forcing on thermal conditions across different depths. Although the topic of the manuscript is relevant and demonstrates a satisfactory degree of originality, the overall organization of the paper particularly the Results section does not yet meet the scientific standards expected for publication in a high-quality journal. Substantial improvements are required before the manuscript can be considered for publication. In its current form, the study requires major revision. The following issues should be addressed carefully by the authors:
  Response: We sincerely thank the reviewer for the careful and constructive evaluation of our manuscript. We appreciate the reviewer’s positive assessment of the relevance and originality of the study, as well as the clear identification of aspects that need to be strengthened. The reviewer’s comments mainly concern methodological benchmarking, transparency of the dataset and model setup, and the depth of the interpretability analysis. These are valuable points, and we agree that the current version can be substantially improved in these respects. In the revised manuscript, we will address these concerns through targeted methodological revisions and clearer presentation of the results.
  
  Lack of baseline comparison. It is neither appropriate nor convincing to present a machine learning prediction framework without a solid baseline for comparison. The present study relies essentially on a single ML model, which makes the analysis incomplete and weak from a methodological standpoint. Additional models (e.g., classical regression models or alternative ML approaches) should be included to provide a meaningful benchmark.
  
  Response: We agree that the current manuscript does not yet provide an adequately developed benchmark for the XGBoost framework. Our original intention was not to conduct an exhaustive algorithm-comparison study, but rather to build an interpretable process-ML attribution framework on top of the CE-QUAL-W2 outputs. We fully agree that the performance of XGBoost should be better contextualized against alternative baseline models.
  In the revised manuscript, we will add a benchmark suite comprising a regression-based baseline together with selected nonlinear machine-learning baselines, such as Multivariate Adaptive Regression Splines (MARS), Random Forest (RF), and Support Vector Machine (SVM). These models are selected because they represent complementary nonlinear modeling strategies and provide a balanced benchmark for tabular environmental predictors. They will be trained and evaluated using the same predictor set, the same data partitioning scheme, and the same overall evaluation framework as applied to the XGBoost. This model suite will provide a representative and methodologically balanced basis for evaluating water temperature dynamics and stratification metrics, while retaining the primary focus of this manuscript on interpretable driver attribution of hypolimnetic warming.
  
  Absence of comprehensive result tables. The manuscript does not contain adequate tables summarizing the performance of the models. For a study focused on machine learning applications, detailed quantitative comparisons presented in well-structured tables are essential.
  
  Response: Agree! The current manuscript relies heavily on figures and does not yet provide sufficiently compact quantitative summaries of model performance. In the revised manuscript, we will add structured summary tables reporting the main performance metrics for both the process-based and machine-learning components. Specifically, we will add structured summary tables reporting (i) CE-QUAL-W2 performance against observations and (ii) the performance of XGBoost and the selected benchmark models for each target variable, with training and testing results presented separately. We will also include a dedicated table summarizing the optimized XGBoost hyperparameters to improve methodological transparency.
  
  Insufficient description of the dataset. A much clearer presentation of the dataset used for model development is required. This should include:
  
  3.1. Detailed description of input features and output variables
  Descriptive statistical analysis
  
  Explanation of the data splitting strategy (training and testing datasets)
  
  Distributional comparisons between training and testing sets using Kolmogorov-Smirnov tests (empirical CDF comparison)
  
  Jensen–Shannon distance comparisons between training and testing sets
  
  Energy distance comparisons to further evaluate distributional consistency.
  
  Response: We fully agree with the reviewer that the description of the machine-learning dataset should be more systematic and transparent. Although the current manuscript defines the main predictors and targets, it does not yet present them in a sufficiently structured way. In the revised manuscript, we will expand the dataset description in the Methods and Supplement. This revision will include: 1) descriptive statistics of the variables used for model development; 2) an explicit explanation of the train/test splitting strategy; and 3) supplementary diagnostics comparing the distributions of the training and testing subsets. In addition, to address the reviewer's specific concern about distributional consistency, we will add formal train/test comparisons using complementary metrics, including a Kolmogorov-Smirnov-based comparison and information-distance metrics such as Jensen-Shannon divergence and energy distance. These diagnostics will be presented in the Supplementary Materials to support the transparency and robustness assessment of the ML models.
  
  3.2. Hyperparameter specification. The hyperparameters used in the ML models should be clearly reported and justified to ensure reproducibility.
  Response: Agree! The original manuscript reports the Bayesian optimization and the parameters for early stopping, but the hyperparameter settings are not yet clearly reported. Following the suggestion, in the revised manuscript we will add a table listing the final hyperparameter values used for the XGBoost models.
  
  3.3. Clear separation of training and testing results. Model performance should be presented distinctly for both training and testing datasets, using appropriate evaluation metrics and visualizations.
  Response: We agree. This distinction is important, especially in a manuscript centered on machine-learning performance and interpretation. Following the comment, we will report training and testing results separately for the XGBoost and the additional benchmark models, and the revised text and tables will clearly distinguish between training-set and testing-set performance.
  
  3.4. Need for a methodological flowchart. A clear flowchart summarizing the overall framework of the study from data preparation to model development, validation, and interpretation should be provided.
  Response: We agree that the methodological workflow can be made more explicit. The current Fig. 1 already illustrates the coupling between long-term observations, CE-QUAL-W2, and XGBoost/SHAP, but its role as a stepwise workflow is not yet fully evident. In the revised version, we will redesign and clarify the workflow figure so that the sequence from long-term monitoring and process-based simulation to data preparation, model training/testing, and SHAP-based interpretation is more directly visible to the reader.
  
  3.5. Incomplete interpretability analysis. Since the authors devoted a significant portion of the manuscript to SHAP-based analysis (SHAP), both global and local interpretability analyses should be conducted. Currently, the local interpretability component is missing.
  Response: We agree that the interpretability analysis should more clearly include both global and local perspectives. In the current manuscript, the emphasis is mainly on global SHAP rankings, although Fig. 9d already moves toward sample-specific behavior for key late-stratification drivers. To strengthen the local interpretability component, we will add explicit local SHAP analyses for representative cases, focusing on the late-stratification bottom-water temperature model. These examples will help demonstrate how specific predictor values contribute to individual model outputs at the event scale.
  
  Additional explainability analyses. To improve the understanding of feature importance and interactions, the study should incorporate complementary explainability techniques, including Permutation Feature Importance (PFI) for global ranking, Partial Dependence Plots (PDP) for marginal effects, and Individual Conditional Expectation (ICE) plots to capture response heterogeneity. Including these analyses would significantly strengthen the scientific contribution of the work.
  
  Response: We appreciate this constructive suggestion and agree that complementary explainability analyses can further strengthen the interpretation of the model results. In the revised manuscript, SHAP will remain the primary interpretability framework because it provides a coherent basis for both global and local attribution, which is central to the mechanistic questions addressed in this study. To complement the SHAP-based analysis, we will incorporate selected complementary explainability techniques, such as permutation feature importance (PFI) as an additional global-importance diagnostic, together with selected partial dependence plots (PDPs) and individual conditional expectation (ICE) plots for the dominant predictors and the most mechanistically relevant target variables. At the same time, we note that permutation-based and SHAP-based importance measures characterize feature relevance from different perspectives and therefore are not expected to produce identical rank orders, particularly for partially redundant predictors such as same-day and antecedent forcing variables. Accordingly, PFI will be used here as a complementary robustness check, and interpreted primarily at the level of dominant drivers and process categories rather than as a strict one-to-one validation of the SHAP ranking.
  To maintain the manuscript’s focus, these additional analyses will be conducted selectively on the models and drivers most central to our conclusions, especially those concerning depth‑specific thermal dynamics and late‑stratification hypolimnetic warming. Our aim is to assess whether the main physical conclusions remain consistent across explainability frameworks, especially the contrast between atmospheric control of upper-layer thermal structure and hydraulic control of late-stratification bottom-water warming. The main findings from these additional analyses will be summarized in the revised manuscript, with the full set of supporting plots provided in the supplementary materials.
  
  Statistical comparison of models. Once a robust baseline comparison is established among several models, the authors should perform appropriate statistical tests such as the Kruskal-Walli’s test and the Diebold-Mariano test to assess whether differences in predictive performance between models are statistically significant.
  
  Response: We fully agree with the reviewer's comment that, once benchmark models are added, the comparison should not rely solely on visual inspection or point estimates of performance metrics. After adding the benchmark models, we will formally compare predicated errors using paired statistical tests based on matched predictions. We will describe the selected testing framework clearly and use it to evaluate whether the differences in predictive skill are statistically meaningful. Our goal herein is not simply to identify the optimal algorithm, but to provide a transparent evaluation of whether XGBoost delivers a substantively and statistically advantage for the specific application addressed in this study.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6442-AC1
RC2:
'Comment on egusphere-2025-6442', Anonymous Referee #2, 18 Mar 2026
Mi and colleagues used a long data record of thermal profiles from Rappbode reservoir to fit a two-dimensional hydrodynamic model, and then used a machine learning to link patterns in temperature and stratification dynamics to potential causes. The study is well-written, clear, and the methodology is reasonably novel and could be applied outside the presented study case. The study suits the audience well and could eventually be published, but I have two major issues that need to be addressed first. I list them below, including some additional minor comments further down.
My first main comment relates to the aim of coupling the machine learning (ML) model XGBoost to the process-based (PB) model CE-QUAL-W2. The ML model is trained on the PB model and is employed specifically to investigate external drivers and causal mechanisms (e.g. L. 32-33). Using data-driven approaches to investigate causality can absolutely be done, but in the present paper, there is no evidence presented on the accuracy of this, although the authors do try to link the outputs from the ML model to processes in the Discussion. Moreover, this could be tested even better than in usual cases, because the PB model that the ML model tries to emulate, has the processes included by design. The authors rightfully argue that the ML-approach primarily gains upon PB scenario runs in terms of time (L. 465-466), but because the novelty of the approach, I argue that the conclusions from the ML model should be validated using PB scenarios first. Some of the causalities highlighted by the ML model (e.g. increasing importance of shortwave radiation with depth; importance of withdrawal volume for end-of-stratification deep-water temperature increase) can be explored with the PB model as well. This should be done in order to build trust in the accuracy of the presented method. If the conclusions between the two methods deviate, the underlying reason should be discussed. The current method has a strong risk of mixing correlation and causality (for instance in telling apart the influence of air temperature and shortwave radiation), and this risk should be mitigated by validation using the PB model. For example, the increasing importance of shortwave radiation with depth is surprising, and this validation could shed light on the reliability. This validation exercise could be presented in the supplementary material and referred to in the main text.
My second main comment relates to the lack of information on the calibration and validation, primarily of CE-QUAL-W2. What periods were used for training and validation, what parameters were adjusted, what were the calibration targets, what method was used? The terms “site-validated” (L. 124) and “site-calibrated CE-QUAL-W2 model” (L. 457) suggest that this was done before, but it is not clearly indicated where this information can be found. If a model from another publication is used, this should be clearly stated in the data availability statement.
Minor comments:
Title: “event-scale hypolimnetic warming” is not clear. Suggest to change to “episodic hypolimnetic warming” as you did in the abstract (or another term that you find more suitable).

L. 37-38: change to “by the 30-day antecedent moving average of shortwave radiation and air temperature” (or similar)

L. 40: change to ”episodic hypolimnetic warming by up to 10 °C…”

L. 42: use of “compensatory-flow pathway” not clear in this context

L. 56: I could not find information on the effect of vertical temperature structure on temperature-sensitive organisms in Carr et al., although some of the references in there might have looked at this. Please assess if this citation is appropriate or cite the original source if possible.

L. 72-73: The GLTC initiative is mentioned but not used further in the manuscript. Suggest to remove this (but keep the Sharma et al. reference and rest of the sentence).

L. 97-100: I think that “temperate” is a too broad term here to say that hypolimnetic temperatures are around 4-6 degrees. Boehrer & Schultze also do not support the generality of this rule for the temperate climate zone. Suggest to change to lakes that cool to the temperature of maximum density.

L. 108-110: Bouffard et al. (2013) does not seem to provide support for the general statement that biological and chemical reaction rates roughly double per 10 °C. Please find a more appropriate reference or change the statement to be more representative of the findings in that paper.

L. 112 (and elsewhere, such as L. 375): General editing comment that there should be spaces around the dash

L. 147-150: What data did you use for the depth-varying outflow discharge? Was this based on data from the reservoir operator?

L. 185: “an monitoring buoy” -> “a monitoring buoy”

L. 262-266 (and links to comment on L. 147-150): In this part you show an impressive R2 value for water level (0.99), but without knowing more about the data sources (see previous comment on L. 147-150) and optional calibration/validation (e.g. added or scaled inflows/outflows to better fit the water level) (see 2^nd main comment), it is difficult to judge the extent to which this builds trust in the model.

L. 276-279: Add something like “…at increasing depth” at the end of the sentence

L. 292: It is very important here that XGBoost was compared to W2, NOT to observations. The term “error” might be a confusing concept here, so I would underline this by explicitly saying “the RMSE between W2 and XGBoost”

L. 301: I wouldn’t describe the mixed depth as a “tight clustering along the 1:1 line”. It is better to quantify this by showing the R2 value (both in the figure and the text).

L. 311: I think “decreases”, “reduces” or “diminishes” are clearer terms than “attenuates” in this context.

L. 367: “combinng” -> “combining”

L. 371: “bridges an inter-decadal gap” could suggest a gap-filling exercise. I suggest to change the wording.

L. 376-377: the part “and the pan-European synthesis of temperate lakes” in not integrated with the rest of the sentence; please rewrite

L. 383: “enhances” -> “enhance”

L. 400: According to the equation and the provided units, shouldn’t the unit of thermal inertia be time per Kelvin, instead of only time? Currently the units do not check out (though I guess a ∆T of 1 K is assumed). I did a quick scan of Imberger & Patterson but could not find this formula; please reply with the equation number in Imberger & Patterson or alternatively how this equation and the units were derived from the reference.

L. 424-427: Can you please clarify how increased vulnerability to mixing events relate to the importance of deep-layer withdrawals?

L. 427: “event” -> “events”

L. 428-435: I missed the argument that a significantly lowered hypolimnetic volume if the mixed layer deepens. Water withdrawn from a smaller hypolimnion will more quickly lower the thermocline. This could be quantified using hypsographic information.

Figure 8: the x-axes are not aligned
Citation: https://doi.org/10.5194/egusphere-2025-6442-RC2
- CC2: 'Reply on RC2', Chenxi Mi, 19 Mar 2026
  
  We sincerely thank the referee for the careful and constructive assessment of our manuscript. We greatly appreciate the positive evaluation of the manuscript’s clarity, methodological novelty, and potential relevance to the HESS readership. We are also grateful for the two major comments, which highlight important aspects that require further clarification and strengthening.
  In particular, we acknowledge the reviewer’s concern regarding the interpretation of the XGBoost results in relation to external drivers and underlying mechanisms. We agree that additional analysis is needed to better demonstrate the extent to which the machine-learning-based interpretations are supported by the process-based model framework. We also appreciate the reviewer’s request for a more complete and transparent description of the calibration and validation of the CE-QUAL-W2 model, including the calibration/validation periods, target variables, adjusted parameters, and evaluation approach.
  We are carefully considering these points and will address them thoroughly in a revised version of the manuscript, together with the reviewer’s many helpful minor comments on terminology, wording, references, and figure presentation. A detailed point-by-point response will be provided during revision.
  We thank the referee again for the thoughtful and helpful review.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6442-CC2
- AC1:
  'Reply on RC1', Xiangzhen Kong, 20 Apr 2026
  Reply on RC1
  The present study applies two types of modelling approaches namely the CE-QUAL-W2 numerical model and a machine learning model based on XGBoost to predict the thermal structure of a reservoir. The machine learning model is trained using data generated from the CE-QUAL-W2 simulations. The modelling framework is clearly structured, with the ML model used to predict temperature-related variables at several depths, including epilimnetic (5 m), metalimnetic (15 m), hypolimnetic (30 m), and bottom (50 m) temperatures, as well as key thermal structure indicators such as Schmidt stability, bottom-to-surface temperature difference, and mixed layer depth. Based on certain assumptions, the authors restricted the analysis to the May-October period, which corresponds to the stratified season of the reservoir. A set of meteorological variables was used as predictive features, and 30-day moving average predictors were introduced to represent the cumulative influence of atmospheric and hydraulic forcing on thermal conditions across different depths. Although the topic of the manuscript is relevant and demonstrates a satisfactory degree of originality, the overall organization of the paper particularly the Results section does not yet meet the scientific standards expected for publication in a high-quality journal. Substantial improvements are required before the manuscript can be considered for publication. In its current form, the study requires major revision. The following issues should be addressed carefully by the authors:
  Response: We sincerely thank the reviewer for the careful and constructive evaluation of our manuscript. We appreciate the reviewer’s positive assessment of the relevance and originality of the study, as well as the clear identification of aspects that need to be strengthened. The reviewer’s comments mainly concern methodological benchmarking, transparency of the dataset and model setup, and the depth of the interpretability analysis. These are valuable points, and we agree that the current version can be substantially improved in these respects. In the revised manuscript, we will address these concerns through targeted methodological revisions and clearer presentation of the results.
  
  Lack of baseline comparison. It is neither appropriate nor convincing to present a machine learning prediction framework without a solid baseline for comparison. The present study relies essentially on a single ML model, which makes the analysis incomplete and weak from a methodological standpoint. Additional models (e.g., classical regression models or alternative ML approaches) should be included to provide a meaningful benchmark.
  
  Response: We agree that the current manuscript does not yet provide an adequately developed benchmark for the XGBoost framework. Our original intention was not to conduct an exhaustive algorithm-comparison study, but rather to build an interpretable process-ML attribution framework on top of the CE-QUAL-W2 outputs. We fully agree that the performance of XGBoost should be better contextualized against alternative baseline models.
  In the revised manuscript, we will add a benchmark suite comprising a regression-based baseline together with selected nonlinear machine-learning baselines, such as Multivariate Adaptive Regression Splines (MARS), Random Forest (RF), and Support Vector Machine (SVM). These models are selected because they represent complementary nonlinear modeling strategies and provide a balanced benchmark for tabular environmental predictors. They will be trained and evaluated using the same predictor set, the same data partitioning scheme, and the same overall evaluation framework as applied to the XGBoost. This model suite will provide a representative and methodologically balanced basis for evaluating water temperature dynamics and stratification metrics, while retaining the primary focus of this manuscript on interpretable driver attribution of hypolimnetic warming.
  
  Absence of comprehensive result tables. The manuscript does not contain adequate tables summarizing the performance of the models. For a study focused on machine learning applications, detailed quantitative comparisons presented in well-structured tables are essential.
  
  Response: Agree! The current manuscript relies heavily on figures and does not yet provide sufficiently compact quantitative summaries of model performance. In the revised manuscript, we will add structured summary tables reporting the main performance metrics for both the process-based and machine-learning components. Specifically, we will add structured summary tables reporting (i) CE-QUAL-W2 performance against observations and (ii) the performance of XGBoost and the selected benchmark models for each target variable, with training and testing results presented separately. We will also include a dedicated table summarizing the optimized XGBoost hyperparameters to improve methodological transparency.
  
  Insufficient description of the dataset. A much clearer presentation of the dataset used for model development is required. This should include:
  
  3.1. Detailed description of input features and output variables
  Descriptive statistical analysis
  
  Explanation of the data splitting strategy (training and testing datasets)
  
  Distributional comparisons between training and testing sets using Kolmogorov-Smirnov tests (empirical CDF comparison)
  
  Jensen–Shannon distance comparisons between training and testing sets
  
  Energy distance comparisons to further evaluate distributional consistency.
  
  Response: We fully agree with the reviewer that the description of the machine-learning dataset should be more systematic and transparent. Although the current manuscript defines the main predictors and targets, it does not yet present them in a sufficiently structured way. In the revised manuscript, we will expand the dataset description in the Methods and Supplement. This revision will include: 1) descriptive statistics of the variables used for model development; 2) an explicit explanation of the train/test splitting strategy; and 3) supplementary diagnostics comparing the distributions of the training and testing subsets. In addition, to address the reviewer's specific concern about distributional consistency, we will add formal train/test comparisons using complementary metrics, including a Kolmogorov-Smirnov-based comparison and information-distance metrics such as Jensen-Shannon divergence and energy distance. These diagnostics will be presented in the Supplementary Materials to support the transparency and robustness assessment of the ML models.
  
  3.2. Hyperparameter specification. The hyperparameters used in the ML models should be clearly reported and justified to ensure reproducibility.
  Response: Agree! The original manuscript reports the Bayesian optimization and the parameters for early stopping, but the hyperparameter settings are not yet clearly reported. Following the suggestion, in the revised manuscript we will add a table listing the final hyperparameter values used for the XGBoost models.
  
  3.3. Clear separation of training and testing results. Model performance should be presented distinctly for both training and testing datasets, using appropriate evaluation metrics and visualizations.
  Response: We agree. This distinction is important, especially in a manuscript centered on machine-learning performance and interpretation. Following the comment, we will report training and testing results separately for the XGBoost and the additional benchmark models, and the revised text and tables will clearly distinguish between training-set and testing-set performance.
  
  3.4. Need for a methodological flowchart. A clear flowchart summarizing the overall framework of the study from data preparation to model development, validation, and interpretation should be provided.
  Response: We agree that the methodological workflow can be made more explicit. The current Fig. 1 already illustrates the coupling between long-term observations, CE-QUAL-W2, and XGBoost/SHAP, but its role as a stepwise workflow is not yet fully evident. In the revised version, we will redesign and clarify the workflow figure so that the sequence from long-term monitoring and process-based simulation to data preparation, model training/testing, and SHAP-based interpretation is more directly visible to the reader.
  
  3.5. Incomplete interpretability analysis. Since the authors devoted a significant portion of the manuscript to SHAP-based analysis (SHAP), both global and local interpretability analyses should be conducted. Currently, the local interpretability component is missing.
  Response: We agree that the interpretability analysis should more clearly include both global and local perspectives. In the current manuscript, the emphasis is mainly on global SHAP rankings, although Fig. 9d already moves toward sample-specific behavior for key late-stratification drivers. To strengthen the local interpretability component, we will add explicit local SHAP analyses for representative cases, focusing on the late-stratification bottom-water temperature model. These examples will help demonstrate how specific predictor values contribute to individual model outputs at the event scale.
  
  Additional explainability analyses. To improve the understanding of feature importance and interactions, the study should incorporate complementary explainability techniques, including Permutation Feature Importance (PFI) for global ranking, Partial Dependence Plots (PDP) for marginal effects, and Individual Conditional Expectation (ICE) plots to capture response heterogeneity. Including these analyses would significantly strengthen the scientific contribution of the work.
  
  Response: We appreciate this constructive suggestion and agree that complementary explainability analyses can further strengthen the interpretation of the model results. In the revised manuscript, SHAP will remain the primary interpretability framework because it provides a coherent basis for both global and local attribution, which is central to the mechanistic questions addressed in this study. To complement the SHAP-based analysis, we will incorporate selected complementary explainability techniques, such as permutation feature importance (PFI) as an additional global-importance diagnostic, together with selected partial dependence plots (PDPs) and individual conditional expectation (ICE) plots for the dominant predictors and the most mechanistically relevant target variables. At the same time, we note that permutation-based and SHAP-based importance measures characterize feature relevance from different perspectives and therefore are not expected to produce identical rank orders, particularly for partially redundant predictors such as same-day and antecedent forcing variables. Accordingly, PFI will be used here as a complementary robustness check, and interpreted primarily at the level of dominant drivers and process categories rather than as a strict one-to-one validation of the SHAP ranking.
  To maintain the manuscript’s focus, these additional analyses will be conducted selectively on the models and drivers most central to our conclusions, especially those concerning depth‑specific thermal dynamics and late‑stratification hypolimnetic warming. Our aim is to assess whether the main physical conclusions remain consistent across explainability frameworks, especially the contrast between atmospheric control of upper-layer thermal structure and hydraulic control of late-stratification bottom-water warming. The main findings from these additional analyses will be summarized in the revised manuscript, with the full set of supporting plots provided in the supplementary materials.
  
  Statistical comparison of models. Once a robust baseline comparison is established among several models, the authors should perform appropriate statistical tests such as the Kruskal-Walli’s test and the Diebold-Mariano test to assess whether differences in predictive performance between models are statistically significant.
  
  Response: We fully agree with the reviewer's comment that, once benchmark models are added, the comparison should not rely solely on visual inspection or point estimates of performance metrics. After adding the benchmark models, we will formally compare predicated errors using paired statistical tests based on matched predictions. We will describe the selected testing framework clearly and use it to evaluate whether the differences in predictive skill are statistically meaningful. Our goal herein is not simply to identify the optimal algorithm, but to provide a transparent evaluation of whether XGBoost delivers a substantively and statistically advantage for the specific application addressed in this study.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6442-AC1
- AC2: 'Reply on RC2', Xiangzhen Kong, 20 Apr 2026
  
  Mi and colleagues used a long data record of thermal profiles from Rappbode reservoir to fit a two-dimensional hydrodynamic model, and then used a machine learning to link patterns in temperature and stratification dynamics to potential causes. The study is well-written, clear, and the methodology is reasonably novel and could be applied outside the presented study case. The study suits the audience well and could eventually be published, but I have two major issues that need to be addressed first. I list them below, including some additional minor comments further down.
  
  Response: We sincerely thank the reviewer for the constructive assessment of our manuscript. We greatly appreciate the reviewer's positive evaluation of the manuscript's novelty and broader applicability, as well as the identification of two key issues that require strengthening before the study can be considered further. We fully agree that these points are central to the credibility of the manuscript, particularly with respect to the physical grounding of the ML-based attribution. In the revised manuscript, we will address these concerns through targeted additional analyses, clearer methodological documentation, and more cautious wording where needed. Please find below our detailed point-by-point responses and the corresponding planned revisions.
  My first main comment relates to the aim of coupling the machine learning (ML) model XGBoost to the process-based (PB) model CE-QUAL-W2. The ML model is trained on the PB model and is employed specifically to investigate external drivers and causal mechanisms (e.g. L. 32-33). Using data-driven approaches to investigate causality can absolutely be done, but in the present paper, there is no evidence presented on the accuracy of this, although the authors do try to link the outputs from the ML model to processes in the Discussion. Moreover, this could be tested even better than in usual cases, because the PB model that the ML model tries to emulate, has the processes included by design. The authors rightfully argue that the ML-approach primarily gains upon PB scenario runs in terms of time (L. 465-466), but because the novelty of the approach, I argue that the conclusions from the ML model should be validated using PB scenarios first. Some of the causalities highlighted by the ML model (e.g. increasing importance of shortwave radiation with depth; importance of withdrawal volume for end-of-stratification deep-water temperature increase) can be explored with the PB model as well. This should be done in order to build trust in the accuracy of the presented method. If the conclusions between the two methods deviate, the underlying reason should be discussed. The current method has a strong risk of mixing correlation and causality (for instance in telling apart the influence of air temperature and shortwave radiation), and this risk should be mitigated by validation using the PB model. For example, the increasing importance of shortwave radiation with depth is surprising, and this validation could shed light on the reliability. This validation exercise could be presented in the supplementary material and referred to in the main text.
  Response: We agree that the current manuscript does not yet provide sufficiently explicit physical support for the machine learning-derived attributions. This is especially critical, as the novelty of the framework rests on connecting interpretable machine learning with process-based models, rather than employing ML as a purely predictive black box. In addition, we agree with the reviewer that the manuscript does not yet distinguish sufficiently clearly between statistical attribution and causal inference. In this study, XGBoost/SHAP was intended as a fast, depth-resolved attribution layer built on a physically consistent CE-QUAL-W2 reconstruction, rather than as a stand-alone tool for causal demonstration. The reviewer is therefore correct that the strongest ML-derived inferences should be checked more directly against the process-based framework.
  
  To address this concern, we will strengthen the manuscript in two ways. First, we will moderate causal wording throughout the manuscript, particularly in the Abstract and Discussion, so that the manuscript more clearly distinguishes between statistical attribution, mechanistic interpretation, and direct process-based validation. Second, we will add a targeted CE-QUAL-W2 validation exercise in the Supplement, focused on the two mechanisms most central to our conclusions, as follows:
  
  (i) the depth-dependent role of shortwave radiation in shaping subsurface thermal dynamics, and
  
  (ii) the role of intensified late-season deep withdrawals in triggering bottom-water warming.
  
  Specifically, we will use a targeted set of CE-QUAL-W2 sensitivity experiments to test whether changes in shortwave forcing and late-season withdrawal intensity reproduce the depth-specific thermal responses highlighted by SHAP. Their purpose is to evaluate whether the central patterns highlighted by XGBoost/SHAP are consistent with the physical behavior of the reservoir as represented by CE-QUAL-W2. If discrepancies emerge between the PB and ML perspectives, we will discuss them explicitly and frame the corresponding ML-derived signals more cautiously as statistically supported associations rather than direct causal evidence. We believe this targeted approach will substantially improve confidence in the coupled framework while directly addressing the reviewer's concern.
  My second main comment relates to the lack of information on the calibration and validation, primarily of CE-QUAL-W2. What periods were used for training and validation, what parameters were adjusted, what were the calibration targets, what method was used? The terms “site-validated” (L. 124) and “site-calibrated CE-QUAL-W2 model” (L. 457) suggest that this was done before, but it is not clearly indicated where this information can be found. If a model from another publication is used, this should be clearly stated in the data availability statement.
  Response: We fully agree with the reviewer that the present manuscript does not describe the CE-QUAL-W2 calibration and evaluation procedure with sufficient transparency in the main text. A brief description of the calibration variables and the full parameter set is provided in Section S1 of the supplementary materials, and we agree that these details should be stated explicitly in the Methods section of the main text. To clarify here, the model was calibrated manually using a trial-and-error approach. Three site-specific parameters were adjusted, namely the shading coefficient (SHADE), the wind-sheltering coefficient (WSC), and the pure-water light-extinction coefficient (EXH2O), while all remaining parameters were retained at standard CE-QUAL-W2 settings because they are physically based and not typically subject to site calibration[KX1.1][A1.2] (DOI: 10.1016/j.jclepro.2024.142347). No separate split-sample validation period was defined. Instead, the full 1981-2019 observational record was used for continuous calibration/evaluation across a wide range of hydroclimatic conditions. This choice follows the calibration philosophy described in the CE-QUAL-W2 manual for long-term continuous applications, which emphasizes evaluating model behavior across broad observed conditions rather than through an arbitrary calibration/verification split. The main calibration targets were the observed thermal structure and seasonal temperature profiles of the reservoir, with water level used as an additional system-scale evaluation variable. In the revised manuscript, we will move these key calibration/evaluation details from the supplementary materials to the main Methods section, clarify the relationship between the present model application and our earlier Rappbode studies, and revise the wording around “site-calibrated” and “site-validated” so that it more accurately reflects the modeling strategy used here.
  Minor comments:
  
  Title: “event-scale hypolimnetic warming” is not clear. Suggest to change to “episodic hypolimnetic warming” as you did in the abstract (or another term that you find more suitable).
  
  Response: We agree and will revise the title to use “episodic hypolimnetic warming”, which is clearer and more consistent with the terminology used in the manuscript.
  L. 37-38: change to “by the 30-day antecedent moving average of shortwave radiation and air temperature” (or similar).
  
  Response: We agree and will revise this sentence for greater precision, following the reviewer’s suggestion.
  L. 40: change to ”episodic hypolimnetic warming by up to 10 °C…”
  
  Response: Agree! We will revise the text following this suggestion.
  L. 42: use of “compensatory-flow pathway” not clear in this context.
  
  Response: We agree that “compensatory-flow pathway” is not clear enough in its current form. We will replace it with a more explicit description of the shortened downward transport pathway associated with deep withdrawals.
  L. 56: I could not find information on the effect of vertical temperature structure on temperature-sensitive organisms in Carr et al., although some of the references in there might have looked at this. Please assess if this citation is appropriate or cite the original source if possible.
  
  Response: Good point! We will re-check the citation to Carr et al. (2019), and if it does not directly support the statement, we will replace it with a more appropriate original source.
  L. 72-73: The GLTC initiative is mentioned but not used further in the manuscript. Suggest to remove this (but keep the Sharma et al. reference and rest of the sentence).
  
  Response: We agree with the reviewer and will remove the mention of the GLTC initiative while retaining the relevant reference and sentence content.
  L. 97-100: I think that “temperate” is a too broad term here to say that hypolimnetic temperatures are around 4-6 degrees. Boehrer & Schultze also do not support the generality of this rule for the temperate climate zone. Suggest to change to lakes that cool to the temperature of maximum density.
  
  Response: Correct! We agree and will revise this sentence to avoid overgeneralizing across all temperate lakes and reservoirs. We will instead refer more specifically to systems that cool toward the temperature of maximum freshwater density.
  L. 108-110: Bouffard et al. (2013) does not seem to provide support for the general statement that biological and chemical reaction rates roughly double per 10 °C. Please find a more appropriate reference or change the statement to be more representative of the findings in that paper.
  
  Response: Agree! We will revise the sentence and citation so that it more accurately reflects the evidence.
  L. 112 (and elsewhere, such as L. 375): General editing comment that there should be spaces around the dash
  
  Response: We agree and will correct spacing around dashes throughout the manuscript.
  L. 147-150: What data did you use for the depth-varying outflow discharge? Was this based on data from the reservoir operator?
  
  Response: The depth-varying outflow discharge is provided by the Rappbode Reservoir authority (Talsperrenbetrieb Sachsen-Anhalt). We will further clarify this point in the Methods.
  L. 185: “an monitoring buoy” -> “a monitoring buoy”
  
  Response: We will correct this in the revised version.
  L. 262-266 (and links to comment on L. 147-150): In this part you show an impressive R2 value for water level (0.99), but without knowing more about the data sources (see previous comment on L. 147-150) and optional calibration/validation (e.g. added or scaled inflows/outflows to better fit the water level) (see 2nd main comment), it is difficult to judge the extent to which this builds trust in the model.
  
  Response: We agree that the water-level result requires clearer contextualization. In the study, we closed the water budget by incorporating a distributed tributary, as it was recommended in the manual of W2. In the revised manuscript, we will clarify this calibration procedure relevant to the result, so that the reported R² can be interpreted more transparently.
  L. 276-279: Add something like “…at increasing depth” at the end of the sentence.
  
  Response: We agree and will revise the sentence to clarify that the pattern refers to changes at increasing depth.
  L. 292: It is very important here that XGBoost was compared to W2, NOT to observations. The term “error” might be a confusing concept here, so I would underline this by explicitly saying “the RMSE between W2 and XGBoost”.
  
  Response: Good point! We will clarify explicitly in the revised manuscript that the reported RMSE here refers to the difference between CE-QUAL-W2 and XGBoost, not between XGBoost and observations.
  L. 301: I wouldn’t describe the mixed depth as a “tight clustering along the 1:1 line”. It is better to quantify this by showing the R2 value (both in the figure and the text).cxe
  
  Response: We agree and will report the corresponding R² values explicitly in both the text and figure.
  L. 311: I think “decreases”, “reduces” or “diminishes” are clearer terms than “attenuates” in this context.
  
  Response: We agree and will replace “attenuates” with a more appropriate word.
  L. 367: “combinng” -> “combining”
  
  Response: We will correct the word in the revised manuscript.
  L. 371: “bridges an inter-decadal gap” could suggest a gap-filling exercise. I suggest to change the wording.
  
  Response: Agree! We will revise the wording to avoid implying a gap-filling exercise.
  L. 376-377: the part “and the pan-European synthesis of temperate lakes” in not integrated with the rest of the sentence; please rewrite.
  
  Response: Agree! We will rewrite this sentence for better integration and clarity.
  L. 383: “enhances” -> “enhance”
  
  Response: We will make the correction in the revised manuscript.
  L. 400: According to the equation and the provided units, shouldn’t the unit of thermal inertia be time per Kelvin, instead of only time? Currently the units do not check out (though I guess a ∆T of 1 K is assumed). I did a quick scan of Imberger & Patterson but could not find this formula; please reply with the equation number in Imberger & Patterson or alternatively how this equation and the units were derived from the reference.
  
  Response: We highly appreciate the reviewer for this important observation. The current formulation indeed requires clearer dimensional interpretation. Here, τ is used as an index of epilimnetic thermal inertia, defined as the areal heat capacity divided by the magnitude of the net heat flux, following the mixed-layer heat budget. The formula is based on the zero-dimensional heat-budget framework of Piccolroaz et al. (2013; DOI: 10.5194/hess-17-3323-2013) instead of Imberger & Patterson (1989). We will revise the text to make explicit that the estimate corresponds to the characteristic timescale associated with a 1 K temperature response under the given net heat flux, and we will clarify the derivation and units in the revised manuscript accordingly.
  L. 424-427: Can you please clarify how increased vulnerability to mixing events relate to the importance of deep-layer withdrawals?
  
  Response: We thank the reviewer for noting this ambiguity. Our intention was not to treat mixing events as a mechanism independent of deep-layer withdrawals. Rather, late-season weakening of stratification lowers the buoyancy resistance of the density interface, allowing a given deep-withdrawal forcing to erode the thermocline more effectively and enhance compensatory downward transport of warm upper-layer water. Thus, the increased vulnerability to mixing refers to the late-season background state that amplifies the thermal effect of deep-layer withdrawals, rather than to a separate driver. We will revise this paragraph accordingly to make this linkage explicit.
  L. 427: “event” -> “events”
  
  Response: We will correct “event” to “events” in the revised manuscript.
  L. 428-435: I missed the argument that a significantly lowered hypolimnetic volume if the mixed layer deepens. Water withdrawn from a smaller hypolimnion will more quickly lower the thermocline. This could be quantified using hypsographic information.
  
  Response: That is a good suggestion! We fully agree that the reduction in hypolimnetic volume during mixed-layer deepening is an important part of the mechanism and should be made more explicit. We will strengthen this argument quantitatively in the revised manuscript.
  Figure 8: the x-axes are not aligned.
  
  Response: We will align the x-axes in the revised manuscript.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6442-AC2
RC3:
'Comment on egusphere-2025-6442', Anonymous Referee #3, 19 Mar 2026
Greetings. I have revised the manuscript entitled ‘Four decades of full-depth profiles reveal layer-resolved drivers of reservoir thermal regimes and event-scale hypolimnetic warming ’. The paper deals with Germany’s largest drinking-water reservoir. By comparing two modelling approaches: the CE-QUAL-W2 numerical model and a machine learning XGBoost-based model, the goal is to predict the thermal structure of a reservoir, assessing the performance of each method. The paper is suitable for publication in the HESS journal, whereas it presents some lacks to be fixed before going further down the publication way. These can be listed as follows:
The paper only presents one type of Machine Learning approach and one kind of model. There is a need for at least a small paragraph that frames, in a general way, machine learning approaches. Moreover, there is no presence of reference to Genetic Algorithms and Metaheuristics ones, that have been proved to outperform other ML methods (see Rajwar et al., 2023; Schiavo and Pedretti, 2026). This does not mean that these approaches should be implemented, nor that the paper should compare these results with the offered ones, but at least citing their existence and weight in the literature framework, and clearly explaining why the XGBoost method has been preferred to Genetic algorithms.

Some recent literature works assessed how, for some earth sciences applications, ‘usual’ geostatistical methods (e.g. kriging or co-kriging-based ones) still perform better than Neural Network algorithms or XGBoost methods (Brckovic et al., 2025), or at least they are good enough at predicting the target variable that they should be involved in ML algorithms (Grey et al., 2025) to achieve reliable predictions. I suggest discussing this point further.

I am not sure that a strong enough system conceptualization is given. Thus, I am a bit reluctant about the goodness of the results without grounding them to physically-based description, meshing, and imposition of boundary conditions. Maybe the Authors can clarify this point, underlining the (i) assumptions they have made, (ii) giving more info about the geological structure, and (iii) providing hydrogeological or geological sections to support their claims. Indeed, a max depth of 80 m (Figure 2) requires strong geological proof of concepts.

Temperature fluctuations seem extremely noisy and subject to historical trends. Why have these not been performed and offered in the results?

I think that the revision could proceed further only once these observation have been clarified.
Best regards.
Suggested references:
Brcković, A., Malvić, T., Orešković, J., & Kapuralić, J. (2025). Comparison of Neural Network, Ordinary Kriging, and Inverse Distance Weighting Algorithms for Seismic and Well-Derived Depth Data: A Case Study in the Bjelovar Subdepression, Croatia. Geosciences, 15(6), 206. https://doi.org/10.3390/geosciences15060206
Schiavo, M., & Pedretti, D. (2026). Genetic and Iterative Metaheuristics-Informed Algorithms for Precision Shallow Groundwater Modeling and Drought Inference. Journal of Geophysical Research: Machine Learning and Computation, 3(1), e2025JH000854. https://doi.org/10.1029/2025JH000854
Rajwar, K., Deep, K., & Das, S. (2023). An exhaustive review of the metaheuristic algorithms for search and optimization: Taxonomy, applications, and open challenges. Artificial Intelligence Review, 56(11), 13187–13257. https://doi.org/10.1007/s10462-023-10470-y
Grey, V., Fletcher, T. D., Smith-Miles, K., Hatt, B. E., & Coleman, R. A. (2025). Harnessing the strengths of machine learning and geostatistics to improve streamflow prediction in ungauged basins; the best of both worlds. Journal of Hydrology, 662, 133936. https://doi.org/10.1016/j.jhydrol.2025.133936
Citation: https://doi.org/10.5194/egusphere-2025-6442-RC3
- CC3: 'Reply on RC3', Chenxi Mi, 19 Mar 2026
  
  We highly appreciated the reviewer for the constructive and detailed comments and for considering the manuscript suitable for publication after revision! We will prepare a point-by-point response and revise the manuscript accordingly. In particular, we will (i) add a brief overview positioning our XGBoost–SHAP approach within the broader landscape of machine-learning and optimization methods, including metaheuristics, and clarify our method choice; (ii) expand the discussion on alternative approaches where relevant and delineate the scope of our attribution objective; (iii) strengthen the system description of the CE-QUAL-W2 setup, including bathymetry, grid design, boundary conditions, and withdrawal representation; and (iv) further quantify and present long-term trends versus event-scale anomalies in the temperature record. We appreciate the reviewer’s suggestions and believe these revisions will substantially improve the clarity and robustness of the study.
  Best Cheers,
  Chenxi
  
  Citation: https://doi.org/10.5194/egusphere-2025-6442-CC3
- AC3: 'Reply on RC3', Xiangzhen Kong, 20 Apr 2026
  
  Greetings. I have revised the manuscript entitled ‘Four decades of full-depth profiles reveal layer-resolved drivers of reservoir thermal regimes and event-scale hypolimnetic warming. The paper deals with Germany’s largest drinking-water reservoir. By comparing two modelling approaches: the CE-QUAL-W2 numerical model and a machine learning XGBoost-based model, the goal is to predict the thermal structure of a reservoir, assessing the performance of each method. The paper is suitable for publication in the HESS journal, whereas it presents some lacks to be fixed before going further down the publication way. These can be listed as follows:
  Response: We sincerely thank the reviewer for the careful assessment of our manuscript and for considering the study potentially suitable for publication in HESS after clarification and revision. We appreciate the reviewer’s comments on methodological framing, physical system description, and the presentation of long-term variability. These points are helpful, and we agree that the manuscript can be strengthened by better positioning the XGBoost-SHAP approach within the broader methodological literature, and by making the CE-QUAL-W2 system setup more explicit in the main text. Please find below our detailed point-by-point responses and the corresponding planned revisions.
  
  1. The paper only presents one type of Machine Learning approach and one kind of model. There is a need for at least a small paragraph that frames, in a general way, machine learning approaches. Moreover, there is no presence of reference to Genetic Algorithms and Metaheuristics ones, that have been proved to outperform other ML methods (see Rajwar et al., 2023; Schiavo and Pedretti, 2026). This does not mean that these approaches should be implemented, nor that the paper should compare these results with the offered ones, but at least citing their existence and weight in the literature framework, and clearly explaining why the XGBoost method has been preferred to Genetic algorithms.
  Response: We agree that the manuscript would benefit from a broader methodological framing of machine-learning and optimization approaches. In the revised manuscript, we will add a concise paragraph that situates XGBoost within a wider family of regression, machine-learning, and metaheuristic approaches, and we will cite the reviewer-suggested references where relevant.
  
  In addition, the revised manuscript will now include benchmark comparisons against several representative baseline models, such as MARS, Random Forest, and SVM, which will provide a clearer context for the predictive performance of XGBoost. We will also clarify more explicitly why XGBoost was selected in the present study. Specifically, our aim is not exhaustive optimizer benchmarking, but interpretable, depth- and phase-resolved attribution of reservoir thermal drivers from tabular environmental predictors. For this purpose, XGBoost provides a strong balance of predictive skill, computational efficiency, and direct compatibility with SHAP, which is central to our attribution-focused framework. At the same time, we will acknowledge in the Discussion that genetic algorithms and other metaheuristic approaches have shown considerable value in a range of optimization-oriented Earth-science applications and represent promising directions for future work, particularly for parameter optimization and model-structure search.
  2. Some recent literature works assessed how, for some earth sciences applications, ‘usual’ geostatistical methods (e.g. kriging or co-kriging-based ones) still perform better than Neural Network algorithms or XGBoost methods (Brckovic et al., 2025), or at least they are good enough at predicting the target variable that they should be involved in ML algorithms (Grey et al., 2025) to achieve reliable predictions. I suggest discussing this point further.
  Response: We highly appreciate this helpful suggestion and will incorporate the reviewer-suggested references into the revised manuscript. We will add a dedicated paragraph in the Discussion section, incorporating the suggested references, to acknowledge that geostatistical and hybrid geostatistical-machine learning approaches have demonstrated strong performance in related Earth science applications, especially in contexts where spatial interpolation or extrapolation represents a core objective. We will also clarify that the present study is centered on depth-resolved temporal attribution at a fixed reservoir monitoring location, so geostatistical methods were not the primary focus of the present methodological framework.
  3. I am not sure that a strong enough system conceptualization is given. Thus, I am a bit reluctant about the goodness of the results without grounding them to physically-based description, meshing, and imposition of boundary conditions. Maybe the Authors can clarify this point, underlining the (i) assumptions they have made, (ii) giving more info about the geological structure, and (iii) providing hydrogeological or geological sections to support their claims. Indeed, a max depth of 80 m (Figure 2) requires strong geological proof of concepts.
  Response: We thank the reviewer for this helpful comment and agree that the physical system conceptualization of the CE-QUAL-W2 application should be made more explicit in the main text. In the revised manuscript, we will strengthen the Methods section by expanding the description of reservoir morphology and bathymetry, grid structure, key modeling assumptions, and the hydrological and meteorological boundary conditions used to drive the simulations. We will also clarify the source and meaning of the reported maximum depth and distinguish more explicitly between the full reservoir depth represented in the model domain and the depth range of the routine temperature profiles. Our intention is to make those aspects of the physical setting that are directly relevant to the CE-QUAL-W2 simulations clearer and easier to assess, thereby providing a more transparent physical basis for the model application.
  4. Temperature fluctuations seem extremely noisy and subject to historical trends. Why have these not been performed and offered in the results?
  Response: We agree that the long-term trend signal should be presented more explicitly. Trend analyses are partly included in the current manuscript, including the significant warming at 5 m and the absence of significant monotonic trends at 15-50 m, but these results are not yet sufficiently prominent relative to the event-scale analyses. In the revised manuscript, we will strengthen the presentation of long-term thermal trends in the main Results and Discussion. Particularly, in Discussion, we will make the long-term warming of the epilimnion and the lack of a significant monotonic warming trend below 15 m more explicit, while clarifying that the four anomalous bottom-warming years reflect event-scale late-stratification departures from the background deep-water variability rather than a persistent long-term trend. This will help distinguish event-scale anomalies from the broader multidecadal thermal context.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6442-AC3

Chenxi Mi, Bo Gai, Xiangzhen Kong, Yuzhe Jiang, Chun Ngai Chan, and Karsten Rinke

Supplement

https://doi.org/10.5194/egusphere-2025-6442-supplement

Chenxi Mi, Bo Gai, Xiangzhen Kong, Yuzhe Jiang, Chun Ngai Chan, and Karsten Rinke

Viewed

Total article views: 1,515 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
894	477	144	1,515	180	64	119

HTML: 894
PDF: 477
XML: 144
Total: 1,515
Supplement: 180
BibTeX: 64
EndNote: 119

Views and downloads (calculated since 20 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	290	136	39	465
Mar 2026	482	280	92	854
Apr 2026	109	55	12	176
May 2026	13	6	1	20

Cumulative views and downloads (calculated since 20 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	290	136	39	465
Mar 2026	482	280	92	854
Apr 2026	109	55	12	176
May 2026	13	6	1	20

Viewed (geographical distribution)

Total article views: 1,548 (including HTML, PDF, and XML) Thereof 1,548 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 17 May 2026

Short summary

This study examines temperature changes in Rappbode Reservoir, Germany’s largest drinking-water reservoir, over four decades. By analyzing detailed temperature data and weather records, the research shows how factors like sunlight and air temperature impact water temperature at different depths. It also finds that late-season warming in deep waters is mainly caused by water withdrawals. These findings help improve understanding of lake ecosystems and can guide better water resource management.


Total:	0
HTML:	0
PDF:	0
XML:	0