the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Never Train a Deep Learning Model on a Single Well? Revisiting Training Strategies for Groundwater Level Prediction
Abstract. Deep learning (DL) models are increasingly used for hydrological forecasting, with a growing shift from site-specific to globally trained architectures. This study tests whether the widely held assumption that global models consistently outperform local ones also applies to groundwater systems, which differ substantially from surface water due to slow response dynamics, data scarcity, and strong site heterogeneity. Using a benchmark dataset of nearly 3000 monitoring wells across Germany, we systematically compare global Long Short-Term Memory (LSTM) models with locally trained single-well models in terms of overall performance, training data characteristics, prediction of extremes, and spatial generalization.
For groundwater level prediction, we find that global models provide no systematic accuracy advantage over local models. Local models more often capture site-specific behavior, while global models yield more robust but less specialized predictions across diverse wells. Performance gains arise primarily from dynamically coherent training data, whereas random data reduction has little effect, indicating that similarity matters more than quantity in this setting. Both model types struggle with extreme groundwater conditions, and global models generalize reliably only to wells with comparable dynamics.
These findings qualify the assumption of global model superiority and highlight the need to align modeling strategies with groundwater-specific constraints and application goals.
- Preprint
(21365 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-4055', Anonymous Referee #1, 26 Dec 2025
-
RC2: 'Comment on egusphere-2025-4055', Anonymous Referee #2, 06 Jan 2026
“Never Train a Deep Learning Model on a Single Well? Revisiting Training Strategies for Groundwater Level Prediction” by Ohmer and Liesch presents an interesting study on the design of DL models for groundwater timeseries modelling. Even though there already exists a substantial amount of DL applications on groundwater timeseries modelling, I believe that the study design and the obtained results add novelty to the existing work. I have several points that I wish to see addressed prior to publication.
In the introduction, the authors give a quite broad overview of DL application for both surface water and groundwater timeseries modelling. The introduction would benefit from clearly stating which studies are focusing on groundwater and which on surface water. Since this is a groundwater study I wonder how many surface water references are required – maybe some of them can be remove and replaced by groundwater references. I agree that there are more DL experiences in the surface water domain, especially when it comes to spatial transferability, but this can maybe also be an additional point to be highlighted in the introduction. The intercomparison study by Collenteur et al (https://doi.org/10.5194/hess-28-5193-2024) would be a good addition to the introduction.
The authors carry out a spatial transferability study (4. objective), which I have not seen in the groundwater literature and the presented references (l.50) are all surface water studies. If this is the first spatial transferability study in the groundwater domain, the authors should state this clearly and if other studies exist, they should be mentioned in the introduction.
What is the reasoning behind using a CNN for the single well models and a LSTM for the global models?
Section 2.2 Are any of the climate variables aggregated in time, for example running sum of net precipitation or SPI at different aggregation windows?
Section 2.3 Just to be clear, timeseries statistics such as mean head or standard deviation are not part of the static attributes?
What are the sensitivities of the choice of architecture and hyperparameter values presented in section 3.2?
Section 3.2 Were the head timeseries normalized in any way? If yes, how can the authors argue for testing spatial extrapolation if knowledge on mean and standard deviation is required for the back transformation?
What does P1, P2, …, P5 mean? P1 excludes 500 wells, P2 excluded 1000 wells, and so on? To me this first became clear when reading the result sections. It would be good to state the number of wells in each stage already in 3.1. The testing strategy is not stated. Are all wells for 2013-2022 used for testing or only the ones left after stagewise removing? From Figure 2 I get the impression that the testing dataset varies for stage – can the performances be compared in a meaningful way across the stages? I would suggest to make an additional test using the P5 wells for all stages.
That fact that global models do not outperform single well models for the P0 stage and that an advantage of global model first becomes tangible at P4 and P5 makes me wonder if the chosen LSTM architecture can exploit the static features in a meaningful way? Along these lines, when removing wells based on their correlation, do the static features also become more homogeneous? In other words, is the similarity of the timeseries reflected by the static features?
Another very relevant question in my opinion is the length of the timeseries. The authors make use of an extensive German database, with full coverage for a period of 1991 to 2022, which is a coverage that is not available in many other countries. Therefore, an alternative modelling experiment with stages where e.g., 2 years at a time are removed from the training dataset would be very insightful. P1 starting in 1996, P2 in 1998, etc. would be extremely relevant for similar applications in countries with shorter groundwater records.
Section 4.4 and 4.5 contain the same text.
I am puzzled why the performance for the correlation wise stages increases in figure 6 for the out of sample wells. For P5cor the model is trained on 451 and tested on 2500, and for P1cor the mode is trained on 2451 and tested on 500, is this correct? Again, the varying testing datasets make it difficult to compare performance across the stages in my opinion. Nevertheless, for the P5cor training you are using very homogeneous timeseries and test it across very heterogenous timeseries. Why should this work better than P1cor where you are training using heterogenous timeseries and also use heterogenous timeseries for testing?
Citation: https://doi.org/10.5194/egusphere-2025-4055-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 350 | 136 | 24 | 510 | 17 | 18 |
- HTML: 350
- PDF: 136
- XML: 24
- Total: 510
- BibTeX: 17
- EndNote: 18
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Summary
The paper compares single-well models with global models for groundwater level forecasting, focusing on robustness and predictive performance. The comparison is well motivated by earlier work suggesting that global approaches often perform better in surface water modelling. The authors also examine how global models performance depends on training-set size and evaluate the influence of dynamic similarity across sites. The study also investigates how well global models generalize to unseen wells. Overall, the manuscript is clearly structured and the analysis is presented in a careful and transparent way.
Evaluation and Recommendations
Model choice may influence the conclusions, but it is currently unclear to what extent. For single-well models, performance can vary across sites depending on the selected model structure. Global models performance may also be sensitive to model choice, which could affect the resulting predictions and the strength of the conclusions. Expanding the set of tested models may be beyond the scope of this paper, but I recommend explicitly discussing how sensitive the main findings are to the chosen model(s), and under which conditions the conclusions might change.
As an additional diagnostic, a map showing the spatial distribution of performance differences (e.g., ΔNSE = NSE_global − NSE_local) would be informative to assess whether the largest deltas follow any geographic or hydrogeological patterns.
The methodology of filtering out a subset of wells is clear and coherent, and the correlation-based selection is easy to follow. However, I wonder how the results might change if a spatio-dynamic clustering were used instead. In this context, it would help to justify why a correlation-based approach was preferred over other clustering methods. A useful discussion point is whether adding hydrogeological classifications (in addition to the dynamic similarity) could provide meaningful context before applying the global model, and whether longer time series (where available) would be expected to improve model performance.
Specific comments:
Line 16: missing reference.
Line 17: are often slower (not always, as in the case of Karst)
Line 104-107: Please rephrase for clarity. In the context of this sentence, it is not clear what “unseen location” means
Line 122: was HYRAS or ERA5-Land used in this case?
Line 193- 195: “Groundwater drought” is defined and interpreted in different ways across the literature. In this manuscript, it appears to be implicitly defined as periods when groundwater levels fall below the 10th percentile (“the 10th and 90th percentiles of the observed distribution in the test set.”), but this threshold is not stated clearly or justified. Please explicitly define the drought criterion, provide a reference (or brief background) for the use of the 10th-percentile threshold, and clarify your terminology. How these lines relate to line 305: “For each well, low extremes were defined as values in the test period below the 1st percentile of its training distribution, and high extremes as values above the 99th percentile”.?
Section 4.4 is duplicated to 4.5.