Never Train a Deep Learning Model on a Single Well? Revisiting Training Strategies for Groundwater Level Prediction
Abstract. Deep learning (DL) models are increasingly used for hydrological forecasting, with a growing shift from site-specific to globally trained architectures. This study tests whether the widely held assumption that global models consistently outperform local ones also applies to groundwater systems, which differ substantially from surface water due to slow response dynamics, data scarcity, and strong site heterogeneity. Using a benchmark dataset of nearly 3000 monitoring wells across Germany, we systematically compare global Long Short-Term Memory (LSTM) models with locally trained single-well models in terms of overall performance, training data characteristics, prediction of extremes, and spatial generalization.
For groundwater level prediction, we find that global models provide no systematic accuracy advantage over local models. Local models more often capture site-specific behavior, while global models yield more robust but less specialized predictions across diverse wells. Performance gains arise primarily from dynamically coherent training data, whereas random data reduction has little effect, indicating that similarity matters more than quantity in this setting. Both model types struggle with extreme groundwater conditions, and global models generalize reliably only to wells with comparable dynamics.
These findings qualify the assumption of global model superiority and highlight the need to align modeling strategies with groundwater-specific constraints and application goals.
Summary
The paper compares single-well models with global models for groundwater level forecasting, focusing on robustness and predictive performance. The comparison is well motivated by earlier work suggesting that global approaches often perform better in surface water modelling. The authors also examine how global models performance depends on training-set size and evaluate the influence of dynamic similarity across sites. The study also investigates how well global models generalize to unseen wells. Overall, the manuscript is clearly structured and the analysis is presented in a careful and transparent way.
Evaluation and Recommendations
Model choice may influence the conclusions, but it is currently unclear to what extent. For single-well models, performance can vary across sites depending on the selected model structure. Global models performance may also be sensitive to model choice, which could affect the resulting predictions and the strength of the conclusions. Expanding the set of tested models may be beyond the scope of this paper, but I recommend explicitly discussing how sensitive the main findings are to the chosen model(s), and under which conditions the conclusions might change.
As an additional diagnostic, a map showing the spatial distribution of performance differences (e.g., ΔNSE = NSE_global − NSE_local) would be informative to assess whether the largest deltas follow any geographic or hydrogeological patterns.
The methodology of filtering out a subset of wells is clear and coherent, and the correlation-based selection is easy to follow. However, I wonder how the results might change if a spatio-dynamic clustering were used instead. In this context, it would help to justify why a correlation-based approach was preferred over other clustering methods. A useful discussion point is whether adding hydrogeological classifications (in addition to the dynamic similarity) could provide meaningful context before applying the global model, and whether longer time series (where available) would be expected to improve model performance.
Specific comments:
Line 16: missing reference.
Line 17: are often slower (not always, as in the case of Karst)
Line 104-107: Please rephrase for clarity. In the context of this sentence, it is not clear what “unseen location” means
Line 122: was HYRAS or ERA5-Land used in this case?
Line 193- 195: “Groundwater drought” is defined and interpreted in different ways across the literature. In this manuscript, it appears to be implicitly defined as periods when groundwater levels fall below the 10th percentile (“the 10th and 90th percentiles of the observed distribution in the test set.”), but this threshold is not stated clearly or justified. Please explicitly define the drought criterion, provide a reference (or brief background) for the use of the 10th-percentile threshold, and clarify your terminology. How these lines relate to line 305: “For each well, low extremes were defined as values in the test period below the 1st percentile of its training distribution, and high extremes as values above the 99th percentile”.?
Section 4.4 is duplicated to 4.5.