Validation Strategies for Deep Learning-Based Groundwater Level Time Series Prediction Using Exogenous Meteorological Input Features

Doll, Fabienne; Liesch, Tanja; Wetzel, Maria; Kunz, Stefan; Broda, Stefan

doi:10.5194/egusphere-2025-3539

Preprints

https://doi.org/10.5194/egusphere-2025-3539

Preprints

08 Sep 2025

| 08 Sep 2025

Validation Strategies for Deep Learning-Based Groundwater Level Time Series Prediction Using Exogenous Meteorological Input Features

Fabienne Doll, Tanja Liesch, Maria Wetzel, Stefan Kunz, and Stefan Broda

Abstract. Due to the growing reliance on machine learning (ML) approaches for predicting groundwater levels (GWL), it is important to examine the methods used for performance estimation. A suitable performance estimation method provides the most accurate estimate of the accuracy the model would archive on completely unseen test data to provide a solid basis for model selection decisions. This paper investigates the suitability of different performance evaluation strategies, namely blocked cross-validation (bl-CV), repeated out-of-sample validation (repOOS), and out-of-sample validation (OOS), for evaluating one-dimensional convolutional neural network (1D-CNN) models for predicting groundwater level (GWL) using exogenous meteorological input data. Unlike previous comparative studies, which mainly focused on autoregressive models, this work uses a non-autoregressive approach based on exogenous meteorological input features without incorporating past groundwater levels for groundwater level prediction. A dataset of 100 GWL time series was used to evaluate the performance of the different validation methods. The study concludes that bl-CV provides the most representative performance estimates of actual model performance compared to the other two performance evaluation methods examined. The most commonly used OOS validation yielded the most uncertain performance estimate in this study. The results underscore the importance of carefully selecting a performance estimation strategy to ensure that model comparisons and adjustments are made on a reliable basis.

Received: 23 Jul 2025 – Discussion started: 08 Sep 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Fabienne Doll, Tanja Liesch, Maria Wetzel, Stefan Kunz, and Stefan Broda

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-3539', Anonymous Referee #1, 27 Sep 2025

0. This study aims to examine performance evaluation methods for ML-based groundwater-level prediction. Based on the abstract, the main contributions of the study appear to be twofold: (1) the use of time-lagged meteorological variables together with non-time-lagged groundwater data for groundwater-level prediction, and (2) the application of different performance evaluation strategies, including blocked 5-fold cross-validation (bl-CV), repeated out-of-sample validation (repOOS), and out-of-sample validation (OOS). Having read through the manuscript, I find this to be a commendable effort that provides a thoughtful and well-executed case study on groundwater-level prediction.
1. However, my main concern—and hence the principal weakness of the study—is that, although the manuscript persuasively situates itself within the existing literature through comparisons and contrasts, its scope is constrained by the exclusive reliance on a 1-D CNN model. This raises the question of whether the conclusions regarding evaluation methods are fully justifiable, transferable, and robust across other ML/DL approaches. In addition, the absence of a benchmark comparison for the proposed approach further undermines the strength of the study.
2. I appreciate that the authors have already defined a clear research question for the current study. However, I would still encourage including more discussion of related research to better inform and engage GMD’s broad and sophisticated readership. For example, it would be valuable to compare your findings with those of Shen et al. (2022), highlighting in what ways your results are consistent or divergent. In addition, while the manuscript presents a strong stance on the time-consecutive hypothesis, it would also be helpful to discuss the perspective proposed by Zhang et al. (2022), who emphasize the importance of distributional representativeness across train/validation/test sets and suggest that this consideration may reduce the need for k-fold cross-validation. Including such comparisons would strengthen the manuscript by situating your contribution more clearly within the broader context of current research.
3. Given that I do not have prior experience in groundwater-level modeling, I find that the opening paragraph of the introduction does not clearly convey the nature of the problem. It remains unclear whether the study is addressing point prediction, area-averaged prediction, or image-type prediction on structured or unstructured grids. In addition, it would be helpful to explain what the common problem setups have been in previous research along this line, so that readers without domain expertise can better situate the present study.
4. At the end of the Introduction, please provide a clear definition of what is meant by stationary and non-stationary conditions in the context of this study. Since these concepts can depend on the choice of window size, it would be helpful if you could explicitly state how they are defined here.
5. In the Theory and Background section, you discuss each evaluation method individually. However, some of this information was already mentioned in the Introduction. While the content itself is sound, I recommend further polishing to reduce redundancy and improve delineation and readability.
6. Around line 110, the manuscript states: “…For time series prediction, random shuffling of the data is often considered problematic as it can break the temporal dependency of the data…”. I would like to clarify whether this claim is model-dependent. For instance, is this necessarily true when using tree-based algorithms such as Random Forests, which do not rely on temporal (Markovian) state updates? While sequential models that rely on temporal dependencies (e.g., autoregressive or state-space models) may indeed be affected, models that only map input–output relationships may not experience the same issue. Could you clarify whether this limitation arises primarily from the choice of model, rather than from the evaluation method itself?
7. Around line 175, you refer to the concept of weak stationarity. Could you please clarify what window size is being used to assess this property? Since the definition of weak stationarity can depend on the temporal window considered, this specification would help readers interpret the results correctly.
8. For section 3.3, I wonder if the use of “dropout” would affect the results of using different evaluation method.
9. If I understand correctly, the authors use 80% of the in-set data for model development. Have you evaluated whether a smaller subset of this 80% could achieve comparable accuracy and robustness, and, if so, what the minimum percentage might be? Additionally, would reducing the total amount of data alter the study’s conclusions?
10. Figures 5–7 provide a reasonable and effective way of summarizing the results. That said, are there additional quantitative approaches that could be used to present the findings on spatial maps? Moreover, beyond the stationarity perspective, could further insights be derived in terms of predictive accuracy that would enrich the interpretation of the results?
11. The manuscript fixes the input meteorological sequence length at 52 weeks. Please clarify the basis for this choice. Have alternative horizons been tested? Should the optimal horizon be constant across sites, or might it vary with hydro-geo-climatic setting? If not constant, what insights can be derived from treating this as a site-specific (or region-specific) hyperparameter?
References:

Shen, H., Tolson, B.A. and Mai, J., 2022. Time to update the split‐sample approach in hydrological model calibration. Water Resources Research, 58(3), p.e2021WR031523.
Zheng, F., Chen, J., Maier, H.R. and Gupta, H., 2022. Achieving robust and transferable performance for conservation‐based models of dynamical physical systems. Water Resources Research, 58(5), p.e2021WR031818.

Citation: https://doi.org/10.5194/egusphere-2025-3539-RC1
- AC1: 'Reply on RC1', Fabienne Doll, 18 Nov 2025
  
  We would like to thank reviewer 1 for the thorough and constructive review of our manuscript.
  
  A detailed response can be found in the attached PDF file (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2025-3539-AC1
RC2:
'Comment on egusphere-2025-3539', Anonymous Referee #2, 04 Oct 2025

I Cannot find the importance of the presents study and how it contributes to the improvement of our knowledge in terms of GWL prediction using machine learning. Predicting GWL using one deep learning model (the CNN) is not new and the fact that the authors propose a modelling strategy based only on exogenous variables is no as important to be proposed and presented as an innovative approach. Furthermore, the fact that three different evaluation strategy, i.e., blocked cross-validation (5 bl-CV), repeated out-of-sample validation (repOOS), and out-of-sample validation (OOS) are compared is not a solid argument to justify the importance and novelty of the present paper. Yet, a modelling strategy based only on one ML model is extremely unsound as there is no any baseline of comparison. The adoption of weekly data is not justified and for closing, section results is extremely poor and unsound. There is no any interpretability of the model and a ranking of the features based on their contribution to the final model response.

Citation: https://doi.org/10.5194/egusphere-2025-3539-RC2
- AC2: 'Reply on RC2', Fabienne Doll, 18 Nov 2025
  
  We would like to thank reviewer 2 for reviewing our manuscript.
  
  A detailed response can be found in the attached PDF file (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2025-3539-AC2
RC3:
'Comment on egusphere-2025-3539', Anonymous Referee #3, 06 Oct 2025

This paper reviews and analyzes the validation strategies of time series deep learning models. They classify the metric performance assessment approaches as three groups: out-of-sample validation (OOS), blocked cross-validation (bl-CV), and repeated out-of-sample validation (repOOS). Subsequently, they establish one Dimension Convolutional Neural Networks (1D-CNN) model considering exogenous meteorological inputs with time lags for groundwater level (GWL) prediction. And then, a data set of 100 GWL time series (including 50 stationary and 50 nonstationary time series) in Brandenburg, Germany, are used to assess the validation strategies. Finally, they confirm that bl-CV and repOOS provide the most representative performance estimates for stationary and nonstationary GWL data, respectively. This paper more likes a review of validation strategies, and lacks deep analysis, especially for hydrogeological conditions.
1. Please clarify the contribution of this paper to hydrology.
2. Did you consider alteration of the loss functions in the training period to identify suitable hyper parameters for improving the model performance?
3. The better actual model performance of GWL should not only have the small APAE, but also reflect the heterogeneity of aquifers (e.g., response time of GWL to meteorological factors, and amplitude).
4. A systematical analysis the hydrogeological conditions of study area are needed (e.g., how many layers of aquifers, which layers the 100 GWL wells located, and groundwater pumping rates), which can help us figure out the best performance model.

Citation: https://doi.org/10.5194/egusphere-2025-3539-RC3
- AC3: 'Reply on RC3', Fabienne Doll, 18 Nov 2025
  
  We would like to thank reviewer 3 for the thorough review of our manuscript.
  
  A detailed response can be found in the attached PDF file (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2025-3539-AC3

Fabienne Doll, Tanja Liesch, Maria Wetzel, Stefan Kunz, and Stefan Broda

Viewed

Total article views: 1,495 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,412	65	18	1,495	18	18

HTML: 1,412
PDF: 65
XML: 18
Total: 1,495
BibTeX: 18
EndNote: 18

Views and downloads (calculated since 08 Sep 2025)

Month	HTML	PDF	XML	Total
Sep 2025	1,253	13	7	1,273
Oct 2025	137	36	10	183
Nov 2025	22	16	1	39

Cumulative views and downloads (calculated since 08 Sep 2025)

Month	HTML	PDF	XML	Total
Sep 2025	1,253	13	7	1,273
Oct 2025	137	36	10	183
Nov 2025	22	16	1	39

Viewed (geographical distribution)

Total article views: 1,408 (including HTML, PDF, and XML) Thereof 1,408 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 19 Nov 2025

Short summary

With the growing use of machine learning for groundwater level (GWL) prediction, proper performance estimation is crucial. This study compares three validation strategies—blocked cross-validation (bl-CV), repeated out-of-sample (repOOS), and out-of-sample (OOS)—for 1D-CNN models using meteorological inputs. Results show that bl-CV offers the most reliable performance estimates, while OOS is the most uncertain, highlighting the need for careful method selection.


Total:	0
HTML:	0
PDF:	0
XML:	0