Validation Strategies for Deep Learning-Based Groundwater Level Time Series Prediction Using Exogenous Meteorological Input Features
Abstract. Due to the growing reliance on machine learning (ML) approaches for predicting groundwater levels (GWL), it is important to examine the methods used for performance estimation. A suitable performance estimation method provides the most accurate estimate of the accuracy the model would archive on completely unseen test data to provide a solid basis for model selection decisions. This paper investigates the suitability of different performance evaluation strategies, namely blocked cross-validation (bl-CV), repeated out-of-sample validation (repOOS), and out-of-sample validation (OOS), for evaluating one-dimensional convolutional neural network (1D-CNN) models for predicting groundwater level (GWL) using exogenous meteorological input data. Unlike previous comparative studies, which mainly focused on autoregressive models, this work uses a non-autoregressive approach based on exogenous meteorological input features without incorporating past groundwater levels for groundwater level prediction. A dataset of 100 GWL time series was used to evaluate the performance of the different validation methods. The study concludes that bl-CV provides the most representative performance estimates of actual model performance compared to the other two performance evaluation methods examined. The most commonly used OOS validation yielded the most uncertain performance estimate in this study. The results underscore the importance of carefully selecting a performance estimation strategy to ensure that model comparisons and adjustments are made on a reliable basis.
0. This study aims to examine performance evaluation methods for ML-based groundwater-level prediction. Based on the abstract, the main contributions of the study appear to be twofold: (1) the use of time-lagged meteorological variables together with non-time-lagged groundwater data for groundwater-level prediction, and (2) the application of different performance evaluation strategies, including blocked 5-fold cross-validation (bl-CV), repeated out-of-sample validation (repOOS), and out-of-sample validation (OOS). Having read through the manuscript, I find this to be a commendable effort that provides a thoughtful and well-executed case study on groundwater-level prediction.
1. However, my main concern—and hence the principal weakness of the study—is that, although the manuscript persuasively situates itself within the existing literature through comparisons and contrasts, its scope is constrained by the exclusive reliance on a 1-D CNN model. This raises the question of whether the conclusions regarding evaluation methods are fully justifiable, transferable, and robust across other ML/DL approaches. In addition, the absence of a benchmark comparison for the proposed approach further undermines the strength of the study.
2. I appreciate that the authors have already defined a clear research question for the current study. However, I would still encourage including more discussion of related research to better inform and engage GMD’s broad and sophisticated readership. For example, it would be valuable to compare your findings with those of Shen et al. (2022), highlighting in what ways your results are consistent or divergent. In addition, while the manuscript presents a strong stance on the time-consecutive hypothesis, it would also be helpful to discuss the perspective proposed by Zhang et al. (2022), who emphasize the importance of distributional representativeness across train/validation/test sets and suggest that this consideration may reduce the need for k-fold cross-validation. Including such comparisons would strengthen the manuscript by situating your contribution more clearly within the broader context of current research.
3. Given that I do not have prior experience in groundwater-level modeling, I find that the opening paragraph of the introduction does not clearly convey the nature of the problem. It remains unclear whether the study is addressing point prediction, area-averaged prediction, or image-type prediction on structured or unstructured grids. In addition, it would be helpful to explain what the common problem setups have been in previous research along this line, so that readers without domain expertise can better situate the present study.
4. At the end of the Introduction, please provide a clear definition of what is meant by stationary and non-stationary conditions in the context of this study. Since these concepts can depend on the choice of window size, it would be helpful if you could explicitly state how they are defined here.
5. In the Theory and Background section, you discuss each evaluation method individually. However, some of this information was already mentioned in the Introduction. While the content itself is sound, I recommend further polishing to reduce redundancy and improve delineation and readability.
6. Around line 110, the manuscript states: “…For time series prediction, random shuffling of the data is often considered problematic as it can break the temporal dependency of the data…”. I would like to clarify whether this claim is model-dependent. For instance, is this necessarily true when using tree-based algorithms such as Random Forests, which do not rely on temporal (Markovian) state updates? While sequential models that rely on temporal dependencies (e.g., autoregressive or state-space models) may indeed be affected, models that only map input–output relationships may not experience the same issue. Could you clarify whether this limitation arises primarily from the choice of model, rather than from the evaluation method itself?
7. Around line 175, you refer to the concept of weak stationarity. Could you please clarify what window size is being used to assess this property? Since the definition of weak stationarity can depend on the temporal window considered, this specification would help readers interpret the results correctly.
8. For section 3.3, I wonder if the use of “dropout” would affect the results of using different evaluation method.
9. If I understand correctly, the authors use 80% of the in-set data for model development. Have you evaluated whether a smaller subset of this 80% could achieve comparable accuracy and robustness, and, if so, what the minimum percentage might be? Additionally, would reducing the total amount of data alter the study’s conclusions?
10. Figures 5–7 provide a reasonable and effective way of summarizing the results. That said, are there additional quantitative approaches that could be used to present the findings on spatial maps? Moreover, beyond the stationarity perspective, could further insights be derived in terms of predictive accuracy that would enrich the interpretation of the results?
11. The manuscript fixes the input meteorological sequence length at 52 weeks. Please clarify the basis for this choice. Have alternative horizons been tested? Should the optimal horizon be constant across sites, or might it vary with hydro-geo-climatic setting? If not constant, what insights can be derived from treating this as a site-specific (or region-specific) hyperparameter?
References:
Shen, H., Tolson, B.A. and Mai, J., 2022. Time to update the split‐sample approach in hydrological model calibration. Water Resources Research, 58(3), p.e2021WR031523.
Zheng, F., Chen, J., Maier, H.R. and Gupta, H., 2022. Achieving robust and transferable performance for conservation‐based models of dynamical physical systems. Water Resources Research, 58(5), p.e2021WR031818.