A Hierarchical Hydrological Knowledge-guided Attention Network for Groundwater Depth Prediction: Insights from Multi-regional Model Interpretation

Xu, Jing; Mo, Yuming; Zhu, Senlin; Shen, Chengji; Zhu, Xinli; Zhang, Chenming; Jiang, Qihao; Li, Ling

doi:10.5194/egusphere-2026-2379

Preprints

https://doi.org/10.5194/egusphere-2026-2379

Preprints

02 Jun 2026

| 02 Jun 2026

A Hierarchical Hydrological Knowledge-guided Attention Network for Groundwater Depth Prediction: Insights from Multi-regional Model Interpretation

Jing Xu, Yuming Mo, Senlin Zhu, Chengji Shen, Xinli Zhu, Chenming Zhang, Qihao Jiang, and Ling Li

Abstract. Given the intensive influence of climate change and anthropogenic activities, accurate groundwater depth (GWD) prediction is essential for sustainable groundwater management. However, existing models struggle to capture spatiotemporal dependencies from complex factors. This study develops a novel Hierarchical Hydrological knowledge-guided Attention Network (HHA-Net) that processes multi-source heterogeneous data through physics-guided encoders, employs adaptive weight allocation and spatiotemporal attention to achieve fourteen-step GWD prediction, and provides insights into groundwater dynamics. Three distinct hydroclimatic and geographical regions in China (128 sites with 233,728 observations) serve as case studies, including the Yanshan-Taihang Mountain Region (YTMR), North China Plain (NCP), and North Jiangsu Plain (NJP). Results show that HHA-Net outperforms baseline models across different sites (natural, agricultural, and urban), with MAPE ranging from 1.02 % to 5.95 % and R² ranging from 0.71 to 0.98. The model demonstrates improved performance under droughts but slightly weaker predictive capability during rainfall events, particularly at natural sites in the YTMR. The geographical encoder dominates GWD in the mountainous YTMR (35.6 %), while the human activity encoder and historical encoder control it in the NCP (32.5 %) and the NJP (36.7 %), respectively. The GWD exhibits prolonged memory effects (25 days) and delayed responses to rainfall (7.5 days) in the YTMR, whereas the over-exploited NCP shows rapid decay (3 days) with negative rainfall thresholds (-0.16) and anthropogenic-dominated patterns. The humid NJP demonstrates low-positive thresholds (0.07) and balanced natural-anthropogenic effects. These findings demonstrate the broad applicability of HHA-Net for GWD prediction and response pattern interpretation across diverse regions, providing scientific support for groundwater management.

Received: 25 Apr 2026 – Discussion started: 02 Jun 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2763 KB)

Supplement (5209 KB)

Download & links

Jing Xu, Yuming Mo, Senlin Zhu, Chengji Shen, Xinli Zhu, Chenming Zhang, Qihao Jiang, and Ling Li

Status: final response (author comments only)

CC1:
'Comment on egusphere-2026-2379', Marc Ohmer, 18 Jun 2026

The GEMS-GER results are very interesting, but I think the comparison to the original benchmark models should be clarified. Although the temporal setup is comparable, the proposed model incorporates historical groundwater depth values up to t−1 as an autoregressive input, whereas the CNN and LSTM benchmarks in the original GEMS-GER study did not include groundwater level or depth as a dynamic input feature. This provides a meaningful informational advantage, particularly for a one-week-ahead prediction task, and should be explicitly acknowledged when reporting NSE and R² comparisons.
Without accounting for this difference, the results are not directly comparable to the GEMS-GER benchmark models. A fair comparison would require either an ablation experiment excluding the historical GWD/GWL encoder or a re-evaluation of the benchmark models using the same autoregressive input information. Additionally, reporting median performance scores alongside mean values would be useful, as the median is less sensitive to outliers and provides a more robust summary of model performance across heterogeneous sites.

Citation: https://doi.org/10.5194/egusphere-2026-2379-CC1
- AC1:
  'Reply on CC1', Jing Xu, 21 Jun 2026
  
  We sincerely appreciate Dr. Marc Ohmer for making the GEMS-GER benchmark dataset publicly available, and for his constructive comments on our manuscript (egusphere-2026-2379), which have greatly improved the rigor of our model evaluations.
  
  We agree with the reviewer’s insight that introducing historical groundwater depth (GWD) as an autoregressive input provides a significant informational advantage for short-term (1-week-ahead) predictions. To address the concern and ensure a rigorous, fair, and direct comparison, we have re-evaluated both the CNN and LSTM benchmark models using the same input configuration as HHA-Net. Specifically, we rebuilt the CNN and LSTM baselines to incorporate the historical GWD sequence (up to t-1) as a dynamic input feature alongside meteorological and static environmental features. As suggested, we now report both the mean and median performance metrics across all 3,207 highly heterogeneous sites in the GEMS-GER dataset to provide a more robust and comprehensive summary. To maintain full transparency and reproducibility, we have uploaded the baseline codes, prediction results, and evaluation metrics to Zenodo (https://doi.org/10.5281/zenodo.20774213).
  
  The results show that HHA-Net consistently and significantly outperforms the baselines across all evaluated metrics. Specifically, HHA-Net achieves the predictive performance with a mean R²of 0.901 (median: 0.954), consistently outperforming the CNN baseline (mean R²: 0.834, median: 0.908) and the LSTM baseline (mean R²: 0.877, median: 0.941). Due to the spatial heterogeneity across the 3,207 sites, the mean NSE of the CNN and LSTM baselines remains negative (-2.299 and -1.806, respectively), while their median NSE values are 0.763 and 0.826. In contrast, HHA-Net maintains a positive mean NSE of 0.348 alongside a superior median NSE of 0.863, demonstrating its exceptional capacity to generalize robustly across highly heterogeneous geographic regions.
  
  We have revised the corresponding paragraph in the manuscript as follows:
  
  “Moreover, we have further validated our model on the benchmark dataset GEMS-GER, which includes GWD from 3,207 groundwater sites in Germany (Ohmer et al. 2026). To provide a rigorous and fair comparison, the CNN and LSTM benchmark models were evaluated using the same input configuration as HHA-Net (i.e., input of historical GWD). Using 7 weeks of historical GWD and environmental features to predict 1 week ahead, HHA-Net achieved robust predictive performance (mean R²= 0.901, median R²= 0.954), consistently outperforming the CNN baseline (mean R²= 0.834, median R²= 0.908) and LSTM baseline (mean R²= 0.877, median R²= 0.941) (Table S6). Notably, due to the high spatial heterogeneity across the 3,207 sites, the mean NSE of the baselines remains negative (mean NSE of -2.299 for CNN and -1.806 for LSTM), whereas their median NSE values are 0.763 and 0.826, respectively. In contrast, HHA-Net maintains a positive mean NSE of 0.348 and a superior median NSE of 0.863. This demonstrates the enhanced robustness and spatial generalization capability of the HHA-Net.”
  
  The revised performance comparison has been added as Table S6 in the Supplementary Material.
  
  Citation: https://doi.org/10.5194/egusphere-2026-2379-AC1
  - CC3: 'Reply on AC1', Marc Ohmer, 25 Jun 2026
    
    Thank you very much for the prompt and comprehensive revision. I appreciate that the CNN and LSTM baselines were re-evaluated using the same autoregressive input configuration as HHA-Net and that both mean and median performance metrics are now reported. This substantially improves the transparency and fairness of the benchmark comparison.
    The updated results look promising and, from my perspective, adequately address my main concern regarding the comparability of the benchmark models.
    
    Citation: https://doi.org/10.5194/egusphere-2026-2379-CC3
    
    AC3: 'Reply on CC3', Jing Xu, 29 Jun 2026
    
    We sincerely thank Dr. Marc Ohmer for his positive feedback and validation of our revisions. Once again, we appreciate Dr. Marc Ohmer’s constructive comments, which have greatly improved the fairness and rigor of our model comparison.
    
    Citation: https://doi.org/10.5194/egusphere-2026-2379-AC3
CC2:
'Comment on egusphere-2026-2379', Marc Ohmer, 18 Jun 2026

The GEMS-GER results are very interesting, but I think the comparison to the original benchmark models should be clarified. Although the temporal setup is comparable, the proposed model incorporates historical groundwater depth values up to t−1 as an autoregressive input, whereas the CNN and LSTM benchmarks in the original GEMS-GER study did not include groundwater level or depth as a dynamic input feature. This provides a meaningful informational advantage, particularly for a one-week-ahead prediction task, and should be explicitly acknowledged when reporting NSE and R² comparisons.
Without accounting for this difference, the results are not directly comparable to the GEMS-GER benchmark models. A fair comparison would require either an ablation experiment excluding the historical GWD/GWL encoder or a re-evaluation of the benchmark models using the same autoregressive input information. Additionally, reporting median performance scores alongside mean values would be useful, as the median is less sensitive to outliers and provides a more robust summary of model performance across heterogeneous sites.

Citation: https://doi.org/10.5194/egusphere-2026-2379-CC2
- AC2: 'Reply on CC2', Jing Xu, 21 Jun 2026
  
  We sincerely appreciate Dr. Marc Ohmer for making the GEMS-GER benchmark dataset publicly available, and for his constructive comments on our manuscript (egusphere-2026-2379), which have greatly improved the rigor of our model evaluations.
  We agree with the reviewer’s insight that introducing historical groundwater depth (GWD) as an autoregressive input provides a significant informational advantage for short-term (1-week-ahead) predictions. To address the concern and ensure a rigorous, fair, and direct comparison, we have re-evaluated both the CNN and LSTM benchmark models using the same input configuration as HHA-Net. Specifically, we rebuilt the CNN and LSTM baselines to incorporate the historical GWD sequence (up to t-1) as a dynamic input feature alongside meteorological and static environmental features. As suggested, we now report both the mean and median performance metrics across all 3,207 highly heterogeneous sites in the GEMS-GER dataset to provide a more robust and comprehensive summary. To maintain full transparency and reproducibility, we have uploaded the baseline codes, prediction results, and evaluation metrics to Zenodo (https://doi.org/10.5281/zenodo.20774213).
  The results show that HHA-Net consistently and significantly outperforms the baselines across all evaluated metrics. Specifically, HHA-Net achieves the predictive performance with a mean R²of 0.901 (median: 0.954), consistently outperforming the CNN baseline (mean R²: 0.834, median: 0.908) and the LSTM baseline (mean R²: 0.877, median: 0.941). Due to the spatial heterogeneity across the 3,207 sites, the mean NSE of the CNN and LSTM baselines remains negative (-2.299 and -1.806, respectively), while their median NSE values are 0.763 and 0.826. In contrast, HHA-Net maintains a positive mean NSE of 0.348 alongside a superior median NSE of 0.863, demonstrating its exceptional capacity to generalize robustly across highly heterogeneous geographic regions.
  We have revised the corresponding paragraph in the manuscript as follows:
  “Moreover, we have further validated our model on the benchmark dataset GEMS-GER, which includes GWD from 3,207 groundwater sites in Germany (Ohmer et al. 2026). To provide a rigorous and fair comparison, the CNN and LSTM benchmark models were evaluated using the same input configuration as HHA-Net (i.e., input of historical GWD). Using 7 weeks of historical GWD and environmental features to predict 1 week ahead, HHA-Net achieved robust predictive performance (mean R²= 0.901, median R²= 0.954), consistently outperforming the CNN baseline (mean R²= 0.834, median R²= 0.908) and LSTM baseline (mean R²= 0.877, median R²= 0.941) (Table S6). Notably, due to the high spatial heterogeneity across the 3,207 sites, the mean NSE of the baselines remains negative (mean NSE of -2.299 for CNN and -1.806 for LSTM), whereas their median NSE values are 0.763 and 0.826, respectively. In contrast, HHA-Net maintains a positive mean NSE of 0.348 and a superior median NSE of 0.863. This demonstrates the enhanced robustness and spatial generalization capability of the HHA-Net.”
  The revised performance comparison has been added as Table S6 in the Supplementary Material.
  
  Citation: https://doi.org/10.5194/egusphere-2026-2379-AC2
RC1: 'Comment on egusphere-2026-2379', Anonymous Referee #1, 25 Jul 2026

In this study the authors propose a novel deep learning architecture for ground water level (GWL) forecasting. In a broad and comprehensive empirical evaluation across multiple geographical, meteorological and other conditions the authors find that the proposed model performs better than a number of baseline models.

The study investigates a number of very interesting angles on ML time series models for GWL forecasting.

Most importantly probably the authors take a close look at how different predictive variables can be combined to improve GWL forecasts, such as human influences, meteorological forcings or geological features and how these influences vary over time, space and when partitioned into different meteorological regimes.

The authors investigate and report a number of different metrics. The selection is very sensible and provides a good intuition across the different aspects of what the models capture.

The evaluation and experimental setup is very well organised and conducted. Especially the analysis and discussion of the impact of various geological, hydrological and other factors on the forecasting quality is interesting for a broad community of researchers and beyond.

In the evaluation, the authors cover a broad spectrum of geological, meteorological and societal conditions. Such a comprehensive account of GWL forecasting challenges is an important contribution. It would be great if the authors could share the data set for research purposes publicly?

A number of ideas in the modelling seem novel and helpful. This includes for instance the combination of multiple loss functions, the ideas of the multiple encoders and the combination of data sources. In some cases (e.g. the loss functions or the model architecture), it could be helpful to investigate some more aspects that could clarify the benefits of those methodological contributions.

Areas for improvement

Generally it would be great for all reported metrics to include percentiles (maybe 25%/50%/75%) rather than average values and standard deviations. In particular for the Ablation studies, but also at other places, for instance Table 2, it would be great to get a better understanding of the range of the metrics.

Sometimes it can be difficult to avoid leakage from test data to validation/training data. Especially when data cleaning is performed or data preprocessing, such as imputation, normalisation and other measures. It would be good if the authors directly stated that none of the test data was used for any preprocessing steps of the data engineering. This seems like a minor detail, but actually getting the mean right for the test period (which would be a lot easier when using the test data in a normalisation step) can have a large impact on the models, as the authors nicely show in the ablation studies on the temporal encoder.

The results in Table 2 could be discussed in more detail with respect to the surprisingly low NSE for all baseline models in YTMR and 4 out of 6 models in NJP. This also relates to the point on the hyperparameter optimization below; it seems like with a bit more optimization (of hyperparameters or the model parameters themselves) for the other models, they would probably reach better NSEs.

For the comparisons with the baseline models it would be helpful to elaborate a bit more on the hyper parameters. The authors write that „All baseline models adopt identical training strategies and hyperparameter settings as HHA-Net to ensure fair comparative experiments.“ - but usually different models require different hyper parameters. It is good practice to conduct some sort of hyper parameter optimization to ensure fair comparisons, rather than using the same hyper parameters for all. While it’s true that often times hyper parameters have a limited impact on the outcome, in this case I would reckon that the individual architectures, dropout rates, optimization parameters could actually change the results.

For the comparison with the GEMS-GER data, it would be helpful to see some more details on how the different encoders were used and how that relates to or differs from the models used on those data sets - that would help to assess whether the performance increase comes from the model architecture or the input data. Also in that comparison, percentiles would be better than averages.

The design decision to predict a 14 week window at once could maybe be motivated a bit better - other approaches predict a single time step in a rolling window approach and for some experiments this is also the case in this study.

The idea to combine multiple loss functions is interesting and could be investigated in a bit more detail. Did the authors observe that the results changed significantly without that multi-objective optimization, or with different weights of the individual loss functions? How were the weights optimized? Do the relative differences between models change with the weightings and without the multiple losses?
The large differences between the proposed model and all baselines seem at odds with existing literature and prior empirical evidence from other geographical regions. Also the very low NSE values for some geographical regions appear puzzling, just like the differences between NSE and R2 (which are often used synonymously and which often are considered equivalent under some assumptions - it’s good that the authors comment on this). With the presentation of quantiles and HPO this should change and will improve the extent to which one can assess the results and their validity.

Minor remarks:

Line 268: „Euclidean data“ - maybe this term could be explained a bit more, I guess neural networks and in fact all of the baseline methods are able to model data on nonlinear manifolds, too, if that’s what the authors refer to with Euclidean data?

How were the architectural parameters chosen, was there some neural architecture search, hyper parameter optimization or were the choices empirically motivated, or by some prior work?

Table 1. Summary of datasets across the three study regions.

It would be great if the authors could report quantiles/percentiles instead of mean and standard deviations, that would give a much better overview of the sites.

Fig 6: I think this is a great comparison with interesting insights, but especially in the top left panel it would be helpful to maybe present it as a box plot across groups, rather than a line plot.

Citation: https://doi.org/10.5194/egusphere-2026-2379-RC1
RC2: 'Comment on egusphere-2026-2379', Anonymous Referee #2, 04 Aug 2026

This manuscript develops a Hierarchical Hydrological Knowledge-guided Attention Network (HHA-Net) for 14-day groundwater-depth forecasting across three hydroclimatically distinct regions in China. The model integrates historical groundwater depth, meteorological conditions, geographical characteristics, and human activities through specialized encoders, site-type-aware feature fusion, and spatiotemporal attention mechanisms. A key methodological innovation is the four-dimensional spatial-similarity representation, which accounts for geographic distance, aquifer type, land-use-based site type, and watershed affiliation rather than relying solely on geographic proximity. The model is evaluated using 128 monitoring sites and more than 230,000 daily observations and is compared with several baseline models, including MLP, CNN-LSTM, Transformer, GNN, and ST-GNN. HHA-Net generally outperforms the selected baseline implementations. The main scientific finding is that the dominant predictive controls vary among regions, with geographical factors being most important in the mountainous region, human activities in the intensively exploited plain, and historical groundwater conditions in the humid coastal plain. Overall, the framework is methodologically innovative, and the large multi-regional dataset provides a valuable basis for evaluating groundwater prediction under contrasting hydrogeological and anthropogenic conditions.
My main comment is:

Line 266, ‘All baseline models adopt identical training strategies and hyperparameter settings as HHANet to ensure fair comparative experiments.’ This raises concern that the baseline models may not have been adequately optimized. The authors should clarify how each baseline was trained and tuned to demonstrate that HHA-Net’s improved performance results from its framework rather than from comparison with poorly optimized baseline models.
Figures:

Figure 5 presents four performance metrics, but the discussion mainly highlights a few poorly performing sites without explaining differences among the spatial patterns. Some sites show low MAPE or RMSE but weak NSE or R2, suggesting that the model captures mean groundwater depth better than temporal variability. The authors should discuss these discrepancies and relate site-level performance to hydrogeological and land-use characteristics.
Figure 6 lacks panel labels (a–d) and clear legends for panels (b–d), making it difficult to interpret. A simplified legend, direct point labels, or clearer panel separation would improve readability. Please also clarify whether all region–site-type combinations are shown and identify any excluded combinations.
Panels 7a and 7b present largely the same region- and site-type-specific bootstrap uncertainty information, with panel 7b showing a normalized radar-chart version of panel 7a. The authors should consider removing one panel or replacing it with an analysis that provides additional insight.
Figure 9 reports rainfall thresholds of 0.70, −0.16, and 0.07, but the manuscript does not clearly explain the rainfall normalization, the corresponding values in physical units, whether these thresholds are site-specific or regional averages, or how a negative normalized threshold should be interpreted. Referring to −0.16 as a “negative rainfall threshold” may therefore be misleading, as it could be mistaken for physically negative rainfall.
Minor comments:

Line 435-440, the statement that the geographical encoder caused the largest performance degradation in the YTMR is inconsistent with the reported ablation results. The historical encoder removal produced a 2213.2% increase in RMSE, compared with 581.2% for the geographical encoder. Please correct the inconsistency.
Line 449-450, the statement is incorrect for the NJP. Removing the CNN branch caused greater degradation than removing the LSTM branch only in the YTMR (1639.4% versus 48.9%). But in the NJP, removing the LSTM branch caused the larger degradation (540.3% versus 97.7%).
Lines 682 and 752, please be consistent on whether the maximum lag is 5 days or 15 days.

Citation: https://doi.org/10.5194/egusphere-2026-2379-RC2

Jing Xu, Yuming Mo, Senlin Zhu, Chengji Shen, Xinli Zhu, Chenming Zhang, Qihao Jiang, and Ling Li

Supplement

https://doi.org/10.5194/egusphere-2026-2379-supplement

Model code and software

HHA-Net for groundwater depth prediction Jing Xu https://doi.org/10.5281/zenodo.18130111

Jing Xu, Yuming Mo, Senlin Zhu, Chengji Shen, Xinli Zhu, Chenming Zhang, Qihao Jiang, and Ling Li

Viewed

Total article views: 264 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
182	70	12	264	27	8	6

HTML: 182
PDF: 70
XML: 12
Total: 264
Supplement: 27
BibTeX: 8
EndNote: 6

Views and downloads (calculated since 02 Jun 2026)

Month	HTML	PDF	XML	Total
Jun 2026	98	28	8	134
Jul 2026	78	34	4	116
Aug 2026	6	8	0	14

Cumulative views and downloads (calculated since 02 Jun 2026)

Month	HTML	PDF	XML	Total
Jun 2026	98	28	8	134
Jul 2026	78	34	4	116
Aug 2026	6	8	0	14

Viewed (geographical distribution)

Total article views: 230 (including HTML, PDF, and XML) Thereof 230 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 05 Aug 2026

Short summary

Groundwater is a vital resource, yet predicting its depth is challenging due to climate and human impacts. We developed a hydrology-guided deep learning model to forecast GWD and reveal its drivers. The model demonstrated excellent performance across different site types. We found geography controls mountain groundwater, human activities dominate inland plains, and coastal areas show balanced influences. Our model provides a scientific foundation for sustainable groundwater management worldwide.


Total:	0
HTML:	0
PDF:	0
XML:	0