the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A Hierarchical Hydrological Knowledge-guided Attention Network for Groundwater Depth Prediction: Insights from Multi-regional Model Interpretation
Abstract. Given the intensive influence of climate change and anthropogenic activities, accurate groundwater depth (GWD) prediction is essential for sustainable groundwater management. However, existing models struggle to capture spatiotemporal dependencies from complex factors. This study develops a novel Hierarchical Hydrological knowledge-guided Attention Network (HHA-Net) that processes multi-source heterogeneous data through physics-guided encoders, employs adaptive weight allocation and spatiotemporal attention to achieve fourteen-step GWD prediction, and provides insights into groundwater dynamics. Three distinct hydroclimatic and geographical regions in China (128 sites with 233,728 observations) serve as case studies, including the Yanshan-Taihang Mountain Region (YTMR), North China Plain (NCP), and North Jiangsu Plain (NJP). Results show that HHA-Net outperforms baseline models across different sites (natural, agricultural, and urban), with MAPE ranging from 1.02 % to 5.95 % and R2 ranging from 0.71 to 0.98. The model demonstrates improved performance under droughts but slightly weaker predictive capability during rainfall events, particularly at natural sites in the YTMR. The geographical encoder dominates GWD in the mountainous YTMR (35.6 %), while the human activity encoder and historical encoder control it in the NCP (32.5 %) and the NJP (36.7 %), respectively. The GWD exhibits prolonged memory effects (25 days) and delayed responses to rainfall (7.5 days) in the YTMR, whereas the over-exploited NCP shows rapid decay (3 days) with negative rainfall thresholds (-0.16) and anthropogenic-dominated patterns. The humid NJP demonstrates low-positive thresholds (0.07) and balanced natural-anthropogenic effects. These findings demonstrate the broad applicability of HHA-Net for GWD prediction and response pattern interpretation across diverse regions, providing scientific support for groundwater management.
- Preprint
(2763 KB) - Metadata XML
-
Supplement
(5209 KB) - BibTeX
- EndNote
Status: open (until 14 Jul 2026)
-
CC1: 'Comment on egusphere-2026-2379', Marc Ohmer, 18 Jun 2026
reply
-
AC1: 'Reply on CC1', Jing Xu, 21 Jun 2026
reply
We sincerely appreciate Dr. Marc Ohmer for making the GEMS-GER benchmark dataset publicly available, and for his constructive comments on our manuscript (egusphere-2026-2379), which have greatly improved the rigor of our model evaluations.
We agree with the reviewer’s insight that introducing historical groundwater depth (GWD) as an autoregressive input provides a significant informational advantage for short-term (1-week-ahead) predictions. To address the concern and ensure a rigorous, fair, and direct comparison, we have re-evaluated both the CNN and LSTM benchmark models using the same input configuration as HHA-Net. Specifically, we rebuilt the CNN and LSTM baselines to incorporate the historical GWD sequence (up to t-1) as a dynamic input feature alongside meteorological and static environmental features. As suggested, we now report both the mean and median performance metrics across all 3,207 highly heterogeneous sites in the GEMS-GER dataset to provide a more robust and comprehensive summary. To maintain full transparency and reproducibility, we have uploaded the baseline codes, prediction results, and evaluation metrics to Zenodo (https://doi.org/10.5281/zenodo.20774213).
The results show that HHA-Net consistently and significantly outperforms the baselines across all evaluated metrics. Specifically, HHA-Net achieves the predictive performance with a mean R2 of 0.901 (median: 0.954), consistently outperforming the CNN baseline (mean R2: 0.834, median: 0.908) and the LSTM baseline (mean R2: 0.877, median: 0.941). Due to the spatial heterogeneity across the 3,207 sites, the mean NSE of the CNN and LSTM baselines remains negative (-2.299 and -1.806, respectively), while their median NSE values are 0.763 and 0.826. In contrast, HHA-Net maintains a positive mean NSE of 0.348 alongside a superior median NSE of 0.863, demonstrating its exceptional capacity to generalize robustly across highly heterogeneous geographic regions.
We have revised the corresponding paragraph in the manuscript as follows:
“Moreover, we have further validated our model on the benchmark dataset GEMS-GER, which includes GWD from 3,207 groundwater sites in Germany (Ohmer et al. 2026). To provide a rigorous and fair comparison, the CNN and LSTM benchmark models were evaluated using the same input configuration as HHA-Net (i.e., input of historical GWD). Using 7 weeks of historical GWD and environmental features to predict 1 week ahead, HHA-Net achieved robust predictive performance (mean R2 = 0.901, median R2 = 0.954), consistently outperforming the CNN baseline (mean R2 = 0.834, median R2 = 0.908) and LSTM baseline (mean R2 = 0.877, median R2 = 0.941) (Table S6). Notably, due to the high spatial heterogeneity across the 3,207 sites, the mean NSE of the baselines remains negative (mean NSE of -2.299 for CNN and -1.806 for LSTM), whereas their median NSE values are 0.763 and 0.826, respectively. In contrast, HHA-Net maintains a positive mean NSE of 0.348 and a superior median NSE of 0.863. This demonstrates the enhanced robustness and spatial generalization capability of the HHA-Net.”
The revised performance comparison has been added as Table S6 in the Supplementary Material.
-
AC1: 'Reply on CC1', Jing Xu, 21 Jun 2026
reply
-
CC2: 'Comment on egusphere-2026-2379', Marc Ohmer, 18 Jun 2026
reply
The GEMS-GER results are very interesting, but I think the comparison to the original benchmark models should be clarified. Although the temporal setup is comparable, the proposed model incorporates historical groundwater depth values up to t−1 as an autoregressive input, whereas the CNN and LSTM benchmarks in the original GEMS-GER study did not include groundwater level or depth as a dynamic input feature. This provides a meaningful informational advantage, particularly for a one-week-ahead prediction task, and should be explicitly acknowledged when reporting NSE and R² comparisons.
Without accounting for this difference, the results are not directly comparable to the GEMS-GER benchmark models. A fair comparison would require either an ablation experiment excluding the historical GWD/GWL encoder or a re-evaluation of the benchmark models using the same autoregressive input information. Additionally, reporting median performance scores alongside mean values would be useful, as the median is less sensitive to outliers and provides a more robust summary of model performance across heterogeneous sites.
Citation: https://doi.org/10.5194/egusphere-2026-2379-CC2 -
AC2: 'Reply on CC2', Jing Xu, 21 Jun 2026
reply
We sincerely appreciate Dr. Marc Ohmer for making the GEMS-GER benchmark dataset publicly available, and for his constructive comments on our manuscript (egusphere-2026-2379), which have greatly improved the rigor of our model evaluations.
We agree with the reviewer’s insight that introducing historical groundwater depth (GWD) as an autoregressive input provides a significant informational advantage for short-term (1-week-ahead) predictions. To address the concern and ensure a rigorous, fair, and direct comparison, we have re-evaluated both the CNN and LSTM benchmark models using the same input configuration as HHA-Net. Specifically, we rebuilt the CNN and LSTM baselines to incorporate the historical GWD sequence (up to t-1) as a dynamic input feature alongside meteorological and static environmental features. As suggested, we now report both the mean and median performance metrics across all 3,207 highly heterogeneous sites in the GEMS-GER dataset to provide a more robust and comprehensive summary. To maintain full transparency and reproducibility, we have uploaded the baseline codes, prediction results, and evaluation metrics to Zenodo (https://doi.org/10.5281/zenodo.20774213).
The results show that HHA-Net consistently and significantly outperforms the baselines across all evaluated metrics. Specifically, HHA-Net achieves the predictive performance with a mean R2 of 0.901 (median: 0.954), consistently outperforming the CNN baseline (mean R2: 0.834, median: 0.908) and the LSTM baseline (mean R2: 0.877, median: 0.941). Due to the spatial heterogeneity across the 3,207 sites, the mean NSE of the CNN and LSTM baselines remains negative (-2.299 and -1.806, respectively), while their median NSE values are 0.763 and 0.826. In contrast, HHA-Net maintains a positive mean NSE of 0.348 alongside a superior median NSE of 0.863, demonstrating its exceptional capacity to generalize robustly across highly heterogeneous geographic regions.
We have revised the corresponding paragraph in the manuscript as follows:
“Moreover, we have further validated our model on the benchmark dataset GEMS-GER, which includes GWD from 3,207 groundwater sites in Germany (Ohmer et al. 2026). To provide a rigorous and fair comparison, the CNN and LSTM benchmark models were evaluated using the same input configuration as HHA-Net (i.e., input of historical GWD). Using 7 weeks of historical GWD and environmental features to predict 1 week ahead, HHA-Net achieved robust predictive performance (mean R2 = 0.901, median R2 = 0.954), consistently outperforming the CNN baseline (mean R2 = 0.834, median R2 = 0.908) and LSTM baseline (mean R2 = 0.877, median R2 = 0.941) (Table S6). Notably, due to the high spatial heterogeneity across the 3,207 sites, the mean NSE of the baselines remains negative (mean NSE of -2.299 for CNN and -1.806 for LSTM), whereas their median NSE values are 0.763 and 0.826, respectively. In contrast, HHA-Net maintains a positive mean NSE of 0.348 and a superior median NSE of 0.863. This demonstrates the enhanced robustness and spatial generalization capability of the HHA-Net.”
The revised performance comparison has been added as Table S6 in the Supplementary Material.
-
AC2: 'Reply on CC2', Jing Xu, 21 Jun 2026
reply
Model code and software
HHA-Net for groundwater depth prediction Jing Xu https://doi.org/10.5281/zenodo.18130111
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 98 | 28 | 8 | 134 | 18 | 6 | 4 |
- HTML: 98
- PDF: 28
- XML: 8
- Total: 134
- Supplement: 18
- BibTeX: 6
- EndNote: 4
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The GEMS-GER results are very interesting, but I think the comparison to the original benchmark models should be clarified. Although the temporal setup is comparable, the proposed model incorporates historical groundwater depth values up to t−1 as an autoregressive input, whereas the CNN and LSTM benchmarks in the original GEMS-GER study did not include groundwater level or depth as a dynamic input feature. This provides a meaningful informational advantage, particularly for a one-week-ahead prediction task, and should be explicitly acknowledged when reporting NSE and R² comparisons.
Without accounting for this difference, the results are not directly comparable to the GEMS-GER benchmark models. A fair comparison would require either an ablation experiment excluding the historical GWD/GWL encoder or a re-evaluation of the benchmark models using the same autoregressive input information. Additionally, reporting median performance scores alongside mean values would be useful, as the median is less sensitive to outliers and provides a more robust summary of model performance across heterogeneous sites.