Towards a Global Spatial Machine Learning Model for Seasonal Groundwater Level Predictions in Germany

Kunz, Stefan; Schulz, Alexander; Wetzel, Maria; Nölscher, Maximilian; Chiaburu, Teodor; Biessmann, Felix; Broda, Stefan

doi:10.5194/egusphere-2024-3484

Preprints

https://doi.org/10.5194/egusphere-2024-3484

Preprints

19 Nov 2024

| 19 Nov 2024

Towards a Global Spatial Machine Learning Model for Seasonal Groundwater Level Predictions in Germany

Stefan Kunz, Alexander Schulz, Maria Wetzel, Maximilian Nölscher, Teodor Chiaburu, Felix Biessmann, and Stefan Broda

Abstract. Reliable predictions of groundwater levels are crucial for a sustainable groundwater resource management, which needs to balance diverse water needs and to address potential ecological consequences of groundwater depletion. Machine Learning (ML) approaches for time series prediction, in particular, have shown promising predictive accuracy for groundwater level prediction and have scalability advantages over traditional numerical methods when sufficient data is available. Global ML architectures enable predictions across numerous monitoring wells concurrently using a single model, allowing predictions for monitoring wells over a broad range of hydrogeological and meteorological conditions and simplifying model management. In this contribution, groundwater levels were predicted up to 12 weeks for 5,288 monitoring wells across Germany using two state-of-the-art ML approaches, the Temporal Fusion Transformer (TFT) and the Neural Hierarchical Forecasting for Time Series (N-HiTS) algorithm. The models were provided with historical groundwater levels, meteorological features and a wide range of static features describing hydrogeological and soil properties at the wells. To determine the conditions under which the model achieves good performance and whether it aligns with hydrogeological system understanding, the model’s performance was evaluated spatially and correlations with both static input features and time-series features from hydrograph data were examined.

The N-HiTS model outperformed the TFT model, achieving a median NSE of 0.5 for the 12-week prediction over all 5,288 monitoring wells. Performance varied widely: 25 % of wells achieved an NSE > 0.68, while 15 % had an NSE < 0 with the best N-HiTS model. A tendency for better predictions in areas with high data density was observed. Moreover, the models achieved higher performance in lowland areas with distinct seasonal groundwater dynamics, in monitoring wells located in porous aquifers, and at sites with moderate permeabilities, which aligns with theoretical expectations. Overall, the findings highlight that global ML models can facilitate accurate seasonal groundwater predictions over large, hydrogeological diverse areas, potentially informing future groundwater management practices at a national scale.

Received: 07 Nov 2024 – Discussion started: 19 Nov 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 10156 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (10156 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

01 Aug 2025

Towards a global spatial machine learning model for seasonal groundwater level predictions in Germany

Stefan Kunz, Alexander Schulz, Maria Wetzel, Maximilian Nölscher, Teodor Chiaburu, Felix Biessmann, and Stefan Broda

Hydrol. Earth Syst. Sci., 29, 3405–3433, https://doi.org/10.5194/hess-29-3405-2025,https://doi.org/10.5194/hess-29-3405-2025, 2025

Short summary

Stefan Kunz, Alexander Schulz, Maria Wetzel, Maximilian Nölscher, Teodor Chiaburu, Felix Biessmann, and Stefan Broda

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-3484', Anonymous Referee #1, 05 Dec 2024
This manuscript evaluates the performance of two machine learning models in predicting groundwater (GW) levels across a dataset of ~5000 wells in Germany. The study examines the influence of both dynamic and static input features on the accuracy of GW level predictions and seeks to enhance the understanding of hydrogeological systems.
The objectives, methodology, results, and discussion are clear, well-structured, and thoroughly explained. The study aligns well with the scope of the journal, and HESS readers would benefit from and appreciate its findings. In my opinion, the manuscript is close to its final form. However, I would like to raise the following points for consideration:
The manuscript specifies particular values for hyperparameters (e.g., dropout rate, batch size, etc.). Are these values based on specific rules or conventions? Did you test alternative values? While this may not significantly affect the overall conclusions, I believe it would be helpful to clarify this for the reader.

Is there a specific reason for setting the prediction horizon to a maximum of 12 weeks?

Figures B9 and B10 indicate that attention is higher one year before the prediction than at times closer to it. Could you elaborate on why this happens?

In figure B10, why are attention values not zero in the interval of 0–10 weeks? Does this imply that the algorithm is somehow using inputs from these time steps? I suggest including a diagram to illustrate how inputs and outputs operate in the ML algorithms (e.g., similar to Figure 1 in Kratzert et al., 2018). This would help clarify which specific information is being utilized and when.
Citation: https://doi.org/10.5194/egusphere-2024-3484-RC1
- AC1: 'Reply on RC1', Stefan Kunz, 17 Jan 2025
  
  We thank Reviewer 1 for the positive assessment of our manuscript. The suggestions have been addressed by us in the attached PDF (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2024-3484-AC1
RC2:
'Comment on egusphere-2024-3484', Abel Henriot, 15 Dec 2024

General comments :
The paper is related to the needs of groundwater modeling using exlclusively machine learning approaches. Whereas this field of research is not complety new, the use of two specific model architecture (TFT and N-HiTS) has never been published in hydrogeology which represents a novelty. In addition, the study combines both static and dynamic features in the dataset. Static features have been carefully selected to represents hydrogeological properties, or representing links with groundwater levels that hydrogeologist can easily figure out.
The very large number of wells for training worth also to be noted as a novelty, and represents a clear advance in the field of hydrogeology, as it suggest that this model architecture could be suitable for even larger datasets. On top of this, confrontation with hydrogeologists expertise is made, which is both hardly found in litterature and very insightfull.
The overall structure is clear and leads the reader to a detailed understanding of methods, workflow, results and main outcomes.
Objectivs or motivation are slightly less clearly explained. And I think a 2 to 3 sentences paragraph at the end of the introduction that exposes more clearly all the motivations could be usefull. From my understanding :

* more efficient models for enlarged dataset compared to previous study (Heudorfer et.al.)

* more capabilities to handle static and dynamic features

* need to enhance the overall performance (NSE of 0.8 in Heudorfer et.al.)

* need to understand/evaluate impact of introduction static features

* capability of models to stick to plausible hydrogeological concepts
Results part has been kept short, with help of complementary material, and only the key points have been reported. This is one of the good point of this paper.
Discussion highlight very crucial consideration, in particular in the field of generalization capacities for this models, and information availabilty for static features.
I have a minor issue with the fact that waterlevel itself has been included in the input features. While this is a common way to proceed and is perfectly grounded, comparison with competing models that did not use this feature is made. I'm questionning myself on the reason of this choice and the overall impact of it.

Last, most of the references are for German works, and I missed a more general overview of ML for groundwater when it comes to deals to overall performance of the models.

All in all, this paper is, to me, an important step for the community, and the code hase been made available which should foster competitors to try for their dataset.
Specific comments
2.1.1 : time step of measurement are not reported, and no mention to re-scaling where needed (time resolution harmonisation)
line 88 : 50 times the average : on what basis is this figure based ? trial and error ? user expertise ? it's surprisingly far from often seen methods such as '3 standard deviation', or 1.5 IQR, z-scores, etc.
line 89 : 4 weeks : this could be challenged as hydaulic head variability could be low or high within this time-period depending on the influences under which the well reacts
line 91 : could be great to justify that the time series are in steady state and that 1996-2010 ; 2010-2013 and 2013-2016 exhibits no major change in time series characteristics, so that the train/validation/test datasets share similar (if not equals) characteristics
end of § 2.1.1. Hard to figure out, whith no prior knowledge on Germany if the few missing data would lead to ignore some specific climatic, geological, hydrogeological contexts. I'll encourage the authors to add a short sentence in this direction.
2.1.2 dynamic features :
Snow, wind, solar radiation has not be used. I would expect snow cover and snow melt to play a role for some years, and some parts of Germany. ANd solar radiation and wind in the evapotranspiration process. Even if I do believe that not all meteorological features can be considered, it could be insightfull for the reader to know if this variables are available in HYRAS, and why they have not been included in the dataset (redudancy for example ?).
line 101 'which have a strong influence on groundwater' sounds to be not really grounded on any scientific arugments. I would suggest to rephrase: (meteorological datasets that are regularly used in modeling (reservoirs ou distributed models), or in similar studies (add some international references))
102,103 - units are written in parenthesis, maybe easier or better to be integrated in table 1, and for all variables
105 : grid size is not given, so it becomes unclear of what is the impact of the weighted average step (averaging several pixels i.e. the 1km buffer includes several (how many ?) smaller pixels, or downsampling i.e. 1 pixel is >> than the 1km buffer)
line 109, 110 : "The meteorological input features and the LAI were extracted within a 1 km buffer around the groundwater wells. Thereby,

a weighted average was calculated based on the area covered by the pixels within the buffer." this should be challenge or at least discussed in the perspective of my previous remark, one can think that averaging could not allways be the best method
line 111 I think the last sentence "Dynamic features were divided [...] groundwater time series" would benefit to be moved in the 2.3 paragraph (experimental design and training)
line 113 : "The static features used in the study are environmental characteristics from the domains hydrogeology, soil, topography and

land cover (see Table 2)." The sentence sounds wired to me. Maybe the verb 'are' could be changed to 'covers'
table 1 : could it be possible to describe which variables are considered as 'meteorological input features' (see lines 102, 103) ?
line 125 'vanilla LSMT'. considering this paper could be read by hydrogeologist with no priori knowledge in computer science, i'll suggest to avoid this term that sounds as technical jargon /technical language
2.2
The models architectures are well described even if remains unclear if some steps have been specifically designed for the study or are included (or part of) the overall TFT models (e.g. "The importance of each feature is then represented by

the average of the variable selection weights over all time steps.")
I would suggest to split the 2.2.1 into two parts : 1-overall description of TFT and 2- implementation with the frame of the study
However I still think that the motivation for new architecture is not justified enough, eventhough one can almost understand some of them :
more efficient models for enlarge dataset compared to previous study (Heudorfer et.al.)
more capabilities to handle static and dynamic features
need to enhance the overall performance (NSE of 0.8 in Heudorfer et.al.)
Strategy for model evaluation is clever and robust, with consideration for hydrogeological context.

§2.3 - experimental design and training
I would expect to find here the global strategy for train/validation/test and not in 'data (§2.1.3)'. Also, a complementary strategy could have been to split on wells themselves, leaving aside a given proportion of wells and their related data. I think this could improve the paper to explain why this has not been used.
The assesment of the impact of static feature is obviously a good strategy, but this is not justify in the previous section of the paper. In my opinion this also one of the objectiv of this paper, and if so, should be expressed as so. I suggest to add a sentence in the last part of introduction that more clearly explain that the evaluation of impact of static features
line 159 : Groundwater levels were predicted from one up to 12 weeks. For every time step, [...]
This is unclear to me. Would it means the models predicts several sequences of growing lenghts ranging from 1 to 12 weeks ? or it is a sequence [52 weeks] to value [horizon = 1 week] procedure with recursion to achieve the 12 weeks desired horizon ?
I think it would be preferable to clearly highlight : the prediction horizon, the time step, and the sequence to sequence or sequence to value prediction strategy.
Are the dynamic variables (covariates) known/given to the model for each predicted time steps ? i.e. water level is predicted from past 52 weeks + of water level, rainfall, etc.. + 2 past weeks of rainfall, evaporation, etc... in case of a 2 weeks ahead prediction ?
lines 160-167 a short discussion on the hyperparameters values could be interesting : how this 10 epochs, dropout rate of 0.2 have been choosen ? DId you made any test on this values to track the gain/loss on the overall model performance ?
"The quantile loss was chosen because it is more robust towards outliers than for example the root mean squared

error (RMSE), e.g., caused by extreme precipitation events." : or a reference or word to express that it's your expertise, or after trial against other loss function.

194- Since the paper deals with two architecture (TFT and NHiTS), it's needed to say smth on both of them. Does NHiTS offers similar capabilities ?
Results :
198 - one-week prediction ... Depending on the dataset, this could be not completly relevant. There is no clue in the paper on the proportion of wells that exibit low frequency variations (i.e. inertial or very inertial). For such cases, even simple models (persistance, or even exponential smoothing) performs all ready very well
line 201 (TFT 0.34....) -> proposition to write (whereas median NSE is 0.34 for TFT...), otherwise hard to understand where the 81 % comes from.
205 - term 'ground truth' if known from advanced user of ML could be hard to understand for other hydrogeologist. If ground truth here is the observed hydraulic head, it could be interesting to more clearly say it.
fig 2. The only axis that do not share the same (xmin, xmax) is the 2B) bottom right, which makes the comparison harder.
fig 3. is underrated. Very little is said on the basis of it. That's a shame because i find it really interesting : the decrease in performance appears almost linear for the interval of 1 - 12 weeks. It also support the fact that extremes prediction horizon 1 and 12 weeks only have been shown, and all the other ignored. SInce there is no sharp change in the performances, there is no prediction horizon at which the model performance is getting really worst and that is also in intersting point. TFT exhibits a difference compared to NHiTS : median for static + dynamic and purely dynamic are almost the same, while for NHiTS, there is a visible difference of about 0.2 point at 12 weeks.
line 227-228 : again, at 1 week, I do agree that the models performances are high, but I would find interesting to mitigate this claim as it's higlhy plausible that any model and even simple one would have good performances.
"Poor performances with a median NSE below zero" -> is this for the 1 week horizon only ? If yes, a word like 'However' at the begining of the sentence would help to clarify.
line 230 - reference to regional terms (Upper Rhine Graben, Central German Unconsolidated Rock District, Alpine Foreland) are confusing because they hav'nt been describre before, and no reference is made to the figure B5 in supplementary material. I would recommend to add a short § (1 to 2 sentences) in section 2-data to explain briefly the geology of Germany and make a reference to the B5 figure.
figure 5 - KDE density varies from 0 to 1, right ? Is so, maybe add it in the legend. Hydrogeological units are hard to read. Here again, a reference to the B5 figure would be helpfull, as I don't see any easy way to improve readibility (maybe try to thicken the white line, or use a medium grey ?).
line 244 : the term correlation refers to the 'spearman correlation coeficient ?'
line 245-248 Why the 1 week horizon is no more in the race as in the begining of the paper ?

Discussion - 4.1
line 281. and NSE of 0.5 is not really high. This should be mentionned somewhere, for the reader to see that authors know there is a room for improvement.
line : 291 "The single-well models solely used meteorological input features (Temperature and precipitation), while the LSTM approach included static features"
line : 294 "The wells in these studies were preselected on the basis that their groundwater dynamics were primarily influenced by climatic processes [...]"
line 301 "However, N-HiTS in its current implementation requires the target feature as input feature, and is for this task inferior to the single-well CNNs or the global LSTM."
and line 274 - "The most important past time steps, according to the attention scores, were often at the beginning of the input sequences (52 weeks, i.e. the week a year ago) and recent time points. " + the feature importance of the 'groundwater level' (figure 6) suggest that the vast majority of wells exhibits an annual regime with very little variations around this. This suggest that TFT of NHiTS are very capable of replicating the past patterns of groundwater heads (low flow in summer, high flow in winter), but not very capable to understand the transformation of rain into groundwater levels evolution (through /infiltration/recharge and possibly delay into the unsaturated zone, and up to the top of aquifer).
This 4 parts of the paper makes a strong inconstancy. While the objectivs (from what I can guess) are probably to compete whith the Heudorfer implementation (LSTM) and do groundwater prediction, the case where the groundwater level itself is left aside appears not to be considered.
In short : why a model without groundwater level as input feature has not be evaluated ? what happens when this feature is left aside on the overall model performance ? I did not found anywhare justification for this choice, and comparision with previous work makes this justification unavoidable.
4.2
line 326-328 Here again, since the water level (WL) appears to be such an important feature, in comparision to static feature, i'm curious of what could be this correlation analysis if WL is removed of the input feature.
line 331 - what is the 'expected seasonality' ?
line 331 "The highest identified [...] a lower flashiness." Here again, it sound's like when WL evolution is 'simple' : sinusoidal variation with low flow/hight flow in summer/winter : the model performs. But I'll wait so much more from advanced DL models ! I suggest to mitigate or more discuss this case, with the perspective of the added-values of 'complicated' models compared to LSTM of even simplier models (exponential smoothing, VAR,...).
line 335-337 One could also challenge the 52 weeks sequence here. The variability of this wells could be at lower frequency, i.e. needs a larger sequence
line 340 : Porous aquifer here seems to denote also 'homogenous'. Would an highly compartimented aquifer made of porous sediment still refered as 'porous' ?
line 361 Still the problem of WL as an input feature or not. This should be distinguished among cited references.
line 362 "However, these studies were conducted for a much smaller number of monitoring wells and the authors suggest that their models used the static features primarily as unique identifiers (Heudorfer et al., 2024; Li et al., 2022)." -> this is not the only difference. Here again I think that WL as an input feature plays a major role.
Title 4.3 The Role of Static Features in Global Machine Learning Models. With the exeption of the last 4 sentences, all this part is dedicated to the usage of ML for groundwater level prediction. Title suggests that the general case will be discussed (which is not strictly the case).
line 366 "It is important to note that by using a validation set and various techniques such as dropout and early stopping to avoid overfitting

the models were prevented from simply replicating historical groundwater levels." this is maybe part of an explanation. But i) it comes to late in the paper, and ii) ok for dropout, one can also think to pruning to achive such a goal, + but, I still believe that the fundamentaly autoregressive behavior of groundwater makes the waterlevel itself as an input feature a big game changer, and comparision with models that do not take WL in inpute is then biaised.
Conclusion :
Still the problem with groundwater level itself in the input feature that could lower the effect of dealing with static features...
data availability for static features, adequation between static features used here (mainly concerning soil/ surface cover), effect of the 1 km radius are missing here, despite beeing written and cleaverly discussed before. They should be added in the conclusion.

Citation: https://doi.org/10.5194/egusphere-2024-3484-RC2
- AC3: 'Reply on RC2', Stefan Kunz, 17 Jan 2025
  
  We thank Abel Henriot for his positive assessment of our manuscript. We address his suggestions in the attached pdf (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2024-3484-AC3
RC3:
'Comment on egusphere-2024-3484', Anonymous Referee #3, 17 Dec 2024
The submission by Kunz et al. presents the development and application of a machine learning model for groundwater level prediction in Germany. The models are referred to as "global" since they are trained against a multitude of wells simultaneously. The study entails several novel aspects which make the submission highly relevant for publication in HESS: 1) the applied models have not previously been applied in the groundwater domain and go beyond the state of the art, 2) such a large number of monitoring wells with time series data has not been used for model development before, and 3) the thorough investigation of the effect of static features in the models.
I only have a few comments that I wish to see addressed before publication:
Introduction: The cited literature in the introduction could be diversified. Here are two suggested references that could be included:
Collenteur, R. A., Haaf, E., Bakker, M., Liesch, T., Wunsch, A., Soonthornrangsan, J., ... & Meysami, R. (2024). Data-driven modelling of hydraulic-head time series: results and lessons learned from the 2022 Groundwater Time Series Modelling Challenge. Hydrology and Earth System Sciences, 28(23), 5193-5208.

Chidepudi, S. K. R., Massei, N., Jardani, A., & Henriot, A. (2024). Groundwater level reconstruction using long-term climate reanalysis data and deep neural networks. Journal of Hydrology: Regional Studies, 51, 101632.

Section 2.1.1: This section is missing information on the temporal resolution of the data. What is the frequency of the measurements, and were the measurements aggregated in time?

Section 2.3: Please clarify if the models are run in an autoregressive manner, simulating one timestep at a time (i.e., prediction at t1 is added to the dynamic inputs to predict t2), or if a sequence for the entire forecast horizon is outputted directly.

Section 2.4: Please clarify how the prediction intervals have been utilized. Were three separate models trained for the 0.1, 0.5, and 0.9 quantiles?

Discussion: Given the data presented in this paper, I was hoping the authors would attempt predictions at ungauged wells. Currently, groundwater level observations are used both in the dynamic and static features, making predictions at ungauged wells impossible with the existing model setup. I encourage the authors to add a discussion section outlining a path towards predicting groundwater levels at ungauged wells. This could be supported with an additional model experiment that excludes observed groundwater level data from the input features and is based on a spatial hold-out of monitoring wells for model testing. Even a poor test performance of such a spatio-temporal holdout experiment would be relevant to publish to underline the need for future research. To my knowledge such an experiment has not been published yet.
Citation: https://doi.org/10.5194/egusphere-2024-3484-RC3
- AC2: 'Reply on RC3', Stefan Kunz, 17 Jan 2025
  
  We thank Reviewer 3 for the positive assessment of our manuscript. The suggestions have been addressed by us in the attached PDF (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2024-3484-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-3484', Anonymous Referee #1, 05 Dec 2024
This manuscript evaluates the performance of two machine learning models in predicting groundwater (GW) levels across a dataset of ~5000 wells in Germany. The study examines the influence of both dynamic and static input features on the accuracy of GW level predictions and seeks to enhance the understanding of hydrogeological systems.
The objectives, methodology, results, and discussion are clear, well-structured, and thoroughly explained. The study aligns well with the scope of the journal, and HESS readers would benefit from and appreciate its findings. In my opinion, the manuscript is close to its final form. However, I would like to raise the following points for consideration:
The manuscript specifies particular values for hyperparameters (e.g., dropout rate, batch size, etc.). Are these values based on specific rules or conventions? Did you test alternative values? While this may not significantly affect the overall conclusions, I believe it would be helpful to clarify this for the reader.

Is there a specific reason for setting the prediction horizon to a maximum of 12 weeks?

Figures B9 and B10 indicate that attention is higher one year before the prediction than at times closer to it. Could you elaborate on why this happens?

In figure B10, why are attention values not zero in the interval of 0–10 weeks? Does this imply that the algorithm is somehow using inputs from these time steps? I suggest including a diagram to illustrate how inputs and outputs operate in the ML algorithms (e.g., similar to Figure 1 in Kratzert et al., 2018). This would help clarify which specific information is being utilized and when.
Citation: https://doi.org/10.5194/egusphere-2024-3484-RC1
- AC1: 'Reply on RC1', Stefan Kunz, 17 Jan 2025
  
  We thank Reviewer 1 for the positive assessment of our manuscript. The suggestions have been addressed by us in the attached PDF (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2024-3484-AC1
RC2:
'Comment on egusphere-2024-3484', Abel Henriot, 15 Dec 2024

General comments :
The paper is related to the needs of groundwater modeling using exlclusively machine learning approaches. Whereas this field of research is not complety new, the use of two specific model architecture (TFT and N-HiTS) has never been published in hydrogeology which represents a novelty. In addition, the study combines both static and dynamic features in the dataset. Static features have been carefully selected to represents hydrogeological properties, or representing links with groundwater levels that hydrogeologist can easily figure out.
The very large number of wells for training worth also to be noted as a novelty, and represents a clear advance in the field of hydrogeology, as it suggest that this model architecture could be suitable for even larger datasets. On top of this, confrontation with hydrogeologists expertise is made, which is both hardly found in litterature and very insightfull.
The overall structure is clear and leads the reader to a detailed understanding of methods, workflow, results and main outcomes.
Objectivs or motivation are slightly less clearly explained. And I think a 2 to 3 sentences paragraph at the end of the introduction that exposes more clearly all the motivations could be usefull. From my understanding :

* more efficient models for enlarged dataset compared to previous study (Heudorfer et.al.)

* more capabilities to handle static and dynamic features

* need to enhance the overall performance (NSE of 0.8 in Heudorfer et.al.)

* need to understand/evaluate impact of introduction static features

* capability of models to stick to plausible hydrogeological concepts
Results part has been kept short, with help of complementary material, and only the key points have been reported. This is one of the good point of this paper.
Discussion highlight very crucial consideration, in particular in the field of generalization capacities for this models, and information availabilty for static features.
I have a minor issue with the fact that waterlevel itself has been included in the input features. While this is a common way to proceed and is perfectly grounded, comparison with competing models that did not use this feature is made. I'm questionning myself on the reason of this choice and the overall impact of it.

Last, most of the references are for German works, and I missed a more general overview of ML for groundwater when it comes to deals to overall performance of the models.

All in all, this paper is, to me, an important step for the community, and the code hase been made available which should foster competitors to try for their dataset.
Specific comments
2.1.1 : time step of measurement are not reported, and no mention to re-scaling where needed (time resolution harmonisation)
line 88 : 50 times the average : on what basis is this figure based ? trial and error ? user expertise ? it's surprisingly far from often seen methods such as '3 standard deviation', or 1.5 IQR, z-scores, etc.
line 89 : 4 weeks : this could be challenged as hydaulic head variability could be low or high within this time-period depending on the influences under which the well reacts
line 91 : could be great to justify that the time series are in steady state and that 1996-2010 ; 2010-2013 and 2013-2016 exhibits no major change in time series characteristics, so that the train/validation/test datasets share similar (if not equals) characteristics
end of § 2.1.1. Hard to figure out, whith no prior knowledge on Germany if the few missing data would lead to ignore some specific climatic, geological, hydrogeological contexts. I'll encourage the authors to add a short sentence in this direction.
2.1.2 dynamic features :
Snow, wind, solar radiation has not be used. I would expect snow cover and snow melt to play a role for some years, and some parts of Germany. ANd solar radiation and wind in the evapotranspiration process. Even if I do believe that not all meteorological features can be considered, it could be insightfull for the reader to know if this variables are available in HYRAS, and why they have not been included in the dataset (redudancy for example ?).
line 101 'which have a strong influence on groundwater' sounds to be not really grounded on any scientific arugments. I would suggest to rephrase: (meteorological datasets that are regularly used in modeling (reservoirs ou distributed models), or in similar studies (add some international references))
102,103 - units are written in parenthesis, maybe easier or better to be integrated in table 1, and for all variables
105 : grid size is not given, so it becomes unclear of what is the impact of the weighted average step (averaging several pixels i.e. the 1km buffer includes several (how many ?) smaller pixels, or downsampling i.e. 1 pixel is >> than the 1km buffer)
line 109, 110 : "The meteorological input features and the LAI were extracted within a 1 km buffer around the groundwater wells. Thereby,

a weighted average was calculated based on the area covered by the pixels within the buffer." this should be challenge or at least discussed in the perspective of my previous remark, one can think that averaging could not allways be the best method
line 111 I think the last sentence "Dynamic features were divided [...] groundwater time series" would benefit to be moved in the 2.3 paragraph (experimental design and training)
line 113 : "The static features used in the study are environmental characteristics from the domains hydrogeology, soil, topography and

land cover (see Table 2)." The sentence sounds wired to me. Maybe the verb 'are' could be changed to 'covers'
table 1 : could it be possible to describe which variables are considered as 'meteorological input features' (see lines 102, 103) ?
line 125 'vanilla LSMT'. considering this paper could be read by hydrogeologist with no priori knowledge in computer science, i'll suggest to avoid this term that sounds as technical jargon /technical language
2.2
The models architectures are well described even if remains unclear if some steps have been specifically designed for the study or are included (or part of) the overall TFT models (e.g. "The importance of each feature is then represented by

the average of the variable selection weights over all time steps.")
I would suggest to split the 2.2.1 into two parts : 1-overall description of TFT and 2- implementation with the frame of the study
However I still think that the motivation for new architecture is not justified enough, eventhough one can almost understand some of them :
more efficient models for enlarge dataset compared to previous study (Heudorfer et.al.)
more capabilities to handle static and dynamic features
need to enhance the overall performance (NSE of 0.8 in Heudorfer et.al.)
Strategy for model evaluation is clever and robust, with consideration for hydrogeological context.

§2.3 - experimental design and training
I would expect to find here the global strategy for train/validation/test and not in 'data (§2.1.3)'. Also, a complementary strategy could have been to split on wells themselves, leaving aside a given proportion of wells and their related data. I think this could improve the paper to explain why this has not been used.
The assesment of the impact of static feature is obviously a good strategy, but this is not justify in the previous section of the paper. In my opinion this also one of the objectiv of this paper, and if so, should be expressed as so. I suggest to add a sentence in the last part of introduction that more clearly explain that the evaluation of impact of static features
line 159 : Groundwater levels were predicted from one up to 12 weeks. For every time step, [...]
This is unclear to me. Would it means the models predicts several sequences of growing lenghts ranging from 1 to 12 weeks ? or it is a sequence [52 weeks] to value [horizon = 1 week] procedure with recursion to achieve the 12 weeks desired horizon ?
I think it would be preferable to clearly highlight : the prediction horizon, the time step, and the sequence to sequence or sequence to value prediction strategy.
Are the dynamic variables (covariates) known/given to the model for each predicted time steps ? i.e. water level is predicted from past 52 weeks + of water level, rainfall, etc.. + 2 past weeks of rainfall, evaporation, etc... in case of a 2 weeks ahead prediction ?
lines 160-167 a short discussion on the hyperparameters values could be interesting : how this 10 epochs, dropout rate of 0.2 have been choosen ? DId you made any test on this values to track the gain/loss on the overall model performance ?
"The quantile loss was chosen because it is more robust towards outliers than for example the root mean squared

error (RMSE), e.g., caused by extreme precipitation events." : or a reference or word to express that it's your expertise, or after trial against other loss function.

194- Since the paper deals with two architecture (TFT and NHiTS), it's needed to say smth on both of them. Does NHiTS offers similar capabilities ?
Results :
198 - one-week prediction ... Depending on the dataset, this could be not completly relevant. There is no clue in the paper on the proportion of wells that exibit low frequency variations (i.e. inertial or very inertial). For such cases, even simple models (persistance, or even exponential smoothing) performs all ready very well
line 201 (TFT 0.34....) -> proposition to write (whereas median NSE is 0.34 for TFT...), otherwise hard to understand where the 81 % comes from.
205 - term 'ground truth' if known from advanced user of ML could be hard to understand for other hydrogeologist. If ground truth here is the observed hydraulic head, it could be interesting to more clearly say it.
fig 2. The only axis that do not share the same (xmin, xmax) is the 2B) bottom right, which makes the comparison harder.
fig 3. is underrated. Very little is said on the basis of it. That's a shame because i find it really interesting : the decrease in performance appears almost linear for the interval of 1 - 12 weeks. It also support the fact that extremes prediction horizon 1 and 12 weeks only have been shown, and all the other ignored. SInce there is no sharp change in the performances, there is no prediction horizon at which the model performance is getting really worst and that is also in intersting point. TFT exhibits a difference compared to NHiTS : median for static + dynamic and purely dynamic are almost the same, while for NHiTS, there is a visible difference of about 0.2 point at 12 weeks.
line 227-228 : again, at 1 week, I do agree that the models performances are high, but I would find interesting to mitigate this claim as it's higlhy plausible that any model and even simple one would have good performances.
"Poor performances with a median NSE below zero" -> is this for the 1 week horizon only ? If yes, a word like 'However' at the begining of the sentence would help to clarify.
line 230 - reference to regional terms (Upper Rhine Graben, Central German Unconsolidated Rock District, Alpine Foreland) are confusing because they hav'nt been describre before, and no reference is made to the figure B5 in supplementary material. I would recommend to add a short § (1 to 2 sentences) in section 2-data to explain briefly the geology of Germany and make a reference to the B5 figure.
figure 5 - KDE density varies from 0 to 1, right ? Is so, maybe add it in the legend. Hydrogeological units are hard to read. Here again, a reference to the B5 figure would be helpfull, as I don't see any easy way to improve readibility (maybe try to thicken the white line, or use a medium grey ?).
line 244 : the term correlation refers to the 'spearman correlation coeficient ?'
line 245-248 Why the 1 week horizon is no more in the race as in the begining of the paper ?

Discussion - 4.1
line 281. and NSE of 0.5 is not really high. This should be mentionned somewhere, for the reader to see that authors know there is a room for improvement.
line : 291 "The single-well models solely used meteorological input features (Temperature and precipitation), while the LSTM approach included static features"
line : 294 "The wells in these studies were preselected on the basis that their groundwater dynamics were primarily influenced by climatic processes [...]"
line 301 "However, N-HiTS in its current implementation requires the target feature as input feature, and is for this task inferior to the single-well CNNs or the global LSTM."
and line 274 - "The most important past time steps, according to the attention scores, were often at the beginning of the input sequences (52 weeks, i.e. the week a year ago) and recent time points. " + the feature importance of the 'groundwater level' (figure 6) suggest that the vast majority of wells exhibits an annual regime with very little variations around this. This suggest that TFT of NHiTS are very capable of replicating the past patterns of groundwater heads (low flow in summer, high flow in winter), but not very capable to understand the transformation of rain into groundwater levels evolution (through /infiltration/recharge and possibly delay into the unsaturated zone, and up to the top of aquifer).
This 4 parts of the paper makes a strong inconstancy. While the objectivs (from what I can guess) are probably to compete whith the Heudorfer implementation (LSTM) and do groundwater prediction, the case where the groundwater level itself is left aside appears not to be considered.
In short : why a model without groundwater level as input feature has not be evaluated ? what happens when this feature is left aside on the overall model performance ? I did not found anywhare justification for this choice, and comparision with previous work makes this justification unavoidable.
4.2
line 326-328 Here again, since the water level (WL) appears to be such an important feature, in comparision to static feature, i'm curious of what could be this correlation analysis if WL is removed of the input feature.
line 331 - what is the 'expected seasonality' ?
line 331 "The highest identified [...] a lower flashiness." Here again, it sound's like when WL evolution is 'simple' : sinusoidal variation with low flow/hight flow in summer/winter : the model performs. But I'll wait so much more from advanced DL models ! I suggest to mitigate or more discuss this case, with the perspective of the added-values of 'complicated' models compared to LSTM of even simplier models (exponential smoothing, VAR,...).
line 335-337 One could also challenge the 52 weeks sequence here. The variability of this wells could be at lower frequency, i.e. needs a larger sequence
line 340 : Porous aquifer here seems to denote also 'homogenous'. Would an highly compartimented aquifer made of porous sediment still refered as 'porous' ?
line 361 Still the problem of WL as an input feature or not. This should be distinguished among cited references.
line 362 "However, these studies were conducted for a much smaller number of monitoring wells and the authors suggest that their models used the static features primarily as unique identifiers (Heudorfer et al., 2024; Li et al., 2022)." -> this is not the only difference. Here again I think that WL as an input feature plays a major role.
Title 4.3 The Role of Static Features in Global Machine Learning Models. With the exeption of the last 4 sentences, all this part is dedicated to the usage of ML for groundwater level prediction. Title suggests that the general case will be discussed (which is not strictly the case).
line 366 "It is important to note that by using a validation set and various techniques such as dropout and early stopping to avoid overfitting

the models were prevented from simply replicating historical groundwater levels." this is maybe part of an explanation. But i) it comes to late in the paper, and ii) ok for dropout, one can also think to pruning to achive such a goal, + but, I still believe that the fundamentaly autoregressive behavior of groundwater makes the waterlevel itself as an input feature a big game changer, and comparision with models that do not take WL in inpute is then biaised.
Conclusion :
Still the problem with groundwater level itself in the input feature that could lower the effect of dealing with static features...
data availability for static features, adequation between static features used here (mainly concerning soil/ surface cover), effect of the 1 km radius are missing here, despite beeing written and cleaverly discussed before. They should be added in the conclusion.

Citation: https://doi.org/10.5194/egusphere-2024-3484-RC2
- AC3: 'Reply on RC2', Stefan Kunz, 17 Jan 2025
  
  We thank Abel Henriot for his positive assessment of our manuscript. We address his suggestions in the attached pdf (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2024-3484-AC3
RC3:
'Comment on egusphere-2024-3484', Anonymous Referee #3, 17 Dec 2024
The submission by Kunz et al. presents the development and application of a machine learning model for groundwater level prediction in Germany. The models are referred to as "global" since they are trained against a multitude of wells simultaneously. The study entails several novel aspects which make the submission highly relevant for publication in HESS: 1) the applied models have not previously been applied in the groundwater domain and go beyond the state of the art, 2) such a large number of monitoring wells with time series data has not been used for model development before, and 3) the thorough investigation of the effect of static features in the models.
I only have a few comments that I wish to see addressed before publication:
Introduction: The cited literature in the introduction could be diversified. Here are two suggested references that could be included:
Collenteur, R. A., Haaf, E., Bakker, M., Liesch, T., Wunsch, A., Soonthornrangsan, J., ... & Meysami, R. (2024). Data-driven modelling of hydraulic-head time series: results and lessons learned from the 2022 Groundwater Time Series Modelling Challenge. Hydrology and Earth System Sciences, 28(23), 5193-5208.

Chidepudi, S. K. R., Massei, N., Jardani, A., & Henriot, A. (2024). Groundwater level reconstruction using long-term climate reanalysis data and deep neural networks. Journal of Hydrology: Regional Studies, 51, 101632.

Section 2.1.1: This section is missing information on the temporal resolution of the data. What is the frequency of the measurements, and were the measurements aggregated in time?

Section 2.3: Please clarify if the models are run in an autoregressive manner, simulating one timestep at a time (i.e., prediction at t1 is added to the dynamic inputs to predict t2), or if a sequence for the entire forecast horizon is outputted directly.

Section 2.4: Please clarify how the prediction intervals have been utilized. Were three separate models trained for the 0.1, 0.5, and 0.9 quantiles?

Discussion: Given the data presented in this paper, I was hoping the authors would attempt predictions at ungauged wells. Currently, groundwater level observations are used both in the dynamic and static features, making predictions at ungauged wells impossible with the existing model setup. I encourage the authors to add a discussion section outlining a path towards predicting groundwater levels at ungauged wells. This could be supported with an additional model experiment that excludes observed groundwater level data from the input features and is based on a spatial hold-out of monitoring wells for model testing. Even a poor test performance of such a spatio-temporal holdout experiment would be relevant to publish to underline the need for future research. To my knowledge such an experiment has not been published yet.
Citation: https://doi.org/10.5194/egusphere-2024-3484-RC3
- AC2: 'Reply on RC3', Stefan Kunz, 17 Jan 2025
  
  We thank Reviewer 3 for the positive assessment of our manuscript. The suggestions have been addressed by us in the attached PDF (marked in blue).
  
  Citation: https://doi.org/10.5194/egusphere-2024-3484-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Publish subject to revisions (further review by editor and referees) (12 Feb 2025) by Alberto Guadagnini

AR by Stefan Kunz on behalf of the Authors (18 Mar 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (18 Mar 2025) by Alberto Guadagnini

RR by Leonardo Sandoval (28 Mar 2025)

ED: Publish as is (29 Mar 2025) by Alberto Guadagnini

AR by Stefan Kunz on behalf of the Authors (31 Mar 2025)

Journal article(s) based on this preprint

01 Aug 2025

Towards a global spatial machine learning model for seasonal groundwater level predictions in Germany

Stefan Kunz, Alexander Schulz, Maria Wetzel, Maximilian Nölscher, Teodor Chiaburu, Felix Biessmann, and Stefan Broda

Hydrol. Earth Syst. Sci., 29, 3405–3433, https://doi.org/10.5194/hess-29-3405-2025,https://doi.org/10.5194/hess-29-3405-2025, 2025

Short summary

Stefan Kunz, Alexander Schulz, Maria Wetzel, Maximilian Nölscher, Teodor Chiaburu, Felix Biessmann, and Stefan Broda

Viewed

Total article views: 699 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
489	186	24	699	17	25

HTML: 489
PDF: 186
XML: 24
Total: 699
BibTeX: 17
EndNote: 25

Views and downloads (calculated since 19 Nov 2024)

Month	HTML	PDF	XML	Total
Nov 2024	99	18	6	123
Dec 2024	100	54	3	157
Jan 2025	53	21	4	78
Feb 2025	42	10	1	53
Mar 2025	35	13	3	51
Apr 2025	35	12	1	48
May 2025	32	13	1	46
Jun 2025	41	34	5	80
Jul 2025	50	10	0	60
Aug 2025	0
Sep 2025	0
Oct 2025	0
Nov 2025	0
Dec 2025	1	0	1
Jan 2026	0
Feb 2026	0
Mar 2026	1	1	0	2
Apr 2026	0

Cumulative views and downloads (calculated since 19 Nov 2024)

Month	HTML	PDF	XML	Total
Nov 2024	99	18	6	123
Dec 2024	100	54	3	157
Jan 2025	53	21	4	78
Feb 2025	42	10	1	53
Mar 2025	35	13	3	51
Apr 2025	35	12	1	48
May 2025	32	13	1	46
Jun 2025	41	34	5	80
Jul 2025	50	10	0	60
Aug 2025	0
Sep 2025	0
Oct 2025	0
Nov 2025	0
Dec 2025	1	0	1
Jan 2026	0
Feb 2026	0
Mar 2026	1	1	0	2
Apr 2026	0

Viewed (geographical distribution)

Total article views: 687 (including HTML, PDF, and XML) Thereof 687 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 11 Apr 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (10156 KB)
Metadata XML

Short summary

Accurate groundwater level predictions are essential for a sustainable groundwater management. This study applies two machine learning (ML) models—N-HiTS and TFT—to seasonally predict groundwater levels for 5,288 monitoring wells across Germany. Both approaches provided good predictions across diverse hydrogeological conditions, whereby N-HiTS outperformed the TFT. Both models showed better perforance in areas with high data density, in lowlands, and when distinct seasonal dynamics occurred.


Total:	0
HTML:	0
PDF:	0
XML:	0