Performance assessment of geospatial and time series features on groundwater level forecasting with deep learning

Gomez, Mariana; Noelscher, Maximilian; Hartmann, Andreas; Broda, Stefan

doi:https://doi.org/10.5194/egusphere-2023-1836

Preprints

https://doi.org/10.5194/egusphere-2023-1836

Preprints

13 Sep 2023

| 13 Sep 2023

Performance assessment of geospatial and time series features on groundwater level forecasting with deep learning

Mariana Gomez, Maximilian Noelscher, Andreas Hartmann, and Stefan Broda

Abstract. Groundwater level (GWL) forecasting with machine learning has been widely studied due to its generally accurate results and little input data requirements. Furthermore, machine learning models for this purpose are set up and trained in a short time when compared to the effort required for process-based numerical models. Despite the high performance of models obtained at specific locations, applying the same model architecture to multiple sites across a regional area might lead to contrasting accuracies. Likely causalities of this discrepancy in model performance have been barely examined in previous studies. Here, we investigate the link between model performance and the effects of geospatial site and time series features. Using precipitation (P) and temperature (T) as predictors, we model groundwater levels at approximately 500 observation wells in Lower Saxony, Germany, applying a 1-D convolutional neural network (CNN) with a fixed architecture and hyperparameters tuned for each time series individually. The GWL observations range from 21 to 71 years, leading to a variable test and training dataset time range. The performances are evaluated against relevant geospatial characteristics (e.g. landcover, distance to water works, and leaf area index) and time series features (e.g. autocorrelation, flat spots, and number of peaks) using Pearson correlation coefficients. We found that model performance is negatively influenced at sites near waterworks and densely vegetated areas. Longer subsequences of GWL measurements above or below the mean negatively impact the metrics and might be associated with anthropogenic influence or wetter and drier periods. Besides, complex GWL time series exhibit better metrics, possibly due to a closer link with precipitation dynamics. As deep learning models are known to be black-box models missing the physical processes understanding, our work shows new insights into the degree of affectation that external physical factors have on the input-output relation of a GWL forecasting model.

Received: 21 Aug 2023 – Discussion started: 13 Sep 2023

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Mariana Gomez, Maximilian Noelscher, Andreas Hartmann, and Stefan Broda

Status: final response (author comments only)

RC1:
'Comment on egusphere-2023-1836', Marvin Höge, 06 Oct 2023

Review of “Performance assessment of geospatial and time series features on groundwater level forecasting with deep learning.” by Mariana Gomez et al., 2023
Summary
The presented study addresses the important topic of groundwater level forecasting, demonstrated for several hundreds of wells in Lower Saxony, Germany. In specific, it covers the performance of convolutional neural network (CNN) to accomplish this task as an increasingly used alternative to typically used physics-based models. The study analyses how the CNN performance relates to geospatial features and time series features of the respective sites.
Evaluation and Recommendation
The study covers an interesting topic especially since groundwater in Germany (and in a global scope) poses a fresh water resource of already high and even growing importance. The topic of this manuscript is suitable for the journal.
The manuscript is overall well written and referenced. The codes that were used are freely available and data sources are referenced. The figures and maps are of high quality. With deep learning approaches being increasingly used for groundwater level forecasting, the investigation of such models’ capability is important and meets the community’s interest. Yet, the current manuscript requires some restructuring and additional investigations.
In the introduction, it is mentioned that some consider machine learning methods as “black boxes”. Explainable artificial intelligence (xAI) tools or similarly called methods are supposed to help here. Therefore, a brief coverage of advances in this field – ideally in the field of groundwater research if available - would be worth mentioning.
The main point of concern, however, is as follows: Overall, the performance of the employed CNN model, as presented, e.g., in Figure 6, is not fully convincing. Even the well-performing models - according to NSE and R2 - show mainly a sinusoidal pattern with only slight variations – yet, it is these variations that would be interesting to be modelled. Otherwise, a sine function-based model with a mean trend might often be sufficient and provide the same goodness-of-fit values. Therefore, one can assume that the used model architecture (together with only precipitation and temperature as inputs) is not complex enough to capture more of the dynamics. The subsequent analysis of performance with respect to geospatial and timeseries features therefore appears to be weaker than it could be. It appears to be difficult to deduce relations between features and model performance if the model does not perform convincingly in the first place. All correlations reported are rather weak with the strongest anti-correlation being -0.62.
Along these lines, the used geospatial features are all interesting but the timeseries features are too many and some are hardly tangible. A thorough explanation of their meaning, range of values, etc. would be beneficial. If time series features shall be part of the analysis, I recommend to focus on only a few of the ones provided. Yet, in this case, the question remains: What is exactly gained from relating these time series features to model performance? For instance, a time series can be rated as complex while an adequate model could still be able to predict it.
It could be helpful to have a leaner story, e.g. presenting another (maybe more complex) model that captures more of the dynamics and then to analyze whether strong relations between performance of that model and geospatial features can be elicited - and whether they are more indicative than the ones that correspond to the current model. If there were clear correlations to timeseries features, it might be an option to keep some in the main analysis. Overall, it would be better to place them in the appendix for the reasons discussed above.
I recommend major revisions and, at the same time, due to its interesting core topic and the different aspects of modelling, feature analysis, etc. I think this study bears quite some potential.

Specific comments
Please see the attached manuscript.
Tables and Figures
Please see the attached manuscript.
Language
Please see the attached manuscript.

Citation: https://doi.org/10.5194/egusphere-2023-1836-RC1
- AC1: 'Reply on RC1', Mariana Gomez, 31 Oct 2023
  
  Dear Marvin Höge,
  We appreciate your suggestions on our manuscript, which we indeed consider all valuable in enhancing its quality. In response to your comments, we aim to address your discussion points as following.
  In the introduction, it is mentioned that some consider machine learning methods as “black boxes”. Explainable artificial intelligence (xAI) tools or similarly called methods are supposed to help here. Therefore, a brief coverage of advances in this field – ideally in the field of groundwater research if available - would be worth mentioning.
  We acknowledge the importance of addressing the implementation of Explainable AI techniques in our research, and we intend to integrate a discussion on this topic into our manuscript. This addition will contribute to a more thorough understanding of the interpretability and transparency of our modeling approach.
  The main point of concern, however, is as follows: Overall, the performance of the employed CNN model, as presented, e.g., in Figure 6, is not fully convincing. Even the well-performing models - according to NSE and R2 - show mainly a sinusoidal pattern with only slight variations – yet, it is these variations that would be interesting to be modelled. Otherwise, a sine function-based model with a mean trend might often be sufficient and provide the same goodness-of-fit values. Therefore, one can assume that the used model architecture (together with only precipitation and temperature as inputs) is not complex enough to capture more of the dynamics. The subsequent analysis of performance with respect to geospatial and time series features therefore appears to be weaker than it could be. It appears to be difficult to deduce relations between features and model performance if the model does not perform convincingly in the first place. All correlations reported are rather weak with the strongest anti-correlation being -0.62.
  Regarding your main concern about the model's performances, we hypothesize that the monthly resolution used in this study may contribute to the sinusoidal pattern seen on the well-performing model example. The seasonality evidenced on the monthly temperature and precipitation used as inputs can certainly affect the model behaviour. It is worth noting that the CNN model has previously demonstrated effectiveness in predicting weekly groundwater level time series with high overall accuracy, as shown by Wunsch et al. (2022) "Deep learning shows declining groundwater levels in Germany until 2100 due to climate change". Therefore, we are confident that our approach works well for higher temporal resolutions and we propose a shift to a weekly temporal resolution for modeling the time series. This adjustment is intended to mitigate the potential seasonality introduced by the monthly resolution. By comparing the model's performance at weekly resolution, we seek to address this issue and establish a more robust foundation for examining geospatial and time series correlations also in form of possibly higher correlations.
  It is important to emphasize that while adopting a different model may yield improvements in overall performance, we have consistently applied the same model to all monitoring stations. Consequently, local or more specific variations in performance metrics across stations are primarily attributed to external factors rather than model selection. Certain models may better adapt to specific locations while displaying lower performance when applied to the entire dataset. In this study, our primary objective is to analyze these relative performance differences in the presence of external influences. Therefore, even if an alternative model were to enhance overall performance, our analysis will retain its current form, focusing on the examination of these external influences.
  Along these lines, the used geospatial features are all interesting but the timeseries features are too many and some are hardly tangible. A thorough explanation of their meaning, range of values, etc. would be beneficial. If time series features shall be part of the analysis, I recommend to focus on only a few of the ones provided. Yet, in this case, the question remains: What is exactly gained from relating these time series features to model performance? For instance, a time series can be rated as complex while an adequate model could still be able to predict it.
  The selection of time series features represents the outcome of a pre-evaluation aimed at elucidating their physical significance in the context of groundwater level time series. We agree with your suggestion that a more extensive interpretation of these features should be incorporated into our manuscript. We will augment our discussion with a detailed explanation of each feature, emphasizing their relevance to the physical aspects of groundwater level dynamics.
  Regarding the detailed comments addressed on the attached PDF, we will go through them and incorporate them when in agreement.
  In conclusion, we are committed to addressing your comments as comprehensively as possible.
  
  Citation: https://doi.org/10.5194/egusphere-2023-1836-AC1
RC2:
'Comment on egusphere-2023-1836', Jonathan Frame, 15 Oct 2023

This paper presents a CNN developed for ground water levels across a region of Germany. The CNN was developed for each well time series individually, meaning this is not a model that should be applied to locations without well data. The paper’s main contributions is an attribution of model performance to geospatial characteristics and time series features. This analysis should be relatively interesting to modelers of groundwater systems, particularly those using data driven methods. This paper would greatly benefit from additional description of the training procedure and evaluation when dealing with gap-filled or processed data. Another benefit would be further evaluation of the sensitivity of model performance to the gap filling and data processing measures.
Below are some specific line comments.
Line 44: These claim should be cited: “In terms of accuracy and calculation speed, the CNN models outperform the LSTM. NARX models performed, on average, better than CNN.”
Line 47: “Most studies have successfully applied these techniques for GWL forecasting using only meteorological variables as inputs.” You might be interested in this paper: Gholizadeh et al., “Long short-term memory models to quantify long-term evolution of streamflow discharge and groundwater depth in Alabama” Science of The Total Environment Volume 901, 25 November 2023, 165884, where the did in fact include site geospatial characteristics to make predictions of wells that were held out from the training set (ungauged).
Line 116: Can you please provide the total number (and percentage) of gap filled values referred to here: “ To provide the CNN model with continuous time series, we performed a data imputation process through a Multiple Linear Regression (if enough dynamically similar wells based on the Euclidean Distance”. Can you also explain if these values were removed, or should be removed, from the loss during training, and also removed from the evaluation?
Line 120: Similarly can you provide the total number (and percent) of data points that were modified as outliers described here: “We removed these anomalies by finding the highest slope in the cumulative sum”? Is this a standard approach? I don’t think this description is satisfactory. I see from your code that you identify these based on “initial point where the values increase by 0.5 of the standard deviation“. This is an important point that should be explained in the paper, as well as the decision to use this processing method.

Line 135: Is it really necessary to give the equations for r-squared and NSE, as you don’t provide the equations for MSE or BIAS? There is also more unfamiliar calculations made in Tables 2 and 3 with no equations provided, and also the main CNN model is not described with equations. I guess I would suggest just removing equations 1 and 2, avoiding an asymmetry in descriptions, rather than adding equations for all the rest of the calculations.
Line 176: “Occasionally, in poorly performing models, the pattern of the GWL observations has been generally learned but with a strong Bias.” This is a little concerning, and I think it would be work describing in more detail. Similar to your NSE/r-squared cutoffs above, can you provide a quantification of these problematic BIAS wells, something like in how many wells does the prediction not intersect the observation? What causes this BIAS, is it an unusually high section in the training period? I wonder if there is anything in the data preprocessing that plays into this issue.

Citation: https://doi.org/10.5194/egusphere-2023-1836-RC2
- AC2: 'Reply on RC2', Mariana Gomez, 31 Oct 2023
  
  Dear Jonathan Frame,
  Thank you very much for your comments on the manuscript; we value your incomes and would like to incorporate the suggestions when suitable and in accordance with the current objective of the paper.
  “this paper would greatly benefit from additional description of the training procedure and evaluation when dealing with gap-filled or processed data. Another benefit would be further evaluation of the sensitivity of model performance to the gap filling and data processing measures.”
  We will certainly elaborate more on the data pre-processing, by including a more detailed description of the data exploration and gap-filling methods used before applying the CNN model. Regarding the training procedure and sensitivity of model performance to the gap filling, we believe that it can be helpful as further and future research but by only using time series with good data quality in terms of data gap lengths and frequency we seek to avoid major influence of data imputation approaches. Therefore, we think that a sensitivity analysis is a bit out of the scope of this study and could be focussed on in a follow-up analysis.
  Line 44: These claim should be cited: “In terms of accuracy and calculation speed, the CNN models outperform the LSTM. NARX models performed, on average, better than CNN.”
  We agree that this needs to be cited.
  Line 47: “Most studies have successfully applied these techniques for GWL forecasting using only meteorological variables as inputs.” You might be interested in this paper: Gholizadeh et al., “Long short-term memory models to quantify long-term evolution of streamflow discharge and groundwater depth in Alabama” Science of The Total Environment Volume 901, 25 November 2023, 165884, where the did in fact include site geospatial characteristics to make predictions of wells that were held out from the training set (ungauged).
  Gholizadeh et al. 2023 applied an LSTM model including static inputs that refer to the aquifers' hydrogeology as an attempt to model ungauged locations. As the model does not include groundwater levels as inputs, the authors attribute the satisfactory model performance to input features such as hydraulic conductivity, soil depth, soil porosity, and maximum water content. These findings can contribute to the central discourse of the paper.
  Line 116: Can you please provide the total number (and percentage) of gap filled values referred to here: “To provide the CNN model with continuous time series, we performed a data imputation process through a Multiple Linear Regression (if enough dynamically similar wells based on the Euclidean Distance”. Can you also explain if these values were removed, or should be removed, from the loss during training, and also removed from the evaluation?
  Thank you for raising this point. We will include precise numbers regarding the data gaps and imputation. From the 505 groundwater level time series, 241 (48%) are complete, 254 (50%) have gaps of 2 consecutive values and 10 (2%) have gaps of 3 consecutive values. Overall, the time series have less than 5% gap-filled values. We did not remove them from the training phase since the number of filled values is not considerably high.
  Line 120: Similarly can you provide the total number (and percent) of data points that were modified as outliers described here: “We removed these anomalies by finding the highest slope in the cumulative sum”? Is this a standard approach? I don’t think this description is satisfactory. I see from your code that you identify these based on “initial point where the values increase by 0.5 of the standard deviation“. This is an important point that should be explained in the paper, as well as the decision to use this processing method.
  Only 28 wells were identified to have jumps/steps on the temporal record. The cumulative sum is commonly used to detect changes in the mean or variance along the time series and is not referred to as outliers. Here, we intended to detect jumps/steps on the observed values that can hinder the model training or might introduce confusion to the model due to potential changes in the dynamic of the ground water levels. The optimal fraction of standard deviation was determined through trial and error by visually inspecting the detections and selecting the value that best adjust to most of the jumps. We will definitely include these explanations in the revised manuscript. Thank you very much for pointing this out.
  Line 135: Is it really necessary to give the equations for r-squared and NSE, as you don’t provide the equations for MSE or BIAS? There is also more unfamiliar calculations made in Tables 2 and 3 with no equations provided, and also the main CNN model is not described with equations. I guess I would suggest just removing equations 1 and 2, avoiding an asymmetry in descriptions, rather than adding equations for all the rest of the calculations.
  We agree to remove the equations to make the manuscript more consistent regarding the inclusion of equations.
  Line 176: “Occasionally, in poorly performing models, the pattern of the GWL observations has been generally learned but with a strong Bias.” This is a little concerning, and I think it would be work describing in more detail. Similar to your NSE/r-squared cutoffs above, can you provide a quantification of these problematic BIAS wells, something like in how many wells does the prediction not intersect the observation? What causes this BIAS, is it an unusually high section in the training period? I wonder if there is anything in the data preprocessing that plays into this issue
  This comment relates a lot to the main concern, raised in the first reviewer comment (RC1). We will address this issue more in detail through re-running the model on a weekly resolution, which we expect will improve model performance.
  
  Citation: https://doi.org/10.5194/egusphere-2023-1836-AC2

Mariana Gomez, Maximilian Noelscher, Andreas Hartmann, and Stefan Broda

Viewed

Total article views: 703 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
438	226	39	703	28	28

HTML: 438
PDF: 226
XML: 39
Total: 703
BibTeX: 28
EndNote: 28

Views and downloads (calculated since 13 Sep 2023)

Month	HTML	PDF	XML	Total
Sep 2023	164	58	5	227
Oct 2023	79	24	7	110
Nov 2023	43	21	7	71
Dec 2023	22	17	1	40
Jan 2024	16	20	1	37
Feb 2024	15	20	3	38
Mar 2024	24	15	4	43
Apr 2024	18	9	5	32
May 2024	13	19	1	33
Jun 2024	23	14	4	41
Jul 2024	21	9	1	31

Cumulative views and downloads (calculated since 13 Sep 2023)

Month	HTML	PDF	XML	Total
Sep 2023	164	58	5	227
Oct 2023	79	24	7	110
Nov 2023	43	21	7	71
Dec 2023	22	17	1	40
Jan 2024	16	20	1	37
Feb 2024	15	20	3	38
Mar 2024	24	15	4	43
Apr 2024	18	9	5	32
May 2024	13	19	1	33
Jun 2024	23	14	4	41
Jul 2024	21	9	1	31

Viewed (geographical distribution)

Total article views: 688 (including HTML, PDF, and XML) Thereof 688 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 26 Jul 2024

Short summary

To understand the affectations of external factors on the groundwater level modelling with deep learning. We trained, validated, and tuned individually a CNN model in 505 wells distributed throughout the state of Lower Saxony, Germany. Then evaluate the performance against available geospatial features and time series features. New insights are provided about the complexity of controlling factors on groundwater dynamics.


Total:	0
HTML:	0
PDF:	0
XML:	0