the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
High-resolution long-term average groundwater recharge in Africa estimated using random forest regression and residual interpolation
Abstract. Groundwater recharge is a key hydrogeological variable that informs the renewability of groundwater resources. Long-term average (LTA) groundwater recharge provides a measure of replenishment under the prevailing climatic and landuse conditions and is therefore of considerable interest in assessing the sustainability of groundwater withdrawals globally. This study builds on the modelling results of MacDonald et al. (2021) who produced the first LTA groundwater recharge map across Africa using a linear mixed model (LMM) rooted in 134 ground-based studies. Here, continent-wide predictions of groundwater recharge were generated using Random Forest (RF) regression employing five variables (precipitation, potential evapotranspiration, soil moisture, NDVI and aridity index) at a higher spatial resolution (0.1° resolution) to explore whether an improved model might be achieved through machine learning. Through the development of a series of RF models, we confirm that a RF model is able to generate maps of higher spatial variability than LMM; the performance of final RF models in terms of the goodness of fit (R2 = 0.83, 0.88 with residual kriging) is comparable to the LMM (R2 = 0.86). The higher spatial scale of the predictor data (0.1°) in RF models better preserves small-scale variability from predictor data, than the values provided via interpolated LMM; these may provide useful in testing global-to-local scale models. The RF model remains, nevertheless, constrained by its representation of focused recharge and by the limited range of recharge studies in tropical Africa, especially in the areas of high precipitation. This confers substantial uncertainty in model estimates.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(23192 KB)
-
Supplement
(4315 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(23192 KB) - Metadata XML
-
Supplement
(4315 KB) - BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-1898', Anonymous Referee #1, 06 Nov 2023
The authors use an RF approach to generate spatial long-term average groundwater recharge for Africa based on 134 recharge values from the literature and compare their results with the field observations and a previous publication using an LMM (linear mixing model). The results are generated and compared for two spatial resolutions. The RF approach is very similar to LMM but offers a higher spatial variability than LMM and therefore also shows small-scale trends.
Even though the approach is generally ok, the manuscript is very well written and the workflow and code(s) is available through github (which I really appreciate), I still have some critical points that should be considered and discussed in detail in a revised version.
I'm somewhat unsure about the better spatial resolution of the results. Just because the resolution is better doesn't mean the results are more reliable. There is a very large uncertainty due to the few observations and their distribution but the maps suggest a much better and more robust result and this is dangerous. What would be the next step with the results or what can the better spatial resolution be used for? If the data is extracted directly from the maps (for water budget calculations, for example) this can lead to very distorted results, as the simulated recharge values are very uncertain for many areas. I believe the whole uncertain should be better discussed and the maps must better highlight the uncertainties (maybe with transparent colors, see my comment below)
I wonder why, for example, seasonality in precipitation is not present in the climatic input data. In some regions, precipitation only falls in a few months and therefore the processes for recharge are significantly different for conditions when precipitation is distributed throughout the year. Yes, LMM or RF show a good fit /regression, but certain parameters may compensate for the missing input. Also, of course, the relative importance does not show the importance of seasonality but only because this has not been tested in the RF (although it was in the previous work using LMM, but this is not transferable directly to the RF approach). Similar for depth to groundwater table (or call it unsaturated zone thickness) which is important for recharge processes, rate and timing. How important is this input for the RF algorithm and for the process description. I also wonder why distance to rivers is not included as an (raster)input, perhaps paired with discharge rates. This would help to better capture the important process of groundwater-surface water interaction and bank filtration, which many of the authors know better than I do.
Of course there is a large uncertainty in the precipitation data sets and in the timing of recharge, but wouldn't it be possible to minimize these uncertainties and also the scaling (regression is dominated by the high recharge values) significantly by using the recharge / precipitation ratio and obtain more robust results? It would be nice if this can be discussed and tested more.
How does the spatially uneven distribution of the observations affect the results? Wouldn't it make more sense to show only the more robust areas and show the very uncertain ones transparently? Since not all climatic conditions have been covered, would clustering be useful to minimize the spatial discrepancy and influence?
Is the correlation of the aridity index with precipitation and ET not a problem for parameter estimation and generally with all estimation methods? Aridity is based on P and ET, and I wonder what is the advantage of using all three parameters? Looking at the SI, precipitation and aridity are the most important parameters, and I wonder what the results would look like if only aridity was used. When I see table S4, I wonder why the results look almost the same for training and test, even if only P us used.
I'm not an expert on RF, but aren't the results validated using the ROC curve and sensitivity, specificity and accuracy rather than just the regression? That would be more informative about the model results and robustness instead of using only a regression, or?
Line 451: Also process based models require careful input selection and quantification of uncertainties in the input dataset.
Â
Citation: https://doi.org/10.5194/egusphere-2023-1898-RC1 - AC1: 'Reply on RC1', Anna Pazola, 08 Feb 2024
-
RC2: 'Comment on egusphere-2023-1898', Anonymous Referee #2, 21 Nov 2023
- AC2: 'Reply on RC2', Anna Pazola, 08 Feb 2024
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-1898', Anonymous Referee #1, 06 Nov 2023
The authors use an RF approach to generate spatial long-term average groundwater recharge for Africa based on 134 recharge values from the literature and compare their results with the field observations and a previous publication using an LMM (linear mixing model). The results are generated and compared for two spatial resolutions. The RF approach is very similar to LMM but offers a higher spatial variability than LMM and therefore also shows small-scale trends.
Even though the approach is generally ok, the manuscript is very well written and the workflow and code(s) is available through github (which I really appreciate), I still have some critical points that should be considered and discussed in detail in a revised version.
I'm somewhat unsure about the better spatial resolution of the results. Just because the resolution is better doesn't mean the results are more reliable. There is a very large uncertainty due to the few observations and their distribution but the maps suggest a much better and more robust result and this is dangerous. What would be the next step with the results or what can the better spatial resolution be used for? If the data is extracted directly from the maps (for water budget calculations, for example) this can lead to very distorted results, as the simulated recharge values are very uncertain for many areas. I believe the whole uncertain should be better discussed and the maps must better highlight the uncertainties (maybe with transparent colors, see my comment below)
I wonder why, for example, seasonality in precipitation is not present in the climatic input data. In some regions, precipitation only falls in a few months and therefore the processes for recharge are significantly different for conditions when precipitation is distributed throughout the year. Yes, LMM or RF show a good fit /regression, but certain parameters may compensate for the missing input. Also, of course, the relative importance does not show the importance of seasonality but only because this has not been tested in the RF (although it was in the previous work using LMM, but this is not transferable directly to the RF approach). Similar for depth to groundwater table (or call it unsaturated zone thickness) which is important for recharge processes, rate and timing. How important is this input for the RF algorithm and for the process description. I also wonder why distance to rivers is not included as an (raster)input, perhaps paired with discharge rates. This would help to better capture the important process of groundwater-surface water interaction and bank filtration, which many of the authors know better than I do.
Of course there is a large uncertainty in the precipitation data sets and in the timing of recharge, but wouldn't it be possible to minimize these uncertainties and also the scaling (regression is dominated by the high recharge values) significantly by using the recharge / precipitation ratio and obtain more robust results? It would be nice if this can be discussed and tested more.
How does the spatially uneven distribution of the observations affect the results? Wouldn't it make more sense to show only the more robust areas and show the very uncertain ones transparently? Since not all climatic conditions have been covered, would clustering be useful to minimize the spatial discrepancy and influence?
Is the correlation of the aridity index with precipitation and ET not a problem for parameter estimation and generally with all estimation methods? Aridity is based on P and ET, and I wonder what is the advantage of using all three parameters? Looking at the SI, precipitation and aridity are the most important parameters, and I wonder what the results would look like if only aridity was used. When I see table S4, I wonder why the results look almost the same for training and test, even if only P us used.
I'm not an expert on RF, but aren't the results validated using the ROC curve and sensitivity, specificity and accuracy rather than just the regression? That would be more informative about the model results and robustness instead of using only a regression, or?
Line 451: Also process based models require careful input selection and quantification of uncertainties in the input dataset.
Â
Citation: https://doi.org/10.5194/egusphere-2023-1898-RC1 - AC1: 'Reply on RC1', Anna Pazola, 08 Feb 2024
-
RC2: 'Comment on egusphere-2023-1898', Anonymous Referee #2, 21 Nov 2023
- AC2: 'Reply on RC2', Anna Pazola, 08 Feb 2024
Peer review completion
Journal article(s) based on this preprint
Data sets
High-resolution long-term average groundwater recharge in Africa estimated using random forest regression and residual interpolation Anna Pazola https://doi.org/10.6084/m9.figshare.22591375.v1
Model code and software
Application of random forest regression in modelling long-term average groundwater recharge in Africa Anna Pazola https://github.com/pazolka/rf-groundwater-recharge-africa
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
375 | 192 | 29 | 596 | 51 | 19 | 24 |
- HTML: 375
- PDF: 192
- XML: 29
- Total: 596
- Supplement: 51
- BibTeX: 19
- EndNote: 24
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
Anna Pazola
Richard G. Taylor
Mohammad Shamsudduha
Jon French
Alan M. MacDonald
Tamiru Abiye
Ibrahim Baba Goni
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(23192 KB) - Metadata XML
-
Supplement
(4315 KB) - BibTeX
- EndNote
- Final revised paper