the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Skills in sub-seasonal to seasonal terrestrial water storage forecasting: insights from the FEWS NET land data assimilation system
Abstract. Accurate prediction of terrestrial water storage (TWS), the sum of soil moisture, groundwater, snow/ice, and surface water, is critical for informing water resource management and disaster responses. In this study, we evaluated subseasonal to seasonal (S2S) TWS forecasts, produced by the FEWS NET land data assimilation system (FLDAS), over Africa using observations from the Gravity Recover and Climate Experiment (GRACE) and its Follow-On (GRACE/FO) mission. FLDAS consists of two advanced land surface models, Noah-MP and the NASA Catchment Land Surface Model (CLSM), both of which simulate key TWS components including groundwater. Results show that CLSM is more skillful in forecasting TWS anomalies at S2S scales than Noah-MP, with >0.6 relative operating characteristics (ROC) scores over more than half of the study domain across the 1–6 months lead times. CLSM forecasts also maintain stronger correlations with GRACE/FO data than Noah-MP, particularly at longer lead times, owing to more skillful reanalysis-based initial conditions and stronger persistence in simulated TWS. In contrast, Noah-MP forecasts show weaker skill, especially in central Africa where the skill also declines rapidly with lead time.
Evaluation results show that accuracy of TWS forecasts is strongly influenced by precipitation interannual variability: forecasts driven by precipitation products with lower precipitation interannual variability are generally more accurate than those driven by higher precipitation variability. The performance gap between Noah-MP and CLSM is also more pronounced in regions with higher precipitation variability such as central Africa. This sensitivity arises because TWS often exhibits strong multi-year variability in responses to interannual precipitation, making realistic simulation of long-term variability critical for skillful TWS forecasts. The superior performance of CLSM is attributed to its strong representation of upward groundwater movement, especially during prolonged droughts, which enhances TWS interannual variability. In contrast, the weak representation of capillary rise in Noah-MP limits its ability to capture effects of long-term precipitation variability on TWS. Both models exhibit lower correlation and higher RMSEs when evaluated against GRACE/FO data than relative to reanalysis, further underscoring substantial uncertainty in model physics.
Autocorrelation analyses show that TWS persistence is closely linked to groundwater persistence. CLSM groundwater exhibits stronger persistence than that of Noah-MP, owing to its ability to simulate groundwater responses to long-term precipitation variability. While persistence provides an important source of predictability, our results also show that inaccurate persistence, such as that associated with anthropogenically induced trends and changes in precipitation that are often inadequately captured by land surface models, can degrade forecast skill. These findings underscore the importance of using independent datasets such as GRACE/FO observations to evaluate TWS forecasts.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Hydrology and Earth System Sciences.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(3035 KB) - Metadata XML
-
Supplement
(47455 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-4198', Anonymous Referee #1, 05 Nov 2025
-
AC2: 'Reply on RC1', Bailing Li, 08 Jan 2026
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4198/egusphere-2025-4198-AC2-supplement.pdf
-
AC2: 'Reply on RC1', Bailing Li, 08 Jan 2026
-
RC2: 'Comment on egusphere-2025-4198', Anonymous Referee #2, 11 Nov 2025
Summary:
The authors evaluate the skill of FEWS NET S2S terrestrial water storage (TWS) hindcasts over Africa. This study is relevant for improving forecasts and the management of extreme conditions across the continent, contributing to enhanced climate resilience. The focus lies on the differing performance of two land surface models—CLSM and Noah-MP—in comparison with GRACE observations. The analysis also includes the decomposition of TWS into its individual storage components and an assessment of TWS persistence. Evaluation metrics used are RMSE, correlation, and ROC.
Overall, the insights presented in the paper are valuable and merit publication. However, three major points should be addressed:
- Some parts of the manuscript are difficult to follow. This can likely be resolved by adding a few clarifying explanations; specific suggestions are provided in the detailed comments.
- The manuscript contains contradictory statements regarding the role of surface water bodies in Africa. In the data section, they are described as small, yet later results indicate that they are relevant. Given that surface water bodies can substantially influence TWS variability in many African regions—and considering the existing literature on this topic—it may be advisable to remove their contribution from the GRACE data before performing the analysis.
- The evaluations in Section 3.1 are based on time series averaged over the entire African continent. In my view, such continental-scale averages offer limited interpretive value due to the large diversity of climate zones and hydrological regimes. The authors also appear to have difficulties interpreting the signal they observe. It would be more informative to present time series for a selection of representative regions.
Specific comments:
Abstract:
- l. 18: “> 0.6 relative operating characteristics” → this measure is not clearly defined. Do you think that it is obvious what it represents?
- L 27: you talk about multi-year variability, but before you said that you do S2S forecasts. This is confusing… so what is the aim here?
- L 32: “relative to reanalysis” → which reanalysis are you referring to?
- L.35: Would it make sense to explain the term “persistence”? Maybe make clearer using the term “temporal persistence”.
1 Introduction:
- L 56/57: I could not understand what you want to say with this sentence.
- L 60: I would say that groundwater being a potential source of predictability for TWS depends mainly on its variability?
- L 64: the sensitivity to biases in the meteorological forcing depends much on the response time, which might be much longer than the S2S scale.
- L 76: you may add the following two studies to your discussion (if you think it is fitting!). They assimilated GRACE-based forecasts of TWS into hydrological models in order to improve the forecast skill of the models:
- Li, F., Springer, A., Kusche, J., Gutknecht, B., Ewerdwalbesloh, Y. (2025). Reanalysis and Forecasting of Total Water Storage and Hydrological States by Combining Machine Learning With CLM Model Simulations and GRACE Data Assimilation. Water Resources Research, e2024WR037926, https://doi.org/10.1029/2024WR037926
- Li, F., Kusche, J., Sneeuw, N., Siebert, S., Gerdener, H., Wang, Z., ... & Tian, K. (2024). Forecasting next year's global land water storage using GRACE data. Geophysical Research Letters, 51(17), e2024GL109101. https://doi.org/10.1029/2024GL109101
- L 90: Here you talk about hindcast for the first time. Before you only talk about forecast. You may introduce this.
- L 91: multi-model → are there more models involved besides CLSM and Noah?
- L. 95/96: please provide references for the past studies
- L. 97: It is not clear to me how autocorrelation analysis can be applied to processes. I would say it can be applied to variables.
2 Data and evaluation metrics
- L101: how do you make sure that CHIRPS based precipitation and the other fields are consistent? And why do you use different fields for generating the initial conditions than used afterwards used for the hindcasts.
- L. 147: on the long-term I agree that groundwater variability is balanced by P and ET, but at S2S scale I have doubts. The papers you cite both refer to the US, which has very different climate regimes and soils. Do you have evidence over Africa that groundwater is influenced by ET on S2S scale?
- L. 154: It was shown by Ndehedehe et al (2017) that lake Volta contributes to 40% of TWS trend in the Volta basin. So I think that in particular over West Africa you cannot neglect surface water bodies when comparing to GRACE.
- Christopher E. Ndehedehe, Joseph L. Awange, Michael Kuhn, Nathan O. Agutu, Yoichi Fukuda, Analysis of hydrological variability over the Volta river basin using in-situ data and satellite observations, Journal of Hydrology: Regional Studies, Volume 12, 2017,Pages 88-110, ISSN 2214-5818, https://doi.org/10.1016/j.ejrh.2017.04.005.
- L 161: Could you clarify: do you evaluate in the following the ensemble mean?
- L 170: why do you remove only 5 years as temporal mean? What about regions that have strong interannual variability? I would expect that in particular the percentile maps shown in Fig. 8 can be significantly affected by the choice of the time span for the temproal mean.
- L. 178: do you mean that you removed the climatology?
- L. 180: you could explain at some point the differences between reanalysis and hincasts.
- L. 185: you used white color in the figures to highlight regions that were masked out. This is not clear, better use gray color.
- L. 186: you could make the percentiles clearer by an equation?
3 Results
- L 197 ff: The structure is difficult to understand. You introduce the three influencing factors, and then say how each of them is isolated. However, for the first to factors I cannot understand how they are connected to what you analyze. Could you make this more clear? How are initial conditions connected to temporal variability of reanalysis? How are meteorological forecasts connected to different lead times? And to isolate model physics shouldn’t you compare to GRACE?
- L 203: Are you sure that it makes sense to average over such a huge study area? Wouldn’t it be more appropriate to look at regions of interest? For instance, you are not able to interpret the interesting signals that you highlight in L 211, because you do not know from which part of your region they come.
- L. 224: Why are here 2 numbers. I an skeptical that trends of few micrometers / month are significant.
- L 226: But I think you canceled regions with large groundwater withdrawals out. And their impact cannot be that big averaged over the entire study region.
- L. 235: You also clearly see Lake Volta here, which shows that you cannot neglect surface water in general.
- L. 240: you do not show discrepancies among NMME models in S1,2 directly, maybe compute the ensemble spread?
- L. 242: what about Lake Volta? It is clearly visible (however, before in the manuscript you say that surface water can be neglected).
- L. 263: on average,… → this is a repition.
- L. 283 – 286: this seems to be a central insight of your experiments. Could you highlight it a bit more?
- L. 289-291: I do not understand the relationship between decreased RMSE and overestimation of TWS interannual variability. Could you please clarify this?
- L. 294 – 295: model physics have stronger influences than meteorological forecasts: didn’t you show that this is different for NOAH and CLSM?
- L.305: So would it be better to use the forcings that lead to the best results for forecasts and not the ensemble mean?
- L 326: Could you define persistence somewhere in one sentence?
- L 358: Could you have an introductory sentence to remember the reader which kind of percentiles you are referring to?
- L 358ff: would it make sense to involve percentiles from GRACE into the extreme event discussion?
- Fig. 8: Why do you show and discuss percentiles only for CLSM and not for NOAH?
4 Summary and discussion
In this section many aspects from the results section are repeated. It would be great if you could add in each paragraph some more interpretation or insights what this means for future research or applications.
Minor comments:
Abstract:
- Abbreviation FEWS NET not defined.
Data and evaluation metrics:
- Table 1: abbreviations are not defined
Results
- L 225: show a statistical… → add “a”
Citation: https://doi.org/10.5194/egusphere-2025-4198-RC2 -
AC1: 'Reply on RC2', Bailing Li, 08 Jan 2026
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4198/egusphere-2025-4198-AC1-supplement.pdf
-
RC3: 'Comment on egusphere-2025-4198', Anonymous Referee #3, 15 Dec 2025
The article evaluates S2S TWS forecasts produced from FLDAS over Africa using gravity observations. I think the article reads well and I think it stretches the surface of a relatively unexplored area. That is, it highlights the importance of improving model physics of groundwater as well as it relevance for S2S forecasts. I want to also say that tha authirs motivate the GRACE community to reduce latency on their products, as GRACE-DA could be beneficial to improve the forecast. I’d encourage the authors to add some of these cavetas in the conclusions section. Beside this “major” comment, I, here, list only minor suggestions for the authors.
2.2. > it is unclear to me what NMME models are, at around line 111, please add a brief broad description of why they are needed here. Also unclear what and why the downscaling is needed. Table 1> what variables of these models are used?
Line 138: Typo CLMS
Line 186: are the percentiles computed using seasonal mean? Please clarify in manuscript
Fig 4. Why is the correlation so small already at 1-month lag time? How significant are these statistics?
Line 308 – 324: it is unclear in my option what this analysis is really telling us. What is ROC and why is it computed only on the lower tercile (drier forecasts?)? Please add some general background on the metric and its interpretation.
Fig 6 > wouldn’t be useful to also show difference maps of the first two rows wrt GRACE (bottom)?
Fig 8 . Can the authors make it clear that top left figure is the IC and everything else is forecasts? At first I thought top raw was initialization, while the bottom was the forecasts
Citation: https://doi.org/10.5194/egusphere-2025-4198-RC3 -
AC3: 'Reply on RC3', Bailing Li, 08 Jan 2026
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4198/egusphere-2025-4198-AC3-supplement.pdf
-
AC3: 'Reply on RC3', Bailing Li, 08 Jan 2026
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 1,430 | 145 | 33 | 1,608 | 110 | 20 | 29 |
- HTML: 1,430
- PDF: 145
- XML: 33
- Total: 1,608
- Supplement: 110
- BibTeX: 20
- EndNote: 29
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Summary: the authors offer an evaluation of FEWSNET S2S terrestrial water storage forecasts for Africa. The manuscript focuses on differences between the two land surface models included in the FEWSNET forecast ensemble--CLSM and Noah-MP--and offers commentary on the performance of each. Overall, they conclude that CLSM offers advantages when simulating and forecasting TWS. Results also show how various NMME meteorological S2S forecasts compare, but these results are not emphasized in the discussion. The primary source of evaluation data in the main text is GRACE, while information on precipitation forecasts is contained in supplementary material and is addressed only briefly in the text.
I find the results presented in the manuscript to be interesting, and the explanation of these results is generally quite clear and useful. I did find myself a bit confused at times, when the authors bounced between comparing hindcasts to reanalysis and comparing hindcasts to GRACE observations, and when some of the explanation of geographic patterns seemed to me to be speculative. But these are minor points, and I have only a few questions that I would like to see addressed before the paper is published in final form.
Specific comments:
Line 204: isn't the 1m CLSM "soil depth" a choice that was made by the authors? This implementation of the model might output 1m soil moisture, but the model also has an implicit soil water profile that could be used to extract an estimate of total soil moisture integrated to any depth. Similarly (and maybe more easily) the authors could have used 1m soil moisture from Noah-MP rather than the full 2m column. Why not compare 1m CLSM to 1m Noah-MP, or 2m CLSM to 2m Noah-MP?
Lines 234-249: In Figure 2, the reanalysis errors look almost identical to the forecast errors for both Noah-MP and CLSM. Yet the authors invoke NMME uncertainties when explaining some aspects of model errors. Given that the patterns and magnitude of error appear to be very similar in reanalysis and in forecasts at all lead times, aren't these errors more about model bias than about forecasts? Even the explanations that invoke interannual climate variability seem like they'd need more evidence in their support, since we'd want to know that errors in interannual meteorological variability are seen in a similar way in both CHIRPS (or MERRA-2) and in the NMME models.
Line 285: If these results compare model forecasts to their own reanalysis, can we really say that degradation of Noah-MP forecasts is due to an "inability" to simulate long-term TWS variability? Couldn't we just as easily say that the persistence of CLSM forecasts is due to that model's "inability" to simulate rapid runoff and drainage? Without an independent evaluation dataset (for this specific result) it's not possible to know which model's behavior is better. That said, the subsequent results that *do* offer comparison with GRACE make a more convincing case. I would recommend that the authors avoid making statements about the quality of model performance when using the retrospective simulations as the truth. (In fact, they might consider moving these statements out of this section, as I admit that I was confused on my first reading about which statements had an observational basis and which were about simulation comparisons.)
Section 3.4: Why aren't any GRACE comparisons offered in this section? It seems odd to show the forecast without any evaluation.