the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Filling Data Gaps in Soil Moisture Monitoring Networks via Integrating Spatio-temporal Contextual Information
Abstract. As critical inputs for global climate studies, watershed hydrologic modeling, and satellite soil moisture product validation, in situ soil moisture measurements are frequently compromised by sensor-derived data gaps that disrupt hydrological continuity. To overcome this challenge, we develop ST-GapFill, a novel spatiotemporal reconstruction framework integrating multi-source contextual information through two key innovations: (1) Spatial correlation-guided neighbor selection that identifies optimal auxiliary stations; (2) A long short-term memory (LSTM) network is employed to capture the complex temporal dependencies within the soil moisture time series. Validation on in-situ networks demonstrates that ST-GapFill successfully reconstructs soil moisture dynamics with preserved diurnal-phase fluctuations, achieving 0.91 correlation coefficients with ground truth under low missing-rate conditions (<50 %). Comparative analysis reveals the ST-GapFill 's statistically superior performance (RMSE reduction: 27.0 % vs IDW, 67.8 % vs ARIMA). This method establishes a robust spatiotemporal imputation paradigm for environmental sensor networks, effectively bridging observation gaps to support precision agriculture and climate change impact assessments.
- Preprint
(2475 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 27 Nov 2025)
-
RC1: 'Comment on egusphere-2025-1900. First review, M.', Mikhail Sarafanov, 31 Aug 2025
reply
-
AC1: 'Reply on RC1', weixuan wang, 28 Sep 2025
reply
The comment was uploaded in the form of a supplement:
-
AC1: 'Reply on RC1', weixuan wang, 28 Sep 2025
reply
-
CC1: 'Comment on egusphere-2025-1900', Huizhen Cui, 21 Oct 2025
reply
The manuscript presents a novel spatiotemporal reconstruction method called ST-GapFill, which combines spatially-optimized neighbor selection with an LSTM model to effectively capture both spatial and temporal dependencies for reconstructing missing data in soil moisture (SM) sensor networks. The results demonstrate ST-GapFill outperforms the traditional SVR and ARIMA models, particularly for block missing data (NMR). The results provide a reference for supplementing long-term soil moisture observation network data. However, the manuscript could be improved by enhancing the clarity of its structure, the quality of the figures, and the precision of the description. Several scientific or presentation issues need to be addressed.
General Comments:
- The Introduction currently presents gap-filling methods directly. It would benefit from first providing a comprehensive overview of standard methodologies for handling missing data in soil moisture observation networks, thereby creating a smoother transition to the presented methodology. Furthermore, the literature review is comprehensive but could be more focused. The existing literature review in the Introduction needs to be better synthesized to refine the logical connections and significance of each reference.
- It is suggested that the article structure be slightly reorganized by presenting the gap-filling results first, followed by the result analysis. This arrangement would improve the logical flow and enhance reader comprehension.
- There is a critical contradiction in the description of ST-GapFill's performance. The text states: "For other types of missing patterns, ST-GapFill performs best" (Line ~341), but later concludes "ST-GapFill is not applicable to long interval time series missing patterns of Type 2" (Line ~353). This creates confusion. The results in Figure 9(b) clearly show that IDW outperforms ST-GapFill for Type 2 (MR at the same sensor). The authors should rephrase this to accurately reflect the findings: ST-GapFill is the best-performing model for Type 1, 3, and 4, but for Type 2 (consecutive gaps at a single sensor), the spatial interpolation of IDW is more robust. Please provide clear conclusions based on a comprehensive analysis of both the gap-filling results and algorithm performance.
- In 4.2, How is the model's performance evaluated specifically on the filled gaps in Figure 8? The text states that sites with real missing values (like L7, M8) were used. It must be clarified that the MAE and RMSE are calculated only at the timestamps where data was originally missing and now filled, comparing the filled value to the held-out "ground truth."
Specific Comments:
- Section 2.2: The rationale for the correlation length scale L = 50 km is based on the network's grid size, which is valid. However, a brief mention of whether this parameter was optimized or validated through sensitivity analysis would strengthen the methodological rigor.
- The stations in Figure 3 are too dense to see clearly. It is recommended to use distinct markers for the three station types (S, M, L) and present them in separate subplots for clarity.
- Line 194 “A method of taking averages was used to resample all data uniformly for 30 minutes.” What is the specific reason for resampling the data to a 30-minute interval, as opposed to simply using the native half-hourly observations?
- The data in Figure 6 is too densely plotted, which obscures the specific periods of missing data. It is recommended to split the figure into three subplots, each corresponding to the missing data periods T1, T2, and T3, to improve legibility.
- Line 238: "re-peats" should be "repeats".
- Line 295: The claim that highly correlated sites have "similar environmental factors" should be expanded slightly. What are these factors? (e.g., soil type, land cover, topography). This would add depth to the spatial analysis.
- Line 314-313: “the optimal window sizes of 250 for ARIMA, 50 for ST-GapFill, and 50 for SVR were finally obtained, respectively.” Could you clarify why the uniform sliding window size is set to 100 at Line 302?
- Line 364-369: When comparing ST-GapFill's performance to other studies (e.g., Chen et al., 2020; Moreno-Martinez et al., 2020), it would be beneficial to provide a direct quantitative or qualitative comparison, if possible, rather than just stating that those methods have "limitations."
- Lin 370-372: Since 0.038 is greater than 0.03, please revise the way this is stated to ensure an accurate representation.
- The color contrast in the legend of Figure 11's subplots is not strong enough, making it difficult to see. Also, please label the missing data pattern in each subplot for easier understanding.
Citation: https://doi.org/10.5194/egusphere-2025-1900-CC1
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 659 | 60 | 19 | 738 | 7 | 23 |
- HTML: 659
- PDF: 60
- XML: 19
- Total: 738
- BibTeX: 7
- EndNote: 23
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General comments
In their article “Filling Data Gaps in Soil Moisture Monitoring Networks via Integrating Spatio-temporal Contextual Information,” the authors explore approaches to filling gaps in soil moisture data. The article proposes a new method based on statistical calculations and machine learning techniques. The results are tested at a single geographic location. The error of the developed algorithm is compared with alternative solutions. The strong points include clear explanations of the proposed algorithm, a fairly good description of the experiments, and well-reasoned conclusions confirmed by experiments.
Nevertheless, it is recommended to improve the structure of this scientific paper, refine the literature review, and consider adding an additional simple method for filling in the gaps and comparing its error with the current results.
I would like to note my wish to add the source code to this article in the form of an open repository. I understand that in some cases, adding code is impossible due to various agreements and grant conditions. However, I kindly ask my colleagues, if possible, to take the time to create a repository with the model and experiment code for the article. Let's take steps to overcome the reproducibility issue in modern research.
Specific comments
line 35: “statistical interpolation and methods based on artificial intelligence.“
- I think it makes sense to consider the classification in more detail, because this division is quite general, and I did not see any further justification for this particular separation in the text.
For example, further in the text, “The artificial intelligence-based approach performs spatio-temporal modeling by capturing the complex non-linear relationship” refers to the specificity of the category “artificial intelligence methods”—nonlinearity. However, k-nearest neighbor interpolation, which the authors included in the first category, allows modeling nonlinear relationships. Firstly, I believe that the division into “statistical interpolation and methods based on artificial intelligence” is unnecessary in the context of this article. Secondly, I suggest reworking this section and providing a more comprehensive and detailed analysis of the solutions. For the sake of systematization, it would be useful to create a table comparing the methods.
Line 110: Methodology section
- I suggest changing the order of the section: first, discuss Data pre-processing, then Correlation calculation, and then Long short-term memory. This will make the narrative more coherent: from data to the correlation analysis method, and then conclude with an explanation of the final model. At the same time, I suggest paying special attention here to explaining why this particular architecture of the final algorithm was proposed (clearly specify in the text what each individual block is responsible for), while the explanation of how the LSTM architecture of neural networks works is not an important part of the narrative. It will be enough to provide a link and not focus on this.
Line 235: To evaluate the model performance, the full dataset was randomly split into training (80%) and testing (20%) sets using train_test_split from the Scikit-learn library.
- This can be kept as it is, but I think not every researcher in the field of geosciences knows Python programming so it might be useful to explain this step in plain text. And since we are talking about using Python here, I would be happy to take a look at the source code of the experiments. I suggest that the authors make their model available as an open-source solution and create a repository on GitHub (if it is legally possible).
Line 250: The smaller the RMSE and MAE are, the higher the accuracy
- It is better to avoid using the term “accuracy,” which has a very specific meaning in machine learning. It is better to say “the smaller the error, the better the model.”
Line 300: Figure 7: Correlation between sites, showing high correlations among nearby small-scale sites and varying correlations among larger sites.
- Since the article discusses “Spatio-temporal Contextual Information,” it would be useful to include a map (there is space on the right) to show the location of these stations. You can even take a specific station and show its neighboring stations in color depending on the correlation coefficient used: dark blue dots if the correlation coefficient is weaker, and yellow if it is stronger, just like on the matrix.
Line 340: Figure 10: Comparison of coefficient of determination (R2 ) under different missing data types. Higher R2 values indicate better agreement between predicted and observed soil moisture.
- Please give an example of how R2 is calculated; the formulas for MAE and RMSE, for example, are given above. It may be useful to calculate adjusted R2 and visualize it, instead of the regular coefficient of determination.
Line 420: 5 Conclusion, Figure 11
- Thank you for the clear visualization of the modeling results. Looking at the graphs, it seems to me that simple methods, such as LOCF (last observation carried forward) method or, if the experimental setup allows, linear interpolation of the time series, could perform just as well as the baseline approaches considered here (for example ARIMA or SVR). I understand that predictive models based on previous values cannot “ look” into the future. However, from the problem statement, I do not see any restrictions on why information before and after the gap cannot be used to fill it. In any case, could you please add LOCF here for comparison.
Technical corrections
Line 190: Figure 3
- The markings (dots) are hard to see. You could use regular dots instead of “stars”. And also you could show the L, M and S with color to make it easier to distinguish.
Line 230: Figure 4
- Please add names to types 1-4 in the image, so it will be easier to connect the visualization to the text. For example “(a) Type1 - completely random missing”, etc.
Line 245: The performance evaluation metrics include Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). The calculation method is as follows: ...
- Please indicate what y means in the formulas and what y with a hat means. I know most readers will understand, but it is better to be clear.
Line 285: Figure 6
- It may make sense to move the footnotes with gap indicators to the top so that they do not overlap with the dates on the X-axis.
Line 340: Figure 10: Comparison of coefficient of determination (R2) under different missing data types. Higher R2 values indicate better agreement between predicted and observed soil moisture.
- In this context, I suggest using “better consistency” instead of “better agreement”