the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Filling Data Gaps in Soil Moisture Monitoring Networks via Integrating Spatio-temporal Contextual Information
Abstract. As critical inputs for global climate studies, watershed hydrologic modeling, and satellite soil moisture product validation, in situ soil moisture measurements are frequently compromised by sensor-derived data gaps that disrupt hydrological continuity. To overcome this challenge, we develop ST-GapFill, a novel spatiotemporal reconstruction framework integrating multi-source contextual information through two key innovations: (1) Spatial correlation-guided neighbor selection that identifies optimal auxiliary stations; (2) A long short-term memory (LSTM) network is employed to capture the complex temporal dependencies within the soil moisture time series. Validation on in-situ networks demonstrates that ST-GapFill successfully reconstructs soil moisture dynamics with preserved diurnal-phase fluctuations, achieving 0.91 correlation coefficients with ground truth under low missing-rate conditions (<50 %). Comparative analysis reveals the ST-GapFill 's statistically superior performance (RMSE reduction: 27.0 % vs IDW, 67.8 % vs ARIMA). This method establishes a robust spatiotemporal imputation paradigm for environmental sensor networks, effectively bridging observation gaps to support precision agriculture and climate change impact assessments.
- Preprint
(2475 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 05 Oct 2025)
-
RC1: 'Comment on egusphere-2025-1900. First review, M.', Mikhail Sarafanov, 31 Aug 2025
reply
General comments
In their article “Filling Data Gaps in Soil Moisture Monitoring Networks via Integrating Spatio-temporal Contextual Information,” the authors explore approaches to filling gaps in soil moisture data. The article proposes a new method based on statistical calculations and machine learning techniques. The results are tested at a single geographic location. The error of the developed algorithm is compared with alternative solutions. The strong points include clear explanations of the proposed algorithm, a fairly good description of the experiments, and well-reasoned conclusions confirmed by experiments.Nevertheless, it is recommended to improve the structure of this scientific paper, refine the literature review, and consider adding an additional simple method for filling in the gaps and comparing its error with the current results.
I would like to note my wish to add the source code to this article in the form of an open repository. I understand that in some cases, adding code is impossible due to various agreements and grant conditions. However, I kindly ask my colleagues, if possible, to take the time to create a repository with the model and experiment code for the article. Let's take steps to overcome the reproducibility issue in modern research.
Specific commentsline 35: “statistical interpolation and methods based on artificial intelligence.“
- I think it makes sense to consider the classification in more detail, because this division is quite general, and I did not see any further justification for this particular separation in the text.
For example, further in the text, “The artificial intelligence-based approach performs spatio-temporal modeling by capturing the complex non-linear relationship” refers to the specificity of the category “artificial intelligence methods”—nonlinearity. However, k-nearest neighbor interpolation, which the authors included in the first category, allows modeling nonlinear relationships. Firstly, I believe that the division into “statistical interpolation and methods based on artificial intelligence” is unnecessary in the context of this article. Secondly, I suggest reworking this section and providing a more comprehensive and detailed analysis of the solutions. For the sake of systematization, it would be useful to create a table comparing the methods.Line 110: Methodology section
- I suggest changing the order of the section: first, discuss Data pre-processing, then Correlation calculation, and then Long short-term memory. This will make the narrative more coherent: from data to the correlation analysis method, and then conclude with an explanation of the final model. At the same time, I suggest paying special attention here to explaining why this particular architecture of the final algorithm was proposed (clearly specify in the text what each individual block is responsible for), while the explanation of how the LSTM architecture of neural networks works is not an important part of the narrative. It will be enough to provide a link and not focus on this.
Line 235: To evaluate the model performance, the full dataset was randomly split into training (80%) and testing (20%) sets using train_test_split from the Scikit-learn library.
- This can be kept as it is, but I think not every researcher in the field of geosciences knows Python programming so it might be useful to explain this step in plain text. And since we are talking about using Python here, I would be happy to take a look at the source code of the experiments. I suggest that the authors make their model available as an open-source solution and create a repository on GitHub (if it is legally possible).
Line 250: The smaller the RMSE and MAE are, the higher the accuracy
- It is better to avoid using the term “accuracy,” which has a very specific meaning in machine learning. It is better to say “the smaller the error, the better the model.”
Line 300: Figure 7: Correlation between sites, showing high correlations among nearby small-scale sites and varying correlations among larger sites.
- Since the article discusses “Spatio-temporal Contextual Information,” it would be useful to include a map (there is space on the right) to show the location of these stations. You can even take a specific station and show its neighboring stations in color depending on the correlation coefficient used: dark blue dots if the correlation coefficient is weaker, and yellow if it is stronger, just like on the matrix.
Line 340: Figure 10: Comparison of coefficient of determination (R2 ) under different missing data types. Higher R2 values indicate better agreement between predicted and observed soil moisture.
- Please give an example of how R2 is calculated; the formulas for MAE and RMSE, for example, are given above. It may be useful to calculate adjusted R2 and visualize it, instead of the regular coefficient of determination.
Line 420: 5 Conclusion, Figure 11
- Thank you for the clear visualization of the modeling results. Looking at the graphs, it seems to me that simple methods, such as LOCF (last observation carried forward) method or, if the experimental setup allows, linear interpolation of the time series, could perform just as well as the baseline approaches considered here (for example ARIMA or SVR). I understand that predictive models based on previous values cannot “ look” into the future. However, from the problem statement, I do not see any restrictions on why information before and after the gap cannot be used to fill it. In any case, could you please add LOCF here for comparison.
Technical corrections
Line 190: Figure 3
- The markings (dots) are hard to see. You could use regular dots instead of “stars”. And also you could show the L, M and S with color to make it easier to distinguish.
Line 230: Figure 4
- Please add names to types 1-4 in the image, so it will be easier to connect the visualization to the text. For example “(a) Type1 - completely random missing”, etc.Line 245: The performance evaluation metrics include Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). The calculation method is as follows: ...
- Please indicate what y means in the formulas and what y with a hat means. I know most readers will understand, but it is better to be clear.Line 285: Figure 6
- It may make sense to move the footnotes with gap indicators to the top so that they do not overlap with the dates on the X-axis.
Line 340: Figure 10: Comparison of coefficient of determination (R2) under different missing data types. Higher R2 values indicate better agreement between predicted and observed soil moisture.
- In this context, I suggest using “better consistency” instead of “better agreement”Citation: https://doi.org/10.5194/egusphere-2025-1900-RC1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
291 | 43 | 12 | 346 | 5 | 22 |
- HTML: 291
- PDF: 43
- XML: 12
- Total: 346
- BibTeX: 5
- EndNote: 22
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1