Filling Data Gaps in Soil Moisture Monitoring Networks via Integrating Spatio-temporal Contextual Information

Wang, Weixuan; Meng, Yizhuo; Wei, Zushuai; Miao, Linguang; Wang, Hui; Zhang, Wen

doi:10.5194/egusphere-2025-1900

Preprints

https://doi.org/10.5194/egusphere-2025-1900

Preprints

10 Jun 2025

| 10 Jun 2025

Filling Data Gaps in Soil Moisture Monitoring Networks via Integrating Spatio-temporal Contextual Information

Weixuan Wang, Yizhuo Meng, Zushuai Wei, Linguang Miao, Hui Wang, and Wen Zhang

Abstract. As critical inputs for global climate studies, watershed hydrologic modeling, and satellite soil moisture product validation, in situ soil moisture measurements are frequently compromised by sensor-derived data gaps that disrupt hydrological continuity. To overcome this challenge, we develop ST-GapFill, a novel spatiotemporal reconstruction framework integrating multi-source contextual information through two key innovations: (1) Spatial correlation-guided neighbor selection that identifies optimal auxiliary stations; (2) A long short-term memory (LSTM) network is employed to capture the complex temporal dependencies within the soil moisture time series. Validation on in-situ networks demonstrates that ST-GapFill successfully reconstructs soil moisture dynamics with preserved diurnal-phase fluctuations, achieving 0.91 correlation coefficients with ground truth under low missing-rate conditions (<50 %). Comparative analysis reveals the ST-GapFill 's statistically superior performance (RMSE reduction: 27.0 % vs IDW, 67.8 % vs ARIMA). This method establishes a robust spatiotemporal imputation paradigm for environmental sensor networks, effectively bridging observation gaps to support precision agriculture and climate change impact assessments.

Received: 22 Apr 2025 – Discussion started: 10 Jun 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Weixuan Wang, Yizhuo Meng, Zushuai Wei, Linguang Miao, Hui Wang, and Wen Zhang

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-1900. First review, M.', Mikhail Sarafanov, 31 Aug 2025

General comments

In their article “Filling Data Gaps in Soil Moisture Monitoring Networks via Integrating Spatio-temporal Contextual Information,” the authors explore approaches to filling gaps in soil moisture data. The article proposes a new method based on statistical calculations and machine learning techniques. The results are tested at a single geographic location. The error of the developed algorithm is compared with alternative solutions. The strong points include clear explanations of the proposed algorithm, a fairly good description of the experiments, and well-reasoned conclusions confirmed by experiments.
Nevertheless, it is recommended to improve the structure of this scientific paper, refine the literature review, and consider adding an additional simple method for filling in the gaps and comparing its error with the current results.

I would like to note my wish to add the source code to this article in the form of an open repository. I understand that in some cases, adding code is impossible due to various agreements and grant conditions. However, I kindly ask my colleagues, if possible, to take the time to create a repository with the model and experiment code for the article. Let's take steps to overcome the reproducibility issue in modern research.

Specific comments
line 35: “statistical interpolation and methods based on artificial intelligence.“

- I think it makes sense to consider the classification in more detail, because this division is quite general, and I did not see any further justification for this particular separation in the text.

For example, further in the text, “The artificial intelligence-based approach performs spatio-temporal modeling by capturing the complex non-linear relationship” refers to the specificity of the category “artificial intelligence methods”—nonlinearity. However, k-nearest neighbor interpolation, which the authors included in the first category, allows modeling nonlinear relationships. Firstly, I believe that the division into “statistical interpolation and methods based on artificial intelligence” is unnecessary in the context of this article. Secondly, I suggest reworking this section and providing a more comprehensive and detailed analysis of the solutions. For the sake of systematization, it would be useful to create a table comparing the methods.
Line 110: Methodology section

- I suggest changing the order of the section: first, discuss Data pre-processing, then Correlation calculation, and then Long short-term memory. This will make the narrative more coherent: from data to the correlation analysis method, and then conclude with an explanation of the final model. At the same time, I suggest paying special attention here to explaining why this particular architecture of the final algorithm was proposed (clearly specify in the text what each individual block is responsible for), while the explanation of how the LSTM architecture of neural networks works is not an important part of the narrative. It will be enough to provide a link and not focus on this.

Line 235: To evaluate the model performance, the full dataset was randomly split into training (80%) and testing (20%) sets using train_test_split from the Scikit-learn library.

- This can be kept as it is, but I think not every researcher in the field of geosciences knows Python programming so it might be useful to explain this step in plain text. And since we are talking about using Python here, I would be happy to take a look at the source code of the experiments. I suggest that the authors make their model available as an open-source solution and create a repository on GitHub (if it is legally possible).

Line 250: The smaller the RMSE and MAE are, the higher the accuracy

- It is better to avoid using the term “accuracy,” which has a very specific meaning in machine learning. It is better to say “the smaller the error, the better the model.”

Line 300: Figure 7: Correlation between sites, showing high correlations among nearby small-scale sites and varying correlations among larger sites.

- Since the article discusses “Spatio-temporal Contextual Information,” it would be useful to include a map (there is space on the right) to show the location of these stations. You can even take a specific station and show its neighboring stations in color depending on the correlation coefficient used: dark blue dots if the correlation coefficient is weaker, and yellow if it is stronger, just like on the matrix.

Line 340: Figure 10: Comparison of coefficient of determination (R2 ) under different missing data types. Higher R2 values indicate better agreement between predicted and observed soil moisture.

- Please give an example of how R2 is calculated; the formulas for MAE and RMSE, for example, are given above. It may be useful to calculate adjusted R2 and visualize it, instead of the regular coefficient of determination.

Line 420: 5 Conclusion, Figure 11

- Thank you for the clear visualization of the modeling results. Looking at the graphs, it seems to me that simple methods, such as LOCF (last observation carried forward) method or, if the experimental setup allows, linear interpolation of the time series, could perform just as well as the baseline approaches considered here (for example ARIMA or SVR). I understand that predictive models based on previous values cannot “ look” into the future. However, from the problem statement, I do not see any restrictions on why information before and after the gap cannot be used to fill it. In any case, could you please add LOCF here for comparison.

Technical corrections

Line 190: Figure 3

- The markings (dots) are hard to see. You could use regular dots instead of “stars”. And also you could show the L, M and S with color to make it easier to distinguish.

Line 230: Figure 4

- Please add names to types 1-4 in the image, so it will be easier to connect the visualization to the text. For example “(a) Type1 - completely random missing”, etc.
Line 245: The performance evaluation metrics include Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). The calculation method is as follows: ...

- Please indicate what y means in the formulas and what y with a hat means. I know most readers will understand, but it is better to be clear.
Line 285: Figure 6

- It may make sense to move the footnotes with gap indicators to the top so that they do not overlap with the dates on the X-axis.

Line 340: Figure 10: Comparison of coefficient of determination (R2) under different missing data types. Higher R2 values indicate better agreement between predicted and observed soil moisture.

- In this context, I suggest using “better consistency” instead of “better agreement”

Citation: https://doi.org/10.5194/egusphere-2025-1900-RC1
- AC1: 'Reply on RC1', weixuan wang, 28 Sep 2025
  
  The comment was uploaded in the form of a supplement:
  
  Citation: https://doi.org/10.5194/egusphere-2025-1900-AC1
CC1:
'Comment on egusphere-2025-1900', Huizhen Cui, 21 Oct 2025
The manuscript presents a novel spatiotemporal reconstruction method called ST-GapFill, which combines spatially-optimized neighbor selection with an LSTM model to effectively capture both spatial and temporal dependencies for reconstructing missing data in soil moisture (SM) sensor networks. The results demonstrate ST-GapFill outperforms the traditional SVR and ARIMA models, particularly for block missing data (NMR). The results provide a reference for supplementing long-term soil moisture observation network data. However, the manuscript could be improved by enhancing the clarity of its structure, the quality of the figures, and the precision of the description. Several scientific or presentation issues need to be addressed.
General Comments:
The Introduction currently presents gap-filling methods directly. It would benefit from first providing a comprehensive overview of standard methodologies for handling missing data in soil moisture observation networks, thereby creating a smoother transition to the presented methodology. Furthermore, the literature review is comprehensive but could be more focused. The existing literature review in the Introduction needs to be better synthesized to refine the logical connections and significance of each reference.

It is suggested that the article structure be slightly reorganized by presenting the gap-filling results first, followed by the result analysis. This arrangement would improve the logical flow and enhance reader comprehension.

There is a critical contradiction in the description of ST-GapFill's performance. The text states: "For other types of missing patterns, ST-GapFill performs best" (Line ~341), but later concludes "ST-GapFill is not applicable to long interval time series missing patterns of Type 2" (Line ~353). This creates confusion. The results in Figure 9(b) clearly show that IDW outperforms ST-GapFill for Type 2 (MR at the same sensor). The authors should rephrase this to accurately reflect the findings: ST-GapFill is the best-performing model for Type 1, 3, and 4, but for Type 2 (consecutive gaps at a single sensor), the spatial interpolation of IDW is more robust. Please provide clear conclusions based on a comprehensive analysis of both the gap-filling results and algorithm performance.

In 4.2, How is the model's performance evaluated specifically on the filled gaps in Figure 8? The text states that sites with real missing values (like L7, M8) were used. It must be clarified that the MAE and RMSE are calculated only at the timestamps where data was originally missing and now filled, comparing the filled value to the held-out "ground truth."

Specific Comments:
Section 2.2: The rationale for the correlation length scale L = 50 km is based on the network's grid size, which is valid. However, a brief mention of whether this parameter was optimized or validated through sensitivity analysis would strengthen the methodological rigor.

The stations in Figure 3 are too dense to see clearly. It is recommended to use distinct markers for the three station types (S, M, L) and present them in separate subplots for clarity.

Line 194 “A method of taking averages was used to resample all data uniformly for 30 minutes.” What is the specific reason for resampling the data to a 30-minute interval, as opposed to simply using the native half-hourly observations?

The data in Figure 6 is too densely plotted, which obscures the specific periods of missing data. It is recommended to split the figure into three subplots, each corresponding to the missing data periods T1, T2, and T3, to improve legibility.

Line 238: "re-peats" should be "repeats".

Line 295: The claim that highly correlated sites have "similar environmental factors" should be expanded slightly. What are these factors? (e.g., soil type, land cover, topography). This would add depth to the spatial analysis.

Line 314-313: “the optimal window sizes of 250 for ARIMA, 50 for ST-GapFill, and 50 for SVR were finally obtained, respectively.” Could you clarify why the uniform sliding window size is set to 100 at Line 302?

Line 364-369: When comparing ST-GapFill's performance to other studies (e.g., Chen et al., 2020; Moreno-Martinez et al., 2020), it would be beneficial to provide a direct quantitative or qualitative comparison, if possible, rather than just stating that those methods have "limitations."

Lin 370-372: Since 0.038 is greater than 0.03, please revise the way this is stated to ensure an accurate representation.

The color contrast in the legend of Figure 11's subplots is not strong enough, making it difficult to see. Also, please label the missing data pattern in each subplot for easier understanding.
Citation: https://doi.org/10.5194/egusphere-2025-1900-CC1
- AC2: 'Reply on CC1', weixuan wang, 31 Oct 2025
  
  The comment was uploaded in the form of a supplement:
  
  Citation: https://doi.org/10.5194/egusphere-2025-1900-AC2
RC2:
'Comment on egusphere-2025-1900', Anonymous Referee #2, 01 Mar 2026

The authors presented a hybrid statistical/machine learning method for filling the gaps between soil moisture observations (four types of missing data) of an existing in-situ soil moisture monitoring network. The authors demonstrated that their approach outperforms some widely used methods for interpolating soil moisture values, and their work can be helpful to operators of similar SM monitoring networks. Overall, the manuscript is well written and has enough results and discussion of the used approach, and a sufficient description of the challenges and limitations. Overall, I strongly suggest an additional round of proofreading of the manuscript to correct grammatical and writing issues (not a lot, but they do exist). I also encourage the authors to improve the quality of the figures by refining the figure legends and axis labels, as they are very difficult to read.
Here are some detailed comments:

L25 indicator of global climate change (needs to be backed by references).

L30 such missing not only (what is missing).

L50 what is MC?

L95 Completely Random Missing (MCR), the acronym seems out of order? Also, Non-Random Block Missing (NMR)?

L148 there has been no mention of the involvement of rainfall observations previously; is that also collected at each site? Please describe the type of stations and data collected, and how you determine which variables are relevant for your application. (e.g., why not use other soil moisture observations from other depths if not missing).

L154 should be mentioned earlier. Please describe if there are other variables collected.

L190 SM and PP at 3 cm, I guess it’s PP and SM at 3cm depth?

L202 need to describe what p, d, and q are.

L235 Please add reference and description of things like Scikit-learn library, Keras, etc.

L352 Paragraph needs English language revision (also throughout the manuscript), e.g., two back-to-back sentences start with however.

Citation: https://doi.org/10.5194/egusphere-2025-1900-RC2
- AC3: 'Reply on RC2', weixuan wang, 07 Mar 2026
  
  The comment was uploaded in the form of a supplement:
  
  Citation: https://doi.org/10.5194/egusphere-2025-1900-AC3

Weixuan Wang, Yizhuo Meng, Zushuai Wei, Linguang Miao, Hui Wang, and Wen Zhang

Viewed

Total article views: 1,164 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
922	206	36	1,164	21	45

HTML: 922
PDF: 206
XML: 36
Total: 1,164
BibTeX: 21
EndNote: 45

Views and downloads (calculated since 10 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	92	20	7	119
Jul 2025	41	7	3	51
Aug 2025	106	10	0	116
Sep 2025	385	17	6	408
Oct 2025	48	11	4	63
Nov 2025	50	26	3	79
Dec 2025	42	24	4	70
Jan 2026	30	34	2	66
Feb 2026	29	23	1	53
Mar 2026	83	29	4	116
Apr 2026	16	5	2	23

Cumulative views and downloads (calculated since 10 Jun 2025)

Month	HTML	PDF	XML	Total
Jun 2025	92	20	7	119
Jul 2025	41	7	3	51
Aug 2025	106	10	0	116
Sep 2025	385	17	6	408
Oct 2025	48	11	4	63
Nov 2025	50	26	3	79
Dec 2025	42	24	4	70
Jan 2026	30	34	2	66
Feb 2026	29	23	1	53
Mar 2026	83	29	4	116
Apr 2026	16	5	2	23

Viewed (geographical distribution)

Total article views: 1,147 (including HTML, PDF, and XML) Thereof 1,147 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 11 Apr 2026

Short summary

Soil moisture data is vital for climate studies and agriculture, but sensors often have gaps that disrupt data continuity. To address this, we developed ST-GapFill, a new framework that uses information from nearby stations and a special tool to fill in missing data. By selecting the best neighboring stations and capturing how soil moisture changes over time, ST-GapFill can accurately reconstruct soil moisture patterns.


Total:	0
HTML:	0
PDF:	0
XML:	0