the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Hydrochemistry and modeling nitrate concentration in farmland groundwater under different hydrological seasons by integrating hybrid quantum-classical ML, virtual sample generation and AlphaEarth Foundation
Abstract. Precise seasonal prediction of groundwater nitrate concentrations in intensive agricultural areas faces challenges such as data sparsity, strong spatiotemporal heterogeneity, and complex hydro-biogeochemical processes. To address these issues, this study proposes an integrated prediction framework combining hybrid quantum-classical machine learning, advanced virtual sample generation (t-SNE-GMM-KNN), and remote sensing foundation model semantic embedding (AEF). Modeling was conducted across the 2022–2023 normal, dry, and wet seasons in Xiong'an New Area. Hydrochemical types were dominated by Ca-Mg-HCO3−, controlled by mineral dissolution and evaporation. Nitrate concentrations were highest in the dry season (mean 42.93 mg L−1), driven by evaporative concentration. Spatially, high-value zones shifted: southeast (normal), central (dry), and northwest (wet). MixSIAR modeling based on isotopes indicated domestic sewage and livestock manure (74.1 %) as dominant sources. The t-SNE-GMM-KNN strategy mitigated small-sample bias while preserving nonlinear structure. When virtual samples were augmented to 10-fold, the Random Forest R2 in the dry season increased from 0.284 to > 0.85. Furthermore, a hybrid quantum-classical Random Forest exhibited superior robustness for data sparsity, achieving peak performance in the normal season (R2 = 0.962, RMSE = 5.73 mg L−1). Additionally, using only AEF embeddings achieved screening-level accuracy (R2 up to 0.860), providing a feasible rapid survey scheme for extensive unmonitored regions. Correlation analysis identified TDS and EC as persistent top predictors (r > 0.8). This comprehensive framework offers a robust solution for seasonal nitrate prediction and sustainable water management.
- Preprint
(6039 KB) - Metadata XML
-
Supplement
(73 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
CC1: 'Comment on egusphere-2026-272', Nima Zafarmomen, 19 Feb 2026
- AC5: 'Reply on CC1', Junjie Xu, 31 Mar 2026
-
RC1: 'Comment on egusphere-2026-272', Anonymous Referee #1, 27 Feb 2026
Reviewer Comments
General Assessment
The manuscript addresses an important geoscientific problem and presents potentially valuable findings. However, several methodological, structural, and interpretative issues limit the clarity, reproducibility, and robustness of the conclusions. Substantial revision is required before the manuscript can be considered for publication. The topic is relevant and potentially publishable but the manuscript requires major revision to improve its quality.
Major Comments
- Page 2, Lines 10–15
The introduction provides general background but does not clearly articulate the precise research gap. While prior studies are cited, it remains unclear What specific limitation of previous work is being addressed? Whether this study offers methodological novelty or merely a regional application? How the proposed approach differs from existing frameworks?
A clearer paragraph explicitly stating identified gap, the proposed advancement, and the expected contribution is necessary.
- Page 3, Lines 5–30
The study area description is descriptive but lacks quantitative justification. For example, No map scale or resolution information is provided. Climatic or hydrological statistics are not sufficiently summarized. The selection rationale for this site is weak.
I suggest authors will include a detailed map with coordinate system and scale. A table summarizing key environmental variables. A clear explanation of why this site is scientifically significant.
- Page 4, Lines 12
The manuscript does not adequately describe Data temporal resolution. Data completeness. Missing data handling procedures. Quality control protocols.
If interpolation or smoothing was applied, it must be explicitly stated. Reproducibility requires transparent preprocessing documentation.
- Page 5–7
The methodological section lacks A clear workflow diagram. Parameter selection criteria. Hyperparameter tuning explanation (if applicable). Justification of threshold values used.
If statistical or machine learning methods are employed, the following must be specified Training/testing split strategy. Cross-validation method. Performance metrics definitions. Software and version.
Currently, the method description is too general for replication.
- Page 8–10
Model validation appears limited. The manuscript does not report Confidence intervals. Statistical significance testing. Sensitivity analysis. Uncertainty quantification.
Without uncertainty analysis, the robustness of conclusions is questionable.
I suggest author will include Bootstrap or cross-validation uncertainty. Sensitivity analysis of key parameters. Error propagation discussion.
- Several figures suffer from:
- Small axis labels.
- Low resolution.
- Lack of units.
- No uncertainty shading.
Figures must be self-explanatory. Captions should fully describe the content without requiring readers to consult the main text.
- Page 11–13
The discussion includes causal interpretations that are not fully supported by the presented data. Correlation-based findings are occasionally interpreted mechanistically without experimental evidence.
Please revise the language to avoid overstatement and clearly distinguish between Observed correlation, Hypothesized mechanism and Established causality
- Page 14
The limitations are acknowledged but superficially. Consider expanding discussion on:
- Spatial representativeness.
- Temporal coverage limitations.
- Potential bias sources.
- Model generalizability.
A more critical self-assessment would strengthen the manuscript.
- Several equations lack complete symbol explanations immediately below the formula. All variables should be defined clearly with units.
- Some terms are used interchangeably without clarification. Please standardize terminology throughout the manuscript.
- Ensure consistent SI unit formatting. Check spacing around mathematical symbols. Italicize variables in equations.
- Recent studies from the last 2–3 years appear underrepresented. Please ensure the manuscript reflects current state-of-the-art research.
Minor Comments
- Several grammatical errors require professional English editing.
- Some references appear inconsistently formatted.
- Figure numbering and cross-referencing should be double-checked.
Citation: https://doi.org/10.5194/egusphere-2026-272-RC1 - AC3: 'Reply on RC1', Junjie Xu, 31 Mar 2026
-
RC2: 'Comment on egusphere-2026-272', Anonymous Referee #2, 12 Mar 2026
The manuscript proposes a framework for predicting nitrate concentrations in the groundwater system. In this research several novel methods such as virtual sample generation and hybrid quantum-classical Random Forest, and novel data sources (AlphaEarth Foundation) are combined with classical water quality sampling including isotopes analysis. The main novelty presented in the paper is the use of virtual sample generation to increase the Random Forest efficiency for predicting nitrate concentrations in the groundwater.
Although the general idea of the manuscript is interesting the current manuscript lacks details to fully understand the virtual sampling and random forest methods that are applied:
- The precise methodology of the virtual sample generation (t-SNE-GMM-KNN) is unclear to me. Are virtual samples added in space or also in time? The virtual samples are used in the RF method, how are the inputs variables and corresponding nitrate concentrations determined?
- Which inputs are used for the RF models? Are the nitrate concentrations modelled based on the other water quality parameters and the Alpha Earth Foundation data. If this is the case, isn’t the prediction of nitrate concentration merely a correlation with the remaining water quality parameters? In practice it would be strange to measure all these water quality parameters (of which some are quite costly) to predict the nitrate concentration. I do understand that this analysis is interesting to gain insight into the functioning of the nitrate dynamics in your groundwater system.
More information on the methodology of the virtual sample generation and RF models is required in my opinion. Additionally, the results are mainly validated based on boxplots of observed and predicted nitrate concentrations. A more thorough validation of the predictions should be added.
The manuscript would benefit from a clearer goal, explain in the manuscript why what you did is relevant and how results can be used.
The results section has a paragraph named ‘3.3 Bayesian model analysis and correlation analysis’. These methods are, however, discussed too briefly in the methodology part.
I have a few conceptual questions about the work:
- The analysis was done on a small-scale groundwater system (basically very large field scale). Within the area the groundwater system likely has very similar physical characteristics. Wouldn’t it be better to apply the method on a larger scale with more physical differences?
- In the manuscript (on page 8) it is mentioned that the sampling was done on irrigation wells, could this have an impact on the winter-summer differences?
- I thought that the large unsaturated zone (>10m?) would limit the seasonal response of both recharge and nitrate leaching in the groundwater system. However, the presented water quality measurements show otherwise. Maybe other external factors: groundwater abstraction - for example for irrigation -, groundwater dynamics on a regional scale... could have an impact as well?
Minor comments:
- In the introduction you mention exceedance of the nitrate rate. Exceedance of what? The local nitrate limits? Line 48 and 55 seem to contain very similar information.
- The information on the maps of fig 7 is interesting, however the maps are not very readable. Please use the same range for the 3 maps. How can the spatial differences in the nitrate concentration maps be explained?
- On page 23 line 552-553 you mention the groundwater depth becomes shallower in the dry season, this is counter intuitive. How can you explain
Citation: https://doi.org/10.5194/egusphere-2026-272-RC2 - AC4: 'Reply on RC2', Junjie Xu, 31 Mar 2026
-
AC1: 'Reply on RC1', Junjie Xu, 31 Mar 2026
Responses to RC1
The manuscript addresses an important geoscientific problem and presents potentially valuable findings. However, several methodological, structural, and interpretative issues limit the clarity, reproducibility, and robustness of the conclusions. Substantial revision is required before the manuscript can be considered for publication. The topic is relevant and potentially publishable but the manuscript requires major revision to improve its quality.
- Page 2, Lines 10–15. The introduction provides general background but does not clearly articulate the precise research gap. While prior studies are cited, it remains unclear What specific limitation of previous work is being addressed? Whether this study offers methodological novelty or merely a regional application? How the proposed approach differs from existing frameworks?
A clearer paragraph explicitly stating identified gap, the proposed advancement, and the expected contribution is necessary.
We sincerely appreciate this insightful comment. The reviewer is correct that the original Introduction lacked explicit articulation of specific research gaps and methodological distinctions. We have thoroughly revised the Introduction to address these concerns:
- Explicit articulation of research gaps (What specific limitation is addressed):
We have added a new paragraph that explicitly identifies three critical gaps in previous work:
The data augmentation paradox: Existing virtual sample methods (GMM, VAEs, GANs) either fail to preserve non-linear geochemical manifold structures or require large training sets that are precisely lacking in environmental monitoring (Farnia et al., 2023; Zhou et al., 2025);
The unexplored potential of remote sensing embeddings: While AEF has been applied to surface parameters, its utility for subsurface groundwater quality prediction across hydrological seasons remains untested (Tollefson, 2025; Li et al., 2025);
The unvalidated quantum-classical hybridization: Although QML offers theoretical advantages for complex pattern recognition, its practical efficacy in small-sample environmental modeling has not been systematically evaluated (Hong et al., 2025; Oliveira et al., 2025).
- Clarification of methodological novelty vs. regional application:
We have substantially rewritten the study objectives paragraph to explicitly distinguish methodological innovation from regional case study. The revised text clearly states that this study "develops a methodologically novel integrated framework that transcends mere regional application" and details three specific methodological advancements: (i) the t-SNE-GMM-KNN strategy for structure-preserving data augmentation; (ii) the hybrid quantum-classical RF architecture utilizing analytical quantum feature encoding; and (iii) the first application of AEF embeddings for groundwater nitrate screening.
- Distinction from existing frameworks (How the approach differs):
We explicitly contrast our approach with existing frameworks in the revised objectives paragraph: "Unlike previous studies that treat data augmentation, model architecture, and input features independently, this framework provides a unified, scalable solution that integrates virtual sample generation, quantum-enhanced machine learning, and remote sensing semantic features to simultaneously address data sparsity, seasonal heterogeneity, and feature extraction challenges."
Thank you for this valuable suggestion, which has significantly improved the clarity and rigor of our work.
The specific modifications are as follows:
- Introduction
Nitrate (NO3-) contamination in groundwater poses a serious threat to drinking water safety and ecosystem health, particularly in intensively managed agricultural regions (Wang et al., 2021). In China, groundwater nitrate pollution is a growing concern, national monitoring data from 2013 to 2017 revealed a nitrate exceedance rate exceeding 10% relative to the Class III limit (88 mg L-1 as NO3-) of the Chinese Groundwater Quality Standard (GB/T 14848-2017), with Hebei Province reporting an alarming rate of 31.66% in 2017 (Li et al., 2019). Over recent decades, escalating nitrate concentrations in surface and groundwater have been driven by intensified fertilizer use in agriculture, along with discharges of industrial and domestic wastewater (Zhang et al., 2018). Severe nitrate exceedances are especially prevalent in northern and northwestern China (Gu et al., 2013), where key contributors include domestic and industrial effluents, nitrification of soil organic nitrogen, and synthetic fertilizer application (Han et al., 2016). For instance, in the North China Plain, shallow groundwater nitrate exceedance rates range from 9.5% to 34.1% relative to the WHO drinking water guideline of 50 mg L-1 NO3-, and a rising trend persists at the regional scale, particularly in agricultural areas (Wang et al., 2018). In monsoonal temperate regions, seasonal shifts in precipitation, evapotranspiration, and groundwater recharge profoundly influence the transport, dilution, and accumulation of nitrate, leading to pronounced intra-annual variability in its concentration and spatial distribution (Gao et al., 2023; Zhu et al., 2025). Consequently, understanding and forecasting nitrate dynamics across hydrological seasons is essential for informed groundwater management and pollution mitigation, but remains a formidable challenge due to the nonlinearity, high dimensionality, and data scarcity inherent in such systems (Deng et al., 2023).
Traditional monitoring and modeling approaches face three critical limitations. First, field sampling campaigns though providing high-fidelity hydrochemical data are inherently sparse in space and time, especially for large-scale or rapidly changing environments (Viswanathan et al., 2022), which are time-consuming, labor-intensive, and costly, limiting the spatial and temporal coverage of data (Cai et al., 2025). Second, while process-based models incorporate physical mechanisms, they require extensive parameterization and are computationally prohibitive for dynamic, multi-season forecasting at farm-to-regional scales (Feng et al., 2022). Hydrological seasonal variations (normal, dry, and wet seasons) significantly influence the migration and transformation of nitrogen in the soil-groundwater system (Chen et al., 2025). For instance, concentrated rainfall during the wet season (accounting for 60%-80% of annual precipitation) can promote the leaching of surface nitrogen into groundwater, leading to a 25-fold increase in stream nitrate concentrations during storm events compared to baseflow (Sebestyen et al., 2014), meanwhile, intense evaporation in the dry season leads to the accumulation of nitrate in shallow aquifers, where concentrations can exceed the US EPA drinking water standard of 10 mg L-1 by 2-3 times (Liu et al., 2025; Cox et al., 2016). These seasonal differences result in distinct hydrochemical characteristics and nitrate concentration distributions, increasing the complexity of prediction models (Wu et al., 2025). Third, even advanced machine learning (ML) techniques such as Random Forest (RF), despite their robustness to nonlinearity and multicollinearity, still rely heavily on sufficient representative samples to capture the multi-modal distribution and tail behavior of environmental variables, particularly for heavy-tailed pollutants like NO3- (Luo et al., 2022). Moreover, the small sample sizes obtained from discrete sampling often lead to data sparsity and skewed distributions, reducing the model's generalization ability by 30%-50% when applied to unmonitored areas and compromising the robustness and generalization ability of machine learning (ML) models trained on such data (Thunyawatcharakul et al., 2025; Wang et al., 2024).
Despite these recognized limitations of conventional approaches, existing solutions remain fragmented and insufficient for seasonal groundwater nitrate prediction. Specifically, current virtual sample generation methods, whether statistical (e.g., SMOTE, GMM) or deep learning-based (e.g., VAEs, GANs), suffer from a critical paradox: they either fail to preserve the complex non-linear manifold structure of high-dimensional geochemical data or inherently require large training datasets that are unavailable in typical environmental monitoring scenarios (Farnia et al., 2023; Tung et al., 2023; Zhou et al., 2025). Meanwhile, while remote sensing foundation models such as Google's AlphaEarth Foundation (AEF) have demonstrated success in surface parameter estimation, their applicability to subsurface groundwater quality prediction, particularly for dynamic seasonal modeling, remains entirely untested (Tollefson, 2025; Li et al., 2025). Furthermore, although quantum machine learning (QML) theoretically offers advantages in capturing complex non-linear relationships through exponentially high-dimensional Hilbert space mapping, its practical efficacy in small-sample environmental datasets, especially when hybridized with classical ensemble methods, has not been systematically evaluated (Hong et al., 2025; Oliveira et al., 2025). Consequently, no existing framework simultaneously addresses the triple challenge of data sparsity, seasonal heterogeneity, and high-dimensional feature extraction for groundwater nitrate forecasting. To bridge these gaps, this study integrates three emerging methodological frontiers: advanced virtual sample generation, remote sensing foundation models, and quantum-enhanced machine learning. Recent advances in these domains offer partial solutions. Gaussian Mixture Models (GMM) and deep generative frameworks (e.g., VAEs, GANs) have shown promise in enriching training data, with GMM achieving an average similarity of 83.0% between unmixed chemical spectra and ground truth in geochemical analysis (Farnia et al., 2023; Tung et al., 2023), yet these approaches exhibit critical constraints: they often fail to preserve the non-linear manifold structure of high-dimensional geochemical space or require large training sets, precisely what is lacking (Zhou et al., 2025). Non-linear dimensionality reduction methods, such as t-SNE, excel at revealing latent clusters corresponding to distinct hydrological processes, with a classification accuracy of 92% for annual daily hydrograph clustering in mountainous watersheds, yet lack explicit generative mechanisms (Wang et al., 2025; Tang et al., 2022). Meanwhile, the rise of foundation models in Earth observation exemplified by Google’s AlphaEarth Foundation (AEF), offers unprecedented opportunities: its 64-dimensional semantic embeddings, derived from multi-sensor satellite time series (including Sentinel-2, Landsat, and Sentinel-1), implicitly encode land use, vegetation phenology, soil moisture, and anthropogenic footprints at 10 m resolution (Tollefson, 2025). These features have been successfully applied in land use classification and crop monitoring, but their potential for predicting groundwater nitrate concentrations, especially across different hydrological seasons remains underexplored (Li et al., 2025). Quantum machine learning (QML) further opens a new frontier. Parameterized Quantum Circuits (PQCs) can map classical inputs into exponentially high-dimensional quantum Hilbert spaces, generating entangled feature representations that reveal complex, non-linear patterns inaccessible to classical kernels (Hong et al., 2025). For ozone concentration forecasting, a hybrid QML model achieved an R2 of 94.12% for 1-hour forecasts and 75.62% for 6-hour forecasts, outperforming classical persistence models by a forecast skill of 31.01-57.46% (Oliveira et al., 2025). Recently, Saberian et al. (2026) developed HydroQuantum, a quantum-driven Python package implementing Quantum LSTM (QLSTM), hybrid quantum-classical LSTM, and Variational Quantum Circuits (VQC) for daily streamflow and stream water temperature simulations across the continental US. Their work demonstrated that quantum-enhanced architectures can effectively capture temporal dependencies in hydrological time series, particularly in snowmelt-driven and regulated basins, providing valuable methodological context for quantum-driven hydrological modeling. Crucially, analytical quantum feature extraction via Pauli-Z expectation values avoids the noisy sampling overhead of near-term quantum hardware, reducing computational latency by ~80% compared to sampling-based methods and making it viable for small-sample environmental modeling (Gujju et al., 2024; Oliveira et al., 2025).
Furthermore, identifying the sources and controlling factors of nitrate pollution is crucial for improving prediction accuracy and guiding targeted pollution control measures. Isotopic analysis (δ15N-NO3- and δ18O-NO3-) combined with the MixSIAR model has proven effective in quantitatively apportioning nitrate sources (Tian et al., 2025). Meanwhile, Bayesian models and SHapley Additive exPlanations (SHAP) analysis can reveal the key environmental variables driving nitrate concentration changes, enhancing the interpretability of prediction models (Alam et al., 2025). Despite these advancements, several gaps persist in the current research: (1) Few studies have integrated hybrid quantum-classical ML with virtual sample augmentation to address small-sample challenges in seasonal nitrate prediction; (2) The potential of AEF remote sensing semantic features for groundwater nitrate prediction remains untested, particularly in comparison with in-situ measured parameters; (3) The combined effects of hydrological seasonal variations, nitrate source apportionment, and key environmental drivers on prediction model performance require systematic investigation.
The North China Plain, an important agricultural production region in China, is characterized by high nitrogen input intensity and significant seasonal hydrological variations, making it an area prone to groundwater nitrate pollution (Liu et al., 2025; Hou et al., 2025). Conducting field-scale research on nitrate pollution in this region is of great significance for the protection of regional water resources. The Xiong’an New Area was specifically selected as the core study site due to its unique combination of national strategic ecological importance, representative hydrogeological conditions, and severe pollution characteristics that exemplify regional challenges. Quantitatively, Hebei Province reported a groundwater nitrate exceedance rate of 31.66% in 2017, significantly higher than the national average, with Xiong'an situated in the heart of this high-risk zone (Xiong et al., 2025). Furthermore, the study site represents a typical high-intensity agricultural system with annual nitrogen inputs ranging from 540 to 660 kg N ha-1, coupled with a shallow groundwater table (5.0-20.0 m) highly susceptible to surface loading (Xu et al., 2022). Moreover, existing investigations reveal severe anthropogenic impacts including legacy nitrogen accumulation (~320 kg N ha-1 a-1 surplus) and sewage irrigation contributing up to 58.3% of groundwater chemical signatures. This setting provides an ideal natural laboratory for testing seasonal prediction frameworks, as the distinct monsoonal climate (60%-80% precipitation concentrated in summer) creates pronounced hydrochemical contrasts between dry, wet, and normal seasons, directly challenging model robustness under data sparsity. (Xu et al., 2022). A mechanistic understanding of how nitrate concentrations vary across these hydrological seasons (normal, dry, wet) and their controlling factors is crucial for regional water resource management. To address the critical gaps outlined above, this study establishes three clear objectives: (1) to develop a t-SNE-GMM-KNN virtual sample generation strategy that preserves geochemical manifold structure while expanding small datasets for robust seasonal modeling; (2) to construct a hybrid quantum-classical Random Forest architecture that enhances feature discriminability without quantum hardware limitations, specifically targeting the high variability and data scarcity characteristic of dry-season nitrate prediction; and (3) to evaluate the feasibility of AlphaEarth Foundation (AEF) semantic embeddings as standalone predictors for rapid groundwater nitrate screening in unmonitored regions. The relevance of this research lies in its provision of a unified methodological framework that transcends single-site application, offering a scalable solution for seasonal groundwater quality prediction in intensively managed agricultural landscapes worldwide. Practically, the results enable three distinct operational modes: (i) mechanistic analysis using field-measurable parameters (pH, EC, TDS, etc.) to elucidate hydrochemical driving processes; (ii) rapid on-site nitrate estimation using portable multi-parameter meters when laboratory analysis is unavailable; and (iii) large-scale regional risk assessment using remote sensing embeddings for areas lacking monitoring infrastructure. By distinguishing between these application scenarios, the framework provides actionable decision-support tools for water resource managers to implement targeted pollution control measures, optimize monitoring network design, and prioritize remediation efforts across hydrological seasons.
Compared with existing research on groundwater quality prediction based on multiple ML algorithms and large datasets, the novelty of this paper lies in three aspects: (1) the t-SNE-GMM-KNN strategy preserves the non-linear manifold structure of high-dimensional geochemical space during generation; (2) we systematically explore the application of hybrid quantum-classical machine learning in groundwater quality prediction, evaluating its robustness and feature discriminability in small-sample seasonal dynamics; and (3) we conduct a comparative analysis of prediction performance using traditional field observation data versus the first-time application of AlphaEarth Foundation (AEF) semantic embeddings for nitrate estimation.
- Page 3, Lines 5–30. The study area description is descriptive but lacks quantitative justification. For example, No map scale or resolution information is provided. Climatic or hydrological statistics are not sufficiently summarized. The selection rationale for this site is weak.
I suggest authors will include a detailed map with coordinate system and scale. A table summarizing key environmental variables. A clear explanation of why this site is scientifically significant.
We appreciate these valuable suggestions. We have revised Figure 1 to include detailed cartographic information.
We agree that a concise summary of key climatic and hydrological variables enhances the reproducibility and contextual clarity of our study. In response, we have added a new table (Table 1) in Section 2.1 "Study area" that systematically summarizes:
(1) Climatic statistics (temperature, precipitation, evaporation, sunshine, and wind);
(2) Soil and vadose zone characteristics relevant to nitrogen transport;
(3) Groundwater system parameters (aquifer type, water table depth, recharge/discharge mechanisms).
We agree that the original manuscript provided a primarily descriptive overview of the study area without sufficiently quantifying the rationale for site selection or its specific scientific significance.
To address this, we have revised the Introduction section to explicitly justify the selection of the Xiong’an New Area based on quantitative indicators and strategic relevance. Specifically, we have added the following details: We highlighted that Hebei Province reported a groundwater nitrate exceedance rate of 31.66% in 2017, significantly higher than the national average, positioning Xiong’an within a critical high-risk zone. We specified the annual nitrogen input rates (540-660 kg N ha-1 yr-1) and shallow groundwater table depth to demonstrate the site's susceptibility to pollution and its representativeness of intensive agricultural regions. We clarified that the distinct monsoonal climate (with 60%-80% precipitation concentrated in summer) creates pronounced seasonal hydrochemical contrasts, making it an ideal natural laboratory for testing the robustness of our proposed seasonal prediction framework under data sparsity.
These additions provide a stronger quantitative justification for the study site and clarify its scientific value for regional water management and methodological validation. Please see the revised Introduction section for these changes.
The specific modifications are as follows:
2.1 Study area
The North China Plain is one of China’s most important agricultural production bases. This study focuses on the Xiong’an New Area, situated in the central part of Hebei Province, as a representative research site within this plain. Located in the core region defined by Beijing, Tianjin, and Baoding, it boasts an advantageous geographical position, with straight-line distances of 105 km to both Beijing and Tianjin, and 30 km to Baoding. Its geographical coordinates range from 38°43′ to 39°10′ N latitude and from 115°38′ to 116°20′ E longitude, covering an area of approximately 1770 km2 (Xiong’an New Area Official Website, 2023). The specific study area is an unmanned farm located in Xieyeqiao Village, Nanzhang Town, Rongcheng County, within the Xiong'an New Area (Fig.1). The farm covers an area of 3000 hectares and primarily cultivates two main grain crops: wheat and corn. As the first mechanized unmanned farm in Xiong'an, it has achieved full mechanization and intellectualization, enabling unmanned, precise, and standardized operations throughout all stages of tillage, sowing, management, and harvesting.
Fig.1. Study area map showing the sampling location and sampling points.
Fig.2. Temperature and precipitation trends in study area (1952-2025).
Cultivated land constitutes a significant proportion of the total area in Xiong’an New Area and is predominantly dryland. Traditional fertilization practices in this region involve high application rates of nitrogen fertilizer and manure. As a representative farm in the region, the study site adheres to these traditional practices, making it susceptible to the impacts of high fertilization intensity. The annual nitrogen fertilizer application rate at the study site ranges from 540 to 660 kg (N) ha-1 yr-1, primarily supplied in the form of urea (46% N). Consequently, the extensive application of chemical fertilizers and manure has increased the risk of groundwater nitrogen pollution. Furthermore, the relatively dense rural population distribution leads to pollution from the discharge of domestic sewage in the vicinity. Irrigation is scheduled according to crop phenological stages. Wheat receives muddy water irrigation before sowing, and during the overwintering, reviving, and jointing stages, while maize receives a single muddy water irrigation after sowing.
The climate is characterized as a temperate continental monsoon climate. Springs are dry with little rain, summers are humid and rainy, autumns are cool and dry, and winters are cold with minimal snowfall. The mean annual temperature is 12.6°C, with relatively small inter-annual fluctuations; the lowest temperatures occur in January, averaging -5°C, while the highest occur in July, averaging 26.1°C. The mean annual precipitation is 480.8 mm, which is highly concentrated between June and September, accounting for nearly 80% of the annual total. The long-term temperature and precipitation changes in the study area are shown in Fig.2. The multi-year average evaporation is 1,661.1 mm (Liao et al., 2020), and the multi-year average water surface evaporation is 1,761.7 mm (Sun, 2024). The mean annual sunshine duration is 2,335.2 hours, with longer sunshine hours in spring and summer and shorter hours in autumn and winter. The average frost-free period lasts 204 days. The mean annual wind speed is 1.7 m s-1, with the highest average speeds occurring in April and the lowest in January, August, and December.
The soil texture is predominantly silt loam. The 2-8.5 m soil layer contains interlayers with high clay content, such as clay and silty clay, reflecting the sedimentary characteristics of the vadose zone in the Central Plains under geomorphic deposition. Nitrogen in the thick vadose zone exists predominantly as organic nitrogen, accounting for approximately 97% of the total nitrogen content. The shallow vadose zone at depths of 3-6 m stores the highest amount of nitrate, accounting for approximately half of the total nitrate reserves in the North China Plain (Li et al., 2025; Zhang et al., 2007). Groundwater in the study area is primarily stored in Quaternary unconsolidated porous aquifers, with sampling well depths ranging from 70 to 120 m (Bai et al., 2023). The lithology of the shallow aquifer group is dominated by sand layers with medium water abundance, such as silty fine sand, fine sand, and medium sand, with single-well yields ranging from approximately 1,000 to 3,000 m3/d. Based on groundwater survey data from June 2022 (Xiong’an Urban Geology Research Center, China Geological Survey, 2023), the depth to the shallow groundwater table in Xiong’an New Area ranges from 5.0 to 35.0 m. Water level fluctuations are primarily influenced by atmospheric precipitation infiltration and agricultural extraction (Ma et al., 2022). The primary source of groundwater recharge for farmland in the study area is atmospheric precipitation, while the main pathway of discharge is artificial extraction for agricultural irrigation. The detailed data on climate and hydrological indicators in the research area are shown in Table 1.
Table 1. Summary of key climatic and hydrological variables in the study area.
Category
Variable
Value / Range
Unit
Temperature
Mean annual temperature
12.6
°C
Mean January temperature
-5.0
°C
Mean July temperature
26.1
°C
Precipitation
Mean annual precipitation
480.8
mm
Rainy season (Jun-Sep) contribution
~80
% of annual total
Evaporation
Mean annual evaporation
1,661.10
mm
Mean annual water surface evaporation
1,761.70
mm
Solar & Wind
Mean annual sunshine duration
2,335.20
h
Mean annual wind speed
1.7
m s-1
Frost-free period
~204
days
Soil & Vadose Zone
Dominant soil texture
Silt loam
—
Organic N proportion in vadose zone
~97
% of total N
Peak nitrate storage depth
3-6
m
Groundwater
Aquifer type
Quaternary unconsolidated porous
—
Sampling well depth
70-120
m
Shallow water table depth
5.0-35.0
m
Single-well yield
1,000-3,000
m3 d-1
Dominant recharge source
Precipitation infiltration
—
Dominant discharge pathway
Agricultural extraction
—
- Page 4, Lines 12. The manuscript does not adequately describe Data temporal resolution. Data completeness. Missing data handling procedures. Quality control protocols.
If interpolation or smoothing was applied, it must be explicitly stated. Reproducibility requires transparent preprocessing documentation.
We sincerely thank the reviewer for pointing out the need for more detailed documentation regarding data temporal resolution, completeness, missing data handling, quality control (QC) protocols, and preprocessing transparency. We agree that these details are crucial for reproducibility.
In the revised manuscript, we have made the following modifications to address these concerns:
Data Temporal Resolution: We explicitly clarified that groundwater data represents seasonal snapshots, while AlphaEarth Foundation (AEF) data provides annual temporal resolution.
Data Completeness & Missing Data: We added a statement confirming that samples with incomplete records were excluded prior to modeling, ensuring 100% completeness in the final dataset, and clarified that no imputation was used for raw data.
Interpolation/Smoothing: We explicitly stated that no interpolation or smoothing was applied to the original observational data to preserve raw variability. The only exception is the spatial visualization of nitrate concentration distributions, where ordinary Kriging interpolation was applied solely for mapping purposes using Surfer software (Golden Software, LLC). This interpolation was not used for data augmentation, missing value imputation, or model input generation. Virtual sample generation is strictly for augmentation, not filling missing values.
Preprocessing Documentation: We enhanced the preprocessing workflow description to ensure transparency and reproducibility (Section 2.7.4).
The specific modifications are as follows:
2.2 Data collection and measurements
2.2.1 Field sampling data and laboratory analysis
Field investigations and the collection of hydrochemical and isotopic samples were conducted in the study area from 2022 to 2023. The temporal resolution for groundwater sampling was designed as seasonal snapshots, with campaigns occurring in October 2022 (normal season), April 2023 (dry season), and August 2023 (wet season), representing a temporal interval of approximately 4-6 months between surveys. A total of 66, 65, and 50 groundwater samples were collected in October 2022, April 2023, and August 2023, respectively. All groundwater samples were obtained from existing agricultural irrigation wells within the study area. Prior to sample collection, each well was purged by pumping. Sampling commenced only after the pumped volume exceeded three times the well's casing volume and on-site parameters had stabilized (i.e., showing minor fluctuations around a constant value rather than a continuous rising or falling trend), a procedure implemented to ensure the representativeness of the samples. At each sampling point, one 1000 mL and two 100 mL samples were collected. Before final collection, the sample bottles were rinsed three times with the water to be sampled. Immediately after collection, the samples were sealed and stored in a portable cooler for transport to the laboratory for subsequent analysis. Furthermore, the precise geographical location of each sampling point was recorded using a GPS device. Regarding data completeness and missing data handling, all collected samples underwent rigorous screening. Samples with incomplete physicochemical records or those failing initial quality checks were excluded from the final dataset. Consequently, the modeling dataset achieved 100% completeness with no missing values requiring imputation or interpolation.
In-situ physicochemical parameters were measured using a Hach HQ400 multi-parameter water quality meter (Li et al., 2022). The measured parameters included water temperature (T, °C), pH, total dissolved solids (TDS, mg L-1), dissolved oxygen (DO, mg L-1), electrical conductivity (EC, μS cm-1), and oxidation-reduction potential (ORP, mV). These parameters can be acquired rapidly in the field with minimal sample preparation, offering a practical advantage over laboratory-based nitrate determination which requires sample preservation, transportation, and longer analytical processing times. This operational efficiency motivates the use of field-measurable parameters as predictors for preliminary nitrate concentration estimation. The concentration of HCO3- was determined within 24 hours of sample collection using the dilute sulfuric acid-methyl orange titration method (Huang et al., 2012). Prior to the determination of cations and anions, water samples were filtered through 0.45 μm membrane filters. Major cations (K+, Ca2+, Na+, Mg2+) were analyzed using an inductively coupled plasma optical emission spectrometer (Avio 500). Major anions (NO3-, Cl-, SO42-) were analyzed using an ion chromatograph (ICS-2100). The analytical precision for cations and anions was controlled within ±0.2 mg L-1, and the charge balance error was maintained within 5% to ensure reliability. The concentrations of nitrite nitrogen and ammonia nitrogen were determined using a flow injection analyzer (Smartchem 200, AMS Alliance) and measured using dual wavelength spectrophotometry and the indophenol blue method (Kim et al., 2019; Sun et al., 2022). The limits of detection for nitrite nitrogen and ammonium nitrogen were both 0.01 mg L-1. For the analysis of stable hydrogen and oxygen isotopes, water samples were filtered through 0.22 μm membrane filters and measured using an LGR liquid water isotope analyzer (TIWA-45-EP). The analytical precisions for δ2H, δ17O, and δ18O were ±0.15‰, ±0.02‰, and ±0.02‰, respectively (Hamidi et al., 2023). The isotopic compositions of nitrate (δ18O-NO3- and δ15N-NO3-) were determined using a MAT-253 mass spectrometer coupled with an elemental analyzer (Li et al., 2022). To ensure analytical precision, standard references, reagent blanks, and duplicate samples were employed. Furthermore, international standards USGS 34 and USGS 35 were used for δ18O quality control, while USGS 32 and USGS 34 were used for δ15N quality control. All isotope results are reported in per mil (δ, ‰). No interpolation or smoothing techniques were applied to the raw hydrochemical data to preserve the inherent variability of the groundwater system. The sole exception was the spatial visualization of nitrate concentration distributions(Fig.7), where ordinary Kriging interpolation was applied exclusively for cartographic representation using Surfer software(Golden Software, LLC). This interpolation was used solely for generating continuous surface maps and was not employed for data augmentation, missing value imputation, or model input generation.
2.2.2 Google AlphaEarth Foundation
To facilitate comparisons with predictions based on in-situ field sampling data and to validate the accuracy of predicting groundwater nitrate concentration using remote sensing data, this study incorporates the Google AlphaEarth Foundation (AEF) dataset. AEF is a collection of high-dimensional surface semantic embedding features generated via pre-training on multi-source remote sensing data (Brown et al., 2025). By fusing imagery from Sentinel-2, Landsat, and other Earth observation satellites, this dataset constructs a 64-dimensional vector representation (denoted as A00-A63) at a global scale with an annual temporal resolution and a 10 m spatial resolution (Alvarez et al., 2025). This annual temporal resolution provides consistent surface context for each hydrological year, complementing the seasonal groundwater snapshots. These embeddings implicitly encode complex environmental semantics, such as land cover types, vegetation dynamics, soil moisture, and the intensity of human activity, and have been successfully applied in tasks including land use classification, crop monitoring, and environmental risk modeling (Tollefson et al., 2025).
The primary processing workflow involved spatially sampling the 64-dimensional AEF vectors at a 10 m resolution using the GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL product on the Google Earth Engine (GEE) platform, based on the geographic coordinates of the field sampling points. To ensure data quality, only samples exhibiting exact matches between the GEE extraction and the actual field data points were retained. Any pixels with significant cloud cover or data gaps in the underlying satellite imagery were automatically handled by the AEF pre-processing pipeline, ensuring complete feature vectors for all sampling locations. Given the redundancy within the initial 64-dimensional AEF features, Principal Component Analysis (PCA) based on Singular Value Decomposition (SVD) was employed for feature compression. Specifically, SVD was performed on the centered feature matrix to select the minimum number of principal components accounting for at least 95% of the cumulative explained variance (Ilyas et al., 2025). This threshold was selected to balance information retention with dimensionality reduction, ensuring robust model input without overfitting (Ilyas et al., 2025). The orthogonalized, low-dimensional principal component scores were subsequently used as model inputs. This approach preserves the vast majority of the semantic information from the original embeddings while significantly mitigating the risk of overfitting. Ultimately, the PCA-reduced AEF features served as the input variables for the model.
2.7.4 Standardized prediction workflow
To systematically evaluate the nitrate concentration prediction capabilities of different input variables and modeling strategies across various hydrological seasons, and to validate the effectiveness of virtual sample generation for small-sample modeling, this study established a standardized prediction pipeline (Fig.3). The specific steps are as follows: (1) Data Preprocessing and Grouping: Observed samples were partitioned by seasons. Z-score normalization was applied separately to two types of input features: field water quality parameters and AlphaEarth Foundation (AEF) features reduced via Principal Component Analysis (PCA). Transparent preprocessing documentation was maintained throughout this process. No interpolation or smoothing was applied to the original observational data to preserve raw variability. The virtual sample generation described in subsequent steps was used solely for dataset generation to improve model robustness, not for filling missing raw data values. (2) Virtual Sample Generation and Validation: A t-SNE-GMM-KNN strategy was employed to generate virtual samples. Generation factors from 1x to 10x were systematically evaluated. Their physical plausibility and distribution consistency were rigorously verified using statistical indicators, box plots, and histograms. (3) Model Training: Under unified hyperparameters, classical Random Forest (RF) and quantum-enhanced RF models were constructed. The latter generates <Z> quantum features via Parameterized Quantum Circuits (PQC) encoding, which are concatenated with original features to form 2*n input features. Models were trained using two distinct input datasets and combinations of "original samples + 1~10× virtual samples." (4) Model Evaluation: The Leave-One-Out Cross-Validation (LOOCV) strategy was adopted to calculate R2, RMSE, and MAE. Visual diagnostics were performed using observed-predicted scatter plots, residual plots, and box plots. (5) Interpretability Analysis: Multi-scale interpretation was conducted based on the SHAP framework, including summary plots (global importance ranking), dependence plots (nonlinear response and interaction effects), and waterfall plots (local attribution). The driving mechanisms were cross-verified with results from Bayesian models and Pearson correlation analysis. This workflow encompasses the full process from data generation, modeling, and evaluation to attribution, providing a reproducible and highly transparent solution for precise groundwater nitrate prediction under conditions of small samples, multiple seasons, and multi-source inputs.
To ensure reproducibility, all computational tasks were performed on an Apple M2 Pro (10-core CPU, 16GB RAM) running Python 3.9. Core machine learning and preprocessing tasks utilized scikit-learn (version 1.7.2), quantum feature mapping and statevector simulation were conducted using qiskit (version 1.4.5) with CPU-based analytical calculation to ensure precision without hardware noise, model interpretability analysis employed shap (version 0.49.1), data manipulation was performed with pandas (version 2.3.3) and numpy (version 2.2.6), and visualization was generated using matplotlib (version 3.9.0). Statistical tests and scientific computing relied on scipy (version 1.15.3), and parallel processing was managed by joblib (version 1.5.2). All random operations were seeded (random_state=42) to ensure deterministic results. Complete code and generated datasets are available upon request.
Fig.3. Process diagram for constructing prediction framework.
- Page 5–7. The methodological section lacks A clear workflow diagram. Parameter selection criteria. Hyperparameter tuning explanation (if applicable). Justification of threshold values used.
If statistical or machine learning methods are employed, the following must be specified Training/testing split strategy. Cross-validation method. Performance metrics definitions. Software and version. Currently, the method description is too general for replication.
We sincerely thank the reviewer for these constructive comments regarding the reproducibility and clarity of our methodological framework. We have thoroughly revised the Materials and Methods section to address each point:
- Workflow Diagram: We have refined Fig. 3to explicitly illustrate the end-to-end pipeline, including data preprocessing, virtual sample generation, quantum feature encoding, hyperparameter tuning, and model evaluation. The text description in Section 2.7.4 has been expanded to walk the reader through each step of the diagram.
- Parameter Selection & Hyperparameter Tuning: We have added detailed explanations in Section 2.4 and Section 2.5. Specifically, we clarified the quantum circuit parameters (e.g., reps=1, n_qubits equal to feature dimension), the t-SNE perplexity setting (10, optimized for small samples), the GMM component selection via BIC minimization (range 1-7), and the KNN inverse mapping parameters (n_neighbors=5). The grid search ranges for the Random Forest (e.g., n_estimators, max_depth) have also been specified.
- Threshold Values: Justifications for thresholds, such as the 95% cumulative variance for PCA in AEF feature processing (Section 2.2.2), the perplexity setting for t-SNE (Section 2.4), and the physical constraints for virtual samples (e.g., NO3-≥ 0 mg L-1) have been clarified.
- Training/Testing & CV Strategy: The LOOCV strategy is now explicitly defined in Section 2.6.2, explaining why it was chosen over standard k-fold split (due to limited samples).
- Performance Metrics: Definitions forR2, RMSE, and MAE are provided with formulas in Section 2.7.2.
- Software and Version: A new paragraphhas been added to list all key packages (Python, scikit-learn, Qiskit, etc.) to ensure replicability.
- Virtual Sample Generation Details: Section 2.4 has been substantially expanded to include the complete technical workflow: KNN-based missing value imputation, t-SNE dimensionality reduction parameters, GMM component optimization via BIC, KNN inverse mapping for reconstructing original feature space, and physical constraint enforcement. Code-level implementation details are now provided to ensure full reproducibility.
The specific modifications are as follows:
2.4 t-SNE-GMM-KNN: based on nonlinear structure modeling in feature space
To address the challenges of overfitting and poor generalization performance in small-sample modeling, which arise from data sparsity and skewed distributions, this study proposes a three-stage virtual sample generation strategy termed t-SNE-Gaussian Mixture Sampling with KNN Inverse mapping. This method aims to preserve the non-linear manifold structure and multi-modal distribution characteristics of the original high-dimensional feature space while generating physically plausible and statistically consistent virtual samples. Crucially, virtual samples are generated exclusively in the hydrochemical feature space (spatial dimension) for each hydrological season independently, with no temporal extrapolation or cross-season data fusion, to avoid altering the inherent seasonal hydrochemical heterogeneity identified in this study. The “spatial dimension” here refers to the high-dimensional hydrochemical feature space composed of in-situ measured water quality parameters and nitrate concentration. These virtual samples augment the hydrochemical attribute space within each specific hydrological season (normal, dry, wet) to increase population density for model training, rather than representing new physical spatial locations or distinct time points. The specific workflow is as follows:
- Data standardization
All input features are standardized using Z-score standardization to eliminate scale differences and enhance the stability of the subsequent dimensionality reduction (Jamshidi et al., 2022). The StandardScaler from scikit-learn was applied to transform each feature to zero mean and unit variance. Notably, the target variable (NO3-) is included alongside predictive variables (pH, T, EC, DO, ORP, Salt, TDS) in this standardization step to ensure the joint distribution between predictors and the target is preserved during generation.
- t-SNE non-linear dimensionality reduction
t-Distributed Stochastic Neighbor Embedding (t-SNE) is employed to map the high-dimensional feature space into a low-dimensional latent space (d=2) (Islam et al., 2023). t-SNE is selected for its superior ability to preserve the local non-linear manifold structure of high-dimensional hydrochemical data compared to linear dimensionality reduction methods such as PCA, which is critical for capturing clustered subgroups corresponding to distinct hydrological and contamination processes (Wang et al., 2025). To balance the preservation of local and global structures, the perplexity is set to 10 based on empirical testing for small sample sizes (n<100), which optimizes the balance between local neighborhood preservation and global structure maintenance (Kobak et al., 2019). PCA initialization is used to ensure reproducibility and accelerate convergence, with a maximum iteration number of 1000 and a gradient descent tolerance of 1e-5 set as the convergence criteria. t-SNE effectively reveals the clustered structure of samples on the low-dimensional manifold, reflecting the differentiation of underlying environmental processes within hydrological seasons (Liu et al., 2021). The implementation used the scikit-learn TSNE module with random_state=42 for reproducibility.
- GMM clustering and optimal component selection
In the t-SNE-reduced low-dimensional space, a Gaussian Mixture Model (GMM) is constructed to characterize the probability density distribution of the data (Jia et al., 2022). The GMM assumes that the data are generated from a linear combination of several Gaussian distributions. The GMM captures the multimodal nature of the hydrochemical data (e.g., distinct clusters corresponding to different contamination levels or redox conditions). The weights, means, and covariance matrices of each Gaussian component are estimated via the Expectation-Maximization (EM) algorithm, thereby accurately capturing the complex distribution patterns of the data (Yan et al., 2023). To avoid subjectively setting the number of clusters, the Bayesian Information Criterion (BIC) is used to automatically optimize the number of components, K, within the range (1 to 7, where the upper bound was set to approximately n/10 to prevent overfitting given the small sample size) (Ghodba et al., 2025):
(5)
where L is the model’s likelihood, k is the total number of free parameters for a K-component model, and n is the sample size. The value of K corresponding to the minimum BIC is selected as the optimal number of components, ensuring a balance between goodness-of-fit and model complexity. The GMM was implemented with full covariance type and random_state=42 for reproducibility.
- Virtual sample generation and inverse mapping
Based on the optimal GMM, a specified number of virtual points are randomly sampled from its joint probability distribution. In this study, generation factors ranging from 1x to 10x the original sample size were tested, with the 10x generation (n_virtual = n_original x 10) yielding optimal model performance. This generation process naturally inherits the multi-modality and covariance structure of the original data. Since NO3- was included in the initial feature space, the generated low-dimensional points inherently contain information corresponding to both input variables and nitrate concentrations. Based on the optimal GMM, virtual samples are generated by randomly sampling from the joint probability distribution in the 2D t-SNE latent space. To reconstruct these synthetic 2D points back into the original 8-dimensional geochemical feature space (including NO3- concentrations), a K-Nearest Neighbors (KNN) regressor is employed as an inverse mapping function (Niu et al., 2025). The KNN regressor was configured with n_neighbors=min(5, n_original-1) to ensure sufficient reference points while maintaining local fidelity. This neighbor number is selected to balance the smoothness of inverse mapping and the preservation of local data characteristics, avoiding over-smoothing of extreme values. The KNN model is fitted on the original t-SNE embedding results (as independent variables) and the original standardized high-dimensional hydrochemical data (as dependent variables), establishing a non-linear mapping relationship from the low-dimensional latent space to the original high-dimensional feature space. Finally, the virtual sample set in its original physical units is obtained by applying inverse standardization. This inverse mapping process synchronously generates the virtual values of all input hydrochemical variables and the corresponding target NO3- concentration for each virtual sample, ensuring the one-to-one correspondence between input predictors and the target variable required for subsequent random forest (RF) modeling. During subsequent Random Forest modeling, the generated NO3- values are separated from the predictors and serve as the target variable (y), while the remaining hydrochemical parameters serve as input features (X).
- Physical constraints and quality control
For the target variable, NO3-, a non-negativity constraint (NO3- ≥ 0 mg L-1) is imposed to prevent non-physical solutions that may arise from the regression approximation. This constraint was enforced using numpy.clip() function post-generation. Other variables, such as pH and ORP, are allowed to fluctuate within reasonable ranges without hard clipping to retain the model's flexibility. The consistency between the virtual and measured samples is validated by comparing their statistical characteristics, including mean, standard deviation, coefficient of variation, range of extreme values, and boxplot distributions. This comparison confirms that the generated data are highly consistent with the original data in terms of statistical properties, without introducing systematic bias or outliers. The generated virtual samples are complete synthetic observations containing both the predictor variables and nitrate concentrations. These are concatenated with the original measured data to form an augmented training dataset. During Random Forest (RF) training, the 7 water quality parameters (pH, T, EC, DO, ORP, Salt, TDS) serve as input features (X), while the NO3- column (from both original and virtual samples) serves as the target variable (y). This approach effectively increases the training sample size while maintaining the geochemical integrity of the feature-target relationships, thereby mitigating small-sample bias and improving the model's ability to capture heavy-tailed distributions.
The advantages of this method are as follows: ① t-SNE excels at capturing local neighborhood relationships, effectively separating implicit subgroups under different hydrological conditions. ② The GMM provides a probabilistic generative framework, supporting reasonable extrapolation for heavy-tailed distributions and extreme values. ③ The KNN-based inverse mapping circumvents the need for large training datasets, which is a limitation of traditional autoencoders, making it particularly suitable for small-sample scenarios (Tang et al., 2022; Razavi-Termeh et al., 2024). ④ By modeling the joint distribution of predictors and target, the method ensures that generated virtual nitrate concentrations are chemically consistent with the accompanying hydrochemical indicators, avoiding unrealistic combinations that could degrade model performance.
2.5 Machine learning methods
2.5.1 Random forest
In this study, Random Forest (RF) was adopted as the baseline model. As an ensemble learning method that leverages bootstrap sampling and random feature selection, RF builds numerous decision trees and integrates their predictions (Abderzak et al., 2025). This approach effectively suppresses overfitting and improves generalization performance, making it especially well-suited for environmental data modeling scenarios involving small samples, high dimensionality, non-linearity, and multicollinearity (Boddu et al., 2025). The core principle of the RF regression model is to generate multiple independent decision tree learners through bootstrap resampling of the training dataset, where each tree is trained on a random subset of samples and a random subset of input features at each node split. The final prediction is the average of the outputs from all individual decision trees, which effectively reduces the variance of the model and suppresses overfitting. For each node split, the optimal split feature and threshold are determined by minimizing the mean squared error (MSE) reduction, which is derived from the Gini impurity criterion and mathematically defined as:
(6)
where Pi is the proportion of samples belonging to the i-th interval of the target variable in the node. For regression tasks, the Gini impurity is extended to the MSE reduction criterion, which quantifies the decrease in the variance of the target variable after the node split.
Hyperparameters were configured based on a preliminary grid search and domain expertise: n_estimators=100, max_depth=5, min_samples_split=6, min_samples_leaf=3, and max_features=. For the interpretation of driving mechanisms, feature importance was quantified by the mean decrease in Gini impurity (Gini Importance) to identify the critical hydrogeochemical indicator factors (Kaur et al., 2025).
2.5.2 Hybrid quantum-classical random forest
Based on the random forest, a hybrid quantum-enhanced Random Forest model was constructed, integrating quantum feature enhancement with classical random forests. The core idea of the model is: utilizing a Parameterized Quantum Circuit (PQC) to perform quantum feature encoding on standardized input features, generating high-dimensional quantum features with non-linear entanglement properties (Naresh et al., 2025). These are then concatenated with the original features to construct an enhanced hybrid feature space, which is finally fed into a random forest regressor for modeling (Lamichhane et al., 2025). This hybrid framework retains the superior ensemble learning performance of the classical RF model, while introducing quantum-enhanced feature representations to improve the model's ability to capture complex non-linear relationships in small-sample datasets, without relying on physical quantum hardware.
(1) Quantum Feature Encoding
Quantum state transformation of classical data is achieved based on the Z-feature map in quantum computing (Vedavyasa et al., 2025). The ZFeatureMap maps the classical high-dimensional feature space into a quantum Hilbert space through single-qubit Z-gate operations and two-qubit CZ-gate entanglement operations (Khalil et al., 2025). In this study, the number of qubits was set equal to the number of input features (7 qubits for in-situ hydrochemical parameters, corresponding to pH, T, EC, DO, ORP, Salt, TDS), and the repetition layers (reps) were set to 1 to prevent circuit depth-induced noise while maintaining expressibility. This single-layer circuit design avoids the exponential growth of quantum state noise with increasing circuit depth, which is critical for ensuring the stability of feature generation in classical simulation environments. Its core advantage lies in obtaining quantum features via analytical calculation of quantum state vectors, thereby avoiding noise interference introduced by quantum sampling and ensuring feature stability. The ZFeatureMap provided by Qiskit is used as the feature encoding circuit, with its Hamiltonian form given by (Tehrani et al., 2024):
(7)
the term represents the unitary operator of the encoding circuit that maps classical data x into a quantum state, which is a dimensionless unitary matrix. The input vector x denotes the classical data with units corresponding to normalized feature values specific to the problem at hand. The parameter R indicates the number of repetition layers in the circuit architecture, serving as a dimensionless positive integer, while d represents the selected number of principal factors, also expressed as a dimensionless positive integer. The indices k and i iterate over the repetition layers and qubits respectively, with k ranging from 1 to R and i ranging from 1 to d . The operator Hi corresponds to the Hadamard gate applied to the i -th qubit. The symbol ⨂ denotes the tensor product operation across multiple qubits, while exp() represents the matrix exponential function defining the parameterized rotation gate. The term −i incorporates the imaginary unit i with a negative sign. The set S refers to the subset of qubit indices involved in entanglement operations, satisfying the condition S⊆{1,...,d} . The rotation angle ϕS(x) depends on the input data and is computed via linear embedding, with units expressed in radians or as a dimensionless quantity. The operator Zj represents the Pauli-Z operator acting on the j-th qubit.
For each sample x, construct the corresponding quantum state |ψ(x)⟩, and analytically calculate the Pauli-Z expectation value for each qubit (Liao et al., 2024):
(8)
here, the quantity ⟨Zi⟩ denotes the expectation value of the Pauli-Z operator for the i -th qubit, taking values in the range [−1,+1] as a dimensionless scalar. The state vector ∣ψ(x)⟩ represents the quantum state encoding the classical data x after applying the unitary UZMap. The bra vector ⟨ψ(x)∣ corresponds to the conjugate transpose of ∣ψ(x)⟩ . The probabilities P(qi=0) and P(qi=1) represent the respective likelihoods of measuring the i -th qubit in states ∥0⟩ and ∥1⟩ , both taking dimensionless values in the range [0,1], while qi indicates the binary measurement outcome of the i -th qubit with possible values {0,1}. This method requires no quantum hardware sampling, completely avoiding the interference of measurement noise and shot noise on small-sample modeling, thus ensuring the determinism and reproducibility of feature generation. This analytical expectation value calculation was implemented using the Statevector module in Qiskit, ensuring deterministic feature generation without shot noise.
(2) Feature Fusion and Modeling
Concatenate the original n-dimensional raw features with the n-dimensional quantum ⟨Z⟩ features to form a 2n-dimensional hybrid feature vector x_aug = [x_raw; ⟨Z⟩] (Cowlessur et al., 2025). That is, original features+quantum encoded features. This feature fusion strategy retains the original physical information of the hydrochemical parameters while introducing non-linear entangled quantum features, expanding the feature space to enhance the model’s discriminative ability for complex non-linear relationships. Using this augmented feature vector as input, we construct a random forest regression model with identical hyperparameter settings to the classical RF baseline model, to ensure a fair performance comparison between the two models::
(9)
The symbol represents the predicted output value of the regression model, with units corresponding to the target variable. The parameter M denotes the total number of decision trees in the ensemble. Treem represents the m -th decision tree regressor in the ensemble. The input xaug refers to the augmented hybrid feature vector formed by concatenating the original n -dimensional raw features xraw with the n -dimensional quantum ⟨Z⟩ features. The hyperparameter settings are the same as those for the classical random forest method described above.
2.7 Evaluation methods and prediction process
2.7.1 SHAP analysis
This study adopts the SHapley Additive exPlanations (SHAP) method for local and global explainability analysis (Merabet et al., 2025). Through three typical visualization methods, namely summary plot, dependence plot, and waterfall plot, the following are revealed respectively: (1) The overall ranking and distribution of feature importance across all samples (global perspective); (2) The nonlinear relationship or interaction effects between a single predictor variable and the predicted nitrate concentration (conditional dependence); (3) The contribution decomposition of each feature in the prediction result of a representative sample (local attribution) (Alam et al., 2025).
The SHAP value is mathematically defined as: the marginal contribution of feature j to the model output offset from the baseline mean (Li et al., 2024), and its form is:
(17)
where F is the set of all features, S is a subset not containing feature j, and f is the model output. By averaging the absolute SHAP values over all samples, a feature importance measure with a game-theoretic foundation, unbiased and robust, can be obtained (Hollmannet al., 2025).
2.7.2 Leave-One-Out Cross-Validation (LOOCV) and model evaluation indicators
Given the limited sample size in each hydrological season, this study adopts Leave-One-Out Cross-Validation (LOOCV) for model performance evaluation to maximize the use of training data and reduce evaluation bias (Austin et al., 2025). Prior to LOOCV, hyperparameter tuning was conducted using 5-fold Cross-Validation on the full dataset to identify optimal model structures (e.g., tree depth, number of estimators). The grid search ranges included: n_estimators [100, 150], max_depth [5, 6, 7], min_samples_split [4, 6], and max_features ['sqrt']. The best parameters identified were then fixed for the LOOCV evaluation. The LOOCV process is: each time, one sample is left out as the validation set, and the remaining n-1 samples are used for training. After repeating nn times, the average of the evaluation indicators is taken as the final result (Ren et al., 2021). Additionally, to assess prediction uncertainty, 90% Prediction Intervals (PI) were estimated using quantile regression forests during the LOOCV process. Coverage probability was calculated to validate the reliability of the uncertainty estimates. To quantify the uncertainty of model performance estimates derived from LOOCV, we implemented Bootstrap resampling (1000 iterations) to calculate 95% confidence intervals (CI) for all evaluation metrics. This approach addresses the limitation that LOOCV alone cannot provide direct confidence intervals for R2 estimates due to the single-sample validation nature. By repeatedly resampling the prediction residuals with replacement and recalculating metrics, we obtained robust uncertainty bounds that reflect the stability of model performance across different data realizations. The Bootstrap 95% CI is reported alongside LOOCV metrics in all performance tables (Table S1-S6).
The coefficient of determination R2, root mean square error (RMSE), and mean absolute error (MAE) are used to quantitatively describe the model accuracy and error characteristics (Gul et al., 2025):
(18)
(19)
(20)
where is the measured value of the i-th sample (mg L-1), is the model’s predicted value (mg L-1), is the mean of the measured values, and n is the number of samples in the test set.
2.7.3 Model robustness and uncertainty analysis
2.7.3.1 Sensitivity analysis
To quantitatively evaluate the response intensity of the model’s nitrate concentration prediction results to the perturbation of input hydrochemical parameters, and to verify the robustness of the model and the reliability of feature importance ranking, this study conducted a multi-dimensional sensitivity analysis. This includes global sensitivity indices based on SHAP values, one-at-a-time (OAT) feature perturbation tests, and elasticity coefficient calculations (Di et al., 2024).
First, based on the SHAP values obtained from the TreeExplainer, the first-order and total-order sensitivity indices were calculated to measure the independent and interactive contributions of each feature (Huang et al., 2021). The first-order sensitivity index for feature i is defined as the normalized mean absolute SHAP value:
(21)
where is the SHAP value of feature i for sample j, n is the number of samples, and p is the number of features. The total-order sensitivity index , which captures both main effects and interaction effects, was approximated using the diagonal elements of the SHAP interaction values:
(22)
where represents the self-interaction SHAP value for feature i in sample j.
Second, to assess the local stability of the model, an OAT perturbation test was conducted by fixing other input parameters and varying the target feature within ±30% of its measured range (Anderson et al., 2018). The perturbation sensitivity is calculated as the relative change rate of the model’s R2 score:
(23)
where represents the coefficient of determination obtained under different perturbation levels.
Finally, the elasticity coefficient was introduced to quantify the directional response degree of nitrate concentration prediction to per unit change of each parameter. It is defined as the ratio of the percentage change in output to the percentage change in input:
(24)
where is the baseline prediction, is the prediction after perturbing feature i by a small fraction, and is the original value of feature i for sample j. An absolute value greater than 1 indicates an elastic response, while less than 1 indicates an inelastic response.
2.7.3.2 Error propagation
Monte Carlo simulation was specifically used to analyze the propagation of input measurement errors to the final predictions (Dega et al., 2023). Assuming the input hydrochemical parameters follow a normal distribution with a coefficient of variation (CV) of 5% (representing typical instrumental uncertainty), the perturbed input matrix Xmc was generated as:
(25)
where = ×CV. This process was repeated for Nmc =500 iterations. The output uncertainty due to input errors was then characterized by the standard deviation (σout) and the 95% CI width of the ensemble predictions:
(26)
where is the prediction from the k-th Monte Carlo simulation, and P2.5 ,P97.5 are the 2.5th and 97.5th percentiles of the prediction distribution.
- Model validation appears limited. The manuscript does not report Confidence intervals. Statistical significance testing. Sensitivity analysis. Uncertainty quantification.
Without uncertainty analysis, the robustness of conclusions is questionable.
I suggest author will include Bootstrap or cross-validation uncertainty. Sensitivity analysis of key parameters. Error propagation discussion.
Several figures suffer from: Small axis labels. Low resolution. Lack of units. No uncertainty shading.
Figures must be self-explanatory. Captions should fully describe the content without requiring readers to consult the main text.
We sincerely thank the reviewer for this critical comment regarding uncertainty analysis. We fully agree that robustness assessment is essential for validating our modeling framework. We have comprehensively addressed this concern through the following revisions:
We have implemented Bootstrap resampling (1000 iterations) to quantify the uncertainty of all performance metrics reported in the study. Specifically:
- We calculated 95% confidence intervals (CI) for R² for all model configurations (classical RF vs. quantum-enhanced RF) across all hydrological seasons and input types.
- The Bootstrap CIs are now reported in all supplementary performance tables (Table S1-S6).
- We note that LOOCV alone cannot provide direct confidence intervals due to its single-sample validation nature; the Bootstrap approach effectively addresses this limitation by resampling prediction residuals.
The revised manuscript (Section 2.6.2) now explicitly describes this dual-evaluation methodology. In the Results section (Sections 3.4.2 and 3.4.3), we discuss these confidence intervals to demonstrate model stability. The narrowing of CI widths with increasing virtual sample sizes (e.g., Normal season RF R2 CI width reduced from 0.1972 at original to 0.0106 at 10-fold augmentation) demonstrates that our augmentation strategy effectively reduces estimation variance.
We agree that a comprehensive sensitivity analysis is crucial for understanding the driving mechanisms of the model and validating the physical interpretability of the input parameters. In response, we have added a new subsection Section 3.6 “Sensitivity analysis of key parameters” in the revised manuscript. In this section, we employed Sobol global sensitivity indices, feature perturbation sensitivity, and elasticity coefficients to quantify the influence of hydrochemical parameters (pH, T, EC, DO, ORP, Salt, TDS) on nitrate predictions across different hydrological seasons (normal, dry, and wet).
The key findings added to the manuscript include:
TDS and EC were identified as the most sensitive parameters across all seasons (Total-order Sobol indices > 0.25), confirming their role as robust proxies for nitrate transport processes, corroborating their central role identified in the Bayesian and correlation analyses.
We observed distinct seasonal variations in sensitivity. Specifically, pH sensitivity increased significantly during the dry season, reflecting evaporative concentration effects. ORP sensitivity was higher during the wet season, likely due to rainfall-induced redox fluctuations.
The analysis demonstrated that increasing virtual sample augmentation (from 1x to 10x) stabilized the sensitivity indices, reducing variance caused by small-sample bias (e.g., Dry season pH Sobol index variance decreased with augmentation).
We agree that a systematic discussion of error propagation is essential for evaluating model robustness and uncertainty quantification in environmental prediction tasks.
Accordingly, we have incorporated a comprehensive error propagation discussion in “Section 4.3 Error propagation under virtual sample” of the revised manuscript. Specifically, we performed Monte Carlo simulations (N=500 iterations) assuming a 5% coefficient of variation for input parameters to quantify how measurement errors propagate through both classical Random Forest and hybrid Quantum-Classical Random Forest models under varying degrees of virtual sample augmentation.
We have carefully revised all figures throughout the manuscript to address these concerns. Our specific actions are as follows: We have significantly enlarged the font size of all axis labels, tick marks, and legends to ensure readability. Additionally, all figures have been exported and embedded at a higher resolution (300 dpi) to meet publication standards. We have meticulously reviewed all figures and ensured that appropriate physical units are now clearly indicated on all axes and in the figure captions where applicable. To maintain clarity and avoid visual clutter in complex multi-seasonal comparison plots, we have reported these precise Bootstrap 95% CIs alongside the performance metrics in Tables S1-S6 (Supplementary Material). This allows readers to access exact quantitative bounds of uncertainty for each model and season. We have ensured that boxplots are used to compare observed and predicted concentrations. These boxplots inherently display the interquartile range (IQR), whiskers, and outliers, providing a clear visual indication of distributional uncertainty and model stability across different data augmentation levels.
We believe these improvements have significantly enhanced the clarity, rigor, and readability of the figures and the overall manuscript.
We agree that making figures self-explanatory is crucial for improving the readability and accessibility of the manuscript. Accordingly, we have thoroughly reviewed and revised the captions for all figures (Fig. 1 to Fig. 22) in the revised manuscript.
The specific modifications are as follows:
3.4 Model performance evaluation
3.4.1 Virtual data analysis
To address the modeling bias arising from limited measured samples, this study constructed virtual datasets at 1-10 times the original scale based on a strategy combining t-SNE dimensionality reduction, GMM clustering sampling, and KNN inverse mapping, to enhance the robustness of model training. Taking the 10x virtual dataset as an example, the statistical characteristics (Table 4) show that the virtual samples effectively reproduced the central tendency and dispersion of the original data. For the normal season, the mean NO3- concentration was 30.41 mg L-1 (vs. observed mean of 33.67), with a standard deviation of 28.78 (vs. 35.83) and a coefficient of variation (CV) of 0.95 (vs. 1.06). In the dry season, the maximum value of the virtual samples reached 178.09 mg L-1, while this did not fully replicate the extreme high values (observed maximum of 358.58 mg L-1), it effectively expanded the range of the heavy-tailed distribution. For the wet season, although the CV for all indicators was slightly lower than the measured values, their ranges (8.47-80.37 vs. 4.15-98.36 mg L-1) still showed a high degree of overlap, indicating that no systematic distortion was introduced.
Table 4. Statistical characteristics of different virtual samples.
Periods
pH
T
EC
DO
ORP
Salt
TDS
NO3-
Unit
℃
μs cm-1
mg L-1
mv
ppt
mg L-1
mg L-1
Normal season
Max
8.32
16.2
963.6
9.31
101.02
0.42
628.6
124.03
n=660
Min
7.95
13.88
378.2
3.52
-48.28
0.12
245
5.10
Mean
8.18
14.72
527.31
6.66
2.09
0.20
342.72
30.41
SD
0.07
0.54
148.31
1.66
25.97
0.08
96.92
28.78
CV
0.01
0.04
0.28
0.25
12.44
0.41
0.28
0.95
Dry season
Max
8.19
17.59
987.8
8.10
69.18
0.44
641.8
178.09
n=650
Min
7.04
14.44
417.6
2.68
-59.14
0.132
267.8
5.19
Mean
7.35
15.49
661.79
6.414
0.52
0.27
429.90
37.75
SD
0.44
0.69
170.65
1.24
29.61
0.09
111.16
33.13
CV
0.06
0.04
0.26
0.19
57.21
0.34
0.26
0.88
Wet season
Max
8.21
18.88
872.8
8.54
56.4
0.37
569
80.37
n=500
Min
6.31
16.12
395.8
5.34
-77.64
0.13
257.6
8.47
Mean
7.38
16.93
535.7
7.06
13.75
0.20
348.24
25.85
SD
0.33
0.62
131.02
0.78
28.85
0.08
86.05
19.88
CV
0.04
0.04
0.24
0.11
2.10
0.38
0.25
0.77
Standardized multivariate boxplots (Fig.12) visually confirm that the median, interquartile range (IQR), whisker length, and outlier distribution of the virtual data for each period were highly similar to the measured data, demonstrating that the central tendency and dispersion characteristics were well-preserved. Hydrological seasonal characteristics, such as high EC/TDS/Cl-/NO3- in the dry season and low, concentrated NO3- in the wet season, were also accurately preserved. Although the number of some newly added outliers slightly increased, they all fell within physically reasonable ranges, with no non-physical solutions, such as negative concentrations or out-of-bounds pH values, occurring. Fig.13 presents a comparison of nitrate concentration frequency distributions between the original and virtual datasets across normal, dry, and wet periods. The distributional comparison indicates that the proposed t-SNE + GMM + KNN inverse mapping virtual sample generation strategy maintains the core features of the NO3- distribution for each hydrological period, while simultaneously improving sample representation in sparse areas and intervals of high variability. Therefore, the t-SNE + GMM method effectively captured the non-linear structure and extreme value information of the original data, and can provide reliable data support for subsequent model training.
Fig.12. Box plots of the observed and virtual variable data at different periods.
Citation: https://doi.org/10.5194/egusphere-2026-272-AC1 -
AC2: 'Reply on RC1', Junjie Xu, 31 Mar 2026
Responses to RC1
The manuscript addresses an important geoscientific problem and presents potentially valuable findings. However, several methodological, structural, and interpretative issues limit the clarity, reproducibility, and robustness of the conclusions. Substantial revision is required before the manuscript can be considered for publication. The topic is relevant and potentially publishable but the manuscript requires major revision to improve its quality.
- Page 2, Lines 10–15. The introduction provides general background but does not clearly articulate the precise research gap. While prior studies are cited, it remains unclear What specific limitation of previous work is being addressed? Whether this study offers methodological novelty or merely a regional application? How the proposed approach differs from existing frameworks?
A clearer paragraph explicitly stating identified gap, the proposed advancement, and the expected contribution is necessary.
We sincerely appreciate this insightful comment. The reviewer is correct that the original Introduction lacked explicit articulation of specific research gaps and methodological distinctions. We have thoroughly revised the Introduction to address these concerns:
- Explicit articulation of research gaps (What specific limitation is addressed):
We have added a new paragraph that explicitly identifies three critical gaps in previous work:
The data augmentation paradox: Existing virtual sample methods (GMM, VAEs, GANs) either fail to preserve non-linear geochemical manifold structures or require large training sets that are precisely lacking in environmental monitoring (Farnia et al., 2023; Zhou et al., 2025);
The unexplored potential of remote sensing embeddings: While AEF has been applied to surface parameters, its utility for subsurface groundwater quality prediction across hydrological seasons remains untested (Tollefson, 2025; Li et al., 2025);
The unvalidated quantum-classical hybridization: Although QML offers theoretical advantages for complex pattern recognition, its practical efficacy in small-sample environmental modeling has not been systematically evaluated (Hong et al., 2025; Oliveira et al., 2025).
- Clarification of methodological novelty vs. regional application:
We have substantially rewritten the study objectives paragraph to explicitly distinguish methodological innovation from regional case study. The revised text clearly states that this study "develops a methodologically novel integrated framework that transcends mere regional application" and details three specific methodological advancements: (i) the t-SNE-GMM-KNN strategy for structure-preserving data augmentation; (ii) the hybrid quantum-classical RF architecture utilizing analytical quantum feature encoding; and (iii) the first application of AEF embeddings for groundwater nitrate screening.
- Distinction from existing frameworks (How the approach differs):
We explicitly contrast our approach with existing frameworks in the revised objectives paragraph: "Unlike previous studies that treat data augmentation, model architecture, and input features independently, this framework provides a unified, scalable solution that integrates virtual sample generation, quantum-enhanced machine learning, and remote sensing semantic features to simultaneously address data sparsity, seasonal heterogeneity, and feature extraction challenges."
Thank you for this valuable suggestion, which has significantly improved the clarity and rigor of our work.
The specific modifications are as follows:
- Introduction
Nitrate (NO3-) contamination in groundwater poses a serious threat to drinking water safety and ecosystem health, particularly in intensively managed agricultural regions (Wang et al., 2021). In China, groundwater nitrate pollution is a growing concern, national monitoring data from 2013 to 2017 revealed a nitrate exceedance rate exceeding 10% relative to the Class III limit (88 mg L-1 as NO3-) of the Chinese Groundwater Quality Standard (GB/T 14848-2017), with Hebei Province reporting an alarming rate of 31.66% in 2017 (Li et al., 2019). Over recent decades, escalating nitrate concentrations in surface and groundwater have been driven by intensified fertilizer use in agriculture, along with discharges of industrial and domestic wastewater (Zhang et al., 2018). Severe nitrate exceedances are especially prevalent in northern and northwestern China (Gu et al., 2013), where key contributors include domestic and industrial effluents, nitrification of soil organic nitrogen, and synthetic fertilizer application (Han et al., 2016). For instance, in the North China Plain, shallow groundwater nitrate exceedance rates range from 9.5% to 34.1% relative to the WHO drinking water guideline of 50 mg L-1 NO3-, and a rising trend persists at the regional scale, particularly in agricultural areas (Wang et al., 2018). In monsoonal temperate regions, seasonal shifts in precipitation, evapotranspiration, and groundwater recharge profoundly influence the transport, dilution, and accumulation of nitrate, leading to pronounced intra-annual variability in its concentration and spatial distribution (Gao et al., 2023; Zhu et al., 2025). Consequently, understanding and forecasting nitrate dynamics across hydrological seasons is essential for informed groundwater management and pollution mitigation, but remains a formidable challenge due to the nonlinearity, high dimensionality, and data scarcity inherent in such systems (Deng et al., 2023).
Traditional monitoring and modeling approaches face three critical limitations. First, field sampling campaigns though providing high-fidelity hydrochemical data are inherently sparse in space and time, especially for large-scale or rapidly changing environments (Viswanathan et al., 2022), which are time-consuming, labor-intensive, and costly, limiting the spatial and temporal coverage of data (Cai et al., 2025). Second, while process-based models incorporate physical mechanisms, they require extensive parameterization and are computationally prohibitive for dynamic, multi-season forecasting at farm-to-regional scales (Feng et al., 2022). Hydrological seasonal variations (normal, dry, and wet seasons) significantly influence the migration and transformation of nitrogen in the soil-groundwater system (Chen et al., 2025). For instance, concentrated rainfall during the wet season (accounting for 60%-80% of annual precipitation) can promote the leaching of surface nitrogen into groundwater, leading to a 25-fold increase in stream nitrate concentrations during storm events compared to baseflow (Sebestyen et al., 2014), meanwhile, intense evaporation in the dry season leads to the accumulation of nitrate in shallow aquifers, where concentrations can exceed the US EPA drinking water standard of 10 mg L-1 by 2-3 times (Liu et al., 2025; Cox et al., 2016). These seasonal differences result in distinct hydrochemical characteristics and nitrate concentration distributions, increasing the complexity of prediction models (Wu et al., 2025). Third, even advanced machine learning (ML) techniques such as Random Forest (RF), despite their robustness to nonlinearity and multicollinearity, still rely heavily on sufficient representative samples to capture the multi-modal distribution and tail behavior of environmental variables, particularly for heavy-tailed pollutants like NO3- (Luo et al., 2022). Moreover, the small sample sizes obtained from discrete sampling often lead to data sparsity and skewed distributions, reducing the model's generalization ability by 30%-50% when applied to unmonitored areas and compromising the robustness and generalization ability of machine learning (ML) models trained on such data (Thunyawatcharakul et al., 2025; Wang et al., 2024).
Despite these recognized limitations of conventional approaches, existing solutions remain fragmented and insufficient for seasonal groundwater nitrate prediction. Specifically, current virtual sample generation methods, whether statistical (e.g., SMOTE, GMM) or deep learning-based (e.g., VAEs, GANs), suffer from a critical paradox: they either fail to preserve the complex non-linear manifold structure of high-dimensional geochemical data or inherently require large training datasets that are unavailable in typical environmental monitoring scenarios (Farnia et al., 2023; Tung et al., 2023; Zhou et al., 2025). Meanwhile, while remote sensing foundation models such as Google's AlphaEarth Foundation (AEF) have demonstrated success in surface parameter estimation, their applicability to subsurface groundwater quality prediction, particularly for dynamic seasonal modeling, remains entirely untested (Tollefson, 2025; Li et al., 2025). Furthermore, although quantum machine learning (QML) theoretically offers advantages in capturing complex non-linear relationships through exponentially high-dimensional Hilbert space mapping, its practical efficacy in small-sample environmental datasets, especially when hybridized with classical ensemble methods, has not been systematically evaluated (Hong et al., 2025; Oliveira et al., 2025). Consequently, no existing framework simultaneously addresses the triple challenge of data sparsity, seasonal heterogeneity, and high-dimensional feature extraction for groundwater nitrate forecasting. To bridge these gaps, this study integrates three emerging methodological frontiers: advanced virtual sample generation, remote sensing foundation models, and quantum-enhanced machine learning. Recent advances in these domains offer partial solutions. Gaussian Mixture Models (GMM) and deep generative frameworks (e.g., VAEs, GANs) have shown promise in enriching training data, with GMM achieving an average similarity of 83.0% between unmixed chemical spectra and ground truth in geochemical analysis (Farnia et al., 2023; Tung et al., 2023), yet these approaches exhibit critical constraints: they often fail to preserve the non-linear manifold structure of high-dimensional geochemical space or require large training sets, precisely what is lacking (Zhou et al., 2025). Non-linear dimensionality reduction methods, such as t-SNE, excel at revealing latent clusters corresponding to distinct hydrological processes, with a classification accuracy of 92% for annual daily hydrograph clustering in mountainous watersheds, yet lack explicit generative mechanisms (Wang et al., 2025; Tang et al., 2022). Meanwhile, the rise of foundation models in Earth observation exemplified by Google’s AlphaEarth Foundation (AEF), offers unprecedented opportunities: its 64-dimensional semantic embeddings, derived from multi-sensor satellite time series (including Sentinel-2, Landsat, and Sentinel-1), implicitly encode land use, vegetation phenology, soil moisture, and anthropogenic footprints at 10 m resolution (Tollefson, 2025). These features have been successfully applied in land use classification and crop monitoring, but their potential for predicting groundwater nitrate concentrations, especially across different hydrological seasons remains underexplored (Li et al., 2025). Quantum machine learning (QML) further opens a new frontier. Parameterized Quantum Circuits (PQCs) can map classical inputs into exponentially high-dimensional quantum Hilbert spaces, generating entangled feature representations that reveal complex, non-linear patterns inaccessible to classical kernels (Hong et al., 2025). For ozone concentration forecasting, a hybrid QML model achieved an R2 of 94.12% for 1-hour forecasts and 75.62% for 6-hour forecasts, outperforming classical persistence models by a forecast skill of 31.01-57.46% (Oliveira et al., 2025). Recently, Saberian et al. (2026) developed HydroQuantum, a quantum-driven Python package implementing Quantum LSTM (QLSTM), hybrid quantum-classical LSTM, and Variational Quantum Circuits (VQC) for daily streamflow and stream water temperature simulations across the continental US. Their work demonstrated that quantum-enhanced architectures can effectively capture temporal dependencies in hydrological time series, particularly in snowmelt-driven and regulated basins, providing valuable methodological context for quantum-driven hydrological modeling. Crucially, analytical quantum feature extraction via Pauli-Z expectation values avoids the noisy sampling overhead of near-term quantum hardware, reducing computational latency by ~80% compared to sampling-based methods and making it viable for small-sample environmental modeling (Gujju et al., 2024; Oliveira et al., 2025).
Furthermore, identifying the sources and controlling factors of nitrate pollution is crucial for improving prediction accuracy and guiding targeted pollution control measures. Isotopic analysis (δ15N-NO3- and δ18O-NO3-) combined with the MixSIAR model has proven effective in quantitatively apportioning nitrate sources (Tian et al., 2025). Meanwhile, Bayesian models and SHapley Additive exPlanations (SHAP) analysis can reveal the key environmental variables driving nitrate concentration changes, enhancing the interpretability of prediction models (Alam et al., 2025). Despite these advancements, several gaps persist in the current research: (1) Few studies have integrated hybrid quantum-classical ML with virtual sample augmentation to address small-sample challenges in seasonal nitrate prediction; (2) The potential of AEF remote sensing semantic features for groundwater nitrate prediction remains untested, particularly in comparison with in-situ measured parameters; (3) The combined effects of hydrological seasonal variations, nitrate source apportionment, and key environmental drivers on prediction model performance require systematic investigation.
The North China Plain, an important agricultural production region in China, is characterized by high nitrogen input intensity and significant seasonal hydrological variations, making it an area prone to groundwater nitrate pollution (Liu et al., 2025; Hou et al., 2025). Conducting field-scale research on nitrate pollution in this region is of great significance for the protection of regional water resources. The Xiong’an New Area was specifically selected as the core study site due to its unique combination of national strategic ecological importance, representative hydrogeological conditions, and severe pollution characteristics that exemplify regional challenges. Quantitatively, Hebei Province reported a groundwater nitrate exceedance rate of 31.66% in 2017, significantly higher than the national average, with Xiong'an situated in the heart of this high-risk zone (Xiong et al., 2025). Furthermore, the study site represents a typical high-intensity agricultural system with annual nitrogen inputs ranging from 540 to 660 kg N ha-1, coupled with a shallow groundwater table (5.0-20.0 m) highly susceptible to surface loading (Xu et al., 2022). Moreover, existing investigations reveal severe anthropogenic impacts including legacy nitrogen accumulation (~320 kg N ha-1 a-1 surplus) and sewage irrigation contributing up to 58.3% of groundwater chemical signatures. This setting provides an ideal natural laboratory for testing seasonal prediction frameworks, as the distinct monsoonal climate (60%-80% precipitation concentrated in summer) creates pronounced hydrochemical contrasts between dry, wet, and normal seasons, directly challenging model robustness under data sparsity. (Xu et al., 2022). A mechanistic understanding of how nitrate concentrations vary across these hydrological seasons (normal, dry, wet) and their controlling factors is crucial for regional water resource management. To address the critical gaps outlined above, this study establishes three clear objectives: (1) to develop a t-SNE-GMM-KNN virtual sample generation strategy that preserves geochemical manifold structure while expanding small datasets for robust seasonal modeling; (2) to construct a hybrid quantum-classical Random Forest architecture that enhances feature discriminability without quantum hardware limitations, specifically targeting the high variability and data scarcity characteristic of dry-season nitrate prediction; and (3) to evaluate the feasibility of AlphaEarth Foundation (AEF) semantic embeddings as standalone predictors for rapid groundwater nitrate screening in unmonitored regions. The relevance of this research lies in its provision of a unified methodological framework that transcends single-site application, offering a scalable solution for seasonal groundwater quality prediction in intensively managed agricultural landscapes worldwide. Practically, the results enable three distinct operational modes: (i) mechanistic analysis using field-measurable parameters (pH, EC, TDS, etc.) to elucidate hydrochemical driving processes; (ii) rapid on-site nitrate estimation using portable multi-parameter meters when laboratory analysis is unavailable; and (iii) large-scale regional risk assessment using remote sensing embeddings for areas lacking monitoring infrastructure. By distinguishing between these application scenarios, the framework provides actionable decision-support tools for water resource managers to implement targeted pollution control measures, optimize monitoring network design, and prioritize remediation efforts across hydrological seasons.
Compared with existing research on groundwater quality prediction based on multiple ML algorithms and large datasets, the novelty of this paper lies in three aspects: (1) the t-SNE-GMM-KNN strategy preserves the non-linear manifold structure of high-dimensional geochemical space during generation; (2) we systematically explore the application of hybrid quantum-classical machine learning in groundwater quality prediction, evaluating its robustness and feature discriminability in small-sample seasonal dynamics; and (3) we conduct a comparative analysis of prediction performance using traditional field observation data versus the first-time application of AlphaEarth Foundation (AEF) semantic embeddings for nitrate estimation.
- Page 3, Lines 5–30. The study area description is descriptive but lacks quantitative justification. For example, No map scale or resolution information is provided. Climatic or hydrological statistics are not sufficiently summarized. The selection rationale for this site is weak.
I suggest authors will include a detailed map with coordinate system and scale. A table summarizing key environmental variables. A clear explanation of why this site is scientifically significant.
We appreciate these valuable suggestions. We have revised Figure 1 to include detailed cartographic information.
We agree that a concise summary of key climatic and hydrological variables enhances the reproducibility and contextual clarity of our study. In response, we have added a new table (Table 1) in Section 2.1 "Study area" that systematically summarizes:
(1) Climatic statistics (temperature, precipitation, evaporation, sunshine, and wind);
(2) Soil and vadose zone characteristics relevant to nitrogen transport;
(3) Groundwater system parameters (aquifer type, water table depth, recharge/discharge mechanisms).
We agree that the original manuscript provided a primarily descriptive overview of the study area without sufficiently quantifying the rationale for site selection or its specific scientific significance.
To address this, we have revised the Introduction section to explicitly justify the selection of the Xiong’an New Area based on quantitative indicators and strategic relevance. Specifically, we have added the following details: We highlighted that Hebei Province reported a groundwater nitrate exceedance rate of 31.66% in 2017, significantly higher than the national average, positioning Xiong’an within a critical high-risk zone. We specified the annual nitrogen input rates (540-660 kg N ha-1 yr-1) and shallow groundwater table depth to demonstrate the site's susceptibility to pollution and its representativeness of intensive agricultural regions. We clarified that the distinct monsoonal climate (with 60%-80% precipitation concentrated in summer) creates pronounced seasonal hydrochemical contrasts, making it an ideal natural laboratory for testing the robustness of our proposed seasonal prediction framework under data sparsity.
These additions provide a stronger quantitative justification for the study site and clarify its scientific value for regional water management and methodological validation. Please see the revised Introduction section for these changes.
The specific modifications are as follows:
2.1 Study area
The North China Plain is one of China’s most important agricultural production bases. This study focuses on the Xiong’an New Area, situated in the central part of Hebei Province, as a representative research site within this plain. Located in the core region defined by Beijing, Tianjin, and Baoding, it boasts an advantageous geographical position, with straight-line distances of 105 km to both Beijing and Tianjin, and 30 km to Baoding. Its geographical coordinates range from 38°43′ to 39°10′ N latitude and from 115°38′ to 116°20′ E longitude, covering an area of approximately 1770 km2 (Xiong’an New Area Official Website, 2023). The specific study area is an unmanned farm located in Xieyeqiao Village, Nanzhang Town, Rongcheng County, within the Xiong'an New Area (Fig.1). The farm covers an area of 3000 hectares and primarily cultivates two main grain crops: wheat and corn. As the first mechanized unmanned farm in Xiong'an, it has achieved full mechanization and intellectualization, enabling unmanned, precise, and standardized operations throughout all stages of tillage, sowing, management, and harvesting.
Fig.1. Study area map showing the sampling location and sampling points.
Fig.2. Temperature and precipitation trends in study area (1952-2025).
Cultivated land constitutes a significant proportion of the total area in Xiong’an New Area and is predominantly dryland. Traditional fertilization practices in this region involve high application rates of nitrogen fertilizer and manure. As a representative farm in the region, the study site adheres to these traditional practices, making it susceptible to the impacts of high fertilization intensity. The annual nitrogen fertilizer application rate at the study site ranges from 540 to 660 kg (N) ha-1 yr-1, primarily supplied in the form of urea (46% N). Consequently, the extensive application of chemical fertilizers and manure has increased the risk of groundwater nitrogen pollution. Furthermore, the relatively dense rural population distribution leads to pollution from the discharge of domestic sewage in the vicinity. Irrigation is scheduled according to crop phenological stages. Wheat receives muddy water irrigation before sowing, and during the overwintering, reviving, and jointing stages, while maize receives a single muddy water irrigation after sowing.
The climate is characterized as a temperate continental monsoon climate. Springs are dry with little rain, summers are humid and rainy, autumns are cool and dry, and winters are cold with minimal snowfall. The mean annual temperature is 12.6°C, with relatively small inter-annual fluctuations; the lowest temperatures occur in January, averaging -5°C, while the highest occur in July, averaging 26.1°C. The mean annual precipitation is 480.8 mm, which is highly concentrated between June and September, accounting for nearly 80% of the annual total. The long-term temperature and precipitation changes in the study area are shown in Fig.2. The multi-year average evaporation is 1,661.1 mm (Liao et al., 2020), and the multi-year average water surface evaporation is 1,761.7 mm (Sun, 2024). The mean annual sunshine duration is 2,335.2 hours, with longer sunshine hours in spring and summer and shorter hours in autumn and winter. The average frost-free period lasts 204 days. The mean annual wind speed is 1.7 m s-1, with the highest average speeds occurring in April and the lowest in January, August, and December.
The soil texture is predominantly silt loam. The 2-8.5 m soil layer contains interlayers with high clay content, such as clay and silty clay, reflecting the sedimentary characteristics of the vadose zone in the Central Plains under geomorphic deposition. Nitrogen in the thick vadose zone exists predominantly as organic nitrogen, accounting for approximately 97% of the total nitrogen content. The shallow vadose zone at depths of 3-6 m stores the highest amount of nitrate, accounting for approximately half of the total nitrate reserves in the North China Plain (Li et al., 2025; Zhang et al., 2007). Groundwater in the study area is primarily stored in Quaternary unconsolidated porous aquifers, with sampling well depths ranging from 70 to 120 m (Bai et al., 2023). The lithology of the shallow aquifer group is dominated by sand layers with medium water abundance, such as silty fine sand, fine sand, and medium sand, with single-well yields ranging from approximately 1,000 to 3,000 m3/d. Based on groundwater survey data from June 2022 (Xiong’an Urban Geology Research Center, China Geological Survey, 2023), the depth to the shallow groundwater table in Xiong’an New Area ranges from 5.0 to 35.0 m. Water level fluctuations are primarily influenced by atmospheric precipitation infiltration and agricultural extraction (Ma et al., 2022). The primary source of groundwater recharge for farmland in the study area is atmospheric precipitation, while the main pathway of discharge is artificial extraction for agricultural irrigation. The detailed data on climate and hydrological indicators in the research area are shown in Table 1.
Table 1. Summary of key climatic and hydrological variables in the study area.
Category
Variable
Value / Range
Unit
Temperature
Mean annual temperature
12.6
°C
Mean January temperature
-5.0
°C
Mean July temperature
26.1
°C
Precipitation
Mean annual precipitation
480.8
mm
Rainy season (Jun-Sep) contribution
~80
% of annual total
Evaporation
Mean annual evaporation
1,661.10
mm
Mean annual water surface evaporation
1,761.70
mm
Solar & Wind
Mean annual sunshine duration
2,335.20
h
Mean annual wind speed
1.7
m s-1
Frost-free period
~204
days
Soil & Vadose Zone
Dominant soil texture
Silt loam
—
Organic N proportion in vadose zone
~97
% of total N
Peak nitrate storage depth
3-6
m
Groundwater
Aquifer type
Quaternary unconsolidated porous
—
Sampling well depth
70-120
m
Shallow water table depth
5.0-35.0
m
Single-well yield
1,000-3,000
m3 d-1
Dominant recharge source
Precipitation infiltration
—
Dominant discharge pathway
Agricultural extraction
—
- Page 4, Lines 12. The manuscript does not adequately describe Data temporal resolution. Data completeness. Missing data handling procedures. Quality control protocols.
If interpolation or smoothing was applied, it must be explicitly stated. Reproducibility requires transparent preprocessing documentation.
We sincerely thank the reviewer for pointing out the need for more detailed documentation regarding data temporal resolution, completeness, missing data handling, quality control (QC) protocols, and preprocessing transparency. We agree that these details are crucial for reproducibility.
In the revised manuscript, we have made the following modifications to address these concerns:
Data Temporal Resolution: We explicitly clarified that groundwater data represents seasonal snapshots, while AlphaEarth Foundation (AEF) data provides annual temporal resolution.
Data Completeness & Missing Data: We added a statement confirming that samples with incomplete records were excluded prior to modeling, ensuring 100% completeness in the final dataset, and clarified that no imputation was used for raw data.
Interpolation/Smoothing: We explicitly stated that no interpolation or smoothing was applied to the original observational data to preserve raw variability. The only exception is the spatial visualization of nitrate concentration distributions, where ordinary Kriging interpolation was applied solely for mapping purposes using Surfer software (Golden Software, LLC). This interpolation was not used for data augmentation, missing value imputation, or model input generation. Virtual sample generation is strictly for augmentation, not filling missing values.
Preprocessing Documentation: We enhanced the preprocessing workflow description to ensure transparency and reproducibility (Section 2.7.4).
The specific modifications are as follows:
2.2 Data collection and measurements
2.2.1 Field sampling data and laboratory analysis
Field investigations and the collection of hydrochemical and isotopic samples were conducted in the study area from 2022 to 2023. The temporal resolution for groundwater sampling was designed as seasonal snapshots, with campaigns occurring in October 2022 (normal season), April 2023 (dry season), and August 2023 (wet season), representing a temporal interval of approximately 4-6 months between surveys. A total of 66, 65, and 50 groundwater samples were collected in October 2022, April 2023, and August 2023, respectively. All groundwater samples were obtained from existing agricultural irrigation wells within the study area. Prior to sample collection, each well was purged by pumping. Sampling commenced only after the pumped volume exceeded three times the well's casing volume and on-site parameters had stabilized (i.e., showing minor fluctuations around a constant value rather than a continuous rising or falling trend), a procedure implemented to ensure the representativeness of the samples. At each sampling point, one 1000 mL and two 100 mL samples were collected. Before final collection, the sample bottles were rinsed three times with the water to be sampled. Immediately after collection, the samples were sealed and stored in a portable cooler for transport to the laboratory for subsequent analysis. Furthermore, the precise geographical location of each sampling point was recorded using a GPS device. Regarding data completeness and missing data handling, all collected samples underwent rigorous screening. Samples with incomplete physicochemical records or those failing initial quality checks were excluded from the final dataset. Consequently, the modeling dataset achieved 100% completeness with no missing values requiring imputation or interpolation.
In-situ physicochemical parameters were measured using a Hach HQ400 multi-parameter water quality meter (Li et al., 2022). The measured parameters included water temperature (T, °C), pH, total dissolved solids (TDS, mg L-1), dissolved oxygen (DO, mg L-1), electrical conductivity (EC, μS cm-1), and oxidation-reduction potential (ORP, mV). These parameters can be acquired rapidly in the field with minimal sample preparation, offering a practical advantage over laboratory-based nitrate determination which requires sample preservation, transportation, and longer analytical processing times. This operational efficiency motivates the use of field-measurable parameters as predictors for preliminary nitrate concentration estimation. The concentration of HCO3- was determined within 24 hours of sample collection using the dilute sulfuric acid-methyl orange titration method (Huang et al., 2012). Prior to the determination of cations and anions, water samples were filtered through 0.45 μm membrane filters. Major cations (K+, Ca2+, Na+, Mg2+) were analyzed using an inductively coupled plasma optical emission spectrometer (Avio 500). Major anions (NO3-, Cl-, SO42-) were analyzed using an ion chromatograph (ICS-2100). The analytical precision for cations and anions was controlled within ±0.2 mg L-1, and the charge balance error was maintained within 5% to ensure reliability. The concentrations of nitrite nitrogen and ammonia nitrogen were determined using a flow injection analyzer (Smartchem 200, AMS Alliance) and measured using dual wavelength spectrophotometry and the indophenol blue method (Kim et al., 2019; Sun et al., 2022). The limits of detection for nitrite nitrogen and ammonium nitrogen were both 0.01 mg L-1. For the analysis of stable hydrogen and oxygen isotopes, water samples were filtered through 0.22 μm membrane filters and measured using an LGR liquid water isotope analyzer (TIWA-45-EP). The analytical precisions for δ2H, δ17O, and δ18O were ±0.15‰, ±0.02‰, and ±0.02‰, respectively (Hamidi et al., 2023). The isotopic compositions of nitrate (δ18O-NO3- and δ15N-NO3-) were determined using a MAT-253 mass spectrometer coupled with an elemental analyzer (Li et al., 2022). To ensure analytical precision, standard references, reagent blanks, and duplicate samples were employed. Furthermore, international standards USGS 34 and USGS 35 were used for δ18O quality control, while USGS 32 and USGS 34 were used for δ15N quality control. All isotope results are reported in per mil (δ, ‰). No interpolation or smoothing techniques were applied to the raw hydrochemical data to preserve the inherent variability of the groundwater system. The sole exception was the spatial visualization of nitrate concentration distributions(Fig.7), where ordinary Kriging interpolation was applied exclusively for cartographic representation using Surfer software(Golden Software, LLC). This interpolation was used solely for generating continuous surface maps and was not employed for data augmentation, missing value imputation, or model input generation.
2.2.2 Google AlphaEarth Foundation
To facilitate comparisons with predictions based on in-situ field sampling data and to validate the accuracy of predicting groundwater nitrate concentration using remote sensing data, this study incorporates the Google AlphaEarth Foundation (AEF) dataset. AEF is a collection of high-dimensional surface semantic embedding features generated via pre-training on multi-source remote sensing data (Brown et al., 2025). By fusing imagery from Sentinel-2, Landsat, and other Earth observation satellites, this dataset constructs a 64-dimensional vector representation (denoted as A00-A63) at a global scale with an annual temporal resolution and a 10 m spatial resolution (Alvarez et al., 2025). This annual temporal resolution provides consistent surface context for each hydrological year, complementing the seasonal groundwater snapshots. These embeddings implicitly encode complex environmental semantics, such as land cover types, vegetation dynamics, soil moisture, and the intensity of human activity, and have been successfully applied in tasks including land use classification, crop monitoring, and environmental risk modeling (Tollefson et al., 2025).
The primary processing workflow involved spatially sampling the 64-dimensional AEF vectors at a 10 m resolution using the GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL product on the Google Earth Engine (GEE) platform, based on the geographic coordinates of the field sampling points. To ensure data quality, only samples exhibiting exact matches between the GEE extraction and the actual field data points were retained. Any pixels with significant cloud cover or data gaps in the underlying satellite imagery were automatically handled by the AEF pre-processing pipeline, ensuring complete feature vectors for all sampling locations. Given the redundancy within the initial 64-dimensional AEF features, Principal Component Analysis (PCA) based on Singular Value Decomposition (SVD) was employed for feature compression. Specifically, SVD was performed on the centered feature matrix to select the minimum number of principal components accounting for at least 95% of the cumulative explained variance (Ilyas et al., 2025). This threshold was selected to balance information retention with dimensionality reduction, ensuring robust model input without overfitting (Ilyas et al., 2025). The orthogonalized, low-dimensional principal component scores were subsequently used as model inputs. This approach preserves the vast majority of the semantic information from the original embeddings while significantly mitigating the risk of overfitting. Ultimately, the PCA-reduced AEF features served as the input variables for the model.
2.7.4 Standardized prediction workflow
To systematically evaluate the nitrate concentration prediction capabilities of different input variables and modeling strategies across various hydrological seasons, and to validate the effectiveness of virtual sample generation for small-sample modeling, this study established a standardized prediction pipeline (Fig.3). The specific steps are as follows: (1) Data Preprocessing and Grouping: Observed samples were partitioned by seasons. Z-score normalization was applied separately to two types of input features: field water quality parameters and AlphaEarth Foundation (AEF) features reduced via Principal Component Analysis (PCA). Transparent preprocessing documentation was maintained throughout this process. No interpolation or smoothing was applied to the original observational data to preserve raw variability. The virtual sample generation described in subsequent steps was used solely for dataset generation to improve model robustness, not for filling missing raw data values. (2) Virtual Sample Generation and Validation: A t-SNE-GMM-KNN strategy was employed to generate virtual samples. Generation factors from 1x to 10x were systematically evaluated. Their physical plausibility and distribution consistency were rigorously verified using statistical indicators, box plots, and histograms. (3) Model Training: Under unified hyperparameters, classical Random Forest (RF) and quantum-enhanced RF models were constructed. The latter generates <Z> quantum features via Parameterized Quantum Circuits (PQC) encoding, which are concatenated with original features to form 2*n input features. Models were trained using two distinct input datasets and combinations of "original samples + 1~10× virtual samples." (4) Model Evaluation: The Leave-One-Out Cross-Validation (LOOCV) strategy was adopted to calculate R2, RMSE, and MAE. Visual diagnostics were performed using observed-predicted scatter plots, residual plots, and box plots. (5) Interpretability Analysis: Multi-scale interpretation was conducted based on the SHAP framework, including summary plots (global importance ranking), dependence plots (nonlinear response and interaction effects), and waterfall plots (local attribution). The driving mechanisms were cross-verified with results from Bayesian models and Pearson correlation analysis. This workflow encompasses the full process from data generation, modeling, and evaluation to attribution, providing a reproducible and highly transparent solution for precise groundwater nitrate prediction under conditions of small samples, multiple seasons, and multi-source inputs.
To ensure reproducibility, all computational tasks were performed on an Apple M2 Pro (10-core CPU, 16GB RAM) running Python 3.9. Core machine learning and preprocessing tasks utilized scikit-learn (version 1.7.2), quantum feature mapping and statevector simulation were conducted using qiskit (version 1.4.5) with CPU-based analytical calculation to ensure precision without hardware noise, model interpretability analysis employed shap (version 0.49.1), data manipulation was performed with pandas (version 2.3.3) and numpy (version 2.2.6), and visualization was generated using matplotlib (version 3.9.0). Statistical tests and scientific computing relied on scipy (version 1.15.3), and parallel processing was managed by joblib (version 1.5.2). All random operations were seeded (random_state=42) to ensure deterministic results. Complete code and generated datasets are available upon request.
Fig.3. Process diagram for constructing prediction framework.
- Page 5–7. The methodological section lacks A clear workflow diagram. Parameter selection criteria. Hyperparameter tuning explanation (if applicable). Justification of threshold values used.
If statistical or machine learning methods are employed, the following must be specified Training/testing split strategy. Cross-validation method. Performance metrics definitions. Software and version. Currently, the method description is too general for replication.
We sincerely thank the reviewer for these constructive comments regarding the reproducibility and clarity of our methodological framework. We have thoroughly revised the Materials and Methods section to address each point:
- Workflow Diagram: We have refined Fig. 3to explicitly illustrate the end-to-end pipeline, including data preprocessing, virtual sample generation, quantum feature encoding, hyperparameter tuning, and model evaluation. The text description in Section 2.7.4 has been expanded to walk the reader through each step of the diagram.
- Parameter Selection & Hyperparameter Tuning: We have added detailed explanations in Section 2.4 and Section 2.5. Specifically, we clarified the quantum circuit parameters (e.g., reps=1, n_qubits equal to feature dimension), the t-SNE perplexity setting (10, optimized for small samples), the GMM component selection via BIC minimization (range 1-7), and the KNN inverse mapping parameters (n_neighbors=5). The grid search ranges for the Random Forest (e.g., n_estimators, max_depth) have also been specified.
- Threshold Values: Justifications for thresholds, such as the 95% cumulative variance for PCA in AEF feature processing (Section 2.2.2), the perplexity setting for t-SNE (Section 2.4), and the physical constraints for virtual samples (e.g., NO3-≥ 0 mg L-1) have been clarified.
- Training/Testing & CV Strategy: The LOOCV strategy is now explicitly defined in Section 2.6.2, explaining why it was chosen over standard k-fold split (due to limited samples).
- Performance Metrics: Definitions forR2, RMSE, and MAE are provided with formulas in Section 2.7.2.
- Software and Version: A new paragraphhas been added to list all key packages (Python, scikit-learn, Qiskit, etc.) to ensure replicability.
- Virtual Sample Generation Details: Section 2.4 has been substantially expanded to include the complete technical workflow: KNN-based missing value imputation, t-SNE dimensionality reduction parameters, GMM component optimization via BIC, KNN inverse mapping for reconstructing original feature space, and physical constraint enforcement. Code-level implementation details are now provided to ensure full reproducibility.
The specific modifications are as follows:
2.4 t-SNE-GMM-KNN: based on nonlinear structure modeling in feature space
To address the challenges of overfitting and poor generalization performance in small-sample modeling, which arise from data sparsity and skewed distributions, this study proposes a three-stage virtual sample generation strategy termed t-SNE-Gaussian Mixture Sampling with KNN Inverse mapping. This method aims to preserve the non-linear manifold structure and multi-modal distribution characteristics of the original high-dimensional feature space while generating physically plausible and statistically consistent virtual samples. Crucially, virtual samples are generated exclusively in the hydrochemical feature space (spatial dimension) for each hydrological season independently, with no temporal extrapolation or cross-season data fusion, to avoid altering the inherent seasonal hydrochemical heterogeneity identified in this study. The “spatial dimension” here refers to the high-dimensional hydrochemical feature space composed of in-situ measured water quality parameters and nitrate concentration. These virtual samples augment the hydrochemical attribute space within each specific hydrological season (normal, dry, wet) to increase population density for model training, rather than representing new physical spatial locations or distinct time points. The specific workflow is as follows:
- Data standardization
All input features are standardized using Z-score standardization to eliminate scale differences and enhance the stability of the subsequent dimensionality reduction (Jamshidi et al., 2022). The StandardScaler from scikit-learn was applied to transform each feature to zero mean and unit variance. Notably, the target variable (NO3-) is included alongside predictive variables (pH, T, EC, DO, ORP, Salt, TDS) in this standardization step to ensure the joint distribution between predictors and the target is preserved during generation.
- t-SNE non-linear dimensionality reduction
t-Distributed Stochastic Neighbor Embedding (t-SNE) is employed to map the high-dimensional feature space into a low-dimensional latent space (d=2) (Islam et al., 2023). t-SNE is selected for its superior ability to preserve the local non-linear manifold structure of high-dimensional hydrochemical data compared to linear dimensionality reduction methods such as PCA, which is critical for capturing clustered subgroups corresponding to distinct hydrological and contamination processes (Wang et al., 2025). To balance the preservation of local and global structures, the perplexity is set to 10 based on empirical testing for small sample sizes (n<100), which optimizes the balance between local neighborhood preservation and global structure maintenance (Kobak et al., 2019). PCA initialization is used to ensure reproducibility and accelerate convergence, with a maximum iteration number of 1000 and a gradient descent tolerance of 1e-5 set as the convergence criteria. t-SNE effectively reveals the clustered structure of samples on the low-dimensional manifold, reflecting the differentiation of underlying environmental processes within hydrological seasons (Liu et al., 2021). The implementation used the scikit-learn TSNE module with random_state=42 for reproducibility.
- GMM clustering and optimal component selection
In the t-SNE-reduced low-dimensional space, a Gaussian Mixture Model (GMM) is constructed to characterize the probability density distribution of the data (Jia et al., 2022). The GMM assumes that the data are generated from a linear combination of several Gaussian distributions. The GMM captures the multimodal nature of the hydrochemical data (e.g., distinct clusters corresponding to different contamination levels or redox conditions). The weights, means, and covariance matrices of each Gaussian component are estimated via the Expectation-Maximization (EM) algorithm, thereby accurately capturing the complex distribution patterns of the data (Yan et al., 2023). To avoid subjectively setting the number of clusters, the Bayesian Information Criterion (BIC) is used to automatically optimize the number of components, K, within the range (1 to 7, where the upper bound was set to approximately n/10 to prevent overfitting given the small sample size) (Ghodba et al., 2025):
(5)
where L is the model’s likelihood, k is the total number of free parameters for a K-component model, and n is the sample size. The value of K corresponding to the minimum BIC is selected as the optimal number of components, ensuring a balance between goodness-of-fit and model complexity. The GMM was implemented with full covariance type and random_state=42 for reproducibility.
- Virtual sample generation and inverse mapping
Based on the optimal GMM, a specified number of virtual points are randomly sampled from its joint probability distribution. In this study, generation factors ranging from 1x to 10x the original sample size were tested, with the 10x generation (n_virtual = n_original x 10) yielding optimal model performance. This generation process naturally inherits the multi-modality and covariance structure of the original data. Since NO3- was included in the initial feature space, the generated low-dimensional points inherently contain information corresponding to both input variables and nitrate concentrations. Based on the optimal GMM, virtual samples are generated by randomly sampling from the joint probability distribution in the 2D t-SNE latent space. To reconstruct these synthetic 2D points back into the original 8-dimensional geochemical feature space (including NO3- concentrations), a K-Nearest Neighbors (KNN) regressor is employed as an inverse mapping function (Niu et al., 2025). The KNN regressor was configured with n_neighbors=min(5, n_original-1) to ensure sufficient reference points while maintaining local fidelity. This neighbor number is selected to balance the smoothness of inverse mapping and the preservation of local data characteristics, avoiding over-smoothing of extreme values. The KNN model is fitted on the original t-SNE embedding results (as independent variables) and the original standardized high-dimensional hydrochemical data (as dependent variables), establishing a non-linear mapping relationship from the low-dimensional latent space to the original high-dimensional feature space. Finally, the virtual sample set in its original physical units is obtained by applying inverse standardization. This inverse mapping process synchronously generates the virtual values of all input hydrochemical variables and the corresponding target NO3- concentration for each virtual sample, ensuring the one-to-one correspondence between input predictors and the target variable required for subsequent random forest (RF) modeling. During subsequent Random Forest modeling, the generated NO3- values are separated from the predictors and serve as the target variable (y), while the remaining hydrochemical parameters serve as input features (X).
- Physical constraints and quality control
For the target variable, NO3-, a non-negativity constraint (NO3- ≥ 0 mg L-1) is imposed to prevent non-physical solutions that may arise from the regression approximation. This constraint was enforced using numpy.clip() function post-generation. Other variables, such as pH and ORP, are allowed to fluctuate within reasonable ranges without hard clipping to retain the model's flexibility. The consistency between the virtual and measured samples is validated by comparing their statistical characteristics, including mean, standard deviation, coefficient of variation, range of extreme values, and boxplot distributions. This comparison confirms that the generated data are highly consistent with the original data in terms of statistical properties, without introducing systematic bias or outliers. The generated virtual samples are complete synthetic observations containing both the predictor variables and nitrate concentrations. These are concatenated with the original measured data to form an augmented training dataset. During Random Forest (RF) training, the 7 water quality parameters (pH, T, EC, DO, ORP, Salt, TDS) serve as input features (X), while the NO3- column (from both original and virtual samples) serves as the target variable (y). This approach effectively increases the training sample size while maintaining the geochemical integrity of the feature-target relationships, thereby mitigating small-sample bias and improving the model's ability to capture heavy-tailed distributions.
The advantages of this method are as follows: ① t-SNE excels at capturing local neighborhood relationships, effectively separating implicit subgroups under different hydrological conditions. ② The GMM provides a probabilistic generative framework, supporting reasonable extrapolation for heavy-tailed distributions and extreme values. ③ The KNN-based inverse mapping circumvents the need for large training datasets, which is a limitation of traditional autoencoders, making it particularly suitable for small-sample scenarios (Tang et al., 2022; Razavi-Termeh et al., 2024). ④ By modeling the joint distribution of predictors and target, the method ensures that generated virtual nitrate concentrations are chemically consistent with the accompanying hydrochemical indicators, avoiding unrealistic combinations that could degrade model performance.
2.5 Machine learning methods
2.5.1 Random forest
In this study, Random Forest (RF) was adopted as the baseline model. As an ensemble learning method that leverages bootstrap sampling and random feature selection, RF builds numerous decision trees and integrates their predictions (Abderzak et al., 2025). This approach effectively suppresses overfitting and improves generalization performance, making it especially well-suited for environmental data modeling scenarios involving small samples, high dimensionality, non-linearity, and multicollinearity (Boddu et al., 2025). The core principle of the RF regression model is to generate multiple independent decision tree learners through bootstrap resampling of the training dataset, where each tree is trained on a random subset of samples and a random subset of input features at each node split. The final prediction is the average of the outputs from all individual decision trees, which effectively reduces the variance of the model and suppresses overfitting. For each node split, the optimal split feature and threshold are determined by minimizing the mean squared error (MSE) reduction, which is derived from the Gini impurity criterion and mathematically defined as:
(6)
where Pi is the proportion of samples belonging to the i-th interval of the target variable in the node. For regression tasks, the Gini impurity is extended to the MSE reduction criterion, which quantifies the decrease in the variance of the target variable after the node split.
Hyperparameters were configured based on a preliminary grid search and domain expertise: n_estimators=100, max_depth=5, min_samples_split=6, min_samples_leaf=3, and max_features=. For the interpretation of driving mechanisms, feature importance was quantified by the mean decrease in Gini impurity (Gini Importance) to identify the critical hydrogeochemical indicator factors (Kaur et al., 2025).
2.5.2 Hybrid quantum-classical random forest
Based on the random forest, a hybrid quantum-enhanced Random Forest model was constructed, integrating quantum feature enhancement with classical random forests. The core idea of the model is: utilizing a Parameterized Quantum Circuit (PQC) to perform quantum feature encoding on standardized input features, generating high-dimensional quantum features with non-linear entanglement properties (Naresh et al., 2025). These are then concatenated with the original features to construct an enhanced hybrid feature space, which is finally fed into a random forest regressor for modeling (Lamichhane et al., 2025). This hybrid framework retains the superior ensemble learning performance of the classical RF model, while introducing quantum-enhanced feature representations to improve the model's ability to capture complex non-linear relationships in small-sample datasets, without relying on physical quantum hardware.
(1) Quantum Feature Encoding
Quantum state transformation of classical data is achieved based on the Z-feature map in quantum computing (Vedavyasa et al., 2025). The ZFeatureMap maps the classical high-dimensional feature space into a quantum Hilbert space through single-qubit Z-gate operations and two-qubit CZ-gate entanglement operations (Khalil et al., 2025). In this study, the number of qubits was set equal to the number of input features (7 qubits for in-situ hydrochemical parameters, corresponding to pH, T, EC, DO, ORP, Salt, TDS), and the repetition layers (reps) were set to 1 to prevent circuit depth-induced noise while maintaining expressibility. This single-layer circuit design avoids the exponential growth of quantum state noise with increasing circuit depth, which is critical for ensuring the stability of feature generation in classical simulation environments. Its core advantage lies in obtaining quantum features via analytical calculation of quantum state vectors, thereby avoiding noise interference introduced by quantum sampling and ensuring feature stability. The ZFeatureMap provided by Qiskit is used as the feature encoding circuit, with its Hamiltonian form given by (Tehrani et al., 2024):
(7)
the term represents the unitary operator of the encoding circuit that maps classical data x into a quantum state, which is a dimensionless unitary matrix. The input vector x denotes the classical data with units corresponding to normalized feature values specific to the problem at hand. The parameter R indicates the number of repetition layers in the circuit architecture, serving as a dimensionless positive integer, while d represents the selected number of principal factors, also expressed as a dimensionless positive integer. The indices k and i iterate over the repetition layers and qubits respectively, with k ranging from 1 to R and i ranging from 1 to d . The operator Hi corresponds to the Hadamard gate applied to the i -th qubit. The symbol ⨂ denotes the tensor product operation across multiple qubits, while exp() represents the matrix exponential function defining the parameterized rotation gate. The term −i incorporates the imaginary unit i with a negative sign. The set S refers to the subset of qubit indices involved in entanglement operations, satisfying the condition S⊆{1,...,d} . The rotation angle ϕS(x) depends on the input data and is computed via linear embedding, with units expressed in radians or as a dimensionless quantity. The operator Zj represents the Pauli-Z operator acting on the j-th qubit.
For each sample x, construct the corresponding quantum state |ψ(x)⟩, and analytically calculate the Pauli-Z expectation value for each qubit (Liao et al., 2024):
(8)
here, the quantity ⟨Zi⟩ denotes the expectation value of the Pauli-Z operator for the i -th qubit, taking values in the range [−1,+1] as a dimensionless scalar. The state vector ∣ψ(x)⟩ represents the quantum state encoding the classical data x after applying the unitary UZMap. The bra vector ⟨ψ(x)∣ corresponds to the conjugate transpose of ∣ψ(x)⟩ . The probabilities P(qi=0) and P(qi=1) represent the respective likelihoods of measuring the i -th qubit in states ∥0⟩ and ∥1⟩ , both taking dimensionless values in the range [0,1], while qi indicates the binary measurement outcome of the i -th qubit with possible values {0,1}. This method requires no quantum hardware sampling, completely avoiding the interference of measurement noise and shot noise on small-sample modeling, thus ensuring the determinism and reproducibility of feature generation. This analytical expectation value calculation was implemented using the Statevector module in Qiskit, ensuring deterministic feature generation without shot noise.
(2) Feature Fusion and Modeling
Concatenate the original n-dimensional raw features with the n-dimensional quantum ⟨Z⟩ features to form a 2n-dimensional hybrid feature vector x_aug = [x_raw; ⟨Z⟩] (Cowlessur et al., 2025). That is, original features+quantum encoded features. This feature fusion strategy retains the original physical information of the hydrochemical parameters while introducing non-linear entangled quantum features, expanding the feature space to enhance the model’s discriminative ability for complex non-linear relationships. Using this augmented feature vector as input, we construct a random forest regression model with identical hyperparameter settings to the classical RF baseline model, to ensure a fair performance comparison between the two models::
(9)
The symbol represents the predicted output value of the regression model, with units corresponding to the target variable. The parameter M denotes the total number of decision trees in the ensemble. Treem represents the m -th decision tree regressor in the ensemble. The input xaug refers to the augmented hybrid feature vector formed by concatenating the original n -dimensional raw features xraw with the n -dimensional quantum ⟨Z⟩ features. The hyperparameter settings are the same as those for the classical random forest method described above.
2.7 Evaluation methods and prediction process
2.7.1 SHAP analysis
This study adopts the SHapley Additive exPlanations (SHAP) method for local and global explainability analysis (Merabet et al., 2025). Through three typical visualization methods, namely summary plot, dependence plot, and waterfall plot, the following are revealed respectively: (1) The overall ranking and distribution of feature importance across all samples (global perspective); (2) The nonlinear relationship or interaction effects between a single predictor variable and the predicted nitrate concentration (conditional dependence); (3) The contribution decomposition of each feature in the prediction result of a representative sample (local attribution) (Alam et al., 2025).
The SHAP value is mathematically defined as: the marginal contribution of feature j to the model output offset from the baseline mean (Li et al., 2024), and its form is:
(17)
where F is the set of all features, S is a subset not containing feature j, and f is the model output. By averaging the absolute SHAP values over all samples, a feature importance measure with a game-theoretic foundation, unbiased and robust, can be obtained (Hollmannet al., 2025).
2.7.2 Leave-One-Out Cross-Validation (LOOCV) and model evaluation indicators
Given the limited sample size in each hydrological season, this study adopts Leave-One-Out Cross-Validation (LOOCV) for model performance evaluation to maximize the use of training data and reduce evaluation bias (Austin et al., 2025). Prior to LOOCV, hyperparameter tuning was conducted using 5-fold Cross-Validation on the full dataset to identify optimal model structures (e.g., tree depth, number of estimators). The grid search ranges included: n_estimators [100, 150], max_depth [5, 6, 7], min_samples_split [4, 6], and max_features ['sqrt']. The best parameters identified were then fixed for the LOOCV evaluation. The LOOCV process is: each time, one sample is left out as the validation set, and the remaining n-1 samples are used for training. After repeating nn times, the average of the evaluation indicators is taken as the final result (Ren et al., 2021). Additionally, to assess prediction uncertainty, 90% Prediction Intervals (PI) were estimated using quantile regression forests during the LOOCV process. Coverage probability was calculated to validate the reliability of the uncertainty estimates. To quantify the uncertainty of model performance estimates derived from LOOCV, we implemented Bootstrap resampling (1000 iterations) to calculate 95% confidence intervals (CI) for all evaluation metrics. This approach addresses the limitation that LOOCV alone cannot provide direct confidence intervals for R2 estimates due to the single-sample validation nature. By repeatedly resampling the prediction residuals with replacement and recalculating metrics, we obtained robust uncertainty bounds that reflect the stability of model performance across different data realizations. The Bootstrap 95% CI is reported alongside LOOCV metrics in all performance tables (Table S1-S6).
The coefficient of determination R2, root mean square error (RMSE), and mean absolute error (MAE) are used to quantitatively describe the model accuracy and error characteristics (Gul et al., 2025):
(18)
(19)
(20)
where is the measured value of the i-th sample (mg L-1), is the model’s predicted value (mg L-1), is the mean of the measured values, and n is the number of samples in the test set.
2.7.3 Model robustness and uncertainty analysis
2.7.3.1 Sensitivity analysis
To quantitatively evaluate the response intensity of the model’s nitrate concentration prediction results to the perturbation of input hydrochemical parameters, and to verify the robustness of the model and the reliability of feature importance ranking, this study conducted a multi-dimensional sensitivity analysis. This includes global sensitivity indices based on SHAP values, one-at-a-time (OAT) feature perturbation tests, and elasticity coefficient calculations (Di et al., 2024).
First, based on the SHAP values obtained from the TreeExplainer, the first-order and total-order sensitivity indices were calculated to measure the independent and interactive contributions of each feature (Huang et al., 2021). The first-order sensitivity index for feature i is defined as the normalized mean absolute SHAP value:
(21)
where is the SHAP value of feature i for sample j, n is the number of samples, and p is the number of features. The total-order sensitivity index , which captures both main effects and interaction effects, was approximated using the diagonal elements of the SHAP interaction values:
(22)
where represents the self-interaction SHAP value for feature i in sample j.
Second, to assess the local stability of the model, an OAT perturbation test was conducted by fixing other input parameters and varying the target feature within ±30% of its measured range (Anderson et al., 2018). The perturbation sensitivity is calculated as the relative change rate of the model’s R2 score:
(23)
where represents the coefficient of determination obtained under different perturbation levels.
Finally, the elasticity coefficient was introduced to quantify the directional response degree of nitrate concentration prediction to per unit change of each parameter. It is defined as the ratio of the percentage change in output to the percentage change in input:
(24)
where is the baseline prediction, is the prediction after perturbing feature i by a small fraction, and is the original value of feature i for sample j. An absolute value greater than 1 indicates an elastic response, while less than 1 indicates an inelastic response.
2.7.3.2 Error propagation
Monte Carlo simulation was specifically used to analyze the propagation of input measurement errors to the final predictions (Dega et al., 2023). Assuming the input hydrochemical parameters follow a normal distribution with a coefficient of variation (CV) of 5% (representing typical instrumental uncertainty), the perturbed input matrix Xmc was generated as:
(25)
where = ×CV. This process was repeated for Nmc =500 iterations. The output uncertainty due to input errors was then characterized by the standard deviation (σout) and the 95% CI width of the ensemble predictions:
(26)
where is the prediction from the k-th Monte Carlo simulation, and P2.5 ,P97.5 are the 2.5th and 97.5th percentiles of the prediction distribution.
- Model validation appears limited. The manuscript does not report Confidence intervals. Statistical significance testing. Sensitivity analysis. Uncertainty quantification.
Without uncertainty analysis, the robustness of conclusions is questionable.
I suggest author will include Bootstrap or cross-validation uncertainty. Sensitivity analysis of key parameters. Error propagation discussion.
Several figures suffer from: Small axis labels. Low resolution. Lack of units. No uncertainty shading.
Figures must be self-explanatory. Captions should fully describe the content without requiring readers to consult the main text.
We sincerely thank the reviewer for this critical comment regarding uncertainty analysis. We fully agree that robustness assessment is essential for validating our modeling framework. We have comprehensively addressed this concern through the following revisions:
We have implemented Bootstrap resampling (1000 iterations) to quantify the uncertainty of all performance metrics reported in the study. Specifically:
- We calculated 95% confidence intervals (CI) for R² for all model configurations (classical RF vs. quantum-enhanced RF) across all hydrological seasons and input types.
- The Bootstrap CIs are now reported in all supplementary performance tables (Table S1-S6).
- We note that LOOCV alone cannot provide direct confidence intervals due to its single-sample validation nature; the Bootstrap approach effectively addresses this limitation by resampling prediction residuals.
The revised manuscript (Section 2.6.2) now explicitly describes this dual-evaluation methodology. In the Results section (Sections 3.4.2 and 3.4.3), we discuss these confidence intervals to demonstrate model stability. The narrowing of CI widths with increasing virtual sample sizes (e.g., Normal season RF R2 CI width reduced from 0.1972 at original to 0.0106 at 10-fold augmentation) demonstrates that our augmentation strategy effectively reduces estimation variance.
We agree that a comprehensive sensitivity analysis is crucial for understanding the driving mechanisms of the model and validating the physical interpretability of the input parameters. In response, we have added a new subsection Section 3.6 “Sensitivity analysis of key parameters” in the revised manuscript. In this section, we employed Sobol global sensitivity indices, feature perturbation sensitivity, and elasticity coefficients to quantify the influence of hydrochemical parameters (pH, T, EC, DO, ORP, Salt, TDS) on nitrate predictions across different hydrological seasons (normal, dry, and wet).
The key findings added to the manuscript include:
TDS and EC were identified as the most sensitive parameters across all seasons (Total-order Sobol indices > 0.25), confirming their role as robust proxies for nitrate transport processes, corroborating their central role identified in the Bayesian and correlation analyses.
We observed distinct seasonal variations in sensitivity. Specifically, pH sensitivity increased significantly during the dry season, reflecting evaporative concentration effects. ORP sensitivity was higher during the wet season, likely due to rainfall-induced redox fluctuations.
The analysis demonstrated that increasing virtual sample augmentation (from 1x to 10x) stabilized the sensitivity indices, reducing variance caused by small-sample bias (e.g., Dry season pH Sobol index variance decreased with augmentation).
We agree that a systematic discussion of error propagation is essential for evaluating model robustness and uncertainty quantification in environmental prediction tasks.
Accordingly, we have incorporated a comprehensive error propagation discussion in “Section 4.3 Error propagation under virtual sample” of the revised manuscript. Specifically, we performed Monte Carlo simulations (N=500 iterations) assuming a 5% coefficient of variation for input parameters to quantify how measurement errors propagate through both classical Random Forest and hybrid Quantum-Classical Random Forest models under varying degrees of virtual sample augmentation.
We have carefully revised all figures throughout the manuscript to address these concerns. Our specific actions are as follows: We have significantly enlarged the font size of all axis labels, tick marks, and legends to ensure readability. Additionally, all figures have been exported and embedded at a higher resolution (300 dpi) to meet publication standards. We have meticulously reviewed all figures and ensured that appropriate physical units are now clearly indicated on all axes and in the figure captions where applicable. To maintain clarity and avoid visual clutter in complex multi-seasonal comparison plots, we have reported these precise Bootstrap 95% CIs alongside the performance metrics in Tables S1-S6 (Supplementary Material). This allows readers to access exact quantitative bounds of uncertainty for each model and season. We have ensured that boxplots are used to compare observed and predicted concentrations. These boxplots inherently display the interquartile range (IQR), whiskers, and outliers, providing a clear visual indication of distributional uncertainty and model stability across different data augmentation levels.
We believe these improvements have significantly enhanced the clarity, rigor, and readability of the figures and the overall manuscript.
We agree that making figures self-explanatory is crucial for improving the readability and accessibility of the manuscript. Accordingly, we have thoroughly reviewed and revised the captions for all figures (Fig. 1 to Fig. 22) in the revised manuscript.
The specific modifications are as follows:
3.4 Model performance evaluation
3.4.1 Virtual data analysis
To address the modeling bias arising from limited measured samples, this study constructed virtual datasets at 1-10 times the original scale based on a strategy combining t-SNE dimensionality reduction, GMM clustering sampling, and KNN inverse mapping, to enhance the robustness of model training. Taking the 10x virtual dataset as an example, the statistical characteristics (Table 4) show that the virtual samples effectively reproduced the central tendency and dispersion of the original data. For the normal season, the mean NO3- concentration was 30.41 mg L-1 (vs. observed mean of 33.67), with a standard deviation of 28.78 (vs. 35.83) and a coefficient of variation (CV) of 0.95 (vs. 1.06). In the dry season, the maximum value of the virtual samples reached 178.09 mg L-1, while this did not fully replicate the extreme high values (observed maximum of 358.58 mg L-1), it effectively expanded the range of the heavy-tailed distribution. For the wet season, although the CV for all indicators was slightly lower than the measured values, their ranges (8.47-80.37 vs. 4.15-98.36 mg L-1) still showed a high degree of overlap, indicating that no systematic distortion was introduced.
Table 4. Statistical characteristics of different virtual samples.
Periods
pH
T
EC
DO
ORP
Salt
TDS
NO3-
Unit
℃
μs cm-1
mg L-1
mv
ppt
mg L-1
mg L-1
Normal season
Max
8.32
16.2
963.6
9.31
101.02
0.42
628.6
124.03
n=660
Min
7.95
13.88
378.2
3.52
-48.28
0.12
245
5.10
Mean
8.18
14.72
527.31
6.66
2.09
0.20
342.72
30.41
SD
0.07
0.54
148.31
1.66
25.97
0.08
96.92
28.78
CV
0.01
0.04
0.28
0.25
12.44
0.41
0.28
0.95
Dry season
Max
8.19
17.59
987.8
8.10
69.18
0.44
641.8
178.09
n=650
Min
7.04
14.44
417.6
2.68
-59.14
0.132
267.8
5.19
Mean
7.35
15.49
661.79
6.414
0.52
0.27
429.90
37.75
SD
0.44
0.69
170.65
1.24
29.61
0.09
111.16
33.13
CV
0.06
0.04
0.26
0.19
57.21
0.34
0.26
0.88
Wet season
Max
8.21
18.88
872.8
8.54
56.4
0.37
569
80.37
n=500
Min
6.31
16.12
395.8
5.34
-77.64
0.13
257.6
8.47
Mean
7.38
16.93
535.7
7.06
13.75
0.20
348.24
25.85
SD
0.33
0.62
131.02
0.78
28.85
0.08
86.05
19.88
CV
0.04
0.04
0.24
0.11
2.10
0.38
0.25
0.77
Standardized multivariate boxplots (Fig.12) visually confirm that the median, interquartile range (IQR), whisker length, and outlier distribution of the virtual data for each period were highly similar to the measured data, demonstrating that the central tendency and dispersion characteristics were well-preserved. Hydrological seasonal characteristics, such as high EC/TDS/Cl-/NO3- in the dry season and low, concentrated NO3- in the wet season, were also accurately preserved. Although the number of some newly added outliers slightly increased, they all fell within physically reasonable ranges, with no non-physical solutions, such as negative concentrations or out-of-bounds pH values, occurring. Fig.13 presents a comparison of nitrate concentration frequency distributions between the original and virtual datasets across normal, dry, and wet periods. The distributional comparison indicates that the proposed t-SNE + GMM + KNN inverse mapping virtual sample generation strategy maintains the core features of the NO3- distribution for each hydrological period, while simultaneously improving sample representation in sparse areas and intervals of high variability. Therefore, the t-SNE + GMM method effectively captured the non-linear structure and extreme value information of the original data, and can provide reliable data support for subsequent model training.
Fig.12. Box plots of the observed and virtual variable data at different periods.
Fig.13. Comparison of nitrate concentration distribution in the original and virtual datasets under different periods.
The Kolmogorov-Smirnov (KS) test results indicate that the dry season exhibited the lowest mean KS statistic (0.1214±0.0387), with 80.0% of the features falling below the strict threshold of 0.15 (Fig. S1). This demonstrates that the virtual samples during this period highly reproduced the distribution characteristics of the observed data. The normal season followed, with a mean KS statistic of 0.1278±0.0356, where 72.5% of the features had KS values below 0.15. The distribution similarity during the wet season was relatively lower, characterized by a mean KS statistic of 0.1657 ± 0.0542; although 37.5% of the features had KS values below 0.15, 70.0% still satisfied the acceptable criterion of KS<0.20. The P-values from the KS test provided a quantitative assessment of the statistical significance regarding distributional differences. When P > 0.05, the null hypothesis could not be rejected, indicating no significant difference between the distributions of the virtual and observed samples (Fig. S2 & Fig.14). Jensen-Shannon (JS) divergence analysis further corroborated these findings (Fig. 15). The mean JS divergence across all three hydrological periods remained below the threshold of 0.05, demonstrating high consistency with the original distributions. These low JS divergence values indicate that the probability distributions of the synthetic samples closely approximated the observed data, with no significant distributional shift observed. Notably, the JS divergence values for Salt and NO3- were relatively higher across all hydrological periods. Evaluation results using the Maximum Mean Discrepancy (MMD) showed that MMD values approaching zero and remaining within the confidence interval indicate no significant distributional differences between the synthetic and observed data in the Reproducing Kernel Hilbert Space (RKHS), confirming that the virtual samples introduced no systematic bias (Fig. 16).
Citation: https://doi.org/10.5194/egusphere-2026-272-AC2
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 238 | 107 | 23 | 368 | 34 | 9 | 33 |
- HTML: 238
- PDF: 107
- XML: 23
- Total: 368
- Supplement: 34
- BibTeX: 9
- EndNote: 33
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This manuscript presents a highly innovative and timely contribution to hydrogeoscience research, offering a well-integrated framework that combines hydrochemical analysis, isotopic source apportionment, virtual sample generation, and hybrid quantum–classical machine learning to predict seasonal groundwater nitrate concentrations. The authors successfully link hydrogeochemical processes with predictive modeling and interpretability (via SHAP and Bayesian analyses), thereby enhancing both scientific insight and practical relevance. The focus on seasonal nitrate dynamics in an intensively cultivated region further strengthens the manuscript’s significance for water resources management. Overall, the work aligns exceptionally well with the scope and standards of HESS, particularly in its emphasis on process understanding, methodological novelty, and interdisciplinary integration.
1) The manuscript presents an interesting integration of virtual sample generation and hybrid quantum–classical ML. However, the authors should more explicitly differentiate their contribution from existing ML-based groundwater quality prediction studies. A short paragraph clearly stating what is fundamentally new (beyond combining known techniques) would strengthen the paper.
2) The t-SNE–GMM–KNN augmentation strategy is promising, but the validation of virtual samples is described rather briefly. Please provide additional quantitative diagnostics (e.g., distribution similarity metrics, KS test, or comparison of covariance structures) to demonstrate that synthetic data do not introduce bias.
3) The reported improvement in R² after 10× augmentation is impressive. The authors should discuss potential risks of overfitting to synthetic patterns and comment on how the framework might generalize to other regions with different hydrogeochemical settings.
4) The manuscript would benefit from acknowledging recent developments in quantum approaches for hydrology. I do strongly recommend citing the following relevant work, which provides useful methodological context: HydroQuantum: A new quantum-driven Python package for hydrological simulation.