Hydrochemistry and modeling nitrate concentration in farmland groundwater under different hydrological seasons by integrating hybrid quantum-classical ML, virtual sample generation and AlphaEarth Foundation

Xu, Junjie; Wei, Xin; Yu, Yilei; Yang, Lihu; Zhai, Yuanzheng; Lv, Cuicui; Song, Xianfang

doi:10.5194/egusphere-2026-272

Preprints

https://doi.org/10.5194/egusphere-2026-272

Preprints

29 Jan 2026

| 29 Jan 2026

Hydrochemistry and modeling nitrate concentration in farmland groundwater under different hydrological seasons by integrating hybrid quantum-classical ML, virtual sample generation and AlphaEarth Foundation

Junjie Xu, Xin Wei, Yilei Yu, Lihu Yang, Yuanzheng Zhai, Cuicui Lv, and Xianfang Song

Abstract. Precise seasonal prediction of groundwater nitrate concentrations in intensive agricultural areas faces challenges such as data sparsity, strong spatiotemporal heterogeneity, and complex hydro-biogeochemical processes. To address these issues, this study proposes an integrated prediction framework combining hybrid quantum-classical machine learning, advanced virtual sample generation (t-SNE-GMM-KNN), and remote sensing foundation model semantic embedding (AEF). Modeling was conducted across the 2022–2023 normal, dry, and wet seasons in Xiong'an New Area. Hydrochemical types were dominated by Ca-Mg-HCO₃−, controlled by mineral dissolution and evaporation. Nitrate concentrations were highest in the dry season (mean 42.93 mg L⁻¹), driven by evaporative concentration. Spatially, high-value zones shifted: southeast (normal), central (dry), and northwest (wet). MixSIAR modeling based on isotopes indicated domestic sewage and livestock manure (74.1 %) as dominant sources. The t-SNE-GMM-KNN strategy mitigated small-sample bias while preserving nonlinear structure. When virtual samples were augmented to 10-fold, the Random Forest R² in the dry season increased from 0.284 to > 0.85. Furthermore, a hybrid quantum-classical Random Forest exhibited superior robustness for data sparsity, achieving peak performance in the normal season (R²= 0.962, RMSE = 5.73 mg L⁻¹). Additionally, using only AEF embeddings achieved screening-level accuracy (R² up to 0.860), providing a feasible rapid survey scheme for extensive unmonitored regions. Correlation analysis identified TDS and EC as persistent top predictors (r > 0.8). This comprehensive framework offers a robust solution for seasonal nitrate prediction and sustainable water management.

Received: 18 Jan 2026 – Discussion started: 29 Jan 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 6039 KB)

Supplement (73 KB)

Download & links

Junjie Xu, Xin Wei, Yilei Yu, Lihu Yang, Yuanzheng Zhai, Cuicui Lv, and Xianfang Song

Status: final response (author comments only)

CC1: 'Comment on egusphere-2026-272', Nima Zafarmomen, 19 Feb 2026

This manuscript presents a highly innovative and timely contribution to hydrogeoscience research, offering a well-integrated framework that combines hydrochemical analysis, isotopic source apportionment, virtual sample generation, and hybrid quantum–classical machine learning to predict seasonal groundwater nitrate concentrations. The authors successfully link hydrogeochemical processes with predictive modeling and interpretability (via SHAP and Bayesian analyses), thereby enhancing both scientific insight and practical relevance. The focus on seasonal nitrate dynamics in an intensively cultivated region further strengthens the manuscript’s significance for water resources management. Overall, the work aligns exceptionally well with the scope and standards of HESS, particularly in its emphasis on process understanding, methodological novelty, and interdisciplinary integration.
1) The manuscript presents an interesting integration of virtual sample generation and hybrid quantum–classical ML. However, the authors should more explicitly differentiate their contribution from existing ML-based groundwater quality prediction studies. A short paragraph clearly stating what is fundamentally new (beyond combining known techniques) would strengthen the paper.
2) The t-SNE–GMM–KNN augmentation strategy is promising, but the validation of virtual samples is described rather briefly. Please provide additional quantitative diagnostics (e.g., distribution similarity metrics, KS test, or comparison of covariance structures) to demonstrate that synthetic data do not introduce bias.
3) The reported improvement in R² after 10× augmentation is impressive. The authors should discuss potential risks of overfitting to synthetic patterns and comment on how the framework might generalize to other regions with different hydrogeochemical settings.
4) The manuscript would benefit from acknowledging recent developments in quantum approaches for hydrology. I do strongly recommend citing the following relevant work, which provides useful methodological context: HydroQuantum: A new quantum-driven Python package for hydrological simulation.

Citation: https://doi.org/10.5194/egusphere-2026-272-CC1
RC1:
'Comment on egusphere-2026-272', Anonymous Referee #1, 27 Feb 2026
Reviewer Comments
General Assessment
The manuscript addresses an important geoscientific problem and presents potentially valuable findings. However, several methodological, structural, and interpretative issues limit the clarity, reproducibility, and robustness of the conclusions. Substantial revision is required before the manuscript can be considered for publication. The topic is relevant and potentially publishable but the manuscript requires major revision to improve its quality.

Major Comments
Page 2, Lines 10–15

The introduction provides general background but does not clearly articulate the precise research gap. While prior studies are cited, it remains unclear What specific limitation of previous work is being addressed? Whether this study offers methodological novelty or merely a regional application? How the proposed approach differs from existing frameworks?
A clearer paragraph explicitly stating identified gap, the proposed advancement, and the expected contribution is necessary.
Page 3, Lines 5–30

The study area description is descriptive but lacks quantitative justification. For example, No map scale or resolution information is provided. Climatic or hydrological statistics are not sufficiently summarized. The selection rationale for this site is weak.
I suggest authors will include a detailed map with coordinate system and scale. A table summarizing key environmental variables. A clear explanation of why this site is scientifically significant.
Page 4, Lines 12

The manuscript does not adequately describe Data temporal resolution. Data completeness. Missing data handling procedures. Quality control protocols.
If interpolation or smoothing was applied, it must be explicitly stated. Reproducibility requires transparent preprocessing documentation.
Page 5–7

The methodological section lacks A clear workflow diagram. Parameter selection criteria. Hyperparameter tuning explanation (if applicable). Justification of threshold values used.
If statistical or machine learning methods are employed, the following must be specified Training/testing split strategy. Cross-validation method. Performance metrics definitions. Software and version.
Currently, the method description is too general for replication.
Page 8–10

Model validation appears limited. The manuscript does not report Confidence intervals. Statistical significance testing. Sensitivity analysis. Uncertainty quantification.
Without uncertainty analysis, the robustness of conclusions is questionable.
I suggest author will include Bootstrap or cross-validation uncertainty. Sensitivity analysis of key parameters. Error propagation discussion.
Several figures suffer from:

Small axis labels.

Low resolution.

Lack of units.

No uncertainty shading.

Figures must be self-explanatory. Captions should fully describe the content without requiring readers to consult the main text.
Page 11–13

The discussion includes causal interpretations that are not fully supported by the presented data. Correlation-based findings are occasionally interpreted mechanistically without experimental evidence.
Please revise the language to avoid overstatement and clearly distinguish between Observed correlation, Hypothesized mechanism and Established causality
Page 14

The limitations are acknowledged but superficially. Consider expanding discussion on:
Spatial representativeness.

Temporal coverage limitations.

Potential bias sources.

Model generalizability.

A more critical self-assessment would strengthen the manuscript.
Several equations lack complete symbol explanations immediately below the formula. All variables should be defined clearly with units.

Some terms are used interchangeably without clarification. Please standardize terminology throughout the manuscript.

Ensure consistent SI unit formatting. Check spacing around mathematical symbols. Italicize variables in equations.

Recent studies from the last 2–3 years appear underrepresented. Please ensure the manuscript reflects current state-of-the-art research.

Minor Comments
Several grammatical errors require professional English editing.

Some references appear inconsistently formatted.

Figure numbering and cross-referencing should be double-checked.
Citation: https://doi.org/10.5194/egusphere-2026-272-RC1
RC2:
'Comment on egusphere-2026-272', Anonymous Referee #2, 12 Mar 2026
The manuscript proposes a framework for predicting nitrate concentrations in the groundwater system. In this research several novel methods such as virtual sample generation and hybrid quantum-classical Random Forest, and novel data sources (AlphaEarth Foundation) are combined with classical water quality sampling including isotopes analysis. The main novelty presented in the paper is the use of virtual sample generation to increase the Random Forest efficiency for predicting nitrate concentrations in the groundwater.
Although the general idea of the manuscript is interesting the current manuscript lacks details to fully understand the virtual sampling and random forest methods that are applied:
The precise methodology of the virtual sample generation (t-SNE-GMM-KNN) is unclear to me. Are virtual samples added in space or also in time? The virtual samples are used in the RF method, how are the inputs variables and corresponding nitrate concentrations determined?

Which inputs are used for the RF models? Are the nitrate concentrations modelled based on the other water quality parameters and the Alpha Earth Foundation data. If this is the case, isn’t the prediction of nitrate concentration merely a correlation with the remaining water quality parameters? In practice it would be strange to measure all these water quality parameters (of which some are quite costly) to predict the nitrate concentration. I do understand that this analysis is interesting to gain insight into the functioning of the nitrate dynamics in your groundwater system.

More information on the methodology of the virtual sample generation and RF models is required in my opinion. Additionally, the results are mainly validated based on boxplots of observed and predicted nitrate concentrations. A more thorough validation of the predictions should be added.
The manuscript would benefit from a clearer goal, explain in the manuscript why what you did is relevant and how results can be used.
The results section has a paragraph named ‘3.3 Bayesian model analysis and correlation analysis’. These methods are, however, discussed too briefly in the methodology part.
I have a few conceptual questions about the work:
The analysis was done on a small-scale groundwater system (basically very large field scale). Within the area the groundwater system likely has very similar physical characteristics. Wouldn’t it be better to apply the method on a larger scale with more physical differences?

In the manuscript (on page 8) it is mentioned that the sampling was done on irrigation wells, could this have an impact on the winter-summer differences?

I thought that the large unsaturated zone (>10m?) would limit the seasonal response of both recharge and nitrate leaching in the groundwater system. However, the presented water quality measurements show otherwise. Maybe other external factors: groundwater abstraction - for example for irrigation -, groundwater dynamics on a regional scale... could have an impact as well?

Minor comments:
In the introduction you mention exceedance of the nitrate rate. Exceedance of what? The local nitrate limits? Line 48 and 55 seem to contain very similar information.

The information on the maps of fig 7 is interesting, however the maps are not very readable. Please use the same range for the 3 maps. How can the spatial differences in the nitrate concentration maps be explained?

On page 23 line 552-553 you mention the groundwater depth becomes shallower in the dry season, this is counter intuitive. How can you explain
Citation: https://doi.org/10.5194/egusphere-2026-272-RC2

Junjie Xu, Xin Wei, Yilei Yu, Lihu Yang, Yuanzheng Zhai, Cuicui Lv, and Xianfang Song

Supplement

https://doi.org/10.5194/egusphere-2026-272-supplement

Junjie Xu, Xin Wei, Yilei Yu, Lihu Yang, Yuanzheng Zhai, Cuicui Lv, and Xianfang Song

Viewed

Total article views: 279 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
192	72	15	279	26	5	22

HTML: 192
PDF: 72
XML: 15
Total: 279
Supplement: 26
BibTeX: 5
EndNote: 22

Views and downloads (calculated since 29 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	57	17	2	76
Feb 2026	83	25	10	118
Mar 2026	52	30	3	85

Cumulative views and downloads (calculated since 29 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	57	17	2	76
Feb 2026	83	25	10	118
Mar 2026	52	30	3	85

Viewed (geographical distribution)

Total article views: 266 (including HTML, PDF, and XML) Thereof 266 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 14 Mar 2026

Short summary

The proposed t-SNE-GMM-KNN virtual sample generation boosts dry season R² from 0.28 to > 0.85, preserving multimodal structure. Total Dissolved Solids (TDS), Electrical Conductivity (EC), and Salinity are consistently identified as the top predictive factors across different hydrological seasons.


Total:	0
HTML:	0
PDF:	0
XML:	0