Preprints
https://doi.org/10.5194/egusphere-2025-6160
https://doi.org/10.5194/egusphere-2025-6160
20 Feb 2026
 | 20 Feb 2026
Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

Quantifying the impact of input data-induced dataset shift on machine learning model applications: A case study of regional reactive nitrogen wet deposition

Yan Zhang, Jiani Tan, Qing Mu, Joshua S. Fu, and Li Li

Abstract. Machine learning (ML) has been extensively applied to studies on spatial distribution characteristics of atmospheric composition, but quantitative assessments of uncertainties arising from input data properties are still lacking. To address this gap, we conducted a case study on wet deposition of reactive nitrogen (Dwet). The Extreme Gradient Boosting (XGBoost) model was employed to predict Dwet in East Asia and Southeast Asia (SEA) with a compiled dataset from multiple sources. We quantified the impacts of input data characteristics on model performance and Dwelt estimations via three sensitivity experiments. Key findings revealed that: (1) Sample size below 6,000–9,000 led to a maximum 12 % accuracy loss, while exceeding this threshold provided marginal performance improvement. (2) Uneven spatial distribution of monitoring sites caused 9–51 % deviations from baseline performance, with >50 % variability in Dwet estimations in data-scarce regions (e.g., western China and SEA). (3) Imbalanced site types lead to insufficient representation of remote sites, resulted in 9–40 % overall accuracy loss and a high risk of severe overestimation (100 %) in remote areas. The bias was attributed to both data range shifts and altered feature-target relationships (e.g., NH₃ emission vs. Dwet-NH₄⁺). Additionally, inconsistencies among multi-source datasets and limitations of ML structure further introduced uncertainties. This study quantified previously unaddressed input data-induced uncertainties in ML-based Nr deposition research, providing critical insights for reliable application of ML-derived data in Nr management. The proposed uncertainty assessment framework is also applicable to other ML-based geospatial interpolation tasks facing data scarcity challenges.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
Share
Yan Zhang, Jiani Tan, Qing Mu, Joshua S. Fu, and Li Li

Status: open (until 17 Apr 2026)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
Yan Zhang, Jiani Tan, Qing Mu, Joshua S. Fu, and Li Li
Yan Zhang, Jiani Tan, Qing Mu, Joshua S. Fu, and Li Li
Metrics will be available soon.
Latest update: 20 Feb 2026
Download
Short summary
Despite growing use of machine learning in environmental research, few have studied how input data affects findings. We examined the impact of input data characteristics on nitrogen deposition estimates in East and Southeast Asia. Insufficient sample size cuts accuracy by up to 12 %, while data-scarce and remote areas show up to 50 % bias due to poor training data representation. We created a transferable framework for uncertainty quantification, applicable to other data-scarce geospatial tasks.
Share