Quantifying the impact of input data-induced dataset shift on machine learning model applications: A case study of regional reactive nitrogen wet deposition

Zhang, Yan; Tan, Jiani; Mu, Qing; Fu, Joshua S.; Li, Li

doi:10.5194/egusphere-2025-6160

Preprints

https://doi.org/10.5194/egusphere-2025-6160

Preprints

20 Feb 2026

| 20 Feb 2026

Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

Quantifying the impact of input data-induced dataset shift on machine learning model applications: A case study of regional reactive nitrogen wet deposition

Yan Zhang, Jiani Tan, Qing Mu, Joshua S. Fu, and Li Li

Abstract. Machine learning (ML) has been extensively applied to studies on spatial distribution characteristics of atmospheric composition, but quantitative assessments of uncertainties arising from input data properties are still lacking. To address this gap, we conducted a case study on wet deposition of reactive nitrogen (D_wet). The Extreme Gradient Boosting (XGBoost) model was employed to predict D_wet in East Asia and Southeast Asia (SEA) with a compiled dataset from multiple sources. We quantified the impacts of input data characteristics on model performance and D_welt estimations via three sensitivity experiments. Key findings revealed that: (1) Sample size below 6,000–9,000 led to a maximum 12 % accuracy loss, while exceeding this threshold provided marginal performance improvement. (2) Uneven spatial distribution of monitoring sites caused 9–51 % deviations from baseline performance, with >50 % variability in D_wet estimations in data-scarce regions (e.g., western China and SEA). (3) Imbalanced site types lead to insufficient representation of remote sites, resulted in 9–40 % overall accuracy loss and a high risk of severe overestimation (100 %) in remote areas. The bias was attributed to both data range shifts and altered feature-target relationships (e.g., NH₃ emission vs. D_wet-NH₄⁺). Additionally, inconsistencies among multi-source datasets and limitations of ML structure further introduced uncertainties. This study quantified previously unaddressed input data-induced uncertainties in ML-based N_r deposition research, providing critical insights for reliable application of ML-derived data in N_r management. The proposed uncertainty assessment framework is also applicable to other ML-based geospatial interpolation tasks facing data scarcity challenges.

Received: 10 Dec 2025 – Discussion started: 20 Feb 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3542 KB)

Supplement (2572 KB)

Download & links

Yan Zhang, Jiani Tan, Qing Mu, Joshua S. Fu, and Li Li

Status: open (until 29 Apr 2026)

Post a comment Subscribe to comment alert

Yan Zhang, Jiani Tan, Qing Mu, Joshua S. Fu, and Li Li

Supplement

https://doi.org/10.5194/egusphere-2025-6160-supplement

Yan Zhang, Jiani Tan, Qing Mu, Joshua S. Fu, and Li Li

Viewed

Total article views: 204 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
122	67	15	204	27	12	15

HTML: 122
PDF: 67
XML: 15
Total: 204
Supplement: 27
BibTeX: 12
EndNote: 15

Views and downloads (calculated since 20 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	55	25	9	89
Mar 2026	67	42	6	115
Apr 2026	0

Cumulative views and downloads (calculated since 20 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	55	25	9	89
Mar 2026	67	42	6	115
Apr 2026	0

Viewed (geographical distribution)

Total article views: 205 (including HTML, PDF, and XML) Thereof 205 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 02 Apr 2026

Short summary

Despite growing use of machine learning in environmental research, few have studied how input data affects findings. We examined the impact of input data characteristics on nitrogen deposition estimates in East and Southeast Asia. Insufficient sample size cuts accuracy by up to 12 %, while data-scarce and remote areas show up to 50 % bias due to poor training data representation. We created a transferable framework for uncertainty quantification, applicable to other data-scarce geospatial tasks.


Total:	0
HTML:	0
PDF:	0
XML:	0