Quantifying the impact of input data-induced dataset shift on machine learning model applications: A case study of regional reactive nitrogen wet deposition
Abstract. Machine learning (ML) has been extensively applied to studies on spatial distribution characteristics of atmospheric composition, but quantitative assessments of uncertainties arising from input data properties are still lacking. To address this gap, we conducted a case study on wet deposition of reactive nitrogen (Dwet). The Extreme Gradient Boosting (XGBoost) model was employed to predict Dwet in East Asia and Southeast Asia (SEA) with a compiled dataset from multiple sources. We quantified the impacts of input data characteristics on model performance and Dwelt estimations via three sensitivity experiments. Key findings revealed that: (1) Sample size below 6,000–9,000 led to a maximum 12 % accuracy loss, while exceeding this threshold provided marginal performance improvement. (2) Uneven spatial distribution of monitoring sites caused 9–51 % deviations from baseline performance, with >50 % variability in Dwet estimations in data-scarce regions (e.g., western China and SEA). (3) Imbalanced site types lead to insufficient representation of remote sites, resulted in 9–40 % overall accuracy loss and a high risk of severe overestimation (100 %) in remote areas. The bias was attributed to both data range shifts and altered feature-target relationships (e.g., NH₃ emission vs. Dwet-NH₄⁺). Additionally, inconsistencies among multi-source datasets and limitations of ML structure further introduced uncertainties. This study quantified previously unaddressed input data-induced uncertainties in ML-based Nr deposition research, providing critical insights for reliable application of ML-derived data in Nr management. The proposed uncertainty assessment framework is also applicable to other ML-based geospatial interpolation tasks facing data scarcity challenges.