Preprints
https://doi.org/10.5194/egusphere-2025-1399
https://doi.org/10.5194/egusphere-2025-1399
04 Apr 2025
 | 04 Apr 2025
Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

TOAR-classifier v2: A data-driven classification tool for global air quality stations

Ramiyou Karim Mache, Sabine Schröder, Michael Langguth, Ankit Patnala, and Martin G. Schultz

Abstract. Accurate characterization of station locations is crucial for reliable air quality assessments such as the Tropospheric Ozone Assessment Report (TOAR). This study introduces a machine learning approach to classify 23,974 stations in the unique global TOAR database as urban, suburban, or rural. We tested several methods: unsupervised K-means clustering with three clusters, and an ensemble of supervised learning classifiers including random forest, CatBoost, and LightGBM. We further enhanced the supervised learning performance by integrating these classifiers into a robust voting model, leveraging their collective predictive power. To address the inherent ambiguity of suburban areas, we implement an adjusted threshold probability technique. Our models, trained on the TOAR station metadata, are evaluated on 1,000 unseen data points. K-means clustering achieves 70.03 % and 71.53 % accuracy for urban and rural areas respectively, but only 26.36 % for suburban zones. Supervised classifiers surpass this performance, reaching over 84 % accuracy for urban and rural categories, and 62–65 % for suburban areas. The adjusted threshold technique significantly enhances overall model accuracy, particularly for suburban classification. The good separation of our model is confirmed through evaluation with NOx and PM2.5 concentration measurements, which were not included in the training data. Furthermore, manual inspection of 25 individual sites with Google maps reveals that our method provides a better label for the station type than the labels that were reported by data providers and used in the model evaluation. The objective station classification proposed in this paper therefore provides a robust foundation for type-specific air quality assessments in TOAR and elsewhere.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.
Share
Download
Short summary
The TOAR-classifier model is a data-driven tool that allows for an objective classification of...
Share