<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" specific-use="SMUR" dtd-version="3.0" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher">EGUsphere</journal-id>
<journal-title-group>
<journal-title>EGUsphere</journal-title>
<abbrev-journal-title abbrev-type="publisher">EGUsphere</abbrev-journal-title>
<abbrev-journal-title abbrev-type="nlm-ta">EGUsphere</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub"></issn>
<publisher><publisher-name>Copernicus Publications</publisher-name>
<publisher-loc>Göttingen, Germany</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.5194/egusphere-2025-1399</article-id>
<title-group>
<article-title>TOAR-classifier v2: A data-driven classification tool for global air quality stations</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Mache</surname>
<given-names>Ramiyou Karim</given-names>
<ext-link>https://orcid.org/0000-0002-0190-3311</ext-link>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Schröder</surname>
<given-names>Sabine</given-names>
<ext-link>https://orcid.org/0000-0002-0309-8010</ext-link>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Langguth</surname>
<given-names>Michael</given-names>
<ext-link>https://orcid.org/0000-0003-3354-5333</ext-link>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Patnala</surname>
<given-names>Ankit</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Schultz</surname>
<given-names>Martin G.</given-names>
<ext-link>https://orcid.org/0000-0003-3455-774X</ext-link>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
</contrib-group><aff id="aff1">
<label>1</label>
<addr-line>Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany</addr-line>
</aff>
<pub-date pub-type="epub">
<day>04</day>
<month>04</month>
<year>2025</year>
</pub-date>
<volume>2025</volume>
<fpage>1</fpage>
<lpage>17</lpage>
<permissions>
<copyright-statement>Copyright: &#x000a9; 2025 Ramiyou Karim Mache et al.</copyright-statement>
<copyright-year>2025</copyright-year>
<license license-type="open-access">
<license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri"  xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p>
</license>
</permissions>
<self-uri xlink:href="https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1399/">This article is available from https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1399/</self-uri>
<self-uri xlink:href="https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1399/egusphere-2025-1399.pdf">The full text article is available as a PDF file from https://egusphere.copernicus.org/preprints/2025/egusphere-2025-1399/egusphere-2025-1399.pdf</self-uri>
<abstract>
<p>Accurate characterization of station locations is crucial for reliable air quality assessments such as the Tropospheric Ozone Assessment Report (TOAR). This study introduces a machine learning approach to classify 23,974 stations in the unique global TOAR database as urban, suburban, or rural. We tested several methods: unsupervised K-means clustering with three clusters, and an ensemble of supervised learning classifiers including random forest, CatBoost, and LightGBM. We further enhanced the supervised learning performance by integrating these classifiers into a robust voting model, leveraging their collective predictive power. To address the inherent ambiguity of suburban areas, we implement an adjusted threshold probability technique. Our models, trained on the TOAR station metadata, are evaluated on 1,000 unseen data points. K-means clustering achieves 70.03 % and 71.53 % accuracy for urban and rural areas respectively, but only 26.36 % for suburban zones. Supervised classifiers surpass this performance, reaching over 84 % accuracy for urban and rural categories, and 62&amp;ndash;65 % for suburban areas. The adjusted threshold technique significantly enhances overall model accuracy, particularly for suburban classification. The good separation of our model is confirmed through evaluation with NOx and PM2.5 concentration measurements, which were not included in the training data. Furthermore, manual inspection of 25 individual sites with Google maps reveals that our method provides a better label for the station type than the labels that were reported by data providers and used in the model evaluation. The objective station classification proposed in this paper therefore provides a robust foundation for type-specific air quality assessments in TOAR and elsewhere.</p>
</abstract>
<counts><page-count count="17"/></counts>
<funding-group>
<award-group id="gs1">
<funding-source>European Research Council</funding-source>
<award-id>787576</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body/>
<back>
</back>
</article>