the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
TOAR-classifier v2: A data-driven classification tool for global air quality stations
Abstract. Accurate characterization of station locations is crucial for reliable air quality assessments such as the Tropospheric Ozone Assessment Report (TOAR). This study introduces a machine learning approach to classify 23,974 stations in the unique global TOAR database as urban, suburban, or rural. We tested several methods: unsupervised K-means clustering with three clusters, and an ensemble of supervised learning classifiers including random forest, CatBoost, and LightGBM. We further enhanced the supervised learning performance by integrating these classifiers into a robust voting model, leveraging their collective predictive power. To address the inherent ambiguity of suburban areas, we implement an adjusted threshold probability technique. Our models, trained on the TOAR station metadata, are evaluated on 1,000 unseen data points. K-means clustering achieves 70.03 % and 71.53 % accuracy for urban and rural areas respectively, but only 26.36 % for suburban zones. Supervised classifiers surpass this performance, reaching over 84 % accuracy for urban and rural categories, and 62–65 % for suburban areas. The adjusted threshold technique significantly enhances overall model accuracy, particularly for suburban classification. The good separation of our model is confirmed through evaluation with NOx and PM2.5 concentration measurements, which were not included in the training data. Furthermore, manual inspection of 25 individual sites with Google maps reveals that our method provides a better label for the station type than the labels that were reported by data providers and used in the model evaluation. The objective station classification proposed in this paper therefore provides a robust foundation for type-specific air quality assessments in TOAR and elsewhere.
- Preprint
(3168 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
CEC1: 'Comment on egusphere-2025-1399 - No compliance with the policy of the journal', Juan Antonio Añel, 09 May 2025
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have archived your code on a git in the fz-juelich.de servers. However, neither git nor fz-juelich.de servers are suitable repositories for scientific publication. Also, your code presents multiple dependencies from third-party software, such as libraries. At minimum, you should clearly identify the Python version and the versions of the different libraries that you have used to produce your work, to assure the replicability of your submitted work.Therefore, the current situation with your manuscript is irregular, as we can not accept manuscripts in Discussions that do not comply with our policy. Please, publish your code and data in one of the appropriate repositories according to our policy and reply as soon as possible to this comment with a modified 'Code and Data Availability' section for your manuscript, which must include the relevant information (link and handle or DOI) of the new repositories, and which you should include in a potentially reviewed manuscript.
I must note that if you do not fix this problem, we will have to reject your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive EditorCitation: https://doi.org/10.5194/egusphere-2025-1399-CEC1 -
AC1: 'Reply on CEC1', Ramiyou Karim Mache, 14 May 2025
Dear Dr. Juan A. Añel,
Thank you for your feedback and for pointing out the issue.
We have now uploaded all relevant data and code to Zenodo, in compliance with the journal’s Code and Data Policy. The repository is publicly accessible and has been assigned the following DOI:
https://doi.org/10.5281/zenodo.15411285In addition, we have updated the requirements.txt file to explicitly list the version of Python and all third-party libraries used in our work to ensure full reproducibility.
The 'Code and Data Availability' section of our manuscript has also been updated to reflect this information.
We hope this resolves the issue and look forward to your feedback.
Kind regards,
Karim Mache (on behalf of all co-authors)Citation: https://doi.org/10.5194/egusphere-2025-1399-AC1 -
CEC2: 'Reply on AC1', Juan Antonio Añel, 14 May 2025
Dear authors,
Many thanks for addressing this issue. We can now consider the current version of your manuscript in compliance with the Code and Data Policy of the journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2025-1399-CEC2
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 14 May 2025
-
AC1: 'Reply on CEC1', Ramiyou Karim Mache, 14 May 2025
-
RC1: 'Comment on egusphere-2025-1399', Anonymous Referee #1, 25 May 2025
General comments:
This study primarily attempts to address an existing challenge to correctly classify suburban ozone monitoring stations using a novel ensemble of machine learning (ML) methods. It holds promise for accurate classification of stations into urban, suburban and rural categories, which is crucial in understanding the ozone measurements and its trends across the globe. However, the manuscript needs work to (1) make their methodology more robust to yield a reliable model with high accuracy, and (2) include sufficient detail to provide more clarity for future reproducibility. Please find more detailed feedback and suggestions below.
Specific comments:
- Data pre-processing and feature section lacks sufficient details on the preprocessing methodology. This section needs to be revised to provide more clarity. Some examples are listed below.
- Lines 81-82: Explain the inherent limitations of the NOx dataset which requires applying this normalization. Elaborate on the Box-Cox method as well.
- Lines 79-80: Provide an excerpt (1-2 lines) about this method for a better understanding of the method.
- Lines 82-83: Explain the “robust scaler” method.
- Lines 88-89: There is a typo in line 88. It should be “we allocated 21,378 samples for model training…”.
- Lines 88-94: There are no reasonings for why only 1000 samples were used for testing the ML (both supervised and unsupervised) models. As a rule of thumb, datasets are split 80%-20% into train-test before training ML models to avoid overfitting the training model. Another commonly used approach is train-test-validate split of 70-15-15%. Authors should refer to and base their dataset splitting reasoning on existing literatures. The current model in this study might be susceptible to higher misclassifications between urban, suburban and rural labels for unseen datasets.
- Lines 88-94: Authors should also include k-fold cross-validation (commonly used is 5-fold CV) while reporting performance metrics for all ML models. The metrics thus derived are more reliable as it ensures a model generalized well to unseen data.
- Section 2.3.1: This section very generalized and needs more study-specific information such as:
- Lines 113-114: How does Figure 5a suggests that 3 clusters are optimal? Is there a scientific technique to arrive at this conclusion, such as a threshold for sum of squares, etc.? Include the explanation.
- List other hyperparameter values as well for all the ML models in this study. For example: “max_iter” for k-means cluster; “criterion” and “max_features” for random forest classifier, etc.
- Refer to recent studies that have shown similar classification application to justify why these models are specifically chosen in this study. It is done for CatBoost classifier in line 133 but is missing for other models.
- Section 2.3.2: The adjusted threshold probability technique introduced in this study needs some context. Where is the probability derived from? Is it applicable for both supervised and unsupervised learning? Are these probability values an output from the models? If so, explain the steps further that can help retrace the steps in this study for reproducing the results.
- Lines 182-185: While it is good to know that the k-means is doing well in labelling urban and rural stations, the main focus of this study is to accurately classify the suburban stations. Therefore, authors should discuss about the classification accuracy of 26.36% for suburban (from Table 3) stations in the text.
- Line 197: Is the confusion matrix in Figure 4a before or after probability threshold adjustment? Probably mention it in the text as well the figure title.
- Line 192: What is the “voting” method? It is mentioned a few times but needs to be clearly defined in methods section.
- Figure 5: How does this figure or including NOx and PM2.5 concentrations at all align with the objective of this study? The trend presented in Figure 5 is generally true for both pollutants, but it is not pertinent here. The authors should provide strong reasoning if they still feel the need to include it.
Citation: https://doi.org/10.5194/egusphere-2025-1399-RC1 - Data pre-processing and feature section lacks sufficient details on the preprocessing methodology. This section needs to be revised to provide more clarity. Some examples are listed below.
-
RC2: 'Comment on egusphere-2025-1399', Frank Techel, 19 Aug 2025
The manuscript addresses a relevant and practical task for the TOAR-II database: the objective classification of air quality monitoring stations. The methodological approach is solid, transparent, and clearly an improvement over the earlier threshold-based scheme. The availability of open code and data adds value.
The main limitation lies in the reliability of the labels used for training and evaluation. Suburban in particular appears to be a transition category rather than a well-defined class, which sets limits on achievable performance. The manuscript acknowledges inconsistencies between provider-reported and manually checked station types, but this issue requires a clearer and more systematic treatment. The lack of clarity on class distributions, the role of the 10,000 unlabeled stations, and the precise level of agreement between TOAR and manual labels makes it difficult to assess the robustness of the results. In addition, the evaluation framework leans too heavily on accuracy. More suitable metrics (per-class precision/recall, F1, macro-F1, balanced accuracy) and a more systematic threshold optimization would strengthen the conclusions.
Detailed comments
- In the Introduction, define the classes upfront (plain language). Add a short description of urban / suburban / rural in this context, and link it to the features listed in Table 1 so readers understand how these relate to the categories.
- In the Introduction, when outlining the problem and motivating the study in the last paragraph, consider mentioning that almost 10,000 of the 24,000 stations have no label, which calls for an automated approach.
- Inconsistency claim (L36/37). You state that TOAR labels are inconsistent/error-prone. Please provide a reference or a concrete rationale/example for this statement. (You later discuss misalignments; connect those to this claim.)
- Section 2.2: Provide information on the distribution of TOAR labels in the dataset. How many cases are in the suburban class? Are the class distributions approximately balanced?
- L88: Check the numbers. If you remove 1,000 cases, you should end up with 21,378, I believe.
- L88–94: You have 12,408 stations with labels, but you cluster and predict the entire dataset of 22,378 stations. What is being done with the 10,000 stations that have no label? Is the predicted classification of these stations used for something in the manuscript? Please make this more obvious.
- L95–99: This is a very important paragraph, as it addresses the reliability of the TOAR labels. Please report explicitly the level of agreement between TOAR labels and the manual approach. What was the distribution of labels in each classification approach, and how often did the suburban category agree/disagree? Did one or several of the authors (independently?) perform the manual classification? This information is crucial to assess the reliability of the reference labels. What do you mean by “clear decision boundaries”? Did you pick particularly easy stations for the manual labeling exercise? Is the proportion of agreement between TOAR labels and manual labels higher than for the 25 worst predictions (Table 6) (if I counted correctly, 8 out of 25 agree)?
- Figure 1: It looks as if you primarily predict stations outside the U.S. and Europe. Are the orange dots the 1,000 stations used for testing? Or are these the 9,970 stations lacking classification?
- Section 2.3.2, L184, L194 – Adjusted probability threshold: The proposed “uncertainty-based” rule for suburban classification is essentially heuristic. A more standard approach would be threshold optimization via grid search on a validation set, maximizing e.g. macro-F1 or balanced accuracy. I am not convinced that the chosen approach truly acknowledges the models’ uncertainty between these classes. Instead, the approach simply assigns the middle class. Wouldn’t it have been more transparent to label these as uncertain cases (e.g., rural–suburban, suburban–urban)? This would have allowed the quality of predictions for “in-between” cases to be assessed separately from those where labels and predictions should be more reliable.
- Section 2.3.3 – Performance metrics: The analysis relies heavily on per-class accuracy. While accuracy is intuitive, it is not an appropriate sole metric under class imbalance and noisy labels (for instance when describing “global accuracy”). In this case, urban and rural likely dominate (numbers are missing), while suburban is both smaller and potentially much noisier. More suitable and widely accepted measures include per-class precision, recall, F1, macro-F1, and balanced accuracy.
- Figures 2 and 3: Please introduce PCA as a method to visualize results in the Methods section.
- Table 2 – interpretation of TOAR vs manual labels: How do you explain that models trained and tested with TOAR labels tend to achieve better performance statistics than with manual labels? Could this indicate that the manually derived labels are themselves noisy, or that they reflect a different conceptual perspective? A more explicit discussion of this is needed.
- L191, L207, Tables 4 and 5: The so-called “voting procedure” is not well explained. From the text it appears to be simple majority voting across three classifiers. If so, please state this explicitly in the Methods section, and later discuss whether the gain is meaningful relative to the best-performing individual model.
- L220–229: How did you decide on the 25 worst misclassifications? What is the gold standard for this analysis? Is it the manual labels? From Table 2, I wonder whether the manual labels are more reliable than the TOAR labels. A fair comparison, in my view, would involve both a comparison against TOAR and against the manual labels. Discuss that TOAR and manual labels also showed low agreement in this case.
- L230: Use of pollutant concentrations. The evaluation with NOx and PM₂.₅ is a useful plausibility check. However, the choice of the 75th percentile is not self-evident. Please justify why the 75th percentile was chosen (e.g., is this a standard in this field or intended to capture robustly elevated concentrations while avoiding sensitivity to extremes?).
- Figures: I suggest removing the grey background. Check with journal guidelines, but I believe (a) and (b) should be shown above the plot (beside the plot title, if the journal allows the latter).
- Discussion: Assuming that the classification approach is supposed to be implemented in the database, I suggest summarizing key facts for users of these labels. What works, what does not? Following your analysis, do you see any points that could be improved regarding the definition of the three classes?
I trust these comments will be useful in revising and improving the manuscript.
Citation: https://doi.org/10.5194/egusphere-2025-1399-RC2
Data sets
TOAR-Classifier input and output data Ramiyou Karim Mache, Sabine Schröder, Michael Langguth, Ankit Patnala, and Martin G. Schultz https://gitlab.jsc.fz-juelich.de/esde/toar-public/ml_toar_station_classification
Model code and software
TOAR-Classifier model code Ramiyou Karim Mache, Sabine Schröder, Michael Langguth, Ankit Patnala, and Martin G. Schultz https://gitlab.jsc.fz-juelich.de/esde/toar-public/ml_toar_station_classification
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
640 | 52 | 22 | 714 | 24 | 36 |
- HTML: 640
- PDF: 52
- XML: 22
- Total: 714
- BibTeX: 24
- EndNote: 36
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1