the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Data clustering to optimise the representativity of observational data in air quality data assimilation: a case study with EURAD-IM (version 5.9.1 DA)
Abstract. In the field of air quality analysis, data assimilation is commonly used to integrate information on the atmospheric state provided by observations into the model. However, the analysis is largely dependent on the data available to the assimilation system. In order to obtain an accurate analysis of the true state of the atmosphere, the representativity of the utilized data becomes a fundamental requirement. Here, a method is presented that derives a representative split of the ground-based monitoring network data that depends only on the characteristics of the observation data. The core of the methods is a clustering algorithm to subdivide the data into subsets. Two clustering algorithms, k-means, and k-mean soft constraint, are tested and applied to air pollutant observations in Europe. The clusters are solely derived from observation intrinsic properties (such as geographic location and averaged concentrations). The resulting clusters reliably distinguish common features of the observational data, e.g. mean and variance of averaged air pollutant concentrations. Representativity of the observational data in the assimilation and validation subset is ensured by sampling each cluster individually. The method is tested using the assimilation system of the chemistry transport model EURAD-IM (EURopean Air pollution Dispersion – Inverse Model) and evaluated for data from four months in 2016. A significant improvement of the analysis' representativity, quantified by the difference between the analysis' root mean square error with respect to the assimilation and validation dataset, is found in the results. Compared to an operational configuration, the largest improvement in the representativity measure is evaluated for CO with 53 %, for NO2 with 50 %, and for O3 with 18 %.
- Preprint
(2845 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-450', Anonymous Referee #1, 17 Jun 2025
Data clustering to optimise the representativity of observational data in air quality data assimilation: a case study with EURAD-IM (version 5.9.1 DA)
Alexander Hermanns et al.
GENERAL COMMENTThe manuscript describes a method to distribute observation time series over clusters, and to use this within the context of data assimilation. The method is applied with the regional air quality model EURAD-IM, which assimilates time series of surface observations to guide the model. Specifically, the clustering is used to sub-divide the observation time series in an "assimilation" subset (~70%, incorporated in the assimilation) and a "validation" subset (~30%, not incorporated). The posterior comparison between analyzed model state and observations should give the same statistics over the "assimilation" and "validation" set, but as shown by the manuscript too, the assimilation usually performs better over the "assimilated" set. The proposed clustering method improves the equality between the statistics over the "assimilation" and "validation" set, and is therefore of interest for all data-assimilation applications.
The clustering method is well described, and easy to follow also for readers without a background in clustering. The application is illustrated for the European air quality network. Especially the maps in Figure 2 and bar plots in Figure 3 are useful here, as they illustrate the result of the clustering and how it was achieved. The 8-clusters obtained with the KSC method shows for example the soft borders between geographical regions, which could not be obtained by simply clustering countries. As described by the authors, the map obtained with KSC shares characteristics with climate zones, which gives trust that the obtained clustering is also related to geographic properties.
The improvement in assimilation/validation (AV) statistics is illustrated based on comparison with CO and NO2 observations (Table 1). In general, RMSE over the assimilation set increases, while the RMSE over the validation set decreases, thus decreasing what is called here the AV-difference. This is a very important result, and shows the usefulness of the method. However, as table C1 shows, the AV-difference increases for most other considered species (SO2, PM2.5, and PM10), and depending on the clustering method, also for O3. These species are rather important for air quality, and one could argue that these are even more important than CO. Therefore, the method seems not immediately applicable yet in for example the CAMS assimilations in which EURAD-IM is included. Could the authors include a discussion on how the clustering could be improved such that the AV-difference is decreased for all chemical species? Are different features of the timeseries needed, for example based on rural/urban locations? Or should for example CO simply be excluded? For the current manuscript it is not necessary to add and evaluate new clustering configurations, but it would be useful to see some guidance for future work.
SPECIFIC COMMENTLine 184: Could the method used by CAMS to distribute observations in assimilation/validation set be summarized here? At lines 326-327 an essential difference is discussed, it might be useful to mention that earlier too.
The data processing requires many steps, for example outlier removal (lines 144-155), but also removal of stations extreme emission corrections in their vicinity (lines 216-221). It would be useful to summarize all selection criteria in for example a table, including the number (fraction) of removed stations.
The KSC clustering is applied using a location future, which gives a result that collects stations in geographic regions (adjacent countries and/or regions in countries). The map in the right panel of Figure 2 shows that within such cluster there are sometimes small regions with a different classification, for example the Pyrenees are part of cluster 7. Would it make sens to add features based on these "exceptions", for example the altitude of a station?
SPELL AND GRAMMERLines 99 and 106: "k" should be "K" as in Figure 1?
Line 126: should be "... some objects, $F_m$, such that ..."
Line 225: should be reference to Fig. A1 ?
Line 240: "the Alps"
Line 253: remove comma's ?
Line 304: ".. month .."
Citation: https://doi.org/10.5194/egusphere-2025-450-RC1 -
AC1: 'Reply on RC1', Alexander Hermanns, 08 Aug 2025
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-450/egusphere-2025-450-AC1-supplement.pdf
-
AC1: 'Reply on RC1', Alexander Hermanns, 08 Aug 2025
-
RC2: 'Comment on egusphere-2025-450', Anonymous Referee #2, 18 Jun 2025
## Paper summary
This manuscript discusses a method for sub-sampling observational data in the context of air quality data assimilation, which requires to prepare the observational data into two datasets, respectively assimilation and validation. The authors propose to use clustering algorithms to improve the representativity of the observations during such a sub-sampling. Their methodology has two practical advantages: on the one hand, it is independent from the assimilation model, and on the other hand, it only requires observational data as inputs.
To evaluate the benefits of their clustering approach, the authors introduce an AV-difference (assimilation/validation) metric, which is the difference between the RMSE (Root-Mean-Square Error) of the model w.r.t. the assimilation dataset and the RMSE of the model w.r.t. the validation datasets. As such, a AV-difference of zero is synonymous of perfect representativity, while a high AV-difference suggests overfitting by the model.
Using an operational CAMS assimilation/validation configuration for year 2016 as a reference, the authors apply their approach on observations over Europe for four months of year 2016 (January, March, June and September, picked for their seasonal representativity), and demonstrate a significant decrease in the AV-difference for several pollutant species, and most notably carbon monoxide (53%), nitrogen dioxide (50%) and ozone (18%). The improvement is particularily interesting in the case of carbon monoxide, due to the scarcity of the observations compared to other pollutant species (such as ozone).
## General comments
The core content of the manuscript is interesting and provides some convincing results, and the authors did a nice job in presenting the K-means clustering algorithm and its soft constraint variant, and how they adapted their problem to both. This said, this manuscript could be improved in terms of presentation and could elaborate on a few points to ensure the final paper is compelling to all.
### Possible presentation improvements
I found that the AV-difference metric was very important to understand the paper and its contributions, yet it's defined quite late in the manuscript (L193). The Introduction does make a review of the state-of-the-art in this regard, but only states that the present study will improve representativity through clustering without hinting at how it will measure it. A few sentences (if not a single one) in the Introduction to give the big picture may be enough.
The manuscript could also benefit from a few more figures to support its content. A possible addition could be a flow-chart in the Introduction, summarizing the proposed methodology (e.g., observations going through the clustering to be split into the two datasets, fed to a data assimilation model like EURAD-IM). Such a flow-chart would not only summarize the overall methodology to the reader in a single figure, but could also be used to picture its advantages in terms of input/output.
I would also recommend moving Figure A1 from the Appendix back to the main body. Indeed, Figure A1 gives a very clear picture of how scarce CO observations are with respects to other pollutant species. Including it into the main body and making a few more references to it would strenghten the conclusion that the proposed clustering methodology significantly improves representativity of CO in the framework of air quality data assimilation.
Finally, on a side note, I would recommend using a gridded layout for most of the figures, especially line plots.
### Questions regarding the content
1) Are there particular reasons for only simulating four months of 2016 ? I get the seasonality argument regarding the choice of the months, but why not simulating the entire year ?
2) More broadly, it would be interesting to develop the seasonality of the results. At L276, there is this mention:
>Furthermore, while the evaluation shows fluctuations for each season, the general result holds true for each season individually.
but this is not enough to convince the reader about the seasonal trends of the results, especially considering point 1) (i.e., no full seasons, only sample months) and considering there is no figure or table detailing seasonal results. If these trends are indeed not significant, maybe a single table or figure would be enough to demonstrate that.
3) What about the slightly worse results for KSC in Tables C1 and D1 (Appendices C and D) ? Should we worry about them or are they small enough to be ignored ? While there is, indeed, an order of magnitude of difference between these results and those for carbon monoxide, additional details could show decisively whether or not the slightly higher AV differences are problematic, and at the very least, why the current manuscript does not elaborate further on them.
* For instance, the slighter higher AV-difference for ozone in D1 is probably not much of an issue given the thresholds for air quality. E.g., below 80 µg per cubic meter of ozone is considered to be good per CAMS, so 1.1 µg per cubic meter of AV-difference remains negligible. However, the reader does not necessarily know about such orders of magnitude depending on the species.
* As far as I'm concerned, I would be also interested in learning if the given AV-differences are constant throughout each year (i.e. 2016 or 2017), or if they depend on the season, if not the day ? A plot of the AV-difference throughout the year for each species may be enough to address this concern.## Specific comments
>L91: An overview of the geographic distribution of the available observation stations for each species is shown in the appendix in Fig. B2.
The reference seems to be wrong; the geographic distribution is shown in Fig. A1. Note that this overlaps with a previous comment on moving such figure back to the main text.
>L128: "Is is termed to be violated [...]"
This looks like a typo. Shouldn't it be "It is termed to be violated..." ? Anyway, the full sentence is a bit unclear. What does "violated" mean precisely in this context ? Does it mean the sigma term only makes sense when the objects are assigned to distinct clusters ? Please clarify.
Citation: https://doi.org/10.5194/egusphere-2025-450-RC2 -
AC2: 'Reply on RC2', Alexander Hermanns, 08 Aug 2025
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-450/egusphere-2025-450-AC2-supplement.pdf
-
AC2: 'Reply on RC2', Alexander Hermanns, 08 Aug 2025
-
RC3: 'Comment on egusphere-2025-450', Anonymous Referee #3, 22 Jun 2025
Data clustering has the potential to improve the representativity of data assimilation results. This is shown by Hermanns and co-authors in their paper. This is an interesting idea and would be something to incorporate in for instance the European CAMS ensemble analysis and reanalyses. However, the paper also raised multiple questions and I am not yet convinced that the potential of the method has been fully exploited. To my opinion a major revision is needed in response to my major and detailed comments.
Major comments:
The results for PM2.5/10 (and ozone) should also be shown and should be discussed in more detail. Why is the result so different? Can this be understood? In the abstract, last sentence, improvements are reported for NO2, O3, CO: Please report the PM results as well.
The motivation for - and introduction to - the clustering approach can be improved. In particular I was wondering why the diurnal cycle is used as property to distinguish stations for all species? The diurnal cycle of ozone is large during summertime pollution photochemical smog events when it builds up during the day. For other species(like NO2, PM) the diurnal cycle may have a very different interpretation, e.g. rush hour emission peaks or development of the PBL. The effectiveness of this choice may be quite different in summer and winter. Apart from the diurnal cycle, are there other properties that may be used for the clustering? Please add a discussion to answer these questions, and also add the seasonal results.
Is the comparison with CAMS a fair comparison? Are the same stations used in both cases? The CAMS REF experiment is not well described in the paper and I have the impression that there are much less stations used by CAMS. Showing the distributions of assimilation and validation stations in all experiments could be a useful extra plot.
The European stations also come with a site classification (rural/urban, background/traffic/industry). This by itself can be seen as a clustering. In CAMS the Joly-Peuch site classification is used (https://doi.org/10.1016/j.atmosenv.2011.11.025). Again, the categories 1-10 of Joly-Peuch are also a form of clustering. Please add this reference and discuss the relation to the present study.
The EURAD assimilation, if I understand well, has been run at a 15 km resolution. Sites near busy roads or near industries will not be well represented at this resolution. A pre-filtering would be good, as is done in CAMS. On the contrary, from the paper I get the impression that all EEA sites are used by the authors. Please explain and motivate this choice.
The filtering that needs to be used to account for unrealistic results in EURAD (line 216-221) gives an uneasy feeling. Please add evidence that the EURAD system is working well overall, with reasonable increments and good reductions of the rmse differences with the stations in the analysis. How has EURAD been tested?
I did not find the 2017 results fully satisfactory. One would hope that the clusters are quite robust and do not change much from year to year.
Introduction:
- Introduction: Please add more references on air quality data assimilation activities. For instance the CAMS ensemble activity is relevant for this paper. Here also a split in observation and validation sites is applied (l 27).
- Introduction (l 31). Classification of surface sites is a topic related to the current paper. The paper of Joly-Peuch is a basis for the CAMS work and is relevant to discuss. Also the methodology of EEA should be mentioned (with reference).
- l 58: Please provide a motivation why the diurnal cycle is used.
- l 80: (IFS) please mention the European Centre of Medium-Range Weather Forecasts
l 87: There is no pre-selection made for the stations? Would it be better to remove roadside traffic stations?
- l 92: "An overview of the geographic distribution of the available observation stations for each species is shown in the appendix in Fig. B2" Do you mean A1?
- Figure 1 is not a figure. A table could be an option, or a listing in the text would be possible as well.
- l 104: Normalised: how? Also units should be removed for this formula to make sense (e.g. concentration).
- l 152: What is the standard deviation of the mean and variance of the diurnal cycle? A formula would be good to be precise on what is computed.
- l 184: "REF experiment" Please provide more details what this is and how it compares. Does CAMS include a similar number of stations? CAMS uses Joly-Peuch classification to pre-select the stations that are compared to the models.
- l 193: The AV-difference: is this normalised? Does this have a unit (e.g. ug/m3)?
- l 194: "split into two different observation configurations" This was unclear to me. Why is this done, and how are these two constructed?
- l 203: Is the RMSE computed for individual model/measurement pairs (hourly observations)?
- Fig. B2: please add diurnal cycles for all species.
Citation: https://doi.org/10.5194/egusphere-2025-450-RC3 -
AC3: 'Reply on RC3', Alexander Hermanns, 08 Aug 2025
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-450/egusphere-2025-450-AC3-supplement.pdf
-
AC3: 'Reply on RC3', Alexander Hermanns, 08 Aug 2025
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
963 | 55 | 19 | 1,037 | 18 | 27 |
- HTML: 963
- PDF: 55
- XML: 19
- Total: 1,037
- BibTeX: 18
- EndNote: 27
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1