Preprints
https://doi.org/10.5194/egusphere-2025-450
https://doi.org/10.5194/egusphere-2025-450
16 May 2025
 | 16 May 2025

Data clustering to optimise the representativity of observational data in air quality data assimilation: a case study with EURAD-IM (version 5.9.1 DA)

Alexander Hermanns, Anne Caroline Lange, Julia Kowalski, Hendrik Fuchs, and Philipp Franke

Abstract. In the field of air quality analysis, data assimilation is commonly used to integrate information on the atmospheric state provided by observations into the model. However, the analysis is largely dependent on the data available to the assimilation system. In order to obtain an accurate analysis of the true state of the atmosphere, the representativity of the utilized data becomes a fundamental requirement. Here, a method is presented that derives a representative split of the ground-based monitoring network data that depends only on the characteristics of the observation data. The core of the methods is a clustering algorithm to subdivide the data into subsets. Two clustering algorithms, k-means, and k-mean soft constraint, are tested and applied to air pollutant observations in Europe. The clusters are solely derived from observation intrinsic properties (such as geographic location and averaged concentrations). The resulting clusters reliably distinguish common features of the observational data, e.g. mean and variance of averaged air pollutant concentrations. Representativity of the observational data in the assimilation and validation subset is ensured by sampling each cluster individually. The method is tested using the assimilation system of the chemistry transport model EURAD-IM (EURopean Air pollution Dispersion – Inverse Model) and evaluated for data from four months in 2016. A significant improvement of the analysis' representativity, quantified by the difference between the analysis' root mean square error with respect to the assimilation and validation dataset, is found in the results. Compared to an operational configuration, the largest improvement in the representativity measure is evaluated for CO with 53 %, for NO2 with 50 %, and for O3 with 18 %.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
Share
Alexander Hermanns, Anne Caroline Lange, Julia Kowalski, Hendrik Fuchs, and Philipp Franke

Status: final response (author comments only)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2025-450', Anonymous Referee #1, 17 Jun 2025
  • RC2: 'Comment on egusphere-2025-450', Anonymous Referee #2, 18 Jun 2025
  • RC3: 'Comment on egusphere-2025-450', Anonymous Referee #3, 22 Jun 2025
Alexander Hermanns, Anne Caroline Lange, Julia Kowalski, Hendrik Fuchs, and Philipp Franke
Alexander Hermanns, Anne Caroline Lange, Julia Kowalski, Hendrik Fuchs, and Philipp Franke

Viewed

Total article views: 1,037 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
963 55 19 1,037 18 27
  • HTML: 963
  • PDF: 55
  • XML: 19
  • Total: 1,037
  • BibTeX: 18
  • EndNote: 27
Views and downloads (calculated since 16 May 2025)
Cumulative views and downloads (calculated since 16 May 2025)

Viewed (geographical distribution)

Total article views: 1,019 (including HTML, PDF, and XML) Thereof 1,019 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 14 Sep 2025
Download
Short summary
For air quality analyses, data assimilation models split available data into assimilation and validation data sets. The former is used to generate the analysis, the latter to verify the simulations. A preprocessor classifying the observations by the data characteristics is developed based on clustering algorithms. The assimilation and validation data sets are compiled by equally allocating data of each cluster. The resulting improvement of the analysis is evaluated with EURAD-IM.
Share