Preprints
https://doi.org/10.5194/egusphere-2025-450
https://doi.org/10.5194/egusphere-2025-450
16 May 2025
 | 16 May 2025

Data clustering to optimise the representativity of observational data in air quality data assimilation: a case study with EURAD-IM (version 5.9.1 DA)

Alexander Hermanns, Anne Caroline Lange, Julia Kowalski, Hendrik Fuchs, and Philipp Franke

Abstract. In the field of air quality analysis, data assimilation is commonly used to integrate information on the atmospheric state provided by observations into the model. However, the analysis is largely dependent on the data available to the assimilation system. In order to obtain an accurate analysis of the true state of the atmosphere, the representativity of the utilized data becomes a fundamental requirement. Here, a method is presented that derives a representative split of the ground-based monitoring network data that depends only on the characteristics of the observation data. The core of the methods is a clustering algorithm to subdivide the data into subsets. Two clustering algorithms, k-means, and k-mean soft constraint, are tested and applied to air pollutant observations in Europe. The clusters are solely derived from observation intrinsic properties (such as geographic location and averaged concentrations). The resulting clusters reliably distinguish common features of the observational data, e.g. mean and variance of averaged air pollutant concentrations. Representativity of the observational data in the assimilation and validation subset is ensured by sampling each cluster individually. The method is tested using the assimilation system of the chemistry transport model EURAD-IM (EURopean Air pollution Dispersion – Inverse Model) and evaluated for data from four months in 2016. A significant improvement of the analysis' representativity, quantified by the difference between the analysis' root mean square error with respect to the assimilation and validation dataset, is found in the results. Compared to an operational configuration, the largest improvement in the representativity measure is evaluated for CO with 53 %, for NO2 with 50 %, and for O3 with 18 %.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
Share

Journal article(s) based on this preprint

03 Dec 2025
Data clustering to optimise the representativity of observational data in air quality data assimilation: a case study with EURAD-IM (version 5.9.1 DA)
Alexander Hermanns, Anne Caroline Lange, Julia Kowalski, Hendrik Fuchs, and Philipp Franke
Geosci. Model Dev., 18, 9417–9432, https://doi.org/10.5194/gmd-18-9417-2025,https://doi.org/10.5194/gmd-18-9417-2025, 2025
Short summary
Alexander Hermanns, Anne Caroline Lange, Julia Kowalski, Hendrik Fuchs, and Philipp Franke

Interactive discussion

Status: closed

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2025-450', Anonymous Referee #1, 17 Jun 2025
  • RC2: 'Comment on egusphere-2025-450', Anonymous Referee #2, 18 Jun 2025
  • RC3: 'Comment on egusphere-2025-450', Anonymous Referee #3, 22 Jun 2025

Interactive discussion

Status: closed

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2025-450', Anonymous Referee #1, 17 Jun 2025
  • RC2: 'Comment on egusphere-2025-450', Anonymous Referee #2, 18 Jun 2025
  • RC3: 'Comment on egusphere-2025-450', Anonymous Referee #3, 22 Jun 2025

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload
AR by Alexander Hermanns on behalf of the Authors (08 Aug 2025)  Author's response   Author's tracked changes   Manuscript 
ED: Referee Nomination & Report Request started (23 Aug 2025) by Yongze Song
RR by Anonymous Referee #3 (18 Sep 2025)
RR by Anonymous Referee #1 (26 Sep 2025)
ED: Publish subject to technical corrections (15 Oct 2025) by Yongze Song
AR by Alexander Hermanns on behalf of the Authors (27 Oct 2025)  Author's response   Manuscript 

Journal article(s) based on this preprint

03 Dec 2025
Data clustering to optimise the representativity of observational data in air quality data assimilation: a case study with EURAD-IM (version 5.9.1 DA)
Alexander Hermanns, Anne Caroline Lange, Julia Kowalski, Hendrik Fuchs, and Philipp Franke
Geosci. Model Dev., 18, 9417–9432, https://doi.org/10.5194/gmd-18-9417-2025,https://doi.org/10.5194/gmd-18-9417-2025, 2025
Short summary
Alexander Hermanns, Anne Caroline Lange, Julia Kowalski, Hendrik Fuchs, and Philipp Franke
Alexander Hermanns, Anne Caroline Lange, Julia Kowalski, Hendrik Fuchs, and Philipp Franke

Viewed

Total article views: 2,082 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
1,961 88 33 2,082 30 40
  • HTML: 1,961
  • PDF: 88
  • XML: 33
  • Total: 2,082
  • BibTeX: 30
  • EndNote: 40
Views and downloads (calculated since 16 May 2025)
Cumulative views and downloads (calculated since 16 May 2025)

Viewed (geographical distribution)

Total article views: 1,983 (including HTML, PDF, and XML) Thereof 1,983 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 03 Dec 2025
Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Short summary
For air quality analyses, data assimilation models split available data into assimilation and validation data sets. The former is used to generate the analysis, the latter to verify the simulations. A preprocessor classifying the observations by the data characteristics is developed based on clustering algorithms. The assimilation and validation data sets are compiled by equally allocating data of each cluster. The resulting improvement of the analysis is evaluated with EURAD-IM.
Share