Machine learning data fusion for high spatio-temporal resolution PM<sub>2.5</sub>

Porcheddu, Andrea; Kolehmainen, Ville; Lähivaara, Timo; Lipponen, Antti

doi:https://doi.org/10.5194/egusphere-2024-4056

Preprints

https://doi.org/10.5194/egusphere-2024-4056

Preprints

14 Feb 2025

| 14 Feb 2025

Machine learning data fusion for high spatio-temporal resolution PM_2.5

Andrea Porcheddu, Ville Kolehmainen, Timo Lähivaara, and Antti Lipponen

Abstract. Understanding PM_2.5 variability at fine scale is crucial to assess urban pollution impact on the population and to inform the policy-making process. PM_2.5 in-situ measurements at ground level cannot offer gapless spatial coverage, while current satellite retrievals generally cannot offer both high-spatial and high-temporal resolution, with night-time estimation posing further challenges. This study tackles these difficulties, introducing an innovative deep learning data fusion method to estimate hourly PM_2.5 maps at 100 m resolution on urban areas. We combine low resolution geophysical model data, high resolution geographical indicators, PM_2.5 in-situ ground stations measurements and PM_2.5 retrieved at satellite overpass. To simultaneously treat spatial and temporal correlations in our data, we deploy a 3D U-Net based neural network model. To evaluate the model, we select the city of Paris, France, in the year 2019 as our study region and time. Quantitative assessment of the model is carried out using the ground station data with a leave-one-out cross-validation approach. Our method outperforms MERRA-2 PM_2.5 estimates, predicting PM_2.5 hourly (R² = 0.51, RMSE = 6.58 μg/m³), daily (R² = 0.65, RMSE = 4.92 μg/m³), and monthly (R² = 0.87, RMSE = 2.87 μg/m³). The proposed approach and its possible future developments can be highly beneficial for PM_2.5 exposure and regulation studies at fine suburban scale.

Received: 20 Dec 2024 – Discussion started: 14 Feb 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3121 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (3121 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

25 Sep 2025

Machine learning data fusion for high spatio-temporal resolution PM_2.5

Andrea Porcheddu, Ville Kolehmainen, Timo Lähivaara, and Antti Lipponen

Atmos. Meas. Tech., 18, 4771–4789, https://doi.org/10.5194/amt-18-4771-2025,https://doi.org/10.5194/amt-18-4771-2025, 2025

Short summary

Andrea Porcheddu, Ville Kolehmainen, Timo Lähivaara, and Antti Lipponen

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-4056', Anonymous Referee #1, 31 Mar 2025

This study integrates multi-source data, including satellite and ground-based station data, to construct a deep learning model for estimating 24-hour high-resolution PM2.5 data. High spatiotemporal resolution PM2.5 mapping is of significant importance for pollution control and decision-making, and this study represents a useful attempt in this field. However, the following issues need to be addressed:

The study aims to estimate 24-hourly PM2.5 maps at 100 m resolution in urban areas. However, as shown in Table A1, most of the input data have resolutions coarser than 100 m, except for OpenStreetMap roads and DEM data, which are not directly related to PM2.5. How do the authors justify that the estimated PM2.5 resolution truly reaches 100 m?

The paper presents a deep learning-based estimation approach, but the description of the methodology remains unclear. First, Lines 148–149 mention that "The output is a 3-dimensional array containing 24 hourly PM2.5 maps," but Lines 159–160 state that "the output layer is a 3D 1x1x1 convolution," which appears contradictory and should be clarified. Second, the construction of the loss function is confusing—it should ideally be constrained by PM2.5 measurements from ground stations and NOODLESALAD PM2.5, but its current formulation appears overly complex and difficult to understand.

The study aims to estimate 24-hour, 100 m resolution PM2.5 data, but most of the results presented are seasonal or monthly averages. We would like to see 24-hour PM2.5 mapping results. Additionally, the comparison with MERRA2 focuses mainly on accuracy. Could the authors also better illustrate PM2.5’s spatial distribution and gradient variations, or even capture specific pollution emissions?

The study applies explainable AI techniques to explore the importance of different features, showing that SHAP values identify 2-meter air temperature as the most important feature. However, this analysis could be further improved. First, the underlying reasons for why certain variables are important (or not) are not sufficiently explored. Second, a broader perspective could be considered—how much of the variability in PM2.5 can be explained by meteorological variables overall?

The description of NOODLESALAD PM2.5 and its role in this study is unclear. The authors should provide a more detailed explanation rather than merely citing previous studies.

The results and analysis section could be further improved. First, it is recommended to structure the results into separate subsections rather than mixing everything together. Second, the quality of Figures 3–6 should be improved—currently, the font size is too small, and the figure titles could be removed (since the descriptions are already included in the captions). Lastly, additional results, such as 24-hour high-resolution PM2.5 maps, could enhance the persuasiveness of the study.

The references in the paper are somewhat outdated, with few studies from the recent three years included. It is recommended to update and supplement them.

Some minor issues:
(1) Figure 1: Does the figure represent the road network? Please clarify.
(2) Line 134: "3D PM2.5 maps" could be misinterpreted as three-dimensional spatial maps (including altitude). Is this the correct terminology?
(3) Figure 2: The representation is somewhat abstract. It would be better if the inputs and outputs were explicitly illustrated.
(4) Line 279: "consistent with prior findings" should be supported with references.

Citation: https://doi.org/10.5194/egusphere-2024-4056-RC1
- AC1: 'Reply on RC1', Andrea Porcheddu, 23 Jun 2025
  
  Thank you for your comments. Our reply can be found in the pdf attached.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4056-AC1
RC2:
'Comment on egusphere-2024-4056', Anonymous Referee #2, 26 May 2025

The authors aimed at mapping hourly PM2.5 concentration at 100-m resolution in Paris using multimodal data from MERRA-2 reanalysis, Sentinel-3 observations, and ground-based measurements via a deep learning model. While the topic is worthwhile to investigate, the proposed method, to a large extent, contributes little to the community. The reasons are follows. First, the proposed model works mainly relying on downscaling MERRA-2 aerosol diagnostics to generate 100-m PM2.5 estimates. Similar studies have been extensively conducted, differing from the spatial resolution, and satellite-based PM2.5 estimates play a very weak role.
The manuscript suffers from the following flaws that should be addressed before the further consideration.
1. the data accuracy of NOODLESALAD PM2.5 should be described in section 2.1. Moreover, what are essential roles of this unique product in the proposed deep learning framework, needs to clarify.
2. since the authors only used 11 stations for reference, is this adequate to depict PM2.5 variability across space in the study area?
3. MERRA-2 PM2.5 estimates: since no nitrates are provided in MERRA-2 aerosol diagnostics, the corresponding PM2.5 estimates are prone to large uncertainty. The data accuracy of this PM2.5 product should be validated as well.
4. The authors used a set of geographic variables with varying spatial resolution, how did the authors collocate them in the deep learning framework, no such descriptions.
5. A flow chart depicting the deep learning architecture, particularly the data flow, is essential for understanding and reproducibility.
6. Equations should be numbered.
7. Methodology: the authors mentioned that both satellite- and ground-based PM2.5 data were used as the learning target. Since these datasets have distinct data accuracy, would this undermine the learning capacity of the deep learned model?
8. line 207-209: this would result in imbalanced training sets at different hours, which could also influence the learning accuracy, as the learned model is more likely to predict PM2.5 during the satellite overpasses.
9. An intercomparison of spatial distribution of predicted PM2.5 estimates from MERRA-2 with satellite-derived PM2.5 at 100-m from Sentinel observations should be provided to assess the reliability of the proposed model in resolving PM2.5 distributions in Paris.

Citation: https://doi.org/10.5194/egusphere-2024-4056-RC2
- AC2: 'Reply on RC2', Andrea Porcheddu, 23 Jun 2025
  
  Thank you for your comments. Our reply can be found in the pdf attached.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4056-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-4056', Anonymous Referee #1, 31 Mar 2025

This study integrates multi-source data, including satellite and ground-based station data, to construct a deep learning model for estimating 24-hour high-resolution PM2.5 data. High spatiotemporal resolution PM2.5 mapping is of significant importance for pollution control and decision-making, and this study represents a useful attempt in this field. However, the following issues need to be addressed:

The study aims to estimate 24-hourly PM2.5 maps at 100 m resolution in urban areas. However, as shown in Table A1, most of the input data have resolutions coarser than 100 m, except for OpenStreetMap roads and DEM data, which are not directly related to PM2.5. How do the authors justify that the estimated PM2.5 resolution truly reaches 100 m?

The paper presents a deep learning-based estimation approach, but the description of the methodology remains unclear. First, Lines 148–149 mention that "The output is a 3-dimensional array containing 24 hourly PM2.5 maps," but Lines 159–160 state that "the output layer is a 3D 1x1x1 convolution," which appears contradictory and should be clarified. Second, the construction of the loss function is confusing—it should ideally be constrained by PM2.5 measurements from ground stations and NOODLESALAD PM2.5, but its current formulation appears overly complex and difficult to understand.

The study aims to estimate 24-hour, 100 m resolution PM2.5 data, but most of the results presented are seasonal or monthly averages. We would like to see 24-hour PM2.5 mapping results. Additionally, the comparison with MERRA2 focuses mainly on accuracy. Could the authors also better illustrate PM2.5’s spatial distribution and gradient variations, or even capture specific pollution emissions?

The study applies explainable AI techniques to explore the importance of different features, showing that SHAP values identify 2-meter air temperature as the most important feature. However, this analysis could be further improved. First, the underlying reasons for why certain variables are important (or not) are not sufficiently explored. Second, a broader perspective could be considered—how much of the variability in PM2.5 can be explained by meteorological variables overall?

The description of NOODLESALAD PM2.5 and its role in this study is unclear. The authors should provide a more detailed explanation rather than merely citing previous studies.

The results and analysis section could be further improved. First, it is recommended to structure the results into separate subsections rather than mixing everything together. Second, the quality of Figures 3–6 should be improved—currently, the font size is too small, and the figure titles could be removed (since the descriptions are already included in the captions). Lastly, additional results, such as 24-hour high-resolution PM2.5 maps, could enhance the persuasiveness of the study.

The references in the paper are somewhat outdated, with few studies from the recent three years included. It is recommended to update and supplement them.

Some minor issues:
(1) Figure 1: Does the figure represent the road network? Please clarify.
(2) Line 134: "3D PM2.5 maps" could be misinterpreted as three-dimensional spatial maps (including altitude). Is this the correct terminology?
(3) Figure 2: The representation is somewhat abstract. It would be better if the inputs and outputs were explicitly illustrated.
(4) Line 279: "consistent with prior findings" should be supported with references.

Citation: https://doi.org/10.5194/egusphere-2024-4056-RC1
- AC1: 'Reply on RC1', Andrea Porcheddu, 23 Jun 2025
  
  Thank you for your comments. Our reply can be found in the pdf attached.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4056-AC1
RC2:
'Comment on egusphere-2024-4056', Anonymous Referee #2, 26 May 2025

The authors aimed at mapping hourly PM2.5 concentration at 100-m resolution in Paris using multimodal data from MERRA-2 reanalysis, Sentinel-3 observations, and ground-based measurements via a deep learning model. While the topic is worthwhile to investigate, the proposed method, to a large extent, contributes little to the community. The reasons are follows. First, the proposed model works mainly relying on downscaling MERRA-2 aerosol diagnostics to generate 100-m PM2.5 estimates. Similar studies have been extensively conducted, differing from the spatial resolution, and satellite-based PM2.5 estimates play a very weak role.
The manuscript suffers from the following flaws that should be addressed before the further consideration.
1. the data accuracy of NOODLESALAD PM2.5 should be described in section 2.1. Moreover, what are essential roles of this unique product in the proposed deep learning framework, needs to clarify.
2. since the authors only used 11 stations for reference, is this adequate to depict PM2.5 variability across space in the study area?
3. MERRA-2 PM2.5 estimates: since no nitrates are provided in MERRA-2 aerosol diagnostics, the corresponding PM2.5 estimates are prone to large uncertainty. The data accuracy of this PM2.5 product should be validated as well.
4. The authors used a set of geographic variables with varying spatial resolution, how did the authors collocate them in the deep learning framework, no such descriptions.
5. A flow chart depicting the deep learning architecture, particularly the data flow, is essential for understanding and reproducibility.
6. Equations should be numbered.
7. Methodology: the authors mentioned that both satellite- and ground-based PM2.5 data were used as the learning target. Since these datasets have distinct data accuracy, would this undermine the learning capacity of the deep learned model?
8. line 207-209: this would result in imbalanced training sets at different hours, which could also influence the learning accuracy, as the learned model is more likely to predict PM2.5 during the satellite overpasses.
9. An intercomparison of spatial distribution of predicted PM2.5 estimates from MERRA-2 with satellite-derived PM2.5 at 100-m from Sentinel observations should be provided to assess the reliability of the proposed model in resolving PM2.5 distributions in Paris.

Citation: https://doi.org/10.5194/egusphere-2024-4056-RC2
- AC2: 'Reply on RC2', Andrea Porcheddu, 23 Jun 2025
  
  Thank you for your comments. Our reply can be found in the pdf attached.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4056-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Andrea Porcheddu on behalf of the Authors (04 Jul 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (07 Jul 2025) by Sandip Dhomse

RR by Anonymous Referee #1 (18 Jul 2025)

ED: Publish subject to minor revisions (review by editor) (18 Jul 2025) by Sandip Dhomse

AR by Andrea Porcheddu on behalf of the Authors (28 Jul 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (30 Jul 2025) by Sandip Dhomse

AR by Andrea Porcheddu on behalf of the Authors (08 Aug 2025) Manuscript

Journal article(s) based on this preprint

25 Sep 2025

Machine learning data fusion for high spatio-temporal resolution PM_2.5

Andrea Porcheddu, Ville Kolehmainen, Timo Lähivaara, and Antti Lipponen

Atmos. Meas. Tech., 18, 4771–4789, https://doi.org/10.5194/amt-18-4771-2025,https://doi.org/10.5194/amt-18-4771-2025, 2025

Short summary

Andrea Porcheddu, Ville Kolehmainen, Timo Lähivaara, and Antti Lipponen

Viewed

Total article views: 881 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
780	79	22	881	20	35

HTML: 780
PDF: 79
XML: 22
Total: 881
BibTeX: 20
EndNote: 35

Views and downloads (calculated since 14 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	79	12	3	94
Mar 2025	41	13	2	56
Apr 2025	48	11	2	61
May 2025	46	9	4	59
Jun 2025	64	15	6	85
Jul 2025	38	4	2	44
Aug 2025	115	14	1	130
Sep 2025	349	1	2	352

Cumulative views and downloads (calculated since 14 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	79	12	3	94
Mar 2025	41	13	2	56
Apr 2025	48	11	2	61
May 2025	46	9	4	59
Jun 2025	64	15	6	85
Jul 2025	38	4	2	44
Aug 2025	115	14	1	130
Sep 2025	349	1	2	352

Viewed (geographical distribution)

Total article views: 854 (including HTML, PDF, and XML) Thereof 854 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 25 Sep 2025

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (3121 KB)
Metadata XML

Short summary

This study proposes a novel machine learning method to estimate pollution levels (PM_2.5) on urban areas at fine scale. Our model generates hourly PM_2.5 maps with high spatial resolution (100 meters), by combining satellite data, ground measurements, geophysical model data, and different geographical indicators. The model properly accounts for spatial and temporal variability of the urban pollution levels, offering relevant insights for air quality monitoring and health protection.


Total:	0
HTML:	0
PDF:	0
XML:	0

Machine learning data fusion for high spatio-temporal resolution PM2.5

Journal article(s) based on this preprint

Interactive discussion

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection

Journal article(s) based on this preprint

Viewed

Viewed (geographical distribution)

Machine learning data fusion for high spatio-temporal resolution PM_2.5