Particulate Matter Concentrations Derived from Airborne High Spectral Resolution Lidar Measurements Using Machine Learning Regression

Ferrare, Richard; Hair, Johnathan; Shingler, Taylor; Hostetler, Chris; Nehrir, Amin; Fenn, Marta; Scarino, Amy Jo; Burton, Sharon; Clayton, Marian; Collins, James; Judd, Laura; Crawford, James; Travis, Katherine; Toth, Travis; Saide, Pablo; Jimenez, Jose Luis; Campuzano-Jost, Pedro; Symonds, Guy; Moore, Richard; Ziemba, Luke; Shook, Michael; Diskin, Glenn; DiGangi, Joshua P.; Bennett, Ryan; Ho, Chia-hsiang; Chang, Lim-seok; Aiampisanuvong, Adisak; Pawarmart, Ittipol

doi:10.5194/egusphere-2025-4812

Preprints

https://doi.org/10.5194/egusphere-2025-4812

Preprints

09 Oct 2025

| 09 Oct 2025

Particulate Matter Concentrations Derived from Airborne High Spectral Resolution Lidar Measurements Using Machine Learning Regression

Richard Ferrare, Johnathan Hair, Taylor Shingler, Chris Hostetler, Amin Nehrir, Marta Fenn, Amy Jo Scarino, Sharon Burton, Marian Clayton, James Collins, Laura Judd, James Crawford, Katherine Travis, Travis Toth, Pablo Saide, Jose Luis Jimenez, Pedro Campuzano-Jost, Guy Symonds, Richard Moore, Luke Ziemba, Michael Shook, Glenn Diskin, Joshua P. DiGangi, Ryan Bennett, Chia-hsiang Ho, Lim-seok Chang, Adisak Aiampisanuvong, and Ittipol Pawarmart

Abstract. We use measurements of near-surface aerosol backscatter, extinction, and depolarization acquired by four NASA Langley Research Center airborne High Spectral Resolution Lidars (HSRLs) in machine learning (ML) regression algorithms to derive concentrations of particulate matter (PM) with aerodynamic diameters less than 2.5 mm (PM_2.5), 10 mm (PM₁₀), and the PM_2.5/PM₁₀ratio. The ML regression models are trained using airborne HSRL measurements acquired over major metropolitan regions in the United States and Asia that are coincident with hourly surface PM_2.5and PM₁₀ measurements from the EPA air quality system and similar networks in other countries. We examine several regression methods and find that exponential Gaussian Process regression (GPR) algorithms consistently give the best performance in terms of the lowest root-mean-square (RMS) errors and the highest correlations. When evaluated using surface measurements withheld from the training sets, ML models that use the HSRL near-surface measurements of aerosol backscatter and aerosol intensive properties such as depolarization, backscatter color ratio, and lidar ratio typically give the best performance with RMS differences in PM_2.5 retrievals around 5 mg m^-3 and correlation coefficients above 0.8, respectively. Corresponding RMS differences and correlation coefficients for PM₁₀ retrievals are 11 mg m^-3 and 0.7 and corresponding RMS differences and correlation coefficients for PM_2.5/PM₁₀ are 0.17 and 0.75. This retrieval performance is achieved using airborne HSRL measurements alone and so does not depend on external knowledge of or assumptions regarding aerosol type, aerosol mass extinction efficiency, aerosol hygroscopic growth, the ratio of PM_2.5to PM₁₀, particle density, or relative humidity. PM_2.5values in the training set range from about 5 to 80 mg m^-3; PM₁₀ values range from about 10 to 100 mg m^-3. Accurate retrievals of PM outside these ranges would require commensurate training data. We present examples of PM retrievals in the United States as well as Asia when HSRL measurements were acquired when the aircraft flew systematic "raster-scan" patterns for several hours over major urban areas. We show that these PM_2.5 retrievals are in good agreement with PM_2.5 derived from coincident airborne in situ measurements near the surface as well as aloft. We describe also how the distribution of PM_2.5varies with aerosol type and altitude over these regions. We use the HSRL measurements of aerosol extinction and retrievals of surface PM_2.5along with HSRL retrievals of aerosol type to derive estimates of the fine mode aerosol mass extinction efficiency (MEE_f) for major aerosol types identified by an updated HSRL aerosol classification method. MEE_f ranges from about 2.6 ± 0.5 m² g^-1 for maritime aerosol to 5.0 ± 0.7 m² g^-1 for smoke. These estimates of MEE_f are also in good agreement with values derived from airborne in situ measurements. We also discuss how this methodology may be applied to measurements from the Atmospheric Lidar (ATLID) on the EarthCARE satellite.

How to cite. Ferrare, R., Hair, J., Shingler, T., Hostetler, C., Nehrir, A., Fenn, M., Scarino, A. J., Burton, S., Clayton, M., Collins, J., Judd, L., Crawford, J., Travis, K., Toth, T., Saide, P., Jimenez, J. L., Campuzano-Jost, P., Symonds, G., Moore, R., Ziemba, L., Shook, M., Diskin, G., DiGangi, J. P., Bennett, R., Ho, C., Chang, L., Aiampisanuvong, A., and Pawarmart, I.: Particulate Matter Concentrations Derived from Airborne High Spectral Resolution Lidar Measurements Using Machine Learning Regression, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2025-4812, 2025.

Received: 29 Sep 2025 – Discussion started: 09 Oct 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 7315 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (7315 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

19 Dec 2025

Particulate matter concentrations derived from airborne high spectral resolution lidar measurements using machine learning regression

Richard Ferrare, Johnathan Hair, Taylor Shingler, Chris Hostetler, Amin Nehrir, Marta Fenn, Amy Jo Scarino, Sharon Burton, Marian Clayton, James Collins, Laura Judd, James Crawford, Katherine Travis, Travis Toth, Pablo Saide, Jose Luis Jimenez, Pedro Campuzano-Jost, Guy Symonds, Richard Moore, Luke Ziemba, Michael Shook, Glenn Diskin, Joshua P. DiGangi, Ryan Bennett, Chia-Hsiang Ho, Lim-Seok Chang, Adisak Aiampisanuvong, and Ittipol Pawarmart

Atmos. Meas. Tech., 18, 7735–7766, https://doi.org/10.5194/amt-18-7735-2025,https://doi.org/10.5194/amt-18-7735-2025, 2025

Short summary

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-4812', Anonymous Referee #1, 20 Oct 2025

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4812/egusphere-2025-4812-RC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-4812-RC1
- AC2: 'Reply on RC1', Richard Ferrare, 02 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4812/egusphere-2025-4812-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-4812-AC2
RC2:
'Comment on egusphere-2025-4812', Anonymous Referee #2, 07 Nov 2025

The study deals with the use of the Matlab regression learner software to explore various regression methods applied to several large observational airborne lidar (HSRL) data sets. These data were collected during large field campaigns over major urban areas in the USA, South Korea, Philippines, Taiwan, and Thailand. Goal was to investigate what combination of lidar information (backscatter, extinction, lidar ratio, depolarization ratio at single or multi wavelengths) allows a good estimation of PM2.5 and PM10 at heights close to the surface. In this machine learning (ML) studies, dense sets of network in situ observations of PM2.5 and PM10 were used. It was found that the Exponential Gaussian Process Algorithms consistently showed the best performance. 12 different lidar configurations (models 1-12) were defined. However, in the result section only the optimum model (model 11) was applied.
This is an excellent and well elaborated study done by experienced researchers and lidar experts!
I have only minor remarks. As a reviewer, my role is to be critical and to criticize if I find something that should be mentioned. All the positive aspects remain widely uncommented.
The Abstract may be too long. According to the AMT/ACP rules the abstract should not exceed 250 words.
Lines 93-111: What about all the ground-based lidars and lidar networks? Why are they not mentioned? All the multiwavelength Raman polarization lidars, EARLINET? Ground-based lidars are ideal to monitor the diurnal, weekly, and seasonal cycle of the aerosol pollution state in urban areas, and this, in contrast to airborne and satellite lidars, continuously! Airborne field campaigns are very useful, no doubt, but they are snapshots! Spaceborne lidar observations provide global coverage, however, also snapshot-like. To my opinion, in such a general introduction one should provide a more general overview on the available lidar techniques and networks, MPLNET, ADNET, EARLINET.
Lines 139-150: To continue with my general comment: I was surprised that the Raman lidar technique was not mentioned at all, although the first author Rich Ferrare grew up as an aerosol Raman lidar specialist. The use of the robust and very stable Raman lidar technique is, to my opinion, the optimum approach for long term monitoring of aerosol pollution, even at low heights of 100-200 m above ground (by using near range receiver units). Meanwhile, rotational Raman channels allow coverage of the lower part of the atmosphere even at daytime.
To avoid misunderstanding. The development of all the different airborne HSRL lidars at LARC, NASA is unique! The lidar team as a whole did a fantastic job during the last 10-15 years.
Back to the manuscript. Later on, I was also surprised that none of the defined models 1-12 covers the CALIOP lidar configuration. I think that should be improved. Or does it make no sense at all, when there is no lidar-ratio information? The CALIOP model would be model 7 without information on 532 nm lidar ratio and 532-1064 nm depolarization ratio. The comparison of model 7 (without the lidar ratio and 1064 nm depol ratio information) and model 12 would be the perfect opportunity to demonstrate the big step forward in spaceborne lidar development from CALIOP to ATLID (EarthCARE lidar)!
Line 216: Table 1! It is not easy to find out what the HSRL 2 (the main lidar in all these field campaign discussed in this paper, model 11) can measure. A better, clear overview of the different systems would be helpful.
Line 218: What do you mean with self-calibration. In the backscatter coefficient retrieval, you always need to assume a reference backscatter value at the reference height.
Line 290: So, the basic goal was to use 193 flights (conducted from 2010-2024) over major metropolitan regions to explore various machine learning regression models for deriving PM concentrations. The result section is, however, mainly contains HSRL-2 observations and applications of model 11.
How many flights were conducted with the HSRL-2?
When using these 193 flight over urban areas then you investigated the link between lidar observations and in situ observations for only ONE aerosol type, even if PM2.5/PM10 ranged from 0.1 to 0.9? Please comment on that!
To cover the entire globe (in the case of global observations with CALIOP or ATLID) would that mean we need global sets of in situ PM observations in the machine learning studies?
Line 316: Is there a good reference available so that the reader can learn more about the Exponential Gaussian Process Algorithm?
Line 329, Table 2: Model 11 has the most crosses and is obviously the best model in this study. Model 12 is the EarthCARE model! Why is there no CALIOP model? … model 7 (without lidar ratio and 1064 nm depol ratio information)? Please comment on that!
In the case of models 7-11: Either BSC or EXT, but always LR is used! Does that mean: When BSC and LR is included, automatically the information about EXT is available, andis not needed? Please explain why a model that uses BSC plus EXT plus LR makes no sense!
Line 351: Figure 4 shows models 1, 2, 3, and 11! I think one should show model 7 in this figure!
Line 366, Figure 5: I am surprised that the use of BSC gives better results than the use of EXT. The extinction coefficient (overall scattering effect) is closely linked to the cross section of the particles, and PM is also well correlated with the particle cross section and thus with EXT. Is the reason that BSC is the better parameter related to the fact that the study only concentrates on the urban-haze aerosol type (mostly fine-mode aerosol)?
Section 3: The result section shows interesting results and the full potential of airborne aerosol HSRL observations to quantify the pollution state close to the ground. I have no questions here!
Figure 8: The in-situ observations (EPA surface stations) are not easy to see. Maybe a bit larger symbols will help?
Figures 12-15 show convincing (excellent) results. But as a critical reviewer my question would be? Can we use the developed approach if we have totally independent data sets, e.g., lidar observations over Beijing, Shanghai, Wuhan, Pearl River Delta in China, or over polluted Cairo, Egypt, Dakar, Senegal, Nairobi, Kenia, or over Paris and London in Europe, or Tomsk in Siberia or Fairbanks in Alaska? Any comment on that would be fine! Do we need always complex data sets of lidar and in situ observations in ML efforts for each region of the world, before we can make trustworthy use of lidar observation?

Citation: https://doi.org/10.5194/egusphere-2025-4812-RC2
- AC1: 'Reply on RC2', Richard Ferrare, 02 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4812/egusphere-2025-4812-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-4812-AC1

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-4812', Anonymous Referee #1, 20 Oct 2025

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4812/egusphere-2025-4812-RC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2025-4812-RC1
- AC2: 'Reply on RC1', Richard Ferrare, 02 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4812/egusphere-2025-4812-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-4812-AC2
RC2:
'Comment on egusphere-2025-4812', Anonymous Referee #2, 07 Nov 2025

The study deals with the use of the Matlab regression learner software to explore various regression methods applied to several large observational airborne lidar (HSRL) data sets. These data were collected during large field campaigns over major urban areas in the USA, South Korea, Philippines, Taiwan, and Thailand. Goal was to investigate what combination of lidar information (backscatter, extinction, lidar ratio, depolarization ratio at single or multi wavelengths) allows a good estimation of PM2.5 and PM10 at heights close to the surface. In this machine learning (ML) studies, dense sets of network in situ observations of PM2.5 and PM10 were used. It was found that the Exponential Gaussian Process Algorithms consistently showed the best performance. 12 different lidar configurations (models 1-12) were defined. However, in the result section only the optimum model (model 11) was applied.
This is an excellent and well elaborated study done by experienced researchers and lidar experts!
I have only minor remarks. As a reviewer, my role is to be critical and to criticize if I find something that should be mentioned. All the positive aspects remain widely uncommented.
The Abstract may be too long. According to the AMT/ACP rules the abstract should not exceed 250 words.
Lines 93-111: What about all the ground-based lidars and lidar networks? Why are they not mentioned? All the multiwavelength Raman polarization lidars, EARLINET? Ground-based lidars are ideal to monitor the diurnal, weekly, and seasonal cycle of the aerosol pollution state in urban areas, and this, in contrast to airborne and satellite lidars, continuously! Airborne field campaigns are very useful, no doubt, but they are snapshots! Spaceborne lidar observations provide global coverage, however, also snapshot-like. To my opinion, in such a general introduction one should provide a more general overview on the available lidar techniques and networks, MPLNET, ADNET, EARLINET.
Lines 139-150: To continue with my general comment: I was surprised that the Raman lidar technique was not mentioned at all, although the first author Rich Ferrare grew up as an aerosol Raman lidar specialist. The use of the robust and very stable Raman lidar technique is, to my opinion, the optimum approach for long term monitoring of aerosol pollution, even at low heights of 100-200 m above ground (by using near range receiver units). Meanwhile, rotational Raman channels allow coverage of the lower part of the atmosphere even at daytime.
To avoid misunderstanding. The development of all the different airborne HSRL lidars at LARC, NASA is unique! The lidar team as a whole did a fantastic job during the last 10-15 years.
Back to the manuscript. Later on, I was also surprised that none of the defined models 1-12 covers the CALIOP lidar configuration. I think that should be improved. Or does it make no sense at all, when there is no lidar-ratio information? The CALIOP model would be model 7 without information on 532 nm lidar ratio and 532-1064 nm depolarization ratio. The comparison of model 7 (without the lidar ratio and 1064 nm depol ratio information) and model 12 would be the perfect opportunity to demonstrate the big step forward in spaceborne lidar development from CALIOP to ATLID (EarthCARE lidar)!
Line 216: Table 1! It is not easy to find out what the HSRL 2 (the main lidar in all these field campaign discussed in this paper, model 11) can measure. A better, clear overview of the different systems would be helpful.
Line 218: What do you mean with self-calibration. In the backscatter coefficient retrieval, you always need to assume a reference backscatter value at the reference height.
Line 290: So, the basic goal was to use 193 flights (conducted from 2010-2024) over major metropolitan regions to explore various machine learning regression models for deriving PM concentrations. The result section is, however, mainly contains HSRL-2 observations and applications of model 11.
How many flights were conducted with the HSRL-2?
When using these 193 flight over urban areas then you investigated the link between lidar observations and in situ observations for only ONE aerosol type, even if PM2.5/PM10 ranged from 0.1 to 0.9? Please comment on that!
To cover the entire globe (in the case of global observations with CALIOP or ATLID) would that mean we need global sets of in situ PM observations in the machine learning studies?
Line 316: Is there a good reference available so that the reader can learn more about the Exponential Gaussian Process Algorithm?
Line 329, Table 2: Model 11 has the most crosses and is obviously the best model in this study. Model 12 is the EarthCARE model! Why is there no CALIOP model? … model 7 (without lidar ratio and 1064 nm depol ratio information)? Please comment on that!
In the case of models 7-11: Either BSC or EXT, but always LR is used! Does that mean: When BSC and LR is included, automatically the information about EXT is available, andis not needed? Please explain why a model that uses BSC plus EXT plus LR makes no sense!
Line 351: Figure 4 shows models 1, 2, 3, and 11! I think one should show model 7 in this figure!
Line 366, Figure 5: I am surprised that the use of BSC gives better results than the use of EXT. The extinction coefficient (overall scattering effect) is closely linked to the cross section of the particles, and PM is also well correlated with the particle cross section and thus with EXT. Is the reason that BSC is the better parameter related to the fact that the study only concentrates on the urban-haze aerosol type (mostly fine-mode aerosol)?
Section 3: The result section shows interesting results and the full potential of airborne aerosol HSRL observations to quantify the pollution state close to the ground. I have no questions here!
Figure 8: The in-situ observations (EPA surface stations) are not easy to see. Maybe a bit larger symbols will help?
Figures 12-15 show convincing (excellent) results. But as a critical reviewer my question would be? Can we use the developed approach if we have totally independent data sets, e.g., lidar observations over Beijing, Shanghai, Wuhan, Pearl River Delta in China, or over polluted Cairo, Egypt, Dakar, Senegal, Nairobi, Kenia, or over Paris and London in Europe, or Tomsk in Siberia or Fairbanks in Alaska? Any comment on that would be fine! Do we need always complex data sets of lidar and in situ observations in ML efforts for each region of the world, before we can make trustworthy use of lidar observation?

Citation: https://doi.org/10.5194/egusphere-2025-4812-RC2
- AC1: 'Reply on RC2', Richard Ferrare, 02 Dec 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4812/egusphere-2025-4812-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-4812-AC1

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Richard Ferrare on behalf of the Authors (02 Dec 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (04 Dec 2025) by Daniel Perez-Ramirez

AR by Richard Ferrare on behalf of the Authors (09 Dec 2025) Author's response Manuscript

Journal article(s) based on this preprint

19 Dec 2025

Particulate matter concentrations derived from airborne high spectral resolution lidar measurements using machine learning regression

Richard Ferrare, Johnathan Hair, Taylor Shingler, Chris Hostetler, Amin Nehrir, Marta Fenn, Amy Jo Scarino, Sharon Burton, Marian Clayton, James Collins, Laura Judd, James Crawford, Katherine Travis, Travis Toth, Pablo Saide, Jose Luis Jimenez, Pedro Campuzano-Jost, Guy Symonds, Richard Moore, Luke Ziemba, Michael Shook, Glenn Diskin, Joshua P. DiGangi, Ryan Bennett, Chia-Hsiang Ho, Lim-Seok Chang, Adisak Aiampisanuvong, and Ittipol Pawarmart

Atmos. Meas. Tech., 18, 7735–7766, https://doi.org/10.5194/amt-18-7735-2025,https://doi.org/10.5194/amt-18-7735-2025, 2025

Short summary

Viewed

Total article views: 591 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
310	251	30	591	22	22

HTML: 310
PDF: 251
XML: 30
Total: 591
BibTeX: 22
EndNote: 22

Views and downloads (calculated since 09 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	175	198	11	384
Nov 2025	80	19	10	109
Dec 2025	55	34	9	98
Jan 2026	0

Cumulative views and downloads (calculated since 09 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	175	198	11	384
Nov 2025	80	19	10	109
Dec 2025	55	34	9	98
Jan 2026	0

Viewed (geographical distribution)

Total article views: 583 (including HTML, PDF, and XML) Thereof 583 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 06 Jan 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (7315 KB)
Metadata XML

Short summary

We present a new method to retrieve atmospheric particulate matter concentrations using only airborne High Spectral Resolution Lidar measurements in machine learning algorithms. Retrieved concentrations agree well with surface measurements. These concentrations and our estimates of the particle mass extinction efficiency are also consistent with those retrieved from airborne in situ measurements. This methodology can also be applied to the Atmosphere Lidar on the EarthCARE satellite.


Total:	0
HTML:	0
PDF:	0
XML:	0