Unsupervised Classification of Absorbing Aerosols with the SP2 via a Variational Autoencoder (VAE)

Doshi, Aaryan; Lamb, Kara

doi:10.5194/egusphere-2025-3210

Preprints

https://doi.org/10.5194/egusphere-2025-3210

Preprints

20 Aug 2025

| 20 Aug 2025

Unsupervised Classification of Absorbing Aerosols with the SP2 via a Variational Autoencoder (VAE)

Aaryan Doshi and Kara Lamb

Abstract. The Single Particle Soot Photometer (SP2) detects refractory aerosol particle mass on a single-particle basis via laser-induced incandescence (L-II). While the SP2 has traditionally been used to quantify black carbon aerosol mass in the atmosphere, the instrument is increasingly being used to detect and quantify other types of absorbing aerosols, such as mineral dust or anthropogenically-sourced iron oxide aerosols. Quantifying the mass loadings and emission sources of absorbing aerosols in the atmosphere is important for understanding their role in the climate cycle. Supervised machine learning algorithms have shown potential to classify different types of aerosols from L-II signals, but these methods are sensitive to instrument configuration and require training datasets generated from laboratory samples, which do not generalize well to ambient atmospheric aerosols. Here we explore the effectiveness of an unsupervised deep learning method, a variational autoencoder (VAE), applied directly to L-II signals from the SP2 in order to classify different types of absorbing aerosols. The VAE compresses L-II signals into a bottleneck latent representation and reconstructs an output as similar as possible to the input signal, thereby reducing dimensionality. We apply this approach to a dataset comprised of laboratory samples of materials that show detectable incandescence in the SP2, including fullerene soot (as a proxy for black carbon), coated fullerene soot, coal fly ash, mineral dust, volcanic ash, hematite, and magnetite. We explore optimal latent representations of L-II signals to maximize separability of different aerosol classes by varying the size of the latent representation, and find that a latent representation of 3 allows us to capture the majority of the information in the L-II signals relevant for identifying different types of absorbing aerosols. We demonstrate that unsupervised machine learning is a promising method for identifying distinct populations of aerosols detected by the SP2.

Received: 05 Jul 2025 – Discussion started: 20 Aug 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Aaryan Doshi and Kara Lamb

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-3210', Anonymous Referee #1, 09 Sep 2025
The paper by Doshi and Lamb introduces an unsupervised machine learning approach to better understand the structure of absorbing aerosols using L-II signals from the SP2. Using a variational autoencoder (VAE) the authors are able to extract a compressed latent feature vector of the L-II signals, and use this for outlier detection and enhanced identification of distinct aerosol populations (even outperforming previous tests using significant feature engineering). The paper is generally well written and I appreciate the conciseness of everything. Before fully recommending the paper for publication, I have a handful of questions/comments I’d like to see addressed surrounding latent feature physical interpretations, the dimensionality reduction methodology, outlier detection approach, and generalizability. Further, several of the figures should be updated to match the specifications set forth by EGU (i.e., enhanced text/label size throughout and improved color choices for visibility) to improve general readability.

General Comments:
Physical Interpretations: At the core of this research project is the latent feature vector [Z1, Z2] produced by the VAE, however it wasn’t clear to me throughout the analysis what these values represent physically? I.e., what does this compressed representation mean wrt. aerosols? This is always a challenge in DR projects, but I think it is important to consider since this vector is used throughout the remainder of the results for visualization/interpreting different classes/outlier etc., and we would like to have some confidence that it is learning a real physical feature in the L-II signals, and not some spurious measurement artifact from the SP2 for instance. I understand Section 3 talks a bit towards this point, but I’d like to see more detail from the authors examining this in more detail, and if possible, provide some validation of their interpretation to ancillary data (if available).

Dimensionality Reduction Methods: Building on my previous point, VAEs are great at capturing nonlinear relationships and for providing a smooth manifold, but the latent embeddings are typically less interpretable that the EOFs presented by a PCA for instance. While I understand this problem likely benefits from nonlinear DR like VAEs, UMAP etc., did you attempt to fit a PCA to the same data? What EOFs were produced and how did they differ from those using the VAE? This should be easy to fit, and I feel would be a useful/interesting comparison for this project to include as it provides a linear baseline for comparison, motivating the need for a more computationally expensive technique like the VAE.

Outlier Detection: I like this idea for outlier detection (Section 4) using the latent manifolds from the VAE. However, I am not totally convinced that the centroid + euclidean distance approach is the most robust approach for this. As the authors illustrate in Figs. 5/6, the manifolds from the latent vector don’t produce circular/spherical distributions across the embedding plane. When dealing with continuous real-world observations, we instead typically see these much more abstract and diverse shapes, where a simple euclidean distance from the centroid might not capture a true outlier, but instead, another relevant mode in the data. I would recommend the authors consider including a discussion/comparison to other approaches for outlier detection (e.g., HDBSCAN) which looks at the density of the cases and how they are distributed across the scene.

Generalizability: Since this work is based on laboratory data, do you expect the results of the VAE manifolds to change if applied to actual atmospheric observations? How robust is this to different regional climates/periods, and how stable would you expect the VAE manifold to remain? It would be nice to see some additional details or some text in the discussion addressing the applicability of this approach and these results to new, unseen data.

Specific Comments:
Title: I would remove “(VAE)” from the title

Figure 1: Can you add some additional details to the figure description explaining the SP2 (outside of just the citation).

Line 36: Should be “incandesce”, right?

Lines 48-49: This felt like a really abrupt change in the introduction, moving from the SP2 to unsupervised machine learning. I would suggest reworking these paragraphs to flow more clearly/logically.

Lines 49-63: I recognize this is a follow-on from Lamb (2019), but are there other references you could include throughout this section looking at supervised/unsupervised ML in aerosols to better motivate this work? The paper would likely benefit from broadening the current set of references to better situate itself in current literature.

Lines 57-59: You write “unsupervised machine learning” three times in three lines and it reads as quite repetitive, I would also recommend reworking this portion of the introduction to flow more cleanly.

Line 67: I know you define VAE in the abstract, but I typically recommend redefining all abbreviations within the text itself, so it can stand on its own.

Lines 153-155: I am less familiar with the experimental setup for aerosol-specific model training, but are there any concerns about overfitting from temporal autocorrection using a random 50/25/25 split over a segmented approach?

Figure 2: I like this figure, but the text is really small/hard to read in its current state

Line 232: You mention hyperparameter selection here, but don’t go into details about the process. What approach was used/what search space was evaluated? I summary table with final tuned values would also be a useful addition to consider.

Figures 5/6: Similar to my previous comment about label size. Also the yellow cases are really hard to see on the white background. It might also be useful to apply a set of coloblind-friendly colors here so the “All classes” figure is more easily digested.

Figure 7: Font size comment again

Line 361-362: I appreciate the inclusion of the GitHub link with code for reproducibility, but when examining it, there are no included datasets to test with? Unless I am missing this step, I would recommend taking a subset of your *.npy data depending on data size and making it available on Colab through gdown (or some similar alternative). Then users can simply run your notebook online to reproduce the primary results.
Citation: https://doi.org/10.5194/egusphere-2025-3210-RC1
- AC2: 'Reply on RC1', Kara Lamb, 14 Nov 2025
  
  See attached response.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3210-AC2
RC2:
'Comment on egusphere-2025-3210', Anonymous Referee #2, 18 Sep 2025
This study presents a new method for classifying aerosol particles based on the Laser-Induced Incandescence (L-II) signals using unsupervised machine learning. The author applied Variational Autoencoder (VAE) to analyze L-II signals and compress into a lower-dimensional latent space. This approach is an improvement because it removes the need for manual feature engineering. The paper is well-structured and easy to follow. The introduction provides sufficient background on the SP2 and the limitations of previous classifying methods. The methods section clearly describes the dataset, data preprocessing, and the VAE model. Despite these strengths, there are several issues that must be addressed before publication.
General comments:
Physical Interpretation of Latent Space: While the VAE approach is a powerful tool for classifying aerosol types, the discussion on the physical interpretation of the latent space (e.g., z1−z4) feels underdeveloped. The connection between the latent representation and blackbody temperature (Figure 3) is fascinating and a key finding. What specific features of the L-II signals (e.g., peak sharpness, symmetry, or decay rate) are being captured by these latent variables? A more thorough discussion linking the distributions in Figures 5 and 6 to the microphysical properties of the different aerosol types (BC, FeOx, dust) would significantly strengthen the paper's scientific contribution.

Outlier Detection and Ambient Data: The claim that outlier detection can be useful for characterizing aerosols from various sources (Lines 293-301) seems a strong assertion. While this is a promising application, the current study, which uses laboratory-generated data, does not provide sufficient evidence to support this claim for ambient atmospheric observations. To make this point more convincing, the authors would need to analyze real atmospheric data. I recommend toning down this claim to a more cautious statement about the potential for this method to be applied to ambient data in future work.

Figure Readability: Overall, the font size for text within the figures, including labels and legends, is too small and difficult to read. The marker sizes in the legends are also unclear, making it hard to distinguish between different aerosol types. I would recommend that the authors increase the font size and marker size to improve the readability of all figures.

Specific comments:
Lines 29-30: Please re-check the citation for Moteki and Kondo (2010). This paper does not focus on a field study of rBC. The citation should be removed or replaced with a more relevant reference.

Line 78: The chemical formula of Iron (IV) should be corrected to Iron (II, III).

Line 131: The text should be corrected from Ch. 0 to Ch. 1.

Figure 2: Please correct “Fe203” and “Fe3O4” to “Fe₂O₃” and “Fe₃O₄”. Additionally, please correct “Schwartz et. al 2006.”

Figure 2: In the center upper panels, “Class 1” and “Class 2” are not defined. Could you clarify what these classes represent?

Line 216: Please correct the spelling of “FeOx”. I would recommend the thorough check of the entire manuscript.

Figure 4: Why are “z1” and “z3” denoted in x labels and “z2” and “z4” denoted in y labels? It seems that they should be “time” and “signal amplitude.”

Figure 4: The "Noise" signals are not explicitly defined. It would be helpful to provide a clear definition of "Noise" signal in this context.

Lines 242–244 and Figure 5: The author state that there is “significant overlap” between FS and FS+glyc. However, the distributions for Ch 1 (z4 vs z3) appear to be different.

Line 278: It seems that “outlier increases” should be corrected to “outlier decreases.”

Figure 7: The scatter plots of z4 and z3 in left two panels appear to be almost identical. It would be more efficient to include only one panel. Additionally, please clarify what the dashed lines indicate and confirm that they correspond to the selected outlier markers.

Figure 7: The signals of Channel 3 are difficult to interpret for unfamiliar reader to interpret. It would be helpful that author to include an example of a normal signal of Channel 3 as well as Channels 0 and 1 showed in Figure 2.

Lines 281–285: Please provide a physical explanation for why these specific outliers occurred. For example, was the Fe₃O₄ outlier due to the multiple detection of particles?
Citation: https://doi.org/10.5194/egusphere-2025-3210-RC2
- AC1: 'Reply on RC2', Kara Lamb, 14 Nov 2025
  
  See attached response.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3210-AC1

Aaryan Doshi and Kara Lamb

Data sets

Laser-Induced Incandescent Signals for Laboratory Samples of Absorbing Aerosols Detected by the Single Particle Soot Photometer Kara Lamb https://doi.org/10.5281/zenodo.15800436

Interactive computing environment

SP2-Aerosol-Classification Aaryan Doshi and Kara Lamb https://github.com/adoshi25/SP2-Aerosol-Classification

Aaryan Doshi and Kara Lamb

Viewed

Total article views: 1,702 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,652	36	14	1,702	32	36

HTML: 1,652
PDF: 36
XML: 14
Total: 1,702
BibTeX: 32
EndNote: 36

Views and downloads (calculated since 20 Aug 2025)

Month	HTML	PDF	XML	Total
Aug 2025	504	7	3	514
Sep 2025	1,086	8	6	1,100
Oct 2025	40	9	3	52
Nov 2025	22	12	2	36

Cumulative views and downloads (calculated since 20 Aug 2025)

Month	HTML	PDF	XML	Total
Aug 2025	504	7	3	514
Sep 2025	1,086	8	6	1,100
Oct 2025	40	9	3	52
Nov 2025	22	12	2	36

Viewed (geographical distribution)

Total article views: 1,676 (including HTML, PDF, and XML) Thereof 1,676 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 18 Nov 2025

Short summary

Aerosols that absorb sunlight play key role in Earth's climate. To improve detection of absorbing aerosols measured by the SP2, we explore unsupervised machine learning. Unlike earlier methods that require labeled training data from laboratory measurements , our approach learns patterns directly from SP2 signals. This makes it more applicable to atmospheric observations. We show this method can reveal distinct aerosol populations and improve aerosol classification.


Total:	0
HTML:	0
PDF:	0
XML:	0