the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Chemical sparsity in Bayesian receptor models for aerosol source apportionment
Abstract. Aerosol source apportionment is a key tool for understanding the origins of atmospheric particulate matter and for guiding effective air quality management strategies. However, source apportionment techniques still struggle to properly separate highly correlated sources without relying on restrictive a priori information, possibly skewing the solution and adding subjective operator input, with varying degrees of benefit. This study introduces sparsity into the Bayesian Autocorrelated Matrix Factorisation (BAMF) model with the aim of removing non-essential species contribution in the unconstrained profiles, which is expected to improve the separation of factors. The regularised horseshoe prior (HS) has been added to BAMF (BAMF+HS) to promote composition matrix F sparsity, shrinking low-signal contributions to the solutions. BAMF+HS was evaluated using three synthetic datasets designed to reflect increasing levels of data complexity (Toy, Offline, and Online), and a real-world multi-site filter dataset. The results demonstrate that BAMF+HS effectively enforces sparsity in offline datasets and that this improves accuracy in reconstructing source profiles and time series compared to BAMF and Positive Matrix Factorisation (PMF). However, its application to higher-complexity ACSM datasets revealed sensitivity to sampling instability hindering sparsification. With that, even though sparsity was not achieved, the quality of the BAMF+HS solution metrics were not deprecated compared to BAMF. Overall, this work underscores the value of incorporating profile sparsity as a solution property in Bayesian source apportionment, and positions BAMF+HS as a promising model for source apportionment.
- Preprint
(2475 KB) - Metadata XML
-
Supplement
(2894 KB) - BibTeX
- EndNote
Status: open (until 21 Jan 2026)
- RC1: 'Comment on egusphere-2025-5253', Anonymous Referee #1, 08 Jan 2026 reply
-
RC2: 'Comment on egusphere-2025-5253', Anonymous Referee #2, 19 Jan 2026
reply
The manuscript by Via et al. applies a new statistical method for source apportionment aiming at introducing sparsity in order to improve the solution.
Chemically-correlated datasets can introduce extra uncertainties for receptor models, which can be resolved, as shown in the current manuscript, by introducing the sparsity method. Sparsity was introduced into the Bayesian Autocorrelation Matrix Factorization (BAMF) model with the use of horseshoe priors. The authors tested this approach to both synthetic and real-world datasets. They concluded that sparsity was achieved in less complicated datasets, while this wasn’t the case for the more complicated ones.
Given the relevance of the topic for the AMT journal and the overall presentation of the manuscript, I recommend publication after revision. My comments are listed below.
Major revisions
- Line 22: A question that arises is why not remove non-essential species directly from PMF? Since the authors later discuss that sparsity could also be introduced into PMF, it would be helpful to explain why this was not tested, or discuss how that would be expected to compare. This is more of a suggestion than a requirement.
- Further justification is needed for why PMF outperforms BAMF and BAMF+HS in the toy (simple) dataset.
- Line 103-105: Please emphasize that in the case of PMF it’s the opposite; normalization takes place after resolving the factors. Also in Line 216: X and σ are not normalized in PMF. Please clarify how these differences might affect the comparison and whether they could bias the results.
- Lines 212-213: Could this create bias when comparing PMF and BAMF?
- For the toy dataset it’s not clear to me if it’s intended to be ACSM-like. Please describe this further.
- Please include somewhere in the manuscript the temporal resolution of any synthetic/real dataset and also what was the computation burden of each model. Would you suggest BAMF as a complementary to PMF method, or as a replacement to the traditional PMF? Maybe include a qualitative comparison including also the ease of applying both and how easily each model is to tune etc.
- Line 348: How was the “truth performance” obtained? If derived through PMF, please discuss the appropriateness of comparing.
- Lastly, Fig. 1 can be quite confusing for the reader. I would expand the explanation in the text and also in the caption. Describe the posterior density distributions and why the truth is always at zero in the y-axis.
Minor revisions
Line 23 Improve the separation compared to what? Be more specific in such discussions (also in Line 514)
Line 43 You mean "PM", not "PMF"
Line 53 Add though; “even though it can lead…”
Line 55 Maybe end the sentence after the references.
Line 57-58 What do you mean “cover the whole range of prior knowledge required” ?
Line 61 Add "was introduced by Park…"
Line 79 BMF; you have already introduced the acronym (also in Line 96 and 108)
Line 96 By (a): do you mean Equation (1)?
Line 100 Similarly, replace (2) with Eq. (2) if you mean equation
Line 103 Delete “meaning”
Line 107 Delete "hereinafter" or "here"
Line 116 "Where" has different font. Also in Line 204
Line 151 Maybe replace "unevenness" with "inequality"
Line 162 Add () in γ
Line 208 Change the Equation number to 14 and also the next ones
Line 210 What do you mean "sorted PMF runs"? Probably regular runs and you sort afterwards. Please rephrase
Line 216 Rephrase “Previously to model running,…”
Line 222 This sentence is confusing, please rephrase
Line 227, 486, 510, 542, 543 Use bolt for F and/or G
Line 232 Use "was" instead of "is"
Line 235 Use "was" instead of "will be"
Line 236 Use bolt in Z, X
Line 251 Please fix this sentence: Despite the truth’s factorisation is unknown…
Line 255 Please change 2.3.1 to 2.4.1 and apply the same to the following subsections in 2.4
Line 255 Maybe add “synthetic” here too, so it’s easier for the reader
Line 303 Please add a comma after "models"
Line 304 Replace "is" with "was"
Line 321 Maybe transfer “slightly” before “modifying”
Lines 322-323 Why did you choose to perturbate only one factor? It would be helpful to see the sensitivity analysis on all factors of one dataset (e.g. Zurich).
Line 328 Replace “with this framework” with “within this framework”
Line 329 Replace “tried” with “implemented”
Line 370 Replace “to prove this further.” With “to further highlight this.”
Line 372 Replace “is intended” with “was used”
Line 378 It is not very clear what the initialization failure means
Line 404 You have used however three times in a paragraph. Maybe remove this one
Line 414 Repeated “offline”
Line 426 Maybe replace “incapacitating“ with “preventing”
Line 417 “we made them more robust”: this needs rephrasing
Line 436 Replace with: “The next step was to test these models on more realistic synthetic datasets”
Line 444 This sentence is a bit confusing
Line 484-485 Repeating “solution”
Line 526-528 This is a large sentence, please try to modify
Line 529 Maybe better to say: “suffers from in contrast to BAMF”
Line 530 Remove “hence”
Line 539 Replace “The other tried out models” with “The results from the other models tested”
Line 547 What do you mean by essentialise?
Line 555 Maybe replace “essayed” with “tried” or “tested”
Citation: https://doi.org/10.5194/egusphere-2025-5253-RC2 -
RC3: 'Comment on egusphere-2025-5253', Anonymous Referee #3, 19 Jan 2026
reply
The paper “Chemical sparsity in Bayesian receptor models for aerosol source apportionment” by Via et al. extends the Bayesian Autocorrelated Matrix Factorisation (BAMF) model by introducing profile sparsity via a regularised horseshoe (HS) prior on the chemical composition matrix, yielding the BAMF+HS approach. This represents a meaningful methodological contribution to the field of atmospheric receptor modelling by tackling the challenge of separating correlated sources without highly restrictive a priori constraints. BAMF+HS is evaluated against BAMF and Positive Matrix Factorisation (PMF) on synthetic “Toy”, “Offline”, and “Online” datasets of increasing complexity, as well as a real multi-site offline filter dataset. The paper demonstrates that BAMF+HS successfully induces profile sparsity in simpler offline simulations and often improves source profile and time-series recovery, although sparsity is less successful in complex ACSM-like data.
This paper addresses an important methodological gap in Bayesian source apportionment by incorporating profile sparsity. The synthetic and offline evaluations convincingly show benefits in appropriate contexts, and the application to real data suggests promising performance without data modification. With revisions that enhance clarity (adding a schematic workflow), expanded discussion on practical applicability, this paper represents a valuable contribution to atmospheric measurement techniques.
General comments
- Although the paper is technically rigorous, the presentation is often dense, with extensive mathematical detail that can obscure the broader modelling logic. Some sections would benefit from clearer explanatory text and more consistent definition of notation to help guide the reader through the methodology. Adding brief intuitive descriptions alongside the formal equations would improve accessibility and make the work more approachable. In particular, a schematic workflow diagram for BAMF+HS would help clarify where the sparsity prior operates relative to temporal autocorrelation and the overall factorisation process.
- The potential benefits and limitations of BAMF+HS in real atmospheric data contexts (beyond the synthetic and offline examples) require deeper discussion. Readers would benefit from explicit guidance on dataset characteristics that favour the use of sparsity priors, when chemical profiles are inherently sparse versus when they are mixed and correlated.
- The introduction of the regularised HS prior increases model complexity and sampling demands. A brief discussion on computational performance, convergence diagnostics, and sample efficiency across dataset types would inform practitioners considering BAMF+HS for large-scale studies.
Specific comments
- While the authors provide quantitative metrics (e.g., correlation coefficients, ratios) to compare methods, additional visual comparisons of source fingerprints and residuals would enhance interpretability and illustrate practical differences between BAMF+HS, BAMF, and PMF solutions.
- The limited achievement of sparsity in the ACSM-like datasets suggests that the HS prior may not always be appropriate. The paper should more explicitly address why the chemical structure of such data resists sparsification and whether alternative priors or hybrid techniques could overcome this challenge.
- It would be beneficial and significantly increase the method’s practical usefulness to provide short recommendations how users might diagnose when to apply BAMF+HS and to describe suggested diagnostic checks prior to deployment.
- Abstract: define “Toy”, “ACSM”
- Line 43-47: add other types of receptor model
- Line 53-58: please introduce constrained/unconstrained PMF
- Line 55: I suggest replacing “disentanglement” with deconvolution
- Line 83-87: please rephrase the definition of sparsity to be clearer to the reader and add its usage together with PMF
- Line 116, 204: check the fonts
- Line 119-120: add a proper reference to the statement
- Line 125: I suggest to put “involves”, instead of “entails”
- Line 222: “hence” is repeated
- Line 235: define the parameters of “computational performance”
- Section 2.3.4: please mention if you tried to perturbate more factors simultaneously and if the model still catches the truth
- Line 330: replace “grasp” with a more appropriate synonyms
- Line 347: mention the models
- Line 378: please rephrase
- Line 380: explain why this specific factor, it has something particular?
- Add in the Discussion section if the method can be applied to other types of datasets, which are the minimum requirements for these datasets
- Please avoid abbreviation in the conclusion section
- Figure 1 and 4 are difficult to follow, please improve the representation
- Figure 4 include more info in the caption (colour, elements, factors)
- Figure 7: use the colour code for the cities, explain the higher variability
Citation: https://doi.org/10.5194/egusphere-2025-5253-RC3
Data sets
Datasets for BAMF+HS test Marta Via et al. https://github.com/martavia0/BAMF-horseshoe/tree/main/datasets
Model code and software
Models for Bayesian Matrix Factorisation Marta Via et al. https://github.com/martavia0/BAMF-horseshoe/tree/main/models
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 154 | 67 | 13 | 234 | 38 | 25 | 21 |
- HTML: 154
- PDF: 67
- XML: 13
- Total: 234
- Supplement: 38
- BibTeX: 25
- EndNote: 21
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Via et al. extend the Bayesian Autocorrelated Matrix Factorisation (BAMF) model for aerosol source apportionment by introducing profile sparsity via a regularised horseshoe (HS) prior on the composition matrix . This yields BAMF+HS, a Bayesian receptor model where: the classical receptor formulation is retained. Temporal autocorrelation in source contributions (from BAMF) is kept. A regularised HS prior shrinks low-signal entries of toward zero, encouraging chemically parsimonious profiles.
They evaluate BAMF+HS on:
They compare against BAMF (without sparsity) and PMF. The main findings are:
Overall, this is a timely and well-motivated methodological contribution to the atmospheric source-apportionment literature. The explicit use of a regularised horseshoe prior to enforce chemical sparsity in a Bayesian receptor model addresses a longstanding challenge in separating correlated sources without heavy, subjective constraints. The synthetic evaluations and the application to a real multi-site offline filter datasets are convincing. I recommend publication in AMT after major revision by addressing following comments:
General comments:
Specific comments:
Line 25: what is Toy?
Line 28: define ACSM
Line 39: Be more careful in referring OP as toxicity. I will be more conservative on this by refereeing it to one of the health metrics.
Line 43: first time introduce PMF needs to be spelled out. Also, PMF is one of the RMs to conduct source apportionment analysis instead of the only approach to do source apportionment. Please rephrase. Also, SA is not just identification, but also quantification sources. You will need to make that clear.
Line 46: decomposes -> deconvolute.
Line 50: I will avoid using mathematical terms like ℝn·m to accomendate wider audiences, suggesting spelling it out.
Line 53: some introduction about unconstrained PMF or constrained PMF is necessary.
Line 55: disentanglement to “identification”
Line 57: CMB does not 100% equal to fully constrained PMF. Also, I don’t know why you introduce CMB here. Perhaps you will need a few sentences in this paragraph to introduce the limitations of PMF or CMB in general.
Line 70: approach -> conduct
Line 82: overlapping emissions -> mixed emission sources
Line 83: slight F differences -> slight differences of F
Line 84-83: use even simpler language to briefly explain sparsity and why it makes sense to enforce it in PMF analyses. Also, change elements to variables since not all the variables of F is element.
Line 86-87: Change it to: “The accomplishment of sparse source fingerprints could represent “cleaner” emission sources without mixing among resolved factor profiles.”
Line 102: What is N? please introduce it in the text
Line 222: avoiding using hence twice in one sentence
Line 256: OK, now I understand what is the toy dataset. Is it more appropriate to use dummy instead of toy? It doesn’t make much sense to me when I first saw it in the beginning of the manuscript without context.
Table 3: For the “toy” dataset, the BAMF or BAMF+HS in general is worse than PMF results, could this be a major flaw of the BAMF? How can this be addressed?
Figure 1: it’s a bit confusing for me with the y-axis. They are not real m/z, right? Also, what is the unit of the y axis? Have you done some repeats of CMB or CMB+HS, and the y-axis are the frequency of the iterations end up with these concentrations? It’s not clear from your text and figure captions. Please clarify.
Section C.1 of SI: there are inconsistencies in BMF or BAMF, BAMF-GS or BAMF+GS.
Figure 3: You will need a legend for which color is which…
Figure 8: I’m still confused about what you are showing here. Is it the autocorrelation of each model of each source? Is it the model vs truth for each source for each model? Or is it the correlation of the autocorrelation between model vs truth? If it’s the third one, what does it mean?