the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Toolbox for accurate estimation and validation of PMF solutions in PM source apportionment
Abstract. Positive matrix factorization (PMF) is the most commonly used approach for particulate matter source apportionment; however, the implementation steps of the model require considerable user experience. Most studies apply PMF according to the recommendations of the Environmental Protection Agency and the European Commission, while relatively few studies focus on further developing the PMF methodology. This study aims to develop a systematic method that reduces some subjective aspects when performing a PMF study, providing recommendations and tools for its application and validation. A total of 13 targeted tests were conducted to address key sources of subjectivity in PMF, categorized into three critical aspects: preparation of the input matrix, selecting the number of sources, and validation of the PMF solution. The results of the first step highlighted that using a single source tracer reduces the tracer's dispersion into other sources, leading to more accurate results. The second stage tests suggested that the selection of a source tracer should be based on low uncertainty and specific temporal evolution, in order to facilitate the determination of a new source without compromising the PMF solution. Finally, the validation step was set up as an advanced comparison of the PMF-derived source profiles with those in the literature, including SPECIEUROPE database, using the ratio of chemicals and distance metrics. All outcomes of this study are compiled into a Python package providing essential tools to support the work from PMF implementation to solution validation, leading to less subjective solutions and more rigorous and reliable source apportionment.
- Preprint
(1254 KB) - Metadata XML
-
Supplement
(942 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-1968', Anonymous Referee #3, 05 Jul 2025
-
AC1: 'Reply on RC1', Vy Dinh Ngoc Thuy, 04 Sep 2025
Authors response
We thank the reviewers for their time and valuable comments that helped improve our manuscript's quality. Below, we have answered the reviewers' comments in bold and in italic are the changes included in the manuscript.
Referee 1
This study develops an approach to reduce subjectivity in PM source apportionment using PMF. Three key challenges, including input matrix preparation, selection of the number of sources, and validation of PMF solutions were addressed. The results find that using single-source tracers (e.g., levoglucosan alone) minimizes tracer dispersion and improves accuracy. For source separation, tracers should have low uncertainty (S/N ≥ 3) and distinct temporal patterns. Validation is enhanced by comparing PMF-derived profiles with literature data using chemical ratios and distance metrics. Finally, the methodology is compiled into a Python package to standardize PMF implementation and improve reliability. Overall, the results of the manuscript are interesting and valuable to the literature. However, there are several aspects requiring further clarification.
Line 96: Please define OC*
Reply: Thanks for your suggestion. The OC* is defined by subtracting the carbon concentrations of organic markers (which are included in PMF input) from total OC concentrations, aiming to avoid double counting of C mass. We have added the OC* equation in SI, and updated the main text as follows (line 97-98):
Many PMF studies make use of OC* instead of OC - as an input variable to avoid double-counting part of the total OC mass (EQ S1) (Borlaza, 2021; Dominutti et al., 2024; Srivastava et al., 2018).
[OC*] = [OC] - ([Levoglucosan]*0.44+ [Mannosan]*0.44+ [MSA]*0.12+ [Polyols]*0.4+ [Oxalate]*0.27+ [2-MT]*0.44+ [3-MBTCA]*0.47) (EQ S1)
Line 221: Should the S/N value be greater than 3? Please clarify.
Reply: Thanks for your comment. The S/N value mentioned in line 221 is to present the classification of chemical species in PMF, using S/N values. Following EU recommendation, the chemical species is classified using the S/N value, where the S/N <0.2 is classified as "Bad", 0.2<S/N<2 is classified as "Weak", S/N > 2 is classified as "Strong".
The S/N should be above 0.2 to be incorporated in the PMF input matrix. At this stage, we only present the methodology, which is based on the PMF user guide and EU recommendation. We have not yet reached a conclusion regarding the S/N values.
On the other hand, the S/N>3 was indeed introduced after evaluating different tests by changing the uncertainty of the input (Test 5 to Test 11).
Line 234: What is the additional coefficient "a"? Please define.
Reply: Thanks for your comment. The uncertainty calculation follows the formula of Giannini et al. (2012), where “a” is the additional coefficient of variation. Using such a coefficient aims to account for the variability beyond the analytical processes, and also to avoid the zero uncertainty of a sample since the null and zero uncertainty is unacceptable in PMF. “a” values in this study are adapted from the study of Weber et al.(2019).
We noticed an error in Table 3 (which could bring confusion), the Alpha value is indeed the “a” value. We have modified the Table (line 239):
S/N value
QL (µg/m3)
CV
a
2-MT1
S/N = 2
0.000352
15%
10%
2-MT2
S/N = 2
0.00352
15%
10%
2-MT3
S/N = 3.8
0.000176
10%
10%
2-MT4
S/N = 8
0.0000704
5%
5%
2-MT5
S/N = 1.4
0.000352
20%
10%
Fake 1
S/N = 4
20%
Fake 2
S/N = 3
30%
Lines 256-260: Why did the authors use 11 factors? Please clarify. How would the results change if the number of factors were reduced?
Reply: Thanks for your suggestion. As presented in lines 154 and 155, the optimal base results of the PMF in Arrest and Ailly have already been presented by Zhang et al. (2024), which indicated that the most appropriate solution is the one selected based on the criteria of validation and the comparison between different PMF runs. The solution with 10 factors showed a mixture between sulfate rich and nitrate rich. In addition, a high contribution of metals is found in this mixed chemical profile, especially Ni and V, indicating a mixing between SIA and shipping sources. Increasing the solution by one factor enhances the separation of SIA and HFO, consequently improving the prediction of sulfate and of Ni, V in the PMF.
However, the purpose of our study is not to re-evaluate the optimal base case, but to test the sensitivity of the result to the input variable, therefore, the investigation of the base case is not included again in the paper.
Line 268: The reasoning is clear. Since levoglucosan and mannosan are isomers and likely share the same source, their high correlation should not distort PMF results, as they would be grouped into a single factor. The critical point here should stress that excessive use of highly correlated inputs risks biasing the model outcomes.
Reply: Thanks for your suggestion. The authors agreed with referee comments, levoglucosan and mannosan shared the same source (with a correlation of r2 = 0.99 between the 2 species for both sites), and including both species or levoglucosan alone in the PMF leads to a minimal change in the contribution of the BB source to PM. However, as shown in Figure 2, in Ailly, incorporating the highly colinear tracers can slightly increase the rotational ambiguity of the PMF solution, where the small proportion of levoglucosan is distributed to the traffic profile, which is not the case in Arrest. Although this last contribution is minor, it still has an effect on the contribution of the BB and traffic sources to PM at the site. We updated the text as follows (lines 283-286):
This strongly suggests that using a single, robust tracer is preferable. While PMF can group correlated variables, including multiple highly collinear tracers in the input data can introduce rotational ambiguity that risks biasing the model's outcome. This can lead to the misattribution of these tracers to other factors, thereby altering their chemical profiles and contributions.
Lines 279-287: Should the first step be a correlation analysis to filter out highly correlated input parameters?
Reply: Thanks for your suggestion. Yes, the correlation analysis is always the first analysis before running the PMF analysis, which is included in the Python package (PMF_toolkits) developed in this study. However, the objective of this test is to elucidate the sensitivity of the PMF result for different changes in the input variables, which is not necessarily related to the correlation analysis.
Table 4: What are the differences between lines 5 and 6 in the table?
Reply: Thanks for your comment. Line 5 indicates at which point in the selection processes of the number of factors the new factor identified is the source of 2-MT. Line 6 describes the optimal solution, and indicates the total number of factors where the 2-MT is presented in the chemical profile. This number should ideally be 1, indicating that there is no mixing between the 2-MT factor (BSOA) and other factors. We have changed the row headers in Table 4 to be clearer:
Row 5: Number of factors required to identify sources of 2-MT
Row 6:Number of final factors that 2-MT contributes to
Line 375: Should it be 99%?
Reply: Thanks for your careful reading. We did a double-check, the PMF result of test 6 indicated that the contribution of 2-MT in the source of 2-MT is 98%, with the 2 remaining % being ascribed to the source of Aged sea salt. Therefore, the value in the manuscript is correct.
Lines 384-387: Given that higher S/N ratios improve PMF separation, what specific recommendations follow? For instance, should priority be given to filter the tracer species with both high concentrations and elevated S/N ratios?
Reply: Thanks for your interesting question. Our recommendation is dedicated to users who seek to identify new or minor sources, especially for organic sources. The recommendation on tracer selection is based on 2 key aspects: (1) S/N ≥ 3 and (2) the low collinearity (ideally R2 < 0.3) with other chemical species. These conditions must be met all at once, and there's no priority. Lacking one condition, the species could not enhance the identification of new sources. For example, the S/N of HULIS (at Ailly) is 5 and concentrations are important, but this species is well-correlated with some other species (R2 max is 0.6 with Levoglucosan). Missing one condition, HULIS could not be separated solely in a factor (i.e, 100% HULIS contribution in a factor). This is, however, expected for this species, considering the knowledge about its geochemistry and sources (Baduel et al., 2010).
Regarding the concentration, the results of tests 6 and 7 demonstrated that the S/N value is more crucial than the concentration for source separation, where a 10 times higher concentration of 2-MT with a lower S/N barely determined a specific source, more easily achieved with a low concentration and high S/N. Therefore, our recommendation is not simply based on the high concentrations of chemical species used for tracers, but rather on filtering for species with high S/N and low collinearity. The recommendation is updated in the main text as follows (line 397-399):
Consequently, the result recommends key criteria for tracer selection with (1) the S/N ≥ 3 and (2) remarkable temporal evolution, where the R2 between the tracer and other chemical species in the input matrix is below 0.3.
Lines 467-468: Building a source library is a great idea. However, sources may vary due to factors such as material, location, and environmental conditions. For example, for biomass burning, as shown in Figure 6, burning different wood types yields different source profiles. Different burning conditions may also contribute to varying results. Are there suggestions to address this diversity, such as using ratios of certain tracer species to determine these variables? The PD/SID metrics are useful, but the manuscript should address how missing data (e.g., non-overlapping species between profiles) affects the comparison.
Reply: Thanks for your comments. The different types of biomass combustion, as well as the fuel combustion, are determined using the specific ratio, which is found from the literature and presented in Table S3. These ratios are calculated using the concentration (in µg m-3) of the tracers in the chemical profile. The Python package developed in this study included ratio validation, which automatically checks the ratio of specific species within a source profile compared to those in the literature. For instance, the ratio of Levoglucosan/Mannosan could indicate the subtype of biomass burning (Crop residues, hardwood, softwood, grasses, etc.). The comparison with the SpeciEurope profiles could give a range of related sources to the PMF-derived source; however, we recommend users to use specific ratios, as well as consider the characteristics of the site studied, to be able to identify which source is the most appropriate.
Regarding the non-overlapping species between profiles (indicated as non-comparison in the study), a high proportion of this part leads to inappropriate comparison, as shown in Figure 6. We indicated that besides the distance metrics and the number of species used for the comparison, the mass of the compared part is an important aspect to observe when comparing the chemical profiles. We did not have a recommendation for the mass since this case barely occurs, and the condition on the number of species could normally cover the mass. We did not encounter any intermediate cases (between 10% and 95% contribution of non-comparison mass), therefore, our study cannot conclude for these conditions. Nevertheless, the specific case of Tyre wear (Figure 6) is interesting. It demonstrates that it is mandatory to compare the specific compositions of PM between the PMF-derived source and the chemical profiles that are most similar to those suggested. The tool for comparison is included in the Python package.
We have added in the main text to clarify (lines 486-490):
These results emphasize that it is crucial to consider the SID and PD values together with the number of chemical species compared, but also to observe in more detail the chemical profiles in order to evaluate the PMF solutions. The output of the tool is a list of similar chemical profiles, which gives an idea of the nature of PMF-derived sources. Other investigations, such as the characteristics of the sites, should be considered when identifying the source.
Line 492: Some parts of the manuscript use S/N>2, while others use S/N>3, which is confusing. Please clarify.
Reply: Thanks for your suggestion, and we apologize for the confusion. Indeed, S/N > 2 is the condition for a species to be classified as strong in the PMF model, which is the proper classification of chemical species in the PMF model, indicated in the PMF guideline. On the other hand, the S/N > 3 is the condition for tracer selection, which is retrieved from the results of different tests in our study.
Citation: https://doi.org/10.5194/egusphere-2025-1968-AC1
-
AC1: 'Reply on RC1', Vy Dinh Ngoc Thuy, 04 Sep 2025
-
RC2: 'Comment on egusphere-2025-1968', Anonymous Referee #1, 24 Jul 2025
This manuscript presents a toolbox for source apportionment of particulate matter using positive matrix factorization (PMF). This study aims to develop a method to reduce some subjective aspects when using PMF, including the preparation of input matrix, number of sources, and validation of solution. There are several issues that need to be addressed before it can be considered for publication.
General comments:
- The input matrix for source apportionment of particulate matter using PMF is critical. The accuracy of source apportionment and the selection of appropriate tracers depend on the overall concentrations of PM and the tracers. In this study, the modification of input variables for tests 1-4 are all for organic materials. However, it is important to consider if these three sampling sites are suitable for these tests. For example, if the biogenic SOA concentrations are too low, tests involving 2-MT on top of 3-MBTCA may not yield meaningful results. Another example is the test 1, using OC instead of OC*. At these two sampling sites, the differences between these two concentrations are very low. Negligible variation in all factor is expected. It is recommended to use a different dataset with more noticeable differences to run the same test.
- HULIS can originate from various sources, including primary sources, such as soil. Some statements in this manuscript imply that HULIS are secondary, which is inaccurate. For example, Line 105-109, Line 309, Line 311-312. Also, HULIS is not commonly determined in PM. The choice of HULIS for tests need further justification.
Minor issues:
- Title, use the full descriptions of PMF and PM.
- Line 65-67, this statement requires additional supporting evidence or references.
- Table 1, what does OC* represent?
- Line 252-253, this is not accuately written.
- Line 278 and 286, is there an explanation for these results?
- Line 302, is “secondary biogenic oxidation” an identified source?
- Line 354-360, since the Fake 1 and 2 are significantly different from all other chemical species, it would be expected to have two more sources in the PMF solution. It is not clear why the tests 10 and 11 are necessary.
- Line 418-420, although the ratios of chemicals or tracers can be useful for comparison with profiles of specific sources (or in the databases), keep in mind that in most cases, they are mixed from different sources.
- Table 5, the 3-MBTCA/OC ratio is also lower than the acceptable ranges. Please provide further elaboration.
Citation: https://doi.org/10.5194/egusphere-2025-1968-RC2 -
AC2: 'Reply on RC2', Vy Dinh Ngoc Thuy, 04 Sep 2025
We thank the reviewers for their time and valuable comments that helped improve our manuscript's quality. Below, we have answered the reviewers' comments in bold and in italic are the changes included in the manuscript.
This manuscript presents a toolbox for source apportionment of particulate matter using positive matrix factorization (PMF). This study aims to develop a method to reduce some subjective aspects when using PMF, including the preparation of input matrix, number of sources, and validation of solution. There are several issues that need to be addressed before it can be considered for publication.
General comments:
The input matrix for source apportionment of particulate matter using PMF is critical. The accuracy of source apportionment and the selection of appropriate tracers depend on the overall concentrations of PM and the tracers. In this study, the modification of input variables for tests 1-4 are all for organic materials. However, it is important to consider if these three sampling sites are suitable for these tests. For example, if the biogenic SOA concentrations are too low, tests involving 2-MT on top of 3-MBTCA may not yield meaningful results. Another example is the test 1, using OC instead of OC*. At these two sampling sites, the differences between these two concentrations are very low. Negligible variation in all factor is expected. It is recommended to use a different dataset with more noticeable differences to run the same test.
Reply: We thank the referee for your thoughtful comment. We believe that the chosen sites are indeed appropriate for our scientific questions as well as the designed tests. The goal of the study is to test the sensitivity and the limitations of PMF, rather than the PM SA study at the sites.
The test with 2-MT: We used the dataset from Grenoble since at this Alpine valley site, the biogenic emission is important, as demonstrated with the secondary biogenic aerosol source determined in the base case from Borlaza et al.(2021). Further, it should be reminded that the purpose of the test is not only to quantify the contribution of the source associated with 2-MT, but also to elucidate at which point the PMF can separate such a source. In Test 5, we failed to identify the source of 2-MT since the S/N is low. Once the S/N value is increased in tests 7 and 8 (with constant concentrations), this source could be easily determined. This result is interesting, perhaps not in the sense of the contribution as the referee suggested, but for the methodological development aspect, where we could emphasize which condition should be matched to identify minor sources, especially organic sources. And, as pointed out above, the concentrations are not the main point in this respect.
The test using OC instead of OC*: We agree with the referee that the difference between OC and OC* is relatively low; however, this difference reflects the general case for such a dataset, where the C mass of the species is negligible compared to the OC mass. While our group is one of the groups that could incorporate an extensive set of organic markers into PMF, and our large experience is that the maximum difference between OC and OC* is about 10% (Glojek et al., 2024). To the best of our knowledge, the difference between OC and OC* reported in the literature is basically less than 10%. Our test is to address whether the calculation of OC* is necessary when the difference is in this range, and the result is that the changes in the concentrations of the factors are quite low. Specific cases with differences well above 10 % remain to be tested.
HULIS can originate from various sources, including primary sources, such as soil. Some statements in this manuscript imply that HULIS are secondary, which is inaccurate. For example, Line 105-109, Line 309, Line 311-312. Also, HULIS is not commonly determined in PM. The choice of HULIS for tests need further justification.
Reply: Thanks for your comment. We agree with the referee that HULIS could originate from various sources, such as primary emission. However, HULIS has also been reported as the secondary product of atmospheric processes (Baduel et al., 2009, 2010; Hoffer et al., 2006; Li et al., 2019; Srivastava et al., 2018; Zheng et al., 2013), therefore, our consideration is based on the literature and is fundamentally correct.
The referee is correct about the rare identification of HULIS and its use in PMF studies. We are one of the few labs that can analyze HULIS in the world, and this is a good occasion to perform PMF tests on a major contributor of OC (very often the most prominent identified component), as well as a multiple-source originated species. Indeed, the HULIS have been incorporated in PMF analysis in other studies, which helps to identify the sources of coal burning or incinerator municipal sources (Dominutti et al., 2024; Li et al., 2019). Our result showed that the PMF solution remained stable when adding HULIS, and the attribution to PM source of HULIS is consistent with those reported in the literature. This is good evidence of PMF's capacity to identify sources for a complex and multi-origin chemical species.
Minor issues:
Title, use the full descriptions of PMF and PM.
Reply: Thanks for your suggestion, we have updated the title.
Toolbox for accurate estimation and validation of Positive Matrix Factorization solutions in particulate matter source apportionment
Line 65-67, this statement requires additional supporting evidence or references.
Reply: Thank for your suggestion. We have added the references.
Finally, constraints that can be applied to initial results to obtain the final solution rely on user experience, and their application criteria as well as the error of the solution (Bootstrap, DISP) are rarely documented in the literature (Hopke, 2016). In the same way, comparison of the chemical profiles obtained in the studies is rarely benchmarked towards previous results.
Table 1, what does OC* represent?
Reply: Thanks for your comment, we updated in the main text:
1 OC* is calculated by EQ S1.
Line 252-253, this is not accuately written.
Reply: Thanks for your comment. We have updated the main text.
Where:
m is the number of chemical species common to both profiles
with x and y the relative mass of these m to the PM in the two respective chemical profiles.
Line 278 and 286, is there an explanation for these results?
Reply: As mentioned in the response for referee 1, using a single tracer provides a cleaner and less mixed chemical profile, reducing the ambiguity of the rotation of PMF solution. Using one good tracer of the source prevents the PMF from incorrectly splitting the tracer into other source(s), leading to less dispersion of tracer into unexpected factors (which is geochemical nonsense). For example, using a single tracer (Levoglucosan) enhances the separation between traffic and biomass burning, consequently leading to a more stable and consistent apportionment between Arrest and Ailly (nearby sites).
Line 302, is "secondary biogenic oxidation" an identified source?
Reply: Yes, this name is assigned for the source of alpha-pinene oxidation product (3-MBTCA), which is identified in the base run PMF (Borlaza et al., 2021). This is a common source category in the PMF literature, which could also be named as Biogenic Secondary Organic Aerosol (BSOA) when considering only the organic fraction.
Line 354-360, since the Fake 1 and 2 are significantly different from all other chemical species, it would be expected to have two more sources in the PMF solution. It is not clear why the tests 10 and 11 are necessary.
Reply: Thanks for your comment. The significantly remarkable temporal evolution of Fake 1 and Fake 2 is created by purpose, which seeks to find the limit of PMF in the number of sources identified. In the literature, the PMF resolved source varies from 8 to 11, which raises the question of whether this limit is due to the lack of tracers or the inherent mathematical limit of the PMF. Using Fake 1 and Fake 2, which meet the ideal criteria for a good tracer, successfully identified 13 and 14 robust factors. This finding is innovative, demonstrating that the limit of PMF-resolved source is not because of the PMF algorithm but rather the lack of high-quality tracers included in the model.
Line 418-420, although the ratios of chemicals or tracers can be useful for comparison with profiles of specific sources (or in the databases), keep in mind that in most cases, they are mixed from different sources.
Reply: We agreed with the referee, thanks for emphasizing this important point. However, the diagnostic ratio introduced in this study is intended for the output of the PMF result, not the raw ambient observation. For example, the ratio of OC/EC of biomass burning source is calculated by using the concentration of OC and EC contributing to the chemical profile of biomass burning source, which is derived from PMF analysis.
Table 5, the 3-MBTCA/OC ratio is also lower than the acceptable ranges. Please provide further elaboration.
Reply: Thanks for your thoughtful note. The ratio 3-MBTCA/OC of the secondary oxidation factor is slightly below the acceptable range, which is lower than 10% of the lower bound (0.009 vs 0.01). This minor deviation is considered negligible and not sufficient to invalidate the SA result, as discussed in lines 426-428. Indeed, the secondary organic aerosol source typically varies according to meteorological and atmospheric conditions, as well as PM ageing, leading to differences between study sites and those reported in the literature.
Citation: https://doi.org/10.5194/egusphere-2025-1968-AC2
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
513 | 58 | 18 | 589 | 29 | 12 | 27 |
- HTML: 513
- PDF: 58
- XML: 18
- Total: 589
- Supplement: 29
- BibTeX: 12
- EndNote: 27
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
This study develops an approach to reduce subjectivity in PM source apportionment using PMF. Three key challenges including input matrix preparation, selection of the number of sources, and validation of PMF solutions were addressed. The results find that using single-source tracers (e.g., levoglucosan alone) minimizes tracer dispersion and improves accuracy. For source separation, tracers should have low uncertainty (S/N ≥ 3) and distinct temporal patterns. Validation is enhanced by comparing PMF-derived profiles with literature data using chemical ratios and distance metrics. Finally, the methodology is compiled into a Python package to standardize PMF implementation and improve reliability. Overall, the results of the manuscript are interesting and valuable to the literature. However, there are several aspects requiring further clarification.