the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Global variability in the detectability of power plant NO2 plumes from space
Abstract. We present the first global, data-driven analysis of power plant NO2 plume visibility from space. Using TROPOMI observations over 6,000 of the world’s highest-emitting power plants and hourly CEMS data for 500 U.S. plants, we develop an automated algorithm that labels plumes and attributes them to their sources with 98 % accuracy. We then train a machine learning model to predict plume detectability from environmental, meteorological, and observational variables (F1 score > 0.65, AUC > 0.8). Out of 25 variables, we find that NOx emission rate, surface albedo, wind speed, and sensor zenith angle jointly explain much of the detection variability. An hourly NOx emission rate of ≈ 400 kg/h corresponds to a 50 % detection probability on average, but detection rates vary from < 20 % to > 60 % under different combinations of these conditions. These results provide the first empirical quantification of the physical and environmental factors that govern NO2 plume visibility in satellite data, establishing a foundation for models to use similar predictors as auxiliary variables when quantifying emission rates from plume appearance.
- Preprint
(45335 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-6008', Anonymous Referee #1, 08 Mar 2026
-
RC2: 'Comment on egusphere-2025-6008', Andrew Barr, 22 Apr 2026
To the editor,
The authors present a stimulating study which addresses an important topic in satellite trace gas research, namely how to quantify the ability to detect plumes of anthropogenic pollutants and greenhouse gases. This can be compared to finding a plume detection limit but goes a step further. Such research has important direct implications on missions like TROPOMI, which the authors focus the manuscript on, but also future missions such as TANGO, which will measure emissions of CO2 and NO2 on facility level scales. The manuscript is generally structured as follows: a new plume detection algorithm is presented, which is used to define training data for a deep learning model which is trained to predict plume detectability. This machine learning model and the prediction metric (plume detectability) form the basis of the results and discussion sections. The results are interesting and offer an impressive demonstration on the potential and depth of satellite data of trace gases.
In its current state the paper does not meet the high requirements of this journal. Significant major revisions are required for it to be acceptable. The paper is well written in parts, but overall appears hastily put together, with lack of consistency and depth throughout. The authors must take care to produce a rigorous manuscript. Below I list the major criticisms followed by a collection of more minor comments, which all need to be addressed.
Major:
- Several fundamentals of scientific writing are not sufficiently executed
- Abbreviations often go undefined (e.g. DDEQ, TROPOMI ect.) or used before they are defined (e.g. SHAP), also in the abstract.
- Figures and Tables need to be properly introduced and not in brackets for the first reference.
- There is a general theme of inconsistency which makes for very difficult reading. The authors use the word observation, snapshot, overpass and samples, and refer to them interchangeably across figure captions and main body text (Fig 2 and sec 2.2 etc), without ever explicitly stating what these mean. These all need to be better defined and consistently used. Similarly in Section 3 the plume detection is interchangeably referred to as Automated Plume Detectability and Automated Plume Detection. It is also sometimes referred to as Automated Plume Labelling Algorithm. This confuses two main aspects of the method. The authors must be more precise in their definitions and consistent with their use throughout the manuscript.
- How is detectability quantitatively defined? This is the key focus of the paper. What quantity is the machine learning model predicting? This fundamental definition is lacking. Suddenly in L365 detection probabilities are reported without a definition of this metric in Section 3. L368 hints that this has something to do with the number of satellite overpasses. In Section 4.4 and Figure 10 this variable appears to be labelled as P(detect) or elsewhere P(detection). A clearer distinction needs to be made between detection probability and detectability. If these two are the same the confusion can be resolved by being more consistent.
- The discussion in Section 5 lacks depth. The ratio of results to discussion material is very high, and Section 5.1 is rather repetitive of the text in Section 4.4. A better separation of results and discussion would improve readability. There are several topics that should be addressed in Section 5 that would add weight to the paper:
- An issue that is not discussed enough is the over/under reporting of emissions by facilities compared with what is observed in satellite data. In this study the reported emissions are used, however many studies show that there are large disagreements between these and satellite observations. Furthermore, there are also disagreements between emission inventory databases themselves, such as EDGAR and E-PRTR. The authors very briefly touch upon this at the end of Section 4.3. Given that plume detectability is derived from satellite data, this discrepancy between reported and observed (satellite) emissions must be addressed. Furthermore, NOx emission is the most important feature contributing to detectability.
- There is no mention of the dependency on spatial resolution. I can imagine that these results would differ significantly for better spatial resolutions e.g. 1-3 km for focus mode of GOSAT-GW NO2 observations. Can the authors comment on this? If it can be shown that the key features of variability are similar across different satellite spatial resolutions, this would increase the impact of these results significantly.
- For the global dataset, the use of a single emission value for almost an entire year of datapoints seems questionable. On the other hand the hourly data available from the US dataset would be more robust. Is there sufficient stability in the hourly emissions timeseries of a single power plant to justify using one value for the global data? Are there overlapping power plants in both datasets for which the predicted detectability values can be compared? Could this be a reason for the systematically higher detectability in the global dataset compared to the US only one?
- In light if the above comments on the discussion, the scope of the conclusions and abstract should be somewhat reduced. It should be specified that these results are for a spatial resolution of around 7 km (S5P pixels) and that the trained model is sensor specific so its output could only be used when looking at TROPOMI data. Furthermore, when presenting a 98 % plume detection accuracy, it should be immediately added that this is for isolated plumes that are 20 km away from other power plants and up to 90 km away from cities. Finally, the caveat that only 21.1 % and 49.1 % of total emissions are accounted for in this analysis, for the global and U.S datasets respectively, must be stated in both the abstract and conclusion.
Minor:
- The presentation of the datasets in Section 2 is very confused. Is the first half of Section 2.1 describing the same dataset as Section 2.3? Further, the second paragraph of 2.1 is a repeat of Section 2.4. The level of detail of filtering out power plants is not needed. Please remove section 2.1 altogether.
- The ordering of the subsections in Section 2 is a bit unnatural. It begins with very detailed information about datasets, while the overarching information about the datasets and what they are used for is only given at the end in 2.5. I suggest making the contents of 2.5 the main part of Section 2, and then go on to present each individual dataset in more detail.
- The contents of Section 4.1 is less a result, rather more a by-product of the method. Since it is not the focus of the paper which addresses plume detectability, I suggest changing this to be a more detailed, quantitative part of Section 3 , i.e. a subsection dedicated to the training sample, unless the authors can give a good connection between this point and it’s impact on plume detectability.
- Sporadic use of bold font throughout – particularly it appears in Section 3 and Section F. Bold font should be removed.
- There is a disconnect between the conclusions and the rest of the main body text. For example, aerosol optical depth is mentioned in the conclusions but never anywhere else in the text. The authors must make an effort to harmonise the conclusion with the rest of the paper.
- Whilst they are visually helpful and insightful into the plume detection, the number of figures in Section F is too many (more than the whole main text) – this can be reduced by consolidating panels into less figures or by reducing the number of examples to, say, 6. Also all symbols need to be made bigger. I cannot see the wind direction icon anywhere.
- Formally this study deals more with enhancements rather than plumes, since there is no constraint on plume structure, such as a certain number of enhanced pixels adjacent to each other etc. I think that, given the fact that there are probably mostly real plumes in their dataset, this is ok, however it should be more clearly and explicitly stated that environmental, meteorological, and observational variables are extracted from a single pixel. This should be included in both the abstract and conclusion.
- The authors make repeated claims that they provide the ‘first’ study to address either quantifying plume visibility (L470) or demonstrate the factors that contribute to this (L488, L505) without sufficient reference to the literature. I think a clearer presentation of the science question in the introduction (Section 1), along with a more thorough discussion of the current literature would go a long way to support this claim. From L30 the introduction gets a bit lost. Until the end (L55) there are no more literature references and the text becomes about method, results and datasets.
Further minor:
L13: Typo in reference .
Figure 1: an additional piece of relevant information would be the date and time of the observation. Also there is an underlying dashed grey grid plotted which has a different projection than the TROPOMI pixels. This should be removed.
L75: I do not understand the supposed link between data quality and the applied filtering. Does the presence of duplicates really imply that the estimation of emissions values are worse?
L79: Reference Veefkind et al. (2012)
L90: what does observation here refer to? Pixels? or orbits or overpasses?
L92: what is a ‘snapshot’?
L110: Pairing of TROPOMI data with emissions is a step too far for a datasets section. This should be moved to Section 3. Similarly for L130.
L119: specify that the total global power plant emissions are the reported emissions.
L131: definition of incomplete data needs clarified. It needs to be more clear which data were used and which were removed and the reasons for doing so.
Table 1, footnote c: The reader is introduced to a lot of concepts that they have not yet been presented, or are not at all (ROCINN occurs nowhere else in the manuscript). This is confusing and should be mentioned in the text or at least be referenced to the section where they are all discussed.
L169: state that the albedo is at different wavelengths.
L188: There are two main parts to Section 3, the plume detection and the training of a model to predict predictability. As a reader it is easy to confuse these two things - change plume-detectability–with-attribution to plume-detection–with-attribution. Likewise the text in the box in Figure 3 should be changed accordingly (Automated Plume Detectability Labeling to Automated Plume Detection).
L195: Give the definition of the quantiles (every 20 % ?). Please define them here, instead of in Section 3.1.7.
L198: Which specific model hyper-parameters were tuned?
Section 3.1: Can you mention here that the parameters used as input to the model training and prediction are extracted from a single TROPOMI pixel, and not across the entire plume?
Figure 4: Please enlarge the blue arrow.
Section 3.1.3: Please give the distances and areas also in terms of number of TROPOMI pixels.
L238: Please elaborate on this special consideration within 5 km - why and how this is done.
L255: Can a rough value of how many scenes are filtered out for each criteria be quoted?
Figure 5: Confusion matrices are typically visualised in a grid, such as Figure 7, which I think would be better. What does true and false correspond to, plume or no plume? Is no plume detected in 5b because the centroid lies within the no-plume zone or because either of the statistical significance or absolute minimum conditions were not fulfilled, or both?
L276: My interpretation of plume detectability is that the integer number of detected plumes is predicted and then divided by 100. In doing so the numbers in L365 are achieved (6.02 corresponds to 602 detected plumes). Is this correct?? If the definition of plume detectability is “plume or no plume”, how is this any different from plume detection in Section 3.1 A proper, clear definition of this key parameter is fundamentally lacking.
L287: remove the word ‘from’.
Section 3.2: I believe that this text should come in a first dedicated subsection of the results (Section 4). Such a structure would make it easier for the reader to navigate Section 4. First present the performance of the model – in terms of the metrics introduced here – then go on to elaborate on these in connection with the key drivers of plume detectability.
Sec3.2.2: In the last paragraph, a figure is being discussed with no reference to it leading the reader to wonder what is actually being talked about. Is this Figure 7?
L362: Round up/down numbers – detection is an integer value.
Section 4.3: There is an inadequate distinction made between the model performance (metrics depicted in Table 3) and plume detectability, which can be very confusing. This in part comes from the absence of any definition of plume detectability, but also the structure of the results section and the titles of the subsections. A subsection dedicated to the performance of the model would help, without tying this simultaneously to the key drivers of plume detectability. This would help disentangle results from discussion.
Section 4.3: Given that the AUC score is a key metric in the evaluation and interpretation, it would be appropriate to see at least one ROC curve per analysis (one for US and one for Global), since this is the definition of AUC – area under ROC curve – with some text describing its result.
L389: Is this conclusion derived from comparing the different dataset scopes (All vs. Top-X emitters)?
L405: These regions are notoriously difficult for satellite retrievals due to limiting factors such as high aerosol load, water vapour and cloud cover. This is most likely the reason for sparser training data in these regions.
Section 4.4 : I would bring this section forward to the first one of the results because it has bearing on all the other results sections (sec 4.2 & 4.3 at least). To illustrate this, take the result in Section 4.2 that there is a clear geographical distribution in detectability (L370), in which dry arid areas have more often plumes detected. This has likely to do with the fact that the surface albedo is high, leading to higher signal-to-noise is the satellite measurements. Looking at Figure 9, the surface albedo is indeed the second most important feature. Therefore, this geographical distribution can be explained by the feature importance, so a more logical flow of presentation would be to have the Section 4.4 before the others. Furthermore, the features that affect plume detectability is the most fundamental result.
L443: I think this surprising result needs more interpretation. It is somewhat questionable that plume detectability is higher for wind speed less than 2 ms-1 in light of what the literature says on this, but the authors give no satisfying explanation of why this would be. Wind speed should also show a hump shape much like solar zenith angle. Can it be that the real relationship is shielded by correlation with other features?
L444: Reference needed when referring to literature.
L470: I think a more thorough presentation of the literature in Section 1 would help to substantiate this claim. This is given in the second paragraph of the introduction, however this could be expanded in more detail.
L515 Why was aerosol optical depth (AOT) not available? Aerosol is the most complex variable to accurately model in trace gas retrieval and therefore it is a very important factor in satellite data. I would expect to see AOT near the top of the feature importance in Figure 9, thus the absence of this parameter should be properly addressed in the text. This could also be one of the reasons limiting the AUC values (L387).
L507: If this is a main conclusion, and it appears so since half of the discussion centres around it, the theoretical expectations deserve an introduction in section 1, with appropriate references to literature, to help substantiate this claim.
Appendix F: Title of this section is misleading. A detection to me implies the presence of a plume, whereas Section F1.2 deals with true negatives which means that there is no plume present. This leads to the question: how do the authors define a detection? From the rest of the subsection titles, this appears to mean correct classification, however this is nowhere explicitly stated.
Citation: https://doi.org/10.5194/egusphere-2025-6008-RC2 - Several fundamentals of scientific writing are not sufficiently executed
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,408 | 1,183 | 102 | 2,693 | 75 | 89 |
- HTML: 1,408
- PDF: 1,183
- XML: 102
- Total: 2,693
- BibTeX: 75
- EndNote: 89
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Summary
The paper "Global variability in the detectability of power plant NO2 plumes from space" by Huang and Wang presents a plume detection algorithm and trains a NN to check for plume "detectability", i.e. whether a plume is visible from satellite (TROPOMI) measurements or not.
With this NN, the most important input features driving "detectability" can be identified.
This is an interesting approach and helps to understand which conditions need to be fulfilled for successful plume detection and emission estimation from satellite measurements.
Overall, the study is written well, except that proper references and acknowledgements are lacking.
The method is comprehensible, but one major drawback is the usage of 10m winds, which are inappropriate even directly at the power plant due to stack height. Finally, while "detectability" is interesting, the overall goal is "quantifiability", and the study does not provide information on this.
I recommend publication on AMT after dealing appropriately with the comments below, which require major revisions.
General remarks
I see that modifying the input wind fields implies a complete re-analysis of this study. But it would be the way to go to get the best results of the presented methodology.
Additional comments
This sections starts with an explanation on what has been done with the power plant data, without stating where the data is coming from first (is it the EPA data introduced later in 2.3?). This section needs to be restructured such that the used input data is introduced, shortly described, and appropriately referenced and acknowledged first.