the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Machine learning-based emission rate estimates of global methane super-emissions
Abstract. Methane, the second most important greenhouse gas, has a global warming potential more than 80 times that of carbon dioxide over a 20-year period. Given its decadal atmospheric lifetime, reducing anthropogenic methane emissions is critical for limiting near-term warming. The TROPOspheric Monitoring Instrument (TROPOMI) provides daily global methane satellite observations, enabling rapid detection of super-emitters. Here, we develop ML-SPERE, a machine-learning framework based on a convolutional neural network trained on simulated TROPOMI methane observations and meteorological data to estimate emission rates for super-emitters. ML-SPERE outperforms the Integrated Mass Enhancement (IME) method on simulated plumes that incorporate real TROPOMI backgrounds and missing spatial data, reducing the median absolute percentage error from 42.4% to 24.3% for well-observed methane plumes. ML-SPERE estimates also do not exhibit the low wind-speed dependent biases present in IME estimates. Applied to TROPOMI observations of a 200-day well blowout in Kazakhstan, ML-SPERE shows better agreement with inverse modeling results and estimates from high-resolution point-source imagers than TROPOMI IME estimates do. Global spatial patterns of methane emissions inferred from ML-SPERE and the IME method for all super-emitters found by TROPOMI in 2021 are broadly consistent, with notable regional differences in northern Russia (where transient pipeline may not be well characterized by either method), the Congo Basin (where IME estimates are potentially inflated due to the large spatial extent of plumes), and southeastern Australia (where IME estimates are potentially negatively biased owing to predominantly low wind speeds). Mean estimated emission rates for this dataset aggregated by estimated source sector remain similar between both methods. Overall, improved performance on simulated plumes and consistency with independent estimates for real-world observations demonstrate the utility of ML-SPERE for quantifying TROPOMI methane super-emitters.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Atmospheric Measurement Techniques.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(2808 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2026-1871', Anonymous Referee #1, 08 Jun 2026
-
RC2: 'Comment on egusphere-2026-1871', Anonymous Referee #2, 16 Jun 2026
General comments
This manuscript by Roberts et al. presents a description and performance evaluation of ML-SPERE, a machine learning-based method to infer point source emissions of methane from data from the Tropomi instrument aboard Sentinel-5p and meteorological reanalysis data. The analysis shows that ML-SPERE outperforms the IME method, a commonly used method for inferring point source emissions from Tropomi. For a case study, the manuscript also includes a comparison to estimates from other data sources and methods. While I can't judge all the details of the ML method (especially sections 2.3 and A2) because I have no hands-on experience with ML methods, I think that the results clearly show an advantage of ML-SPERE for fast and accurate emission retrieval from vast amounts of satellite data compared to the existing state of the art. Thus, the study makes a valuable contribution to the field of independent greenhouse gas emission monitoring and verification. I would like to commend the efforts that went into quantifying the uncertainty of the emission estimates. At a high level, my only real criticism is that some uncertainties and potentials for future improvement should be stated more clearly in abstract and conclusions. In my opinion, the sources of uncertainty should include modeled plume morphology, diversity of the training dataset and regression dilution (all in principle improvable with future research). In addition, in the Karaturun case study, the text about the agreement of ML-SPERE and IME with inversion-based emission estimates sounds a bit more optimistic than it looks to me in Fig. 5. Finally, I would like to highlight one specific comment on area sources, because I'm not sure that the interpretation of IME results for them is correct (see my comment on line 409).
In addition to these points, several clarifications and minor edits are necessary prior to publication. Nonetheless, I think it's a very good manuscript and I recommend publication after minor revisions that address the following comments.
Specific commentsLines 8f: I stumbled on "the bias" because it's the first time the bias is mentioned. I suggest to rearrange the sentence, e.g. "IME exhibits a bias at low wind speed; ML-SPERE does not"
Line 25: To round out the argumentation, you could insert a sentence on the relevance of observations for mitigating super-emitters.
Lines 33 and 191: Since you bring up computational cost as one of the benefits of machine learning-based methods in the introduction: Can you add a few details about the computational cost of ML-SPERE vs IME? Line 191 has a bit on the cost of ML-SPERE training, but there is nothing concrete about the cost of IME in the text.
Lines 51f: "converting satellite-retrieved methane columns into plume enhancements and accurately distinguishing plume pixels from background concentration": This sounds like two different sources of uncertainty and I think you mean "by accurately distinguishing", which would be easier to read.
Lines 64f: "If the plume can be accurately modeled, inversions in general provide the most accurate emission rate estimates". That's a big "if" and therefore, I'm not sure the statement is both relevant and correct. Can you provide a source where the performance of mass-balance methods vs. inversions was systematically assessed? In general, inversions, just like mass-balance methods, rely on the accuracy of modeled winds, plume rise, etc. Specifically, the common discrepancy between modeled and real wind fields can lead to displacements of modeled plumes, leading to a double-penalty for modeled emissions and thus underestimation and/or smearing out of point-sources by inversions. Mass-balance methods don't have this particular issue. Unless you can corroborate the statement, I would remove it.
Lines 81f: "This is due to the fact that": not entirely clear what "this" refers to. Perhaps "The method relies on the fact that" or something like that.
Line 98: Suggestion: "previously analyzed using inverse modeling" -> "previously analyzed using inverse modeling and compared to IME estimates based on GHGSat data"
Lines 115f: While you claim that your training and validation dataset covers "varied surface and meteorological conditions", it strongly focuses on northern hemispheric summer (6/7 simulated domains), and specifically, on areas close to the Caspian Sea (5/7 domains). I'm concerned that this choice might lead to overfitting (performing better on scenes that are close to the conditions at the training scenes and worse compared to others). Appendices A5 and A6 partly address this concern and you conclude that training set diversity is much less important than plume morphology for the performance. However, the argument is based on cross-validation within the limited training dataset, so I'm not fully convinced that geographic and seasonal limitations in the training data play only a negligible role in the performance. One could further examine the performance of the model on varied scenes. However, the performance advantage of ML-SPERE vs. IME is clearly shown in Sect. 3.1 and results are similar on average as for IME for a large variety of scenes (Fig. 6). Therefore, I don't think it's necessary to dissect the performance of the model further in this paper. But I suggest that you consider including an outlook with potential improvements in the conclusions, and that one of them may be a more varied training/validation dataset (i.e., other domains and seasons). For example, a future training set could cover situations where the current ML-SPERE and IME differ (the ones in Sect. A8). More on that outlook in my comment on lines 429ff.
Lines 122f: The split of the simulated plumes into training and validation dataset was random (ratio 90/10). Upon first reading, I was concerned that including the same domains in both training and validation could lead to overfitting. Please add a reference to Section A5 here, where this concern is addressed and quantified.
Line 131: Would it not be better to use a uniform distribution, to alleviate the regression dilution you observed? Fig. 7g of Schuit et al. suggests that the high emitters are severely underrepresented if the training dataset follows that distribution. If you agree, consider including that as an outlook.
Lines 134f: "The pixel containing the plume source remains within the central 28×28 pixel region of the cropped scene" - Does that mean that in some cases, the plume is only a few pixels long because it is close to the downwind edge of the scene?
Lines 142f: I'm amazed that the attempt without wind data lead to anything but noise at all. Please specify the statement "30% lower relative performance than models incorporating wind channels" - I assume "lower relative performance w.r.t. the test dataset"? Out of curiosity: Do you know how the model learned to distinguish wind speeds in the absence of wind channels? I guess it must have something to do with plume morphology after all, though I didn't expect that there is that much information on that at the scale of Tropomi pixels.
Lines 166ff: Please specify how many unique backgrounds were used. E.g. one per scene (224 unique backgrounds) or one per plume (438 unique backgrounds)?
Line 214: A comment (no need to change anything in the manuscript): I think the background uncertainty could be overestimated by sampling from the standard deviation and not the standard error of the background pixels (sigma/sqrt(n)). Though the number of pixels n would be debatable.
Lines 216-222: A comment (no need to change anything in the manuscript): I'm concerned that this method could lead to unphysical wind fields and unphysical combinations of wind field and plume morphology. However, since this method is "only" applied to obtain an uncertainty estimate, I would leave it as is.
Line 265: By "constant absolute error", I think you mean a "constant bias"? Please reference the plot that shows it (Fig. 3d) in the text. Also, the linear regression line shown in Fig. 3d doesn't capture the "constant" bias described in the text, of course. To clarify, could you add density maps to panels c and d, similar as panels a and b? It's a detail, but it could help to match the description in the text to what is shown in the figure.
Lines 270-282: Here and in Table 1, please also describe the performance of ML-SPERE vs IME for plumes that are not well observed.
Line 293 and Line 584: However, the tests in A5 and A6 do not examine the limited seasonal coverage of the training/validation dataset. See my comment on lines 115f.
Fig. A3 a: Do plumes that extend to the edge of the domain incur a bias in either the IME or the ML-SPERE method?
Line 298: Please note that Plewa et al. 2025 (https://doi.org/10.1016/j.rse.2025.115002), whom you cite above, discussed and found a solution for a low bias at high emission rates in their ML-based emission estimation method. The solution was to extend the range of emissions in the training dataset beyond emission rates that the trained model is later applied to. I think that solution could be applied to improve ML-SPERE as well - perhaps include that as an outlook for ML-SPERE.
Line 299: "greatly alleviated" - please be more specific (e.g. "reduced by 1/3", judging from the linear fit equations in Figures 3d vs A4d).
Fig. 4a: Please double-check whether the OLS equation is correct. The intercept in the plot is at a slightly different position than in the equation (around -38 t/h instead of -44.21 t/h), and "44.21" looks like it could be a copy-paste error from the intercept shown in Fig. 4b (-4.21 t/h).
Line 303: "IME ... slightly overestimating them at high wind speeds". The figure that supposedly shows this (Fig. 4a) stops at 2 m/s - still a low wind speed. Also, the IME results in Fig. 4a don't show a bias at the "high" wind speed as claimed in the text. Please extend the figure to show the results for higher wind speeds. The same applies to Fig. 6b.
Lines 308-310: Is this not a shortcoming of this particular effective wind speed parameterization, rather than of the IME method? It seems like it could be fixed.
Lines 10, 339 and 455: In my opinion, the statement that "ML-SPERE agrees well with the inversion" sounds too vague and downplays discrepancies between the methods: In Fig. 5 and subsequent paragraphs, it is shown that Tropomi IME and Tropomi ML-SPERE are much closer to each other than either is to the Tropomi inversion - there is a large bias and low correlation between the results from either method. To me, the phrasing in abstract and conclusions sounds like there is a much bigger improvement in the agreement when using ML-SPERE instead of IME than there actually is. Please phrase these statements more carefully.
Line 345: "For days in the time series, ..." - This phrase sounds incomplete; did you mean to include a number of days here?
Lines 358f: I don't fully understand this sentence. In line 358, it sounds like the metrics (R and bias) refer to ML-SPERE vs inversion results. In line 359, it says these numbers describe the agreement between ML-SPERE and a known truth. Which is it? Also, it says here that these results are "improved" compared to applying ML-SPERE to the actual data. But if I understand correctly, they refer to a subset of the cases from Fig. 5. So the reference point for the comparison (ML-SPERE applied to real data and inversion for this subset) is missing.
Line 362: What do you mean by steady state emission and _transport_ assumptions? Constant mean wind speed? Be specific. All other occurrences of "steady-state" in the text only refer to emissions.
Line 362-364: I might have missed it, but why is it expected that IME is affected more strongly than ML-SPERE by violations of steady-state assumptions?
Lines 375f: "inverse modeling is generally considered the most accurate approach for estimating methane emissions from TROPOMI data when good plume matches can be obtained". See my comment to line 64f. I think the inversion might underestimate the true emissions for the reasons explained there.
Lines 384-386: "ML-SPERE estimates tend to exceed those of the IME method at low IME-estimated emission values, but are generally lower at higher IME-estimated emission values". Add a figure in the appendix that demonstrates this statement? Fig. 6a kind of has this I guess, but I find it hard to see. A plot of the difference between ML-SPERE and IME vs either estimate could show this better.
Lines 401-407: The first reason for potential overestimation given in A8 - regression dilution - is not mentioned here. Why not include it?
Line 406: "especially if the plume has detached" - Why does detachment of the plume cause IME errors? Is it simply a strong deviation from the steady-state assumption or is something else going on?
Line 407: "the total plume mass may provide a more important characterization" -> more important than what and for what?
Lines 13f, 409 and 634-637: Could you please confirm / elaborate on the conclusion that IME estimates could be too high for diffuse sources? I tried following this argument with pen and paper and arrived at the opposite conclusion.
Consider a simple "diffuse" source consisting of two identical point sources, Q_single, in adjacent pixels of the swath, and that their two plumes are right next to each other but don't overlap - forming a single, broader plume from a "diffuse" source. The true emission rate is 2*Q_single. The combined IME is twice the single IMEs, and the combined length is sqrt(2) times the length of the single plumes (because the combined area is twice the single areas and L=sqrt(A)). If I compute Q from the combined plume, I get u_eff*(2*IME_single)/(sqrt(2)*L_single) = sqrt(2)*Q_single - an underestimation of the true emission rate of 2*Q_single. Intuitively, the reason for the underestimation is that the broadened plume has a "longer" length L (because it's computed from the area) than a point source with the same emission rate would have. So IME misinterprets the more diffuse source as a longer residence time.
From these considerations, I conclude that IME has a risk to underestimate emissions from diffuse sources - the opposite of what the text says and what we seem to see in the wetland results in Fig. 6c, where the IME estimate is higher than ML-SPERE. Perhaps I made an error in my simplified case. But perhaps ML-SPERE also underestimates area sources, but more severely than IME?Lines 429ff: In the abstract and conclusions, add a statement of the major sources of uncertainty of ML-SPERE identified in the study. In addition to "well-observed" vs "not well observed" plumes, dataset diversity and regression dilution, I think fidelity of modeled plume morphology should also be highlighted (as identified in Section A6). This part falls short in the abstract and conclusions. The results in A6 raise the question how accurate the WRF-Chem morphology is - and thus if ML-SPERE results could be improved.
Lines 584-588: These sentences are not completely accurate and miss the key point of the differences between WRF-Chem and HYSPLIT. The current text hints at the difference of transport representation between WRF-Chem and HYSPLIT: "WRF-Chem can only simulate “point sources" at a resolution equivalent to the native grid of the simulation, whereas HYSPLIT simulates particle emission at an actual point". This statement hints at the fact that WRF-Chem is a Eulerian model and HYSPLIT is Lagrangian. But that doesn't explain the observed differences, which were that the HYSPLIT plumes were more diffuse. The key point that explains that instead is that both models rely on parameterizations for subgrid-scale processes (most importantly, turbulence), and the schemes that implement that differ. Please refer to Brunner et al., 2023 (https://doi.org/10.5194/acp-23-2699-2023) for an intercomparison of plume morphologies among GHG transport models. As seen in Brunner et al., Eulerian vs Lagrangian transport isn't necessarily the key difference among models, as large discrepancies exist also among the different Eulerian models investigated in that study.Lines 616-618: "It is therefore plausible that ML-SPERE may overestimate emission rates for short-duration, highly concentrated emissions where mass-balance structure differs from that of steady-state plumes". Is overestimation more plausible than underestimation? It's not clear to me from the text if the deviation is expected to be positive or negative. If you mean the more general statement that errors (over- or underestimation) are plausible for these plumes, please rephrase accordingly.
Technical correctionsLine 1: "second-most"
Line 34: "machine learning (ML)-based"
Table 1 caption: "In the first two rows we" -> "In the first two rows, we" (add comma)
Line 88: "extract meaningful patterns between" -> "extract meaningful patterns from"
Line 312: "In contrast" -> "By contrast"
Line 364: "as large than" -> "as large as"Citation: https://doi.org/10.5194/egusphere-2026-1871-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 183 | 51 | 8 | 242 | 12 | 7 |
- HTML: 183
- PDF: 51
- XML: 8
- Total: 242
- BibTeX: 12
- EndNote: 7
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General Comments
The manuscript presents a machine learning approach for the inference of methane emission rates from large point sources detected using the TROPOMI satellite. It compares this new methodology to the widely used Integrate Mass Enhancement (IME) method of estimation and shows superior performance to the IME method, particularly at low wind speeds. The manuscript is timely and well-structured with clear communication. Some clarification is required to represent cited work more accurately, and to fully explore the behavior of the IME source rate quantification and its sources of uncertainty in this study.
Specific Comments