the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
CH4Net: a deep learning model for monitoring methane super-emitters with Sentinel-2 imagery
Abstract. We present a deep learning model, CH4Net, for automated monitoring of methane super-emitters from Sentinel-2 data. When trained on images of 21 methane super-emitters from 2017–2020 and evaluated on images from 2021 this model achieves a scene-level accuracy of 0.83 and pixel-level balanced accuracy of 0.77. For individual emitters, accuracy is greater than 0.8 for 17 out of the 21 sites. We further demonstrate that CH4Net can successfully be applied to monitor two superemitter locations with similar background characteristics not included in the training set, with accuracies of 0.92 and 0.96. In addition to the CH4Net model we compile and open source a hand annotated training dataset consisting of 925 methane plume masks.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(14816 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(14816 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-563', Anonymous Referee #1, 22 May 2023
General comments:
The paper “CH4Net: a deep learning model for monitoring methane super-emitters with Sentinel-2 imagery” proposes a new method for monitoring and detection of Methane emissions from Sentinel-2 data with a convolutional neural network segmentation model (UNet). The authors collect a large dataset of Sentinel-2 images for locations of known Methane super-emitters (10k+ images) and manually annotate Methane plumes in 925 of those images.In contrast to existing methods, this paper proposes a fully automatic system for Methane classification/segmentation that operates on single Sentinel-2 images (no manual intervention, time-series, or reference images).
The proposed method is evaluated with good results on two different tasks: Monitoring of methane emissions at known locations and detection of Methane emissions at unseen locations (of known Methane emitters).The chosen use-case is well motivated as remote detection and monitoring of Methane emissions are powerful tools to mitigate the release of greenhouse gas emissions. The paper is well written and describes the proposed methodology in sufficient detail. However, there are some central issues around the machine learning methodology and the presentation of results.
Specific comments:
The Methane plume masks are manually annotated using the multi-band multi-pass approach of Varon (2021). Judging from the examples in Figure 2, annotating the plumes depends to a significant degree on knowledge about the exact location of the emitter in the image as well as the annotator. It would be valuable to collect an additional set of annotations for the same locations from at least one other annotator. This would help to quantify the uncertainty on the labels (e.g., by computing the intersection over union of different annotators) and provide an upper bound for possible model performance.
As a reader I would also appreciate a negative example (no plume) in Figure 2.The presented detection and monitoring use-cases both assume known Methane emitter locations. Therefore, the benefits of a model that operates on single images only is not clear to me. Approaches like MBMP or time-series analyses are very helpful to detect Methane in Sentinel-2 imagery (illustrated by the use of MBMP for labelling in this work). Why not let the model take advantage of this additional information? Given the spectral bandwidths of the Sentinel-2 MSI, it is very difficult to detect Methane in single images. To the best of my understanding, using multiple images would be compatible with the proposed use-cases, as they are restricted to known emitter locations.
My central issue with this work is the split of the dataset for training, validation, and testing. Currently, the authors use three splits: The train set, which contains data from 2017-2020 and all but two locations. A “validation” set of the same locations as the train set with data from 2021, and the “held out dataset” with data from the two remaining locations in 2021. The validation dataset is then used to evaluate the Methane monitoring use-case, while the “held out dataset” is used for the Methane detection use-case. It is unclear if the validation dataset is also used to perform for model selection and hyperparameter tuning (as is common in the machine learning literature) or if a sub-set of the training data is used for this purpose. In any case, I strongly discourage the use of the same locations for training, validation and testing as it allows the model to overfit on seen locations.
Instead, I suggest a random train/evaluation/test split by locations for the detection use case, and by locations and time for the monitoring use-case. The models should be tuned based on the train and validation data while only reporting final metrics based on the test set.I appreciate the many different metrics and figures describing the evaluation results. However, most of them do not properly take the highly imbalanced nature of the data (most pixels do not contain Methane plumes) into account. To provide a more nuanced analysis I would welcome the addition of a balanced accuracy metric for the classification task, and Intersection over Union scores for the plume segmentation task.
The detection use-case focuses on locations of known emitters, in my view a more appropriate test of detection capabilities would be to “scan” a larger area (perhaps multiple adjacent Sentinel-2 images, containing some known emitter locations) to look for plumes. This test would also highlight the proposed model’s large-scale data processing capabilities without need for human intervention or time-series data.
Furthermore, I am curious about the role of temporal patterns in the data that might be correlated with plume presence. For example, are plumes more frequently observed in summer vs. winter?
Technical corrections:
- The link to the code points to missing webpage.
- Table 2: the “% positive” column is inconsistent with the performance metrics (there is percent sign there)
- It is my understanding that the plume masks are binary, but in some figures more than 2 values seem to be present (perhaps an interpolation issue at the plume/background border (e.g., Figs. 3 and 6))
- It would be interesting to compare the “constructed classification” model with a dedicated classifier. Given your dataset it could be straightforward to train a binary classification model that directly predicts the presence of a sizable plume in each image. Would this model perform better at detection that the UNet?
Varon, D. J., Jervis, D., McKeever, J., Spence, I., Gains, D., and Jacob, D. J.: High-frequency monitoring of anomalous methane point sources with multispectral Sentinel-2 satellite observations, Atmospheric Measurement Techniques, 14, 2771–2785, 2021.
Citation: https://doi.org/10.5194/egusphere-2023-563-RC1 -
AC4: 'Reply on RC1', Anna Vaughan, 21 Aug 2023
We would like to thank all the reviewers for their detailed comments. We have made a number of changes which we believe have significantly improved the manuscript. Below we summarize some major changes we have made, then respond individually to each reviewer's comments. We thank the reviewers again for taking the time to provide such insightful feedback.
-
RC2: 'Comment on egusphere-2023-563', Anonymous Referee #2, 01 Jun 2023
The paper presents CH4Net - a methane plume detection and segmentation neural network model trained on Sentinel-2 imagery in Turkmenistan. Turkmenistan is known for having optimal observing conditions for remote sensing technology that relies on solar backscatter (bright, homogeneous, arid region), so CH4 net results in this paper can be seen as a bounding result for plume detections via Sentinel-2. The authors went to great lengths to create a training set and should be applauded for that effort. I have a few comments on the manuscript in regards to how they summarize their results, which I outline below:1. Line 22. You say that PRISMA and EnMAP provide the most accurate concentration retrievals. What does this mean? In terms of single-sounding precision? That's precision, not accuracy. Also - please be clear what tasking means - they each are limited to X number of X by X km2 tasks per day that are split across a variety of hyper spectral applications.
2. Line 45 and Point (2) in your introduction. You previously state that the benefit of your approach is that you only need a single overpass as opposed to a time-series, like Ehret. However, if you are splitting your data into train/test that train on one period of time and test on another period of time in the same location, then intrinsically you have added temporal information into your model. Your model is learning surface features along with plume info, correct?
3. Line 45 and Point (2) in your introduction. What is the motivating use-case for not wanting multiple overpasses to reduce noise? Latency for plume detection? Leak detection? It is not made clear in the manuscript how this is a significant benefit. For example, one could envision a spin-up period where you well characterize surface reflectance features in a region. Once that's initialized, every subsequent overpass of Sentinel-2 would result in a low latency plume detection. So not clear to me the benefit of emphasizing this use-case. Please explain further.
4. Table 1 and scene-level statistics. Can one back out easily the number of detected plumes vs. the number of total plumes using this summary info? If not, can you please include? In a similar vein as Reviewer #1 - I am curious as to your model performance if you trained a classification model, e.g., CNN, on this dataset and got similar performance.5. Line 115 and Table 1. Can you please define balanced accuracy in this context? If balanced accuracy = (true positive rate + true negative rate) / 2 for example, then you are still going to get overly optimistic results. For example, assume that 1% of pixels in a scene are plume pixels, then working backwards, a 77% balanced accuracy score would mean that your true positive rate was only 55%: (55+99)/2 = 77%. So why not show these in Table 1 as well? Similar to Reviewer #1's comment - did you try metrics like Intersection over Union? Did they provide similar results?
6. Can we see predictions for your high quality examples, like Figure 6? In particular, for T21 - would be interested in seeing the plume mask for the correct prediction vs. the false positive.
Citation: https://doi.org/10.5194/egusphere-2023-563-RC2 -
AC2: 'Reply on RC2', Anna Vaughan, 21 Aug 2023
We would like to thank all the reviewers for their detailed comments. We have made a number of changes which we believe have significantly improved the manuscript. Below we summarize some major changes we have made, then respond individually to each reviewer's comments. We thank the reviewers again for taking the time to provide such insightful feedback.
-
AC2: 'Reply on RC2', Anna Vaughan, 21 Aug 2023
-
RC3: 'Comment on egusphere-2023-563', Anonymous Referee #3, 09 Jun 2023
The authors describe a CNN-driven plume detection system trained + validated on multispectral Sentinel-1 (repeat) observations of 26 superplume sites in Turkmenistan. The data collection, labeling, data preprocessing and model preparation + training processes are sound, but there are significant issues with the sampling + validation methodology that require additional work and further clarification. Additionally, the authors provide no comparisons to baseline or state-of-the-art approaches. These issues must be addressed in order to provide the reviewer sufficent context to assess the capabilites of their model and the significance of this application.
My primary concern with this paper is with respect to the impact of spatial bias on the provided results. Specifically, by applying the current training/validation methodology to nearly 1k scenes representing only 26 sites with superplumes, it is highly probable that CH4net system is learning to distinguish labeled regions where plumes have previously occurred within the selected sites from (regions in) non superemitter sites, rather than consistently distinguishing pixels representing CH4 plumes from pixels with no observed CH4 present. The somewhat mixed results on the two hold out sites are inadequate to demonstrate robust plume detection. To demonstrate robust plume detection performance, the authors need to provide additional results where the validation set is spatially disjoint from the training set. A 60/40 train/val split (i.e., all data from 16 sites in the training set vs. all observations from the remaining 10 sites in the val set) should provide roughly similar sampling proportions as their current methodology, and would more effectively capture how well the approach generalizes.
Another concern is that separating superplumes from background enhancements is often achievable with simple image processing methods (e.g., applying a threshold to a band ratio product). The pixelwise concentrations of typical superplume enhancements often (dramatically) exceed the (numerical magnitudes of pixels representing) nominal background enhancements observed in many remote sensing GHG products. The authors provide no baseline comparisons with alternative baseline approaches (e.g., thresholding the MBMP images or the ratio between bands 11/12, with a threshold determined by plume vs. background pixel magnitudes), so the reviewer cannot assess whether a CNN is truly necessary for this detection problem. At minimum, results on the scenewise plume detection task using a basic "straw man" approach should be provided to demonstrate that the detection problem is nontrival.
I would suggest one additional minor change wrt Table 1: the authors should replace the aggregate pixel level accuracy / balanced accuracy scores with the pixelwise FPR/FNR (or TPR/TNR) averaged across the validation scenes. Because plumes are relatively rare, the vast majority of pixels are background (negative class) pixels, so if a classifier predicts all pixels in all scenes are not plumes, the average accuracy will approach 100%. While the balanced accuracy is slightly more informative, it does not specify whether prediction errors are false positives or false negatives.
Citation: https://doi.org/10.5194/egusphere-2023-563-RC3 -
AC3: 'Reply on RC3', Anna Vaughan, 21 Aug 2023
We would like to thank all the reviewers for their detailed comments. We have made a number of changes which we believe have significantly improved the manuscript. Below we summarize some major changes we have made, then respond individually to each reviewer's comments. We thank the reviewers again for taking the time to provide such insightful feedback.
-
AC3: 'Reply on RC3', Anna Vaughan, 21 Aug 2023
-
AC1: 'Comment on egusphere-2023-563', Anna Vaughan, 21 Aug 2023
We would like to thank all the reviewers for their detailed comments. We have made a number of changes which we believe have significantly improved the manuscript. Below we summarize some major changes we have made, then respond individually to each reviewer's comments. We thank the reviewers again for taking the time to provide such insightful feedback.
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-563', Anonymous Referee #1, 22 May 2023
General comments:
The paper “CH4Net: a deep learning model for monitoring methane super-emitters with Sentinel-2 imagery” proposes a new method for monitoring and detection of Methane emissions from Sentinel-2 data with a convolutional neural network segmentation model (UNet). The authors collect a large dataset of Sentinel-2 images for locations of known Methane super-emitters (10k+ images) and manually annotate Methane plumes in 925 of those images.In contrast to existing methods, this paper proposes a fully automatic system for Methane classification/segmentation that operates on single Sentinel-2 images (no manual intervention, time-series, or reference images).
The proposed method is evaluated with good results on two different tasks: Monitoring of methane emissions at known locations and detection of Methane emissions at unseen locations (of known Methane emitters).The chosen use-case is well motivated as remote detection and monitoring of Methane emissions are powerful tools to mitigate the release of greenhouse gas emissions. The paper is well written and describes the proposed methodology in sufficient detail. However, there are some central issues around the machine learning methodology and the presentation of results.
Specific comments:
The Methane plume masks are manually annotated using the multi-band multi-pass approach of Varon (2021). Judging from the examples in Figure 2, annotating the plumes depends to a significant degree on knowledge about the exact location of the emitter in the image as well as the annotator. It would be valuable to collect an additional set of annotations for the same locations from at least one other annotator. This would help to quantify the uncertainty on the labels (e.g., by computing the intersection over union of different annotators) and provide an upper bound for possible model performance.
As a reader I would also appreciate a negative example (no plume) in Figure 2.The presented detection and monitoring use-cases both assume known Methane emitter locations. Therefore, the benefits of a model that operates on single images only is not clear to me. Approaches like MBMP or time-series analyses are very helpful to detect Methane in Sentinel-2 imagery (illustrated by the use of MBMP for labelling in this work). Why not let the model take advantage of this additional information? Given the spectral bandwidths of the Sentinel-2 MSI, it is very difficult to detect Methane in single images. To the best of my understanding, using multiple images would be compatible with the proposed use-cases, as they are restricted to known emitter locations.
My central issue with this work is the split of the dataset for training, validation, and testing. Currently, the authors use three splits: The train set, which contains data from 2017-2020 and all but two locations. A “validation” set of the same locations as the train set with data from 2021, and the “held out dataset” with data from the two remaining locations in 2021. The validation dataset is then used to evaluate the Methane monitoring use-case, while the “held out dataset” is used for the Methane detection use-case. It is unclear if the validation dataset is also used to perform for model selection and hyperparameter tuning (as is common in the machine learning literature) or if a sub-set of the training data is used for this purpose. In any case, I strongly discourage the use of the same locations for training, validation and testing as it allows the model to overfit on seen locations.
Instead, I suggest a random train/evaluation/test split by locations for the detection use case, and by locations and time for the monitoring use-case. The models should be tuned based on the train and validation data while only reporting final metrics based on the test set.I appreciate the many different metrics and figures describing the evaluation results. However, most of them do not properly take the highly imbalanced nature of the data (most pixels do not contain Methane plumes) into account. To provide a more nuanced analysis I would welcome the addition of a balanced accuracy metric for the classification task, and Intersection over Union scores for the plume segmentation task.
The detection use-case focuses on locations of known emitters, in my view a more appropriate test of detection capabilities would be to “scan” a larger area (perhaps multiple adjacent Sentinel-2 images, containing some known emitter locations) to look for plumes. This test would also highlight the proposed model’s large-scale data processing capabilities without need for human intervention or time-series data.
Furthermore, I am curious about the role of temporal patterns in the data that might be correlated with plume presence. For example, are plumes more frequently observed in summer vs. winter?
Technical corrections:
- The link to the code points to missing webpage.
- Table 2: the “% positive” column is inconsistent with the performance metrics (there is percent sign there)
- It is my understanding that the plume masks are binary, but in some figures more than 2 values seem to be present (perhaps an interpolation issue at the plume/background border (e.g., Figs. 3 and 6))
- It would be interesting to compare the “constructed classification” model with a dedicated classifier. Given your dataset it could be straightforward to train a binary classification model that directly predicts the presence of a sizable plume in each image. Would this model perform better at detection that the UNet?
Varon, D. J., Jervis, D., McKeever, J., Spence, I., Gains, D., and Jacob, D. J.: High-frequency monitoring of anomalous methane point sources with multispectral Sentinel-2 satellite observations, Atmospheric Measurement Techniques, 14, 2771–2785, 2021.
Citation: https://doi.org/10.5194/egusphere-2023-563-RC1 -
AC4: 'Reply on RC1', Anna Vaughan, 21 Aug 2023
We would like to thank all the reviewers for their detailed comments. We have made a number of changes which we believe have significantly improved the manuscript. Below we summarize some major changes we have made, then respond individually to each reviewer's comments. We thank the reviewers again for taking the time to provide such insightful feedback.
-
RC2: 'Comment on egusphere-2023-563', Anonymous Referee #2, 01 Jun 2023
The paper presents CH4Net - a methane plume detection and segmentation neural network model trained on Sentinel-2 imagery in Turkmenistan. Turkmenistan is known for having optimal observing conditions for remote sensing technology that relies on solar backscatter (bright, homogeneous, arid region), so CH4 net results in this paper can be seen as a bounding result for plume detections via Sentinel-2. The authors went to great lengths to create a training set and should be applauded for that effort. I have a few comments on the manuscript in regards to how they summarize their results, which I outline below:1. Line 22. You say that PRISMA and EnMAP provide the most accurate concentration retrievals. What does this mean? In terms of single-sounding precision? That's precision, not accuracy. Also - please be clear what tasking means - they each are limited to X number of X by X km2 tasks per day that are split across a variety of hyper spectral applications.
2. Line 45 and Point (2) in your introduction. You previously state that the benefit of your approach is that you only need a single overpass as opposed to a time-series, like Ehret. However, if you are splitting your data into train/test that train on one period of time and test on another period of time in the same location, then intrinsically you have added temporal information into your model. Your model is learning surface features along with plume info, correct?
3. Line 45 and Point (2) in your introduction. What is the motivating use-case for not wanting multiple overpasses to reduce noise? Latency for plume detection? Leak detection? It is not made clear in the manuscript how this is a significant benefit. For example, one could envision a spin-up period where you well characterize surface reflectance features in a region. Once that's initialized, every subsequent overpass of Sentinel-2 would result in a low latency plume detection. So not clear to me the benefit of emphasizing this use-case. Please explain further.
4. Table 1 and scene-level statistics. Can one back out easily the number of detected plumes vs. the number of total plumes using this summary info? If not, can you please include? In a similar vein as Reviewer #1 - I am curious as to your model performance if you trained a classification model, e.g., CNN, on this dataset and got similar performance.5. Line 115 and Table 1. Can you please define balanced accuracy in this context? If balanced accuracy = (true positive rate + true negative rate) / 2 for example, then you are still going to get overly optimistic results. For example, assume that 1% of pixels in a scene are plume pixels, then working backwards, a 77% balanced accuracy score would mean that your true positive rate was only 55%: (55+99)/2 = 77%. So why not show these in Table 1 as well? Similar to Reviewer #1's comment - did you try metrics like Intersection over Union? Did they provide similar results?
6. Can we see predictions for your high quality examples, like Figure 6? In particular, for T21 - would be interested in seeing the plume mask for the correct prediction vs. the false positive.
Citation: https://doi.org/10.5194/egusphere-2023-563-RC2 -
AC2: 'Reply on RC2', Anna Vaughan, 21 Aug 2023
We would like to thank all the reviewers for their detailed comments. We have made a number of changes which we believe have significantly improved the manuscript. Below we summarize some major changes we have made, then respond individually to each reviewer's comments. We thank the reviewers again for taking the time to provide such insightful feedback.
-
AC2: 'Reply on RC2', Anna Vaughan, 21 Aug 2023
-
RC3: 'Comment on egusphere-2023-563', Anonymous Referee #3, 09 Jun 2023
The authors describe a CNN-driven plume detection system trained + validated on multispectral Sentinel-1 (repeat) observations of 26 superplume sites in Turkmenistan. The data collection, labeling, data preprocessing and model preparation + training processes are sound, but there are significant issues with the sampling + validation methodology that require additional work and further clarification. Additionally, the authors provide no comparisons to baseline or state-of-the-art approaches. These issues must be addressed in order to provide the reviewer sufficent context to assess the capabilites of their model and the significance of this application.
My primary concern with this paper is with respect to the impact of spatial bias on the provided results. Specifically, by applying the current training/validation methodology to nearly 1k scenes representing only 26 sites with superplumes, it is highly probable that CH4net system is learning to distinguish labeled regions where plumes have previously occurred within the selected sites from (regions in) non superemitter sites, rather than consistently distinguishing pixels representing CH4 plumes from pixels with no observed CH4 present. The somewhat mixed results on the two hold out sites are inadequate to demonstrate robust plume detection. To demonstrate robust plume detection performance, the authors need to provide additional results where the validation set is spatially disjoint from the training set. A 60/40 train/val split (i.e., all data from 16 sites in the training set vs. all observations from the remaining 10 sites in the val set) should provide roughly similar sampling proportions as their current methodology, and would more effectively capture how well the approach generalizes.
Another concern is that separating superplumes from background enhancements is often achievable with simple image processing methods (e.g., applying a threshold to a band ratio product). The pixelwise concentrations of typical superplume enhancements often (dramatically) exceed the (numerical magnitudes of pixels representing) nominal background enhancements observed in many remote sensing GHG products. The authors provide no baseline comparisons with alternative baseline approaches (e.g., thresholding the MBMP images or the ratio between bands 11/12, with a threshold determined by plume vs. background pixel magnitudes), so the reviewer cannot assess whether a CNN is truly necessary for this detection problem. At minimum, results on the scenewise plume detection task using a basic "straw man" approach should be provided to demonstrate that the detection problem is nontrival.
I would suggest one additional minor change wrt Table 1: the authors should replace the aggregate pixel level accuracy / balanced accuracy scores with the pixelwise FPR/FNR (or TPR/TNR) averaged across the validation scenes. Because plumes are relatively rare, the vast majority of pixels are background (negative class) pixels, so if a classifier predicts all pixels in all scenes are not plumes, the average accuracy will approach 100%. While the balanced accuracy is slightly more informative, it does not specify whether prediction errors are false positives or false negatives.
Citation: https://doi.org/10.5194/egusphere-2023-563-RC3 -
AC3: 'Reply on RC3', Anna Vaughan, 21 Aug 2023
We would like to thank all the reviewers for their detailed comments. We have made a number of changes which we believe have significantly improved the manuscript. Below we summarize some major changes we have made, then respond individually to each reviewer's comments. We thank the reviewers again for taking the time to provide such insightful feedback.
-
AC3: 'Reply on RC3', Anna Vaughan, 21 Aug 2023
-
AC1: 'Comment on egusphere-2023-563', Anna Vaughan, 21 Aug 2023
We would like to thank all the reviewers for their detailed comments. We have made a number of changes which we believe have significantly improved the manuscript. Below we summarize some major changes we have made, then respond individually to each reviewer's comments. We thank the reviewers again for taking the time to provide such insightful feedback.
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
676 | 467 | 39 | 1,182 | 27 | 25 |
- HTML: 676
- PDF: 467
- XML: 39
- Total: 1,182
- BibTeX: 27
- EndNote: 25
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
1 citations as recorded by crossref.
Anna Vaughan
Gonzalo Mateo-García
Luis Gómez-Chova
Vít Růžička
Luis Guanter
Itziar Irakulis-Loitxate
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(14816 KB) - Metadata XML