the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
FootNet v1.0: Development of a machine learning emulator of atmospheric transport
Abstract. There has been a proliferation of dense observing systems to monitor greenhouse gas (GHG) concentrations over the past decade. Estimating emissions with these observations is often done using an atmospheric transport model to characterize the source-receptor relationship, which is commonly termed measurement "footprint". Computing and storing footprints using full-physics models is becoming expensive due to the requirement of simulating atmospheric transport at high resolution. We present the development of FootNet, a deep learning emulator of footprints at kilometer scale. We train and evaluate the emulator using footprints simulated using a Lagrangian particle dispersion model (LPDM). FootNet predicts the magnitudes and extents of footprints in near-real-time with high fidelity. We identify the relative importance of input variables of FootNet to improve the interpretability of the model. Surface winds and a precomputed Gaussian plume from the receptor are identified to be the most important variables for footprint emulation. The FootNet emulator developed here may help address the computational bottleneck of flux inversions using dense observations.
Status: final response (author comments only)
-
CEC1: 'No compliance with the policy of the journal', Juan Antonio Añel, 16 Jul 2024
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
Multiple issues preclude the publication of your manuscript as it is now. I want to make you aware that, given them, your manuscript should not have been accepted in Discussions, as no manuscript can be accepted for Discussions before securing full compliance with the code and data policy of the journal. Therefore, the current situation is irregular, and we would appreciate it if you could address it as soon as possible. Otherwise, we will have to reject your manuscript for publication because of a lack of compliance with our policy.
First, as the Topical Editor pointed out in the previous stages after submitting your manuscript, Git repositories are unacceptable for scientific publication. However, some assets in your manuscript continue to be stored in such a repository. You must move this repository to one we accept (check our policy) and reply to this comment with its link and DOI. Namely, FootNet v1.0 is stored in GitHub. GitHub itself instructs authors to use other alternatives for long-term archival and publishing. Another issue with this repository is that it does not contain a license. If you do not include a license, despite what you state, the code is not FLOSS; it remains your property. Therefore, when uploading the model's code to Zenodo, you could choose a free software/open-source (FLOSS) license. We recommend the GPLv3. You simply need to include the file 'https://www.gnu.org/licenses/gpl-3.0.txt' as LICENSE.txt with your code. Also, you can choose other options that Zenodo provides: GPLv2, Apache License, MIT License, etc. Otherwise, nobody can use this code independently of being public.The link for the STILT model is in GitHub, too. However, the situation in this case is worse. In this case, the webpage linked only provides instructions on how to install pre-compiled packages. We can not accept this. Again, you should publish the STILT code in the same way as for FootNet v1.0. I understand that you have not run the STILT mode yourself. If this is the case, please make it explicit in a reply to this comment, and we can check if, in this case, we can make an exception for our request for the STILT code.
You provide a tarball with footprints. First, again, the repository you use is not acceptable; please solve it and give the link and DOI for the new repository. Second, the data is provided in the Numpy compressed array format. For the sake of reproducibility, it would be good to give the Python and Numpy version numbers you used to produce them. Third, you should not provide only "examples" of the training data but the entire training data. Again, if it is too large, we could make an exception, but we need to know the size of your training dataset to evaluate it.
Please, reply to this comment as soon as possible addressing all these issues. As I have said, the situation of the manuscript is irregular, and we need to be sure that your manuscript can comply with our policy to avoid investing time in it when we can have to reject it.Thanks,
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-1526-CEC1 -
AC1: 'Reply on CEC1', Tai-Long He, 16 Jul 2024
Dear GMD editor,
Thank you for the help with the compliance of our manuscript with your policy. And I apologize for the inconvenience caused by the lack of compliance in previous version.
We have now archived the GitHub repository to Zenodo with a GPLv3 license included. The link to the Zenodo archive is https://zenodo.org/records/12752655. The assigned DOI of the Zenodo archive is 10.5281/zenodo.12752655.For the STILT code, we run the model without any modification or development. As such, this is not our model code and it seems inappropriate for us to publish their code elsewhere. We have provided links to their code, publications, and websites.
We have also uploaded the sample footprints to the same Zenodo archive. We only provide samples because the complete training data set consists of 20,000 files with a total size of 852 Gb, which makes it hard to archive on Zenodo. We are happy to share the complete training data set upon request. We have also provided the version numbers of Python and NumPy used to decompress the footprint files.
Please let us know if there is anything still not compliant with your policy.
Thanks,
Tai-Long HeCitation: https://doi.org/10.5194/egusphere-2024-1526-AC1 -
CEC2: 'Reply on AC1', Juan Antonio Añel, 19 Jul 2024
Dear authors,
Thanks for your reply and for making the FootNet v1.0 code available in Zenodo.
Regarding the STILT model, I would ask you to make an effort and contact their developers, asking them, at minimum, to deposit the model in a Zenodo private repository. By doing it, we can be sure that the model (and therefore the assets necessary to reproduce your work) are available (that is, they will not be lost) and "guarded". In a Zenodo private repository, those submitting an asset keep control over who can access it, and the advantage is that it will not be lost.
Regarding the footprints, your current provision of 6 GB of data represents less than 1% of the dataset. To replicate the process effectively, at least 20-30% of the dataset would be necessary, as is often the case when splitting datasets between training and validation. Given that Zenodo allows 50 GB repositories if you could deposit 150 GB of footprints, we could ensure better the reproducibility of the work. I think this is a fair request. Please provide the three links and DOIs in your reply and in the reviewed versions of your work.Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-1526-CEC2 -
AC2: 'Reply on CEC2', Tai-Long He, 24 Jul 2024
Dear GMD editor,
Thank you for your comments.
We have requested for permission from the developers to share the STILT model. The STILT model is now archived in a Zenodo private repository (https://zenodo.org/records/12803589; DOI:10.5281/zenodo.12803589).We have also deposited 150 GB of footprints in three Zenodo repositories, which is about 20% of the training set. Links and DOIs of the three repositories are as follows:
Repository 1 at https://zenodo.org/records/12803617 with DOI:10.5281/zenodo.12803617.
Repository 2 at https://zenodo.org/records/12803736 with DOI:10.5281/zenodo.12803736.
Repository 3 at https://zenodo.org/records/12803855 with DOI:10.5281/zenodo.12803855.We will update the data availability section in the revised manuscript in response to reviewers. Please let us know if there is anything still not compliant with your policy.
Thanks,
Tai-Long He
Citation: https://doi.org/10.5194/egusphere-2024-1526-AC2 -
CEC3: 'Reply on AC2', Juan Antonio Añel, 26 Jul 2024
Dear authors,
Many thanks for your reply. We can consider the current version of your manuscript in compliance with our code and data policy.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-1526-CEC3
-
CEC3: 'Reply on AC2', Juan Antonio Añel, 26 Jul 2024
-
AC2: 'Reply on CEC2', Tai-Long He, 24 Jul 2024
-
CEC2: 'Reply on AC1', Juan Antonio Añel, 19 Jul 2024
-
AC1: 'Reply on CEC1', Tai-Long He, 16 Jul 2024
-
RC1: 'Comment on egusphere-2024-1526', Anonymous Referee #1, 23 Aug 2024
The article introduces a deep learning-based "footprint" simulator, which estimates emissions using greenhouse gas observation data. This simulator offers high-fidelity, near-real-time predictions of the footprint's size and extent and enhances the model's interpretability by determining the relative importance of input variables through the addition or reduction of these variables. The article is well-structured, flows smoothly, and provides detailed information. However, before considering the article for acceptance, the following issues are suggested for further consideration:
- The article mentions that the four-dimensional variational method (4D-Var) and the Kalman filter method (both of which are based on full physical field models for flux inversion) become expensive with increased resolution. However, the article does not clearly demonstrate the advantages of the deep learning-based "footprint" simulator compared to these full physical field model-based methods. The author provided some time consumption data, but they are not directly comparable since 640 core-hours and 32 cores for 1 second cannot be directly compared without knowing whether both can achieve parallelism and the efficiency of that parallelism. Additionally, the author did not theoretically analyze why the deep learning-based "footprint" simulator can reduce costs, such as whether the speedup is due to the structure of the machine learning algorithm or hardware acceleration by GPUs. If both contribute, what is the respective contribution of each?
- The author's description of the input and output data metrics is unclear. For instance, why is the most important metric, observed concentration data, not included as an input? Shouldn't the output data be a spatial distribution of emissions? Why is there only a logH? What does H represent? What is the Gaussian plume, and what is its physical significance and calculation process? Its inclusion or exclusion's impact on the results is not shown in Figure 2. Why is data only needed 6 hours in advance and not earlier? Has the author conducted similar tests with data from earlier times?
- The consistency of the results in this study is not well-established, and some statistical indicators are not very high. The author should provide a detailed analysis of the reasons for the lack of accuracy and suggest directions for future improvement.
Citation: https://doi.org/10.5194/egusphere-2024-1526-RC1 -
RC2: 'Comment on egusphere-2024-1526', Anonymous Referee #2, 27 Aug 2024
The authors construct a machine-learning emulator to generate atmospheric transport simulations for greenhouse gas and air quality modeling. Atmospheric transport simulations are a major computational bottleneck in greenhouse gas and air quality modeling, and I think this paper addresses and important topic. I enjoyed reading his paper, and I think it'll make a nice contribution to the literature. I have several suggestions for editing the paper:
High-level suggestions:
I think the article would benefit from much more extensive evaluation of the emulated footprints. The figures in the manuscript examine the correlation between the log of the true footprints against the log of the emulated footprints. At the end of the day, many modelers ultimately care about the accuracy of simulated greenhouse gas or air pollution mixing ratios. Hence, I personally think that the correlation between the footprint values is necessary but not sufficient to convince many modelers (including myself) to use a tool like FootNet. For example, let's suppose one used these footprints to model CO2 and CH4 mixing ratios. Would these footprints capture peaks and troughs in CO2 and CH4 (I.e., suppose one were to plot CO2 and CH4 at individual sites as timeseries.)? Would the emulated footprints capture spatial variability in atmospheric CO2 or CH4 levels? Suppose there were CH4 super-emitters scattered in the Barnett Shale. Would the emulated footprints accurately capture the impact of those super-emitters on downwind atmospheric observations? Let's say one were to model CO2 using these emulated footprints. Would those footprints capture diurnal variability in CO2 mixing ratios (i.e., due to variability in both fluxes and boundary layer dynamics)?In addition, I imagine it might not always be practical to use 85% of data for training. For example, if one wanted to run footprints for a large satellite dataset, it might (hypothetically) only be feasible to use 5% or 10% of the data for training. In that case, one would need to train FootNet on a limited number of data points and then run the trained FootNet algorithm on a much larger number of data points. Do you have a sense of how FootNet would perform in this circumstance? Both of the cases studies described in this paper are for small geographic regions (e.g., San Francisco and the Barnett Shale). Let's say one wanted to use FootNet across the entire US or across the entire globe. For these larger spatial scales, I imagine there are more variable and diverse transport patterns in different regions. In this circumstance, one would want FootNet to capture all those transport patterns in different regions. By contrast, San Francisco and the Barnett Shale, by factor of their limited geographic size, might have a more limited set of transport patterns to capture. These different circumstances might necessitate very different approaches to the training data, and the resulting emulated footprints might not have the same fidelity or accuracy.
Specific suggestions:- It would be helpful to include line numbers on future versions of the manuscript. Doing so would make it easier to discuss specific lines of the manuscript.
- Abstract: What does "near-real-time" mean in this context?
- Intro: "The sensitivity of each receptor to its upwind sources, termed as the receptor’s “footprint”, can then be used to estimate fluxes inversely (e.g., Turner et al., 2020)." There are a bunch of other good references that could be used as examples here going back to the early 2000s.
- Intro: "The footprints are integrated 72 hours backwards from the measurement time." A lot of continental-scale studies using STILT use footprints that are integrated 10 days backward from the measurement site. Do you think the approach developed here is applicable to those longer time scales?
- Sect. 2: "Each convolutional block includes two convolutional layers with 3 × 3 convolutional kernels and one 2 × 2 max-pooling layer." What is a convolutional block, convolutional kernel, and max-pooling layer? I suspect that most readers won't be familiar with these terms.
- Sect. 3 "The overall correlation between FootNet predictions and STILT simulations..." Does this line refer to r or r^2?
- Pg. 8, line 1: This paragraph feels like it could use a better topic sentence. You've just finished describing the training process for a footprint calculation in the Barnett Shale. What topic or concept are you going to describe next? I think the answer to this question would better guide the reader and give the reader a better idea of what to anticipate.
- Fig. 3F: I was a little confused about Fig. 3F. Are the individual footprints summed before being plotted in this figure? I.e., does this figure compare the log sum of each predicted footprint against the log sum of each true footprint? Alternately, is each individual model grid box from each footprint a different point on this plot? I imagine that the former plot would show less noise and a higher correlation coefficient whereas the latter plot would show more noise and a lower correlation coefficient. I would recommend clarifying how this figure is constructed.
Citation: https://doi.org/10.5194/egusphere-2024-1526-RC2 -
AC3: 'Authors' response to referee comments on egusphere-2024-1526', Tai-Long He, 25 Sep 2024
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2024/egusphere-2024-1526/egusphere-2024-1526-AC3-supplement.pdf
Viewed
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
452 | 0 | 0 | 452 | 0 | 0 |
- HTML: 452
- PDF: 0
- XML: 0
- Total: 452
- BibTeX: 0
- EndNote: 0
Viewed (geographical distribution)
Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1