the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Simulating out-of-sample atmospheric transport to enable flux inversions
Abstract. Accurately estimating greenhouse gas (GHG) emissions from atmospheric observations requires resolving the upwind influence of measurements via atmospheric transport models. However, the computational demands of full-physics models limit the scalability of flux inversions, particularly for dense in situ and satellite-based observations. Here, we present FootNet v3, a deep-learning emulator of atmospheric transport based on a U-Net++ architecture, which improves generalization and inversion fidelity over prior U-Net-based models. FootNet v3 is trained on 500,000 pseudo-observations across the contiguous United States. It predicts surface and column-averaged source-receptor relationships at kilometer-scale resolution and operates 650x faster than traditional Lagrangian models. Critically, FootNet learns the underlying physical relationship between meteorology and atmospheric transport. We show that it accurately predicts source-receptor relationships when driven by GFS meteorology, despite being trained on HRRR data. FootNet generalizes to unseen regions and meteorological regimes, enabling accurate flux inversions in domains withheld during training. Case studies using GHG measurements in the San Francisco Bay Area and Barnett Shale show that FootNet matches or exceeds the performance of full-physics models when evaluated against independent GHG observations. This is achieved despite FootNet having never seen meteorological inputs from Northern California or North Texas. Feature importance testing identifies physically meaningful drivers that are consistent across both surface and column models. These findings show that machine learning models can learn the physics governing atmospheric transport, allowing them to extrapolate to out-of-sample scenarios and support real-time, high-resolution GHG flux estimation in novel domains without the need for retraining or precomputed footprint libraries.
- Preprint
(11554 KB) - Metadata XML
-
Supplement
(4415 KB) - BibTeX
- EndNote
Status: open (until 10 Sep 2025)
-
RC1: 'Comment on egusphere-2025-3441', Anonymous Referee #1, 15 Aug 2025
reply
General Comments
The paper addresses the emulation of footprints generated by atmospheric Lagrangian Particle Dispersion Models (LPDMs), which is an important problem in the field of trace gas inverse modelling, where the computational demands are increasing due to growing dataset sizes. In this paper, the authors demonstrate the performance of a new architecture for their Footnet algorithm (based on U-Net++). The model is evaluated in applications of inverse modelling of CO2 and CH4 based on in situ and column data. These evaluations are conducted in regions where the model has been trained as well as in “out-of-sample” regions. Therefore, the main novelty of the work lies in the model’s ability to generalize to different meteorological conditions and geographic locations.
Overall, I think the paper tackles an important subject that is within scope for ACP. It is generally well written and structured. However, before publication, I think so some elements of the work need to be explored more thoroughly, as there is a danger that the claimed generalizability is over-stated, particularly since the main claims are around the performance of the model in “out-of-sample” regions. In particular:
- I think it is misleading to claim that this data-driven model out-performs a physics-based LPDM. The authors base this claim on an apparent improvement in fit to the mole fraction data when comparing Footnet or STILT to observations, attributing an improvement mainly to a tendency for the machine learning (ML) model to “smooth” the footprints (L150-L156). The way that the model has been trained (i.e., penalising any deviations from STILT footprints) means that perfect performance of the algorithm would be the exact retrieval of STILT footprints. If the ML model does not fit the STILT output perfectly (which, of course, it can never do), but better fits the independent observations, this improvement must be coincidental. Put another way, any difference in performance to STILT must be considered a degradation in emulator performance (even if it’s better fit to some other observational dataset). What the authors have found here is potentially interesting, and it would imply that we should smooth LPDM outputs to improve the fit to the data. But, to me, this points to a separate systematic model error or representation issue, rather than some benefit that somehow comes from training an emulator.
- To me, it seems that the claim of “out-of-sample” generalization has only been partially demonstrated, since the tests in “unseen regions” were performed in 2020, which is the time period during which the inversion was being trained. Therefore, the model has “seen” similar footprints around the same time, within a few hundred km of the left-out region. Meteorological variables have substantial spatial and temporal correlations. To be more out-of-sample and strengthen the claims of the work, it would be beneficial to demonstrate the model performance in these regions in a different year (e.g., 2022)? Additionally, it could also be tested in another country, but I accept that this would be more challenging to set up.
- The authors claim that the model has also been demonstrated against out-of-sample meteorology, but I think this is probably over-stating what has been achieved. The two products (GFS and HRRR) are based on assimilated meteorological observations, and, whilst I don’t have direct experience of these products, I’m assuming that they must be extremely similar to each other, especially in the lower troposphere. Therefore, can the GFS dataset really be considered “out-of-sample”? It will surely be almost fully correlated with HRRR.
- The ML model performance has been demonstrated over relatively small scales (~400km x 400km). At these scales, I’m assuming the inversions are strongly influenced by potential errors in boundary conditions. I don’t think this is a problem with the approach per se, but it does seem like a limitation that should be mentioned, since it may limit the extension of the model to some regions.
- The claim that the model has “learned the physics” governing atmospheric transport also seems like it could be misconstrued (L14, L64 and elsewhere). I suspect that some readers will assume this is an application of physics-informed machine learning, where the model has been informed by some underlying physical equations. In this case, it’s a purely data-driven approach, so perhaps it’s better described as a having learned relationships between meteorology and LPDM footprints.
Other general comments:
- No details are provided on the testing/validation set or metrics, which is critical for the evaluation of this kind of paper. This certainly needs addressing, with justification provided for the test/train split and choice of metrics. L140 and Figure 5 seems to imply that the testing set is from the same year and location(s) as the training set. Given the above-mentioned strong spatial and temporal correlations in the atmosphere, this seems to be a critical limitation. It seems imperative that the testing set is at least temporally distinct (i.e., separated in time by more than synoptic timescales) from the training set (and, if the authors want to demonstrate spatially out-of-sample performance, spatially distinct too).
Specific comments:
- Throughout, the authors use “observations” to describe the training footprints (L5 and elsewhere). I found this confusing, as, to me, this would imply mole fraction data. Why not say that, e.g., “500,000 training footprints”?
- Are the results for one random seed only or is it a mean over several seeds? If it is a mean over several seeds, could some measure of error or standard deviation be shown on Figure 3?
- In Figure 1, there are different circles for the arrows and the different convolutions, what do the colors represent?
- Line 23: There is another preprint under consideration for GMD that covers very similar themes using a different modelling approach, Fillola et al. (2025): https://egusphere.copernicus.org/preprints/2025/egusphere-2025-2392/. It seems that these two papers should cite each other in their revised forms.
- Line 133, is there any hypothesis on why there was no clear benefit to including a mass penalty?
- Line 148 – 149: As mentioned above, I can’t see how the logic here holds up. Unless you’re bringing in some additional information, no matter what the biases in the model you’re emulating, the best you can do is emulate that model, biases and all.
- Line 150: What do you propose is the reason for the smoothness?
- Line 164, to make these claims, you’ll need to explain what is substantively different about these two meteorological products and demonstrate that the meteorological variables are substantially different (see general comment above)
- For Figure 6, I understand that the results are similar for different types of meteorology but I do not understand the axes, since they seem to be displaying aggregated 2D quantities (footprints). Are these all of the gridded footprint values from all locations aggregated together and compared? How is the r value computed in this case? Can you clarity this?
- Figure 7 and 8, can you clarify what the percentage is relative to and what quantile the contour corresponds to?
- Line 271: Presumably a standard analytical Gaussian inversion? Provide a few extra details or a reference. How were model and measurement uncertainties (and emulator uncertainties?) represented?
Citation: https://doi.org/10.5194/egusphere-2025-3441-RC1 -
RC2: 'Comment on egusphere-2025-3441', Anonymous Referee #2, 01 Sep 2025
reply
Review: Simulating out-of-sample atmospheric transport to enable flux inversions by Dadheech and Turner
This is a well-written manuscript of a nice study, and very well-organized. I especially appreciate that the authors extended the comparison analysis through performing flux inversions in order to understand the impact of using the FootNet footprints when deriving fluxes. I have only skimmed the previous two papers by this same group on this topic, but it seems that in this one they have extended the model to work on column (satellite) GHG data, and that they have evaluated/validated the results for out-of-sample data, which is probably the main contribution here. The editors can determine if perhaps the manuscript is better suited for GMD, as it is very much a model development & validation study. I recommend publication after addressing the comments below.
Overall comments:
In regional or city-scale inversions, particle trajectories from ATD models like STILT are also often used to sample and/or optimize a background in some way. Can this be done with Footnet?
In the out-of-sample FootNet simulations (i.e. when the footprint was generated using a model that did not use the Barnett region for training), were the simulations also from a different year or month than what was used in training? I.e., I am wondering how well Footnet performs for data (receptors) from a completely different year than the training. I would think that transport is correlated across very large spatial scales and it could be that if trained on data from the same time period, could allow the model to perform better, even if the receptor is hundreds of km away from the training data receptors? Now reading L 195, perhaps this was already done in previous work, perhaps that could be mentioned either way.
Optional question for the authors to perhaps comment on in the Conclusion or future work: How can FootNet can be used for larger continental-scale inversions. Will v4 simulate longer time periods (at coarser scales) to perform inversions for CONUS?
Lastly, it seems the code is available. Would the authors recommend that others use the already-trained FootNet code to generate footprints for their own use? What would be the caveats about the use of this model (where would it perform well or not)?
Abstract:
I would add “dispersion” to “atmospheric transport and dispersion models” here in the abstract at least, as dispersion is a large part of the modeling that Hysplit/Stilt is doing, along with transporting the tracer, and that is a common term in the literature (ATD models).
L47-48 adjust grammar in this sentence appropriately – what does “which” refer to (L48). Perhaps omit “the” prior to “flux inversion”?
L75+ Perhaps the authors could clarify what they mean by “pseudo-observation” - is this a simulated GHG concentration, or is it a footprint (i.e. gridded and varying in space and time)? Readying on in L86, it seems the observations were simulated with footprints but – I would think the observations are footprints themselves, right? The output after all is a footprint.
L88 - why was HRRR re-gridded to 1-km? Using STILT does not require that even if the STILT grid is at 1-km… does FootNet perform better when this is done– in which case I would guess it depends how the interpolation is done?
L94- How far back were the particles traced in the training footprints? From this time step list, it seems it would only be 24 hours. How does this affect the simulated column footprints, especially the influence of the upper altitudes where the particles may not have any footprint influence in the first 24 hours at times?
L169 - Is there a reference for how much uncertainty there is in the STILT footprints? Thinking about the comparison between FootNet and STILT in the context of the overall uncertainty in STILT may be a useful framing and can more quantitatively make this point. The papers looking at differences between different models probably only compare in certain places and times making extrapolation or generalization difficult, but one could cite some here as a comparison- is the uncertainty 10%, 20%, 50%? (e.g., Karion et al., 2019, https://doi.org/10.5194/acp-19-2561-2019 is in the Barnett so could be useful for making this point?).
Fig 5 - regarding the smoother footprints generated by FootNet: Were the STILT footprints run with the optional far-field smoothing (Gaussian Kernel method) that is provided with the University of Utah STILT footprinting codebase? If so, how was it set?
SI Fig S2 and Fig S3, perhaps note in the legend that GP refers to Gaussian Plume or define in the table in Fig 1 next to “Gaussian Plume (GP)”, for example, for consistency.
Appendix A: It would be useful to include a similarly short description of the details of the SF inversion so the reader does not need to refer to the previous papers, if possible.
Fig 6, units should be included - presumably these are summed over space and time for each receptor (so each data point in the color scale is a full footprint?).
Fig 6A does indicate a high bias in the FootNet footprints - can the authors comment on this?
Figs 7 & 8: Another figure or panel should indicate the difference between the posterior fluxes for each case (relative to the STILT or XSTILT case, plus comparing the in-sample vs. out-of-sample FootNet, either absolute units of percent perhaps). As is, the second column really looks identical. Especially in Fig 7, the footprints look quite different between panel A vs. D and G, so it would be useful to see the magnitude of the flux difference.
L215: given the importance of the Gaussian plume proxy, can the authors include the basics of how this was calculated (equation?) – perhaps in the SI or appendix. For example, how was the stability class determined for each case? It is indeed interesting that this is the most important input, showing that giving the model some basic understanding of the relationship between the inputs helps it perform better, rather than giving the model only the inputs to the GP proxy, for example. Perhaps this points to the model not actually “learning” relationships, since the GP gives it the basic expected relationship between wind, PBL, etc.
Citation: https://doi.org/10.5194/egusphere-2025-3441-RC2 -
RC3: 'Comment on egusphere-2025-3441', Anonymous Referee #3, 04 Sep 2025
reply
General Comments
This paper introduces FootNet v3. FootNet is a deep learning emulator of atmospheric transport that computes the sensitivity of passive atmospheric trace gas concentrations to upwind emissions (the “footprint”). The footprint is a key component of inverse analysis for emissions quantification, but is a computational bottleneck that limits the feasibility of the analysis. FootNet promises to provide footprints with 650x less wall clock time per footprint vs Lagrangian Particle Dispersion models or Eulerian transport models, while maintaining or even improving model fidelity. FootNet v3 specifically promises to provide footprints for locations and times outside the training data, allowing the model to be deployed to new column and point concentration observations without re-training. This would represent a major step forward in the field of inverse analysis for emissions quantification by greatly decreasing cost and expanding access to more researchers.
FootNet v3 improves upon previous versions of FootNet by:
- Using the improved U-Net++ architecture in place of the U-Net architecture of previous FootNet versions.
- Increasing the quantity of training footprints to 500,000 and increasing the breadth to cover the entire Continental United States across seasons and meteorological conditions. These training footprints were computed using XSTILT driven by HRRR meteorology with a
The paper makes the following key claim about FootNet v3 that must be justified:
FootNet v3 can produce footprints at locations and times outside the training data, towards generalization to new observations.
To justify this claim, the authors train and run FootNet v3 in a “out-of-sample” configuration, where training data from large regions in California and Texas are withheld and the resulting model is used to compute flux inversions using real observations of 1) surface CO2 measurements from the BEACO2N network in San Francisco, and 2) TROPOMI XCH4 observations in the Barnett Shale of Texas. The authors find that FootNet v3 performs comparably and even slightly better than STILT alone in the out-of sample inversions, demonstrating that FootNet v3 could indeed be used for regions out of the training sample. These experiments appear to be well done.
Additional claims include that the quantity of samples and sample strategy were sufficient to constrain the model, which was demonstrated by a validation loss experiment, and that the model appropriately conserved mass, which was tuned with a parameter for surface footprints.
This paper is well written and technical comments about the writing are minor. I support publication of this paper with minor revisions.
One question I do have: The authors demonstrate here and in their previous papers that the added diffusivity of FootNet enhances some properties of the predicted concentrations, but how does if affect the distributions of retrieved emitters? Does it induce more diffuse posterior emissions patterns? What are the implications for modeling emissions from the “fat tail” distribution of methane emissions from point sources? This is one of the key questions that could be answered by high resolution satellite data, which would benefit most from having such a computationally efficient transport model.
Specific Comments
How much influence in the out-of-sample footprints lies in the in-sample domain? If this is significant it could taint the results. A stronger experiment would be to fully separate the out of sample domain in time and space, though I suspect the spatiotemporal domain was chosen to align with previous work to conserve limited resources, which is understandable.
Line 116 “but also suggest that moderately sized, region-specific training efforts may be sufficient to fine-tune for local applications” It is not obvious to me that this point follows from the data provided.
Line 147: “While STILT’s spatial structure is more physically realistic (as it directly solves the governing equations), it is not necessarily more accurate due to potential biases in the driving meteorological fields. Highly localized but biased footprints could introduce artifacts in GHG flux inversions. In this context, the smoother prediction from FootNet may actually be preferable.” STILT has a feature that can artificaially increase the dispersion and induces similar smoothness. See Lin and Gerbig (2005) Accounting for the effect of transport errors on tracer inverrsions, Geophys Res Lett 32 L01802 doi:10.1029/2004GL021127. Also, correlated errors along the duration of a STILT footprint are implemented in Jones et al. (2021) Assessing urban methane emissions using column-observing portable Fourier transform infrared (FTIR) spectrometers and a novel Bayesian inversion framework Atmos. Chem. Phys., 21, 13131–13147, https://doi.org/10.5194/acp-21-13131-2021.
Line 167 “This means that FootNet could likely be used in domains outside of where it was trained.” I thought that this was tested directly by excluding footprints in this domain from the training set. I think that the conclusions that can be drawn from the HRRR vs GFS comparison are analogous to those that would be drawn from alternate met products used in traditional observation system simulation experiments.
Technical Corrections
In He et al., 2025, the FootNet version is named FootNet v1.0. In Dadheech et al 2025 the version number is omitted. Should this not be FootNet v2.0? What is the minor version number signifying in He et al., 2025?
In this paper, Eulerian transport models and Lagrangian Particle Dispersion models are referred to as “full-physics”. These models rely heavily on parameterizations to simulate dispersion, which is a fundamental property of the output, and therefore full-physics is not an appropriate term.
The descriptions of the domains in the caption of figure 2 is confusing and I think an error– It says that the gray dots indicate individual pseudo observations, and that the Domains A and B are withheld from the out-of-sample training configuration, but there is a wide margin of missing gray dots around Domains A and B. Should the full set of pseudo-observations cover all of CONUS (as is given in the supplement) and the out-of-sample set be the dots drawn? This would better align with the description in the text. Also, this caption refers generically to FootNet when it should refer to FootNet v3.
Line 96 “Table in Figure 1” -> “The ‘Inputs’ table in Figure 1”?
Figure 4 caption: “the surface model” should be replaced with “Surface FootNet” so that the figure stands alone.
Line 169: “For instance, STILT and FLEXPART can produce larger disagreement than observed here when run under similar conditions. This flexibility allows FootNet to support flux inversions using multiple sources of meteorology” citation required.
Figure 6: What is the property being evaluated? Sum of footprint weight?
Citation: https://doi.org/10.5194/egusphere-2025-3441-RC3
Data sets
Simulating out-of-sample atmospheric transport to enable flux inversions [Dataset] Nikhil Dadheech and Alexander J. Turner https://doi.org/10.5281/zenodo.16011454
Model code and software
Simulating out-of-sample atmospheric transport to enable flux inversions - Code Nikhil Dadheech and Alexander J. Turner https://doi.org/10.5281/zenodo.16010441
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
268 | 43 | 13 | 324 | 42 | 9 | 13 |
- HTML: 268
- PDF: 43
- XML: 13
- Total: 324
- Supplement: 42
- BibTeX: 9
- EndNote: 13
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1