the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Methane fluxes from arctic & boreal North America: Comparisons between process-based estimates and atmospheric observations
Abstract. Methane (CH4) flux estimates from high-latitude North American wetlands remain highly uncertain in magnitude, seasonality, and spatial distribution. In this study, we evaluate a decade (2007–2017) of CH4 flux estimates by comparing 16 process-based models with atmospheric CH4 observations collected from in situ atmospheric observation towers across Canada and the US. We compare the Global Carbon Project (GCP) process-based models with a model inter-comparison from a decade earlier called The Wetland and Wetland CH4 Intercomparison of Models Project (WETCHIMP). Our analysis reveals that the current process-based models have a much smaller inter-model uncertainty and have an average magnitude that is a factor of 1.5 smaller across Canada and Alaska based on our analysis using tower-based atmospheric CH4 observations. Furthermore, the differences in flux magnitudes among GCP models are more likely driven by uncertainties in the amount of soil carbon or spatial extent of inundation than in temperature relationships, such as Q10 factors. In addition, the GCP models do not agree on the timing and amplitude of the seasonal cycle, and we find that models with a seasonal peak in July and August show the best agreement with atmospheric observations. Models that exhibit the best fit to atmospheric observation also have a similar spatial distribution; these models concentrate fluxes near Canada's Hudson Bay Lowlands (HBL). Overall, current, state-of-the-art process-based models are much more consistent with atmospheric observations than models from a decade ago, but our analysis shows that there are still numerous opportunities for improvement.
- Preprint
(3358 KB) - Metadata XML
-
Supplement
(22342 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-2150', Anonymous Referee #1, 15 Jul 2025
This study presents an update of the WetChimp wetland model intercomparison that was published several years ago. The new inter-comparison makes use of model submissions to the Global Carbon Project. The results show a significant reduction in inter-model spread compared with the previous intercomparison, in closer agreement with atmospheric measurements evaluated using Stilt over North America. This is regarded as a sign of good progress in developing these models. In my view, explained below, this should consider another possible explanation. The comparison with WetChimp is only indirect, since the results were not included in the evaluation using atmospheric measurements. Furthermore, the evaluation using atmospheric data concentrates on R2, for a reason that remains unclear. After these concerns are repaired and accounted for, I see no reason to uphold publication of a study that could provide a useful new reference.
GENERAL COMMENTS
The risk of model intercomparisons is that they might steer model development in the direction of the “mean model”. It is tempting to interpret a convergence in model results as progress towards uncertainty reduction. This is only true, however, if the models converge to the true state. The evaluation that is presented does not provide evidence that this is the case.
Atmospheric measurements are used to test the quality of wetland emission estimates. But, for a reason that is not clear, they are not used to confirm that the WetChimp submissions are less realistic. The argument that they are is only based on the convergence of results and the size of the emissions. The analysis of new submissions suggests that models with lower emissions are more accurate, based on the amplitude of concentration increments, but the argument is again rather indirect as this comparison also did not include the WetChimp emission estimates. I propose to either redo the analysis using the WetChimp fluxes, or – if that is not possible – acknowledge this short coming of the method that is used.
The model evaluation method uses R2 as a metric of agreement with the observations. R2 is limited, however, in that it does not penalize a wrong enhancement amplitude. The observed concentration variability is explained mostly by the weather. Differences in emissions show up rather in the concentration increments, which are not captured by R2. A more logical choice would have been to use RMSE as evaluation metric. This should either be tried, or an explanation should be given of why it was not done. Note that RMSE is not the same as the metric shown in Figure 4, although that does provide an evaluation based on the size of the mean concentration increment.
Based in the results in Figure 6, it is suggested that simpler diagnostic models perform better than more sophisticated prognostic models. This raises the question, however, how independent the model results are of the data that are used to evaluate them. Simpler models are easier tuned to the existing measurements than sophisticated mechanistic models. Could that explain why they score better? I was surprised to see that the evaluation is based only on ambient air measurements, without the mentioning of flux measurements that are made at several sites in the study domain. They might even provide a less independent means of evaluation. It would nevertheless provide useful additional information to compare the performance of the different model categories that are distinguished.
SPECIFIC COMMENTS
Line 75, how about regional models for the study domain? I understand that this model inter-comparison evaluates global models, but results from regional models might nevertheless provide useful information for evaluating them.
Line 94, the purpose of this sentence in relation to the previous is not clear. Is it meant to provide further justification for afternoon measurements? Or is it meant to indicate a limitation that will anyway play a role? Please rephrase to clarify.
Line 99, Don’t the campaigns in Alaska offer a useful opportunity for further validation? If so, why was it not used?
Line 168: From a simple back of the envelop calculation it seems the 1 – 1.5 ppb represents high-latitudes already, because the global decay due to OH should be faster.
Figure 1: Does ‘daily’ mean that the footprints shown in this figure represent only the influence of a one day back trajectory? The text mentions that 10-day back trajectories are used, which raises the question why mean 1-day footprints are shown here. Is the ‘mean’ evaluated over 2007 – 2017 (if so then this should be mentioned explicitly).
Line 191: “The remaining sites …” You might want to add a reference to Figure 1 where these sites are indicated as red circles.
Line 205: Did you test how reliably the apparent Q10 approximates Q10 for the models that use a Q10 formulation? (and for which its value is known)
Line 215: How about the seasonality of anthropogenic emissions?
Line 220: But anthropogenic emissions inventories provide estimates for each year, so reasonably accurate IAV estimates exist for the anthropogenic part.
Line 227: Could it be that WetChimp led to a consensus about the mean flux that might explain some degree of convergence?
Line 229: Is this also true for the models that are common to both experiments?
Line 275: How are emissions from fresh water accounted for in the current study?
Figure 2: An explanation about the error bar should be added in the figure caption.
Line 315-317: It is not clear why Q10 would correlate with the average methane emission (which indeed seems not to be the case). Wouldn’t it have been more logical to assess Q10 against R2 or against the seasonal amplitude?
Line 357: It would be useful to add standard deviations to the points in figure 6 corresponding to the averages over climate forcing data and anthropogenic emission inventories.
Line 376: Figure 5 is referred to for a relation between Q10 and flux variations, but this figure relates Q10 to the mean flux rather than its variation.
Line 395-398: This rightly mentions that the explained variance of the PC1 has no relation with the true variance. However, more useful would have been to explain what the comparison of these numbers does mean. Right now, it is unclear why these numbers are even mentioned.
Line 405: ‘so this analysis of spatial distribution’ It is not clear what ‘this analysis’ refers to. The PCR analysis is not weighted to areas with stronger observational coverage, is it?
Figure 8: There is no reference in the text to panel d – f, and an explanation is missing of what the mean standardized flux is and how it was derived.
Line 420: ‘this change in magnitude improves …’ This cannot be concluded because the WetChimp flux estimates were not included in the comparison to observations.
Line 422: ‘most consistent with atmospheric observations’ only concerns the R2, whereas it is not clear that R2 is best metric to evaluate the consistency with atmospheric observations.
Line 432: ‘Overall, we argue …’ It should be made clear that this conclusion only holds for the current analysis of emissions from Northern America. Since the models are global, there is still the possibility that other regions turn the overall outcome in the opposite direction.
TECHNICAL CORRECTIONS
Line 180: ‘initially’ instead of ‘preliminary’?
Line 324: “contribute to<o>” (?) but are not “the primary >the< cause”
Citation: https://doi.org/10.5194/egusphere-2025-2150-RC1 -
RC2: 'Comment on egusphere-2025-2150', Anonymous Referee #2, 27 Jul 2025
General comments:
This paper is an interesting spinoff from the community-wide GCP Global methane budget effort, this time focused on the skill of the flux models for the artic and boreal North America. It first compares a recent batch of simulations with an older one, then compares the former with atmospheric observations using a transport model. While the study is commendable, it often lacks subtlety, as detailed below, and, like many studies of this type, it does not really allow novel insights. It should certainly be published, but after a major revision.
Detailed comments:
l. 5, 419, 433: what is the “inter-model uncertainty”? I suspect a loose concept behind.
l. 12: HBL for “Hudson Bay Lowlands” is defined three times in the text, but actually I would encourage the authors not to abbreviate this region.
l. 41-43: how does the use of “similar modeling protocols etc.” allow us to identify and diagnose uncertainties in models? By construction the uniformization of some of the input data and configuration focuses the analysis on a subset of the uncertainty sources.
l. 49: weird phrasing. Please reformulate.
l. 52-61: “Although … Notwithstanding” Back-and-forth reasoning. Please reformulate more linearly.
l. 61: what does “narrower range of uncertainties” mean?
l. 68-70: Now you need to say more: what did we learn from these studies?
l. 74:77: trivial statement. Please remove.
l. 101: odd argument for leaving the aircraft data out here. What did we learn in these studies which is interesting for the present one? If nothing, then the job would still need to be done.
Section 2: we are missing a subsection on WETCHIMP.
l. 108: could you say more about how inundation is estimated in prognostic models? Is it really prognostic or observation-driven like in the diagnostic models?
l. 109: what is the point of saying that the GCP modeling groups submitted flux estimates to the GCP?
Section 2.3 we are missing some information about the temporal resolution and time range of this data
l. 145: how do you simulate fluxes with WRF-STILT?
l. 155: I understood that the study time frame stopped in December 2017.
l. 184-185: fine, but then don’t write that you used a 1.5 threshold a few lines before.
l. 218-222: I understand that the discussion cannot be extensive as you say, but where is it? Please substantiate the statement based on your data, or simply say that you did not study IAV without speculating.
l. 226, 228: the use of “consensus” is ambiguous here because it only refers to converging numbers for whatever cause, not to “scientific consensus” among the modelers. Please use a non-ambiguous term.
Section 3.1. The statistical analysis is too short. You are comparing standard deviations estimated on ensembles made of few members only and of varying sizes. Further, some members are most likely correlated: in Table S2, there are three flavors of LPJ and I guess that most models share some parameterizations together. Given the importance of this section for the paper conclusions, you need to make it much more robust.
l. 231: the definition of the error bars should also appear in the legend of Figure 2.
l. 263: all models could be wrong the same way and there would be opportunity of improvement. Please rephrase.
l. 278-279: this reasoning (“This improved inter-model agreement implies… more accurate”) is shocking.
Figure 2: whiskers should be defined.
l. 297: this sentence actually stems from the statement of l. 156 about the methane lifetime. Nothing new.
l. 324: one “the” too much
l. 332: “Interestingly”
l. 341: I am not following the logic here.
l. 413: didn’t we know that beforehand?
l. 429: can you explain how the room for improvement can be compared between diagnostic and prognostic models?Citation: https://doi.org/10.5194/egusphere-2025-2150-RC2
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
506 | 70 | 15 | 591 | 48 | 7 | 23 |
- HTML: 506
- PDF: 70
- XML: 15
- Total: 591
- Supplement: 48
- BibTeX: 7
- EndNote: 23
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1