the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
High-resolution remote sensing and machine-learning-based upscaling of methane fluxes: a case study in the Western Canadian tundra
Abstract. Arctic methane (CH4) budgets are uncertain because field measurements often capture only fragments of the wet-to-dry gradient that control tundra CH4 fluxes. Wet hotspots are over-represented, while dry, net-sink sites are under-sampled. We paired over 13,000 chamber flux measurements during peak growing season in July (2019–2024) from Trail Valley Creek in the western Canadian Arctic with co-registered remotely sensed predictor variables to test how spatial resolution (1 m vs. 10 m) and choice of machine-learning algorithm shape upscaled CH4flux maps over our 3.1 km2 study domain. Four algorithms for CH4 flux scaling (Random Forest (RF), Gradient Boosting Machine (GBM), Generalised Additive Model (GAM), and Support Vector Regression (SVR)) were tuned using the same stack of multispectral indices, terrain derivatives and a six-class landscape classification. Tree-based models such as RF and GBM offered the best balance of 10-fold cross-validated R² (≤0.75) and errors, so RF and GBM were used in a subsequent step for upscaling to the study area. With 1 m resolution, GBM captured the full range of microtopographic extremes and predicted a mean July flux of 99 mg CH4 m-2 month-1. In contrast, RF, which smoothed local extremes, yielded an average flux of 519 mg CH4 m-2 month-1. The disagreement between flux estimates using GBM and RF correlated mainly with the Normalized Difference Water Index (NDWI), a moisture proxy, and was most pronounced in waterlogged, low-lying areas. Aggregating predictors to 10 m averaged the sharp metre-scale flux highs in hollows and lows on ridges, narrowing the GBM-RF difference to ~75 mg CH4 m-2 month-1while broadening the overall flux distribution with more intermediate values. At 1 m, microtopography is the main driver. At 10 m, moisture proxies explained about half of the variance. Our results demonstrate that: (i) sub‑metre predictors are indispensable for capturing the wet-dry microtopography and its CH4 signals, (ii) upscaling algorithm selection strongly controls prediction spread and uncertainty once that microrelief is resolved, and (iii) coarser grids smooth local microtopographic details, resulting in flattened CH4 flux peaks and wider distribution. All factors combined lead to potentially large differences in scaled CH4 flux budgets, calling for a careful selection of scaling approaches, spatial predictor layers (e.g., vegetation, moisture, topography), and grid resolution. Future work should couple ultra-high-resolution imagery with temporally dynamic indices to reduce upscaling bias along Arctic wetness gradients.
- Preprint
(5684 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 13 Oct 2025)
- RC1: 'Comment on egusphere-2025-3968', Anonymous Referee #1, 10 Sep 2025 reply
-
RC2: 'Comment on egusphere-2025-3968', Anonymous Referee #2, 07 Oct 2025
reply
Overview
This study examines the use of high-resolution (10 m) and ultra-high-resolution (1 m) data to estimate tundra methane fluxes in an area of interest in the boreal region. Two key points emerge from this work:
1. The impact of the spatial resolution of the input data and model on the upscaling of local methane emissions.
2. The uncertainty in upscaling associated with the choice of machine learning model.
This topic is highly relevant, especially as local and regional greenhouse gas budgets will certainly increasingly being developed for carbon accounting. It raises important questions about how such upscaling should be conducted and, most importantly, how the associated uncertainties could be quantified. The field measurement effort is also commendable.
While the study is robust and well presented, several key aspects need to be clarified and reorganized to ensure methodological transparency, interpretability, and a thorough discussion of limitations. I hope that the comments provided below will help open up new perspectives for discussion and improvement of the manuscript.
Representativeness of sites:
It is unclear whether the sample is representative of the entire study area. The data appear to be concentrated in a few locations, which is understandable given the logistical challenges of conducting fieldwork across large wetland areas. Nevertheless, this aspect should be described in more detail in the 'Methods' and 'Results' sections; a single sentence in the 'Limitations' section is insufficient.
A more detailed discussion of site representativeness would be beneficial, including the number of sites per land cover type, and the number of measurements per site. In Figure 4, n appears to refer to the total number of measurements, but it would also be useful to indicate the number of sites per land cover category there, as well as in Tables B1 and B2, or somewhere else. This would facilitate discussion of this limitation; e.g., the text mentions that the 'wetland, permanent' class includes only one site which.
Additionally, are sites weighted differently in the model? For example, automatic chambers likely produce more measurements than manual ones — do these sites then have a greater influence on model training? How do you account for potential site-level overfitting? Did you consider using a leave-one-site-out cross-validation approach to assess the robustness of the model in predicting new areas where no data was used for training?
Finally, could the differences between the two models (particularly at 10 m in Figure 6) over specific areas/LC types be explained by a lack of training data in these areas/LC types?
Mismatch between data input and resolution effect :
The comparison between the 1 m and 10 m datasets is particularly interesting, as it reveals the differences in the two approaches with commonly used input data at these resolutions. However, this comparison potentially combines two effects: one related to the resolution itself (average over a larger area), and another related to potential differences in the data sources themselves (e.g., different acquisition date/time, different sensors...). Have you attempted to separate these two influences? One way to do this would be to aggregate the 1 m product to 10 m and apply the same workflow (e.g. for land cover, use the dominant vegetation type within each 10 m grid cell and take the mean for the other variables). This could help to isolate the effect of the resolution from that of the different data sources.
Otherwise, it would be useful to discuss this somewhere, and include a comparison of the datasets used as is done for the comparison with CALU (Figure 5), but for the two datasets at different resolutions (as is partly done for land cover in Figure 2, where important differences can be seen).
The mix of spatial and interannual analysis is somewhat confusing.
It is unclear how the spatial and interannual components are distinguished in the study. It is not always obvious whether the analysis is spatial, temporal, or a combination of both. Although the study appears to be mainly spatial, with a single-month focus on July, it also uses temporally varying predictors only (AT, PAR and TDD over six years). Clarifying this in the text and figures (methods and results) would improve readability. The time-varying inputs are difficult to understand from the main text: which variables are dynamic and at what resolution? (See the comment about the data section below.)
Spatial and temporal accuracy should be discussed separately in the 'Results' section, or more explanations should be provided. For example, spatial correlations (mean flux per site) and temporal correlations (time series at individual sites) could be reported separately in Tables 2 and 3 to disentangle these effects and avoid sites with potential larger amounts of data dominating the analysis compared to sites with smaller amounts of data. A panel like 7B could be used to directly compare model predictions with measurements at the sites, providing a clearer assessment of spatial and temporal performance.
The data section of the Methods section needs to be restructured and expanded.
- Section 2.2.3 (and the Materials and Methods section more generally) should be reorganised, as it is currently difficult for the reader to determine which datasets are used at 1 m, which at 10 m, and which at both resolutions. For instance, the text initially focuses on 1 m data, but then abruptly shifts to Sentinel-2 (presumably 10 m) before describing the 10 m products. References to 30 m window data are confusing and require explanation. The temporal dimension of each variable is unclear too. While some of this information appears in Table A1, Figure 3 and lines 240–247, the description remains fragmented. Providing a summary table that explicitly lists the ten variables used for each resolution, their data source, spatial resolution (1 m, 10 m or constant) and whether they are static or dynamic would certainly help the reader.
- Data processing procedures should also be described in more detail in Section 2.2.3 or in a dedicated section. For instance, how were Sentinel-2 data cleaned or filtered? Were cloud-free conditions explicitly selected for the time-varying Sentinel-2 indices? This is implied by lines 278–281, but stating this explicitly in the 'Remotely Sensed Data' section would improve transparency. Overall, providing a clearer and more detailed description of the data pre-processing and management would strengthen the reproducibility of the study.
- for the chamber data, management should also been specified. How is chamber data managed spatially? How are fluxes aggregated at 1 m or 10 m resolution — do you take the mean of all chambers within each 1×1 m or 10×10 m pixel? You mention PAR and other variables measured at chamber sites. Are these used here ? Providing this information is essential for understanding how point-scale observations are scaled to the model resolutions. Additionally, since chamber flux measurements are known to be highly variable, it would be useful to specify in the methods section whether each flux observation corresponds to a single or repeated measurement.
Specific comments :
- I do not understand Figure 7A. According to the caption, it should show monthly estimates averaged over the entire area, but the large number of points is confusing.
- Have you considered using a “leave-one-site-out” or “leave-one-year-out” cross-validation (e.g. training on the first three years and predicting the last year) ? This could enable assessing how well these models can predict pixels/sites or or time for which no data was used in the training process, as well as the uncertainties related to each model training, which are not really discussed here.
- The emphasis of what are the main findings differs between the abstract, the main text and the conclusion. Please clarify. For example, the substantial difference in emission estimates between the RF and GBM models (519 vs. 99 mg CH₄ m⁻²) is emphasised in the abstract, but this is not discussed in the same way in the main text or conclusion. I think this should also be highlighted in the conclusion. Conversely, the potential advantages of the 10 m estimates, which are mentioned in the conclusion (in particular lines 519–520), are not emphasised in the abstract. This important point should probably be included in the abstract to provide a more coherent overall message.
- lines 234-236 «Root‐mean‐square error (RMSE) between measured and predicted CH4 fluxes was the primary comparison metric because it penalizes large deviations more strongly than a mean‐absolute error (Chai & Draxler 2014).» seem contradictory with lines lines 336-337 «Due to the relatively low mean CH4 flux across all sites (0.102 mg CH4 m-2 h-1), the emphasis of our model evaluation was placed on absolute errors (MAE) rather than the fraction of explained variance (Table 2).». Please clarify.
Suggestions :
There is generally little discussion about the application, relevance, and limitations of this research within a broader context. A discussion of its potential and limitations for broader application would strengthen the manuscript. Below are some suggestions for further discussion.
- Seasonal/monthly variations are not presented in this manuscript, but they should be discussed in the 'Discussion' or 'Limitations' sections.
- The study uses chamber measurements focusing on soil and short vegetation emissions. How could this approach be generalized to other ecosystems, particularly forested or tree-dominated systems? Would the method differ if measurements were taken at the ecosystem scale such as Eddy Covariance that average emissions, but also take vegetation emissions (which can be very substantial part of the emissions) into account? Could comparisons with flux estimates from larger-scale approaches — for example, eddy covariance tower or aircraft-based eddy covariance campaigns over the same areas of interest (e.g., Shaw et al., 2022, https://doi.org/10.1029/2021GB007261; Sayres et al., 2017, https://doi.org/10.5194/acp-17-8619-2017) provide valuable evaluation of the upscaling methods and help assess their robustness across spatial scales ?
- Linking these fine-scale results to broader CH₄ budgets, which are usually estimated at coarser resolutions, raises questions about scalability. Why were only 1 m and 10 m resolutions considered? Would other coarser scales (50 m, 100 m, 1 km) be relevant ? The comparisons mentioned in lines 436–440 refer to models run at 0.25–0.5°. Are these results directly comparable? How could your findings be used in larger-scale budgets?
Minor comments:
- Line 175: Could you please clarify how water pixels are defined? Could a wetland pixel be masked?
- lines 173-175 : «At 1 m resolution, the Tall shrubs + trees class was merged with Dwarf shrubs due to the absence of chamber measurements within that class, resulting in five effective classes at 1 m and six at 10 m.» It is unclear why there are six classes at 10 m. The point(s) that was (were) classified tall schrubs + tree at 1 m was not in a pixel classified in this category at 10 m ? Please clarify.
- lines 78-79 «Moreover, ultra-high resolution can even introduce noise and not necessarily lead to a better representation of environmental conditions (Riihimäki et al., 2021).» seems contridactory with lines 70-75. Please clarify this sentence.
Citation: https://doi.org/10.5194/egusphere-2025-3968-RC2
Data sets
CH4 Flux Dataset and Upscaling Maps for TVC, Canada, 2019–2024 Kseniia Ivanova et al. https://doi.org/10.5281/zenodo.15753253
Model code and software
Modeling and Comparing Methane Flux Upscaling at 1m and 10m Resolution in Trail Valley Creek Kseniia Ivanova et al. https://doi.org/10.5281/zenodo.15399083
Processing of the carbon gas chamber flux, with automatic window detection and manual improvement. Kseniia Ivanova and Mathias Göckede https://doi.org/10.5281/zenodo.16732354
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
1,919 | 51 | 8 | 1,978 | 19 | 17 |
- HTML: 1,919
- PDF: 51
- XML: 8
- Total: 1,978
- BibTeX: 19
- EndNote: 17
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
The manuscript upscales methane fluxes in Canadian tundra with remote sensing predictors at two different spatial resolutions. Overall, the conducted research is sound and solid and the topic is definitely worth of investigation. However, there are some, in particular methodological, issues that need to be revised. My more detailed comments are as follows:
- The selection and parameterization of the used machine learning and regression models need to be better described. First, it could have been worthwhile to test also other machine learning methods, such as extreme gradient boosting that has performed well in many recent model comparisons. Second, the parameterization for the different models need to be elaborated. Gradient boosting and support vector regression are both very sensitive to parameter settings but there is no description at all whether different parameter combinations were tested. Additionally, for support vector regression, it should be detailed what kernel for used. For generalized additive models, it should be described what kind of smoother functions were used and whether the unimportant variables were penalized in the model building. Furthermore, there should be no multicollinear predictor variables in generalized additive models. Was the cross-correlation between predictors checked? Random forest is less sensitive to parameterization but the model performance can be boosted with variable selection. If variable selection is conducted, the variable importance results of the model are also more robust.
- The measured CH4 flux data should be described better. In remote sensing-based upscaling, there should be spatially representative data for the whole study area. It is now unclear whether this is the case. When looking at Figure 1, it seems that the sampling is very concentrated in a few locations. It is rightfully written in the limitations section, that the sampling could have been better. However, the sampling should be described in the methods section more. How many measurement points were there in total? Do the points represent the total spatial heterogeneity in the study area? How many measurements for each point? How the points are divided into the different landscape classes? How the point locations were chosen, was the sampling purposeful? Were there boardwalks or how the measurements were conducted in the plots? If there were boardwalks, do they impede the remote sensing signals over the plot locations? Were the RS-based observations of the plots taken from a single pixel or a larger neighborhood? Are the different measurements and plots independent and does the potential spatial and temporal autocorrelation affect model results?
Landscape classification: How were the classes derived for the landscape classification; visual interpretation and field work experience of the site? Please describe in the main text what is the collection platform for the 1 m stack, drones? How many training and validation data points were there for the classification? How the training data can be the same for both resolutions? Do you mean that the location and LC class was the same but the training data was calculated from the respective RS datasets? Why there were no tall shrubs measurements for the 1 m spatial resolution but there were such measurements for the coarser spatial resolution? How the water pixels were masked before the classification?
Sentinel-2 preprocessing: Did you mask clouds, shadows and snow? Did you use also cloudy data for calculating the average mosaic? An earlier study has shown that average/median image calculation can be prone to include clouds/haze and 40th percentile could work better (https://doi.org/10.1016/j.jag.2024.103659). How were the time-specific NDVI and NDWI calculated? Based on one image only? How close was the image to the CH4 measurements? What was done for clouds?
Minor/more detailed comments:
l27: microtopography is -> microtopography was
l28: sub-metre: you do not have any sub-meter predictors as the finer tested spatial resolution was 1 m. You write also in other parts about sub-meter resolution. Please be consistent about writing of 1 m spatial resolution (not higher than that).
l70: ultra-high spatial resolution can also finer than 1 m
l73: do these references use 1 m spatial resolution or higher (or lower) than that?
l74: meter or metre: You seem to use mostly British spelling but not consistently.
l78: noise related to what?
l85 and beyond: is GAM a machine learning method?
Research questions: It would be easiest if the result (and conclusion) section would be organized in the same order as the research questions.
Study site description: is there peat soil in the study area?
There could be a general overview sentence/paragraph of the methods before listing the different datasets. Now, when reading the dataset section, it is a bit unclear what is done with which data. This applies particularly to the landscape classification.
Climatic data: You seem to use also PAR data at 1 km spatial resolution (Table A1). Please add a short description of these data in the main text.
l161: you start describing Sentinel-2 data before the sentence about 10 m stack. Please reorganize the paragraph.
l162: Why 30 neighborhood for TPI? In general, you should test multiple neighborhood distances.
l190: Why NDVI and NDWI? Why not other indices such as NDMI? NDVI and NDWI have typically very high negative correlation.
l195: What do you mean by "sensitivity tests"?
l198: What do you mean by isolating "the effects of scale"?
l215: "The overall workflow is summarized in Fig. 3. " The sentence can be deleted.
l216: It would be good to describe already here that the climatic variables were spatially uniform over the study area.
l217: Is there kind of double counting if some of the variables are first used for landscape classification and then again for the regression models together with the landscape classification. Is the classification needed as a predictor in the regression analyses?
l239: Did this analysis include temporally variable NDVI and NDWI? How about CALU and subsidence? Please state clearly that the analysis was done for both spatial resolutions.
l251: How was the differencing done? Was the 10 m maps resampled to 1 m spatial resolution first? What does "differences between model families" mean; differences between random forest and gradient boosting? What data was subtracted from what data?
l255: What does "tuned" mean here?
l264-267: There is little information value in this paragraph. Consider deleting.
l273: You should test SAGA wetness index which spreads high wetness values to larger neighborhoods. It could produce spatially more coherent result than traditional TWI.
Table 2: What was N in the correlation analyses? How many temporal observations for NDVI and NDWI?
l296: Should this paragraph be above the previous paragraph?
l298: "This also applies to the correlations reported in Table 2." This sentence could be deleted.
l301: This sentence does not continue anywhere.
l311: Similar flux stratification compared to what?
l315: Delete "the" from the end of the line.
l320: Overlay analysis of what?
Figure 5: This is a good figure. It would be good to conduct a similar analysis between CALU and 1 m landscape classification.
l337: In the methods, you write that you emphasize RMSE in model comparison and here you state that you emphasize MAE. In reality, you seem to emphasize R2 a lot (or at least you report it). Please be consistent. Please state also how you calculated R2 in the methods section. Is it just squared correlation?
l342: You seem to report inconsistently both normalized and unnormalized MAE and RMSE. Please be consistent. If you normalized these values, how did you conduct the normalization?
l355-356: Please put these two paragraphs together.
Table 3: There are no bold values in the table despite caption claims it. What is the unit for MAE and RMSE? Can you include also normalized values?
l366: Is there a need to refer to MAUP? Can you speculate this with a plainer language? Do you have any evidence that pixel aggregation is the reason for the model explanatory capacity behavior?
l396-397: There seems to be little logic between the first and second part of this sentence.
l423: Isn't this in contrast with the earlier claim that GBM can predict extreme values better?
Figure 7: Can you have the chamber-measured fluxes in this figure? It would also be a good idea to compare the upscaled fluxes with the chamber-measured fluxes the text.
l459: Please explain in the methods section how you normalized the importance values.
l462: Are the terrain measures correlating? Please give evidence, do not just speculate.
l476: You could cite more the remote sensing-based WT studies here.
Figure 8: The division into the categories seems a bit arbitrary. Why TWI is not under topography? Should you have instead topography, spectral, meteorological and landscape categories here?
Feature importance analysis: it would be good to have the analysis also for subsidence, CALU and temporal NDVI/NDWI.
Limitations: Can you integrate the 3.5 into the earlier sections? It feels a bit odd to have a section with only a couple of sentences.
l495: Can you quantify and describe the unbalanced sampling more?
l497: "were not included"; as predictors?
Conclusions: Please rewrite the conclusion section. Give a brief overview of the study aims and then answer to each four research questions one-by-one.
l508: How the targeted sensitivity analyses could be done?
l514: Acronyms such as RF-1 m are a bit difficult for the conclusions section.
l519: Should it be "coarser resolution models can outperform ultra-high spatial resolution models"?
l520 and generalizability: Are you writing only about spatial resolution here?
Table A1: Should the resolution column be entitled "spatial resolution"? Text about slope and aspect could be shorter. The reference to Wilson & Gallant feels a bit odd as your DTM is not from their work.
Table A2: Please explain the abbreviations in the caption. Spatial resolution for individual bands is 10 m; were they not calculated for 1 m spatial resolution? Did you use SWIR and RE bands for Sentinel-2? You could include "spatial" in the resolution column header for this table also.
l563: Please report also class-specific accuracies. You could also add a confusion matrix.
Figure B1: The caption should be below the figure. How was the difference calculated? What minus what? Can you provide also the difference maps? Are middle and right-hand columns needed? They are a bit difficult to digest and they are not referred to in the main text.