High-resolution remote sensing and machine-learning-based upscaling of methane fluxes: a case study in the Western Canadian tundra
Abstract. Arctic methane (CH4) budgets are uncertain because field measurements often capture only fragments of the wet-to-dry gradient that control tundra CH4 fluxes. Wet hotspots are over-represented, while dry, net-sink sites are under-sampled. We paired over 13,000 chamber flux measurements during peak growing season in July (2019–2024) from Trail Valley Creek in the western Canadian Arctic with co-registered remotely sensed predictor variables to test how spatial resolution (1 m vs. 10 m) and choice of machine-learning algorithm shape upscaled CH4flux maps over our 3.1 km2 study domain. Four algorithms for CH4 flux scaling (Random Forest (RF), Gradient Boosting Machine (GBM), Generalised Additive Model (GAM), and Support Vector Regression (SVR)) were tuned using the same stack of multispectral indices, terrain derivatives and a six-class landscape classification. Tree-based models such as RF and GBM offered the best balance of 10-fold cross-validated R² (≤0.75) and errors, so RF and GBM were used in a subsequent step for upscaling to the study area. With 1 m resolution, GBM captured the full range of microtopographic extremes and predicted a mean July flux of 99 mg CH4 m-2 month-1. In contrast, RF, which smoothed local extremes, yielded an average flux of 519 mg CH4 m-2 month-1. The disagreement between flux estimates using GBM and RF correlated mainly with the Normalized Difference Water Index (NDWI), a moisture proxy, and was most pronounced in waterlogged, low-lying areas. Aggregating predictors to 10 m averaged the sharp metre-scale flux highs in hollows and lows on ridges, narrowing the GBM-RF difference to ~75 mg CH4 m-2 month-1while broadening the overall flux distribution with more intermediate values. At 1 m, microtopography is the main driver. At 10 m, moisture proxies explained about half of the variance. Our results demonstrate that: (i) sub‑metre predictors are indispensable for capturing the wet-dry microtopography and its CH4 signals, (ii) upscaling algorithm selection strongly controls prediction spread and uncertainty once that microrelief is resolved, and (iii) coarser grids smooth local microtopographic details, resulting in flattened CH4 flux peaks and wider distribution. All factors combined lead to potentially large differences in scaled CH4 flux budgets, calling for a careful selection of scaling approaches, spatial predictor layers (e.g., vegetation, moisture, topography), and grid resolution. Future work should couple ultra-high-resolution imagery with temporally dynamic indices to reduce upscaling bias along Arctic wetness gradients.
The manuscript upscales methane fluxes in Canadian tundra with remote sensing predictors at two different spatial resolutions. Overall, the conducted research is sound and solid and the topic is definitely worth of investigation. However, there are some, in particular methodological, issues that need to be revised. My more detailed comments are as follows:
- The selection and parameterization of the used machine learning and regression models need to be better described. First, it could have been worthwhile to test also other machine learning methods, such as extreme gradient boosting that has performed well in many recent model comparisons. Second, the parameterization for the different models need to be elaborated. Gradient boosting and support vector regression are both very sensitive to parameter settings but there is no description at all whether different parameter combinations were tested. Additionally, for support vector regression, it should be detailed what kernel for used. For generalized additive models, it should be described what kind of smoother functions were used and whether the unimportant variables were penalized in the model building. Furthermore, there should be no multicollinear predictor variables in generalized additive models. Was the cross-correlation between predictors checked? Random forest is less sensitive to parameterization but the model performance can be boosted with variable selection. If variable selection is conducted, the variable importance results of the model are also more robust.
- The measured CH4 flux data should be described better. In remote sensing-based upscaling, there should be spatially representative data for the whole study area. It is now unclear whether this is the case. When looking at Figure 1, it seems that the sampling is very concentrated in a few locations. It is rightfully written in the limitations section, that the sampling could have been better. However, the sampling should be described in the methods section more. How many measurement points were there in total? Do the points represent the total spatial heterogeneity in the study area? How many measurements for each point? How the points are divided into the different landscape classes? How the point locations were chosen, was the sampling purposeful? Were there boardwalks or how the measurements were conducted in the plots? If there were boardwalks, do they impede the remote sensing signals over the plot locations? Were the RS-based observations of the plots taken from a single pixel or a larger neighborhood? Are the different measurements and plots independent and does the potential spatial and temporal autocorrelation affect model results?
Landscape classification: How were the classes derived for the landscape classification; visual interpretation and field work experience of the site? Please describe in the main text what is the collection platform for the 1 m stack, drones? How many training and validation data points were there for the classification? How the training data can be the same for both resolutions? Do you mean that the location and LC class was the same but the training data was calculated from the respective RS datasets? Why there were no tall shrubs measurements for the 1 m spatial resolution but there were such measurements for the coarser spatial resolution? How the water pixels were masked before the classification?
Sentinel-2 preprocessing: Did you mask clouds, shadows and snow? Did you use also cloudy data for calculating the average mosaic? An earlier study has shown that average/median image calculation can be prone to include clouds/haze and 40th percentile could work better (https://doi.org/10.1016/j.jag.2024.103659). How were the time-specific NDVI and NDWI calculated? Based on one image only? How close was the image to the CH4 measurements? What was done for clouds?
Minor/more detailed comments:
l27: microtopography is -> microtopography was
l28: sub-metre: you do not have any sub-meter predictors as the finer tested spatial resolution was 1 m. You write also in other parts about sub-meter resolution. Please be consistent about writing of 1 m spatial resolution (not higher than that).
l70: ultra-high spatial resolution can also finer than 1 m
l73: do these references use 1 m spatial resolution or higher (or lower) than that?
l74: meter or metre: You seem to use mostly British spelling but not consistently.
l78: noise related to what?
l85 and beyond: is GAM a machine learning method?
Research questions: It would be easiest if the result (and conclusion) section would be organized in the same order as the research questions.
Study site description: is there peat soil in the study area?
There could be a general overview sentence/paragraph of the methods before listing the different datasets. Now, when reading the dataset section, it is a bit unclear what is done with which data. This applies particularly to the landscape classification.
Climatic data: You seem to use also PAR data at 1 km spatial resolution (Table A1). Please add a short description of these data in the main text.
l161: you start describing Sentinel-2 data before the sentence about 10 m stack. Please reorganize the paragraph.
l162: Why 30 neighborhood for TPI? In general, you should test multiple neighborhood distances.
l190: Why NDVI and NDWI? Why not other indices such as NDMI? NDVI and NDWI have typically very high negative correlation.
l195: What do you mean by "sensitivity tests"?
l198: What do you mean by isolating "the effects of scale"?
l215: "The overall workflow is summarized in Fig. 3. " The sentence can be deleted.
l216: It would be good to describe already here that the climatic variables were spatially uniform over the study area.
l217: Is there kind of double counting if some of the variables are first used for landscape classification and then again for the regression models together with the landscape classification. Is the classification needed as a predictor in the regression analyses?
l239: Did this analysis include temporally variable NDVI and NDWI? How about CALU and subsidence? Please state clearly that the analysis was done for both spatial resolutions.
l251: How was the differencing done? Was the 10 m maps resampled to 1 m spatial resolution first? What does "differences between model families" mean; differences between random forest and gradient boosting? What data was subtracted from what data?
l255: What does "tuned" mean here?
l264-267: There is little information value in this paragraph. Consider deleting.
l273: You should test SAGA wetness index which spreads high wetness values to larger neighborhoods. It could produce spatially more coherent result than traditional TWI.
Table 2: What was N in the correlation analyses? How many temporal observations for NDVI and NDWI?
l296: Should this paragraph be above the previous paragraph?
l298: "This also applies to the correlations reported in Table 2." This sentence could be deleted.
l301: This sentence does not continue anywhere.
l311: Similar flux stratification compared to what?
l315: Delete "the" from the end of the line.
l320: Overlay analysis of what?
Figure 5: This is a good figure. It would be good to conduct a similar analysis between CALU and 1 m landscape classification.
l337: In the methods, you write that you emphasize RMSE in model comparison and here you state that you emphasize MAE. In reality, you seem to emphasize R2 a lot (or at least you report it). Please be consistent. Please state also how you calculated R2 in the methods section. Is it just squared correlation?
l342: You seem to report inconsistently both normalized and unnormalized MAE and RMSE. Please be consistent. If you normalized these values, how did you conduct the normalization?
l355-356: Please put these two paragraphs together.
Table 3: There are no bold values in the table despite caption claims it. What is the unit for MAE and RMSE? Can you include also normalized values?
l366: Is there a need to refer to MAUP? Can you speculate this with a plainer language? Do you have any evidence that pixel aggregation is the reason for the model explanatory capacity behavior?
l396-397: There seems to be little logic between the first and second part of this sentence.
l423: Isn't this in contrast with the earlier claim that GBM can predict extreme values better?
Figure 7: Can you have the chamber-measured fluxes in this figure? It would also be a good idea to compare the upscaled fluxes with the chamber-measured fluxes the text.
l459: Please explain in the methods section how you normalized the importance values.
l462: Are the terrain measures correlating? Please give evidence, do not just speculate.
l476: You could cite more the remote sensing-based WT studies here.
Figure 8: The division into the categories seems a bit arbitrary. Why TWI is not under topography? Should you have instead topography, spectral, meteorological and landscape categories here?
Feature importance analysis: it would be good to have the analysis also for subsidence, CALU and temporal NDVI/NDWI.
Limitations: Can you integrate the 3.5 into the earlier sections? It feels a bit odd to have a section with only a couple of sentences.
l495: Can you quantify and describe the unbalanced sampling more?
l497: "were not included"; as predictors?
Conclusions: Please rewrite the conclusion section. Give a brief overview of the study aims and then answer to each four research questions one-by-one.
l508: How the targeted sensitivity analyses could be done?
l514: Acronyms such as RF-1 m are a bit difficult for the conclusions section.
l519: Should it be "coarser resolution models can outperform ultra-high spatial resolution models"?
l520 and generalizability: Are you writing only about spatial resolution here?
Table A1: Should the resolution column be entitled "spatial resolution"? Text about slope and aspect could be shorter. The reference to Wilson & Gallant feels a bit odd as your DTM is not from their work.
Table A2: Please explain the abbreviations in the caption. Spatial resolution for individual bands is 10 m; were they not calculated for 1 m spatial resolution? Did you use SWIR and RE bands for Sentinel-2? You could include "spatial" in the resolution column header for this table also.
l563: Please report also class-specific accuracies. You could also add a confusion matrix.
Figure B1: The caption should be below the figure. How was the difference calculated? What minus what? Can you provide also the difference maps? Are middle and right-hand columns needed? They are a bit difficult to digest and they are not referred to in the main text.