A non-linear data driven approach to bias correction of XCO2 for OCO-2 NASA ACOS version 10
Abstract. Measurements of column averaged, dry air mole fraction of CO2 (termed XCO2) from the Orbiting Carbon Obersvatory-2 (OCO-2) contain systematic errors and regional scale biases; often induced by forward model error or nonlinearity in the retrieval. Operationally, these biases are corrected for by a multiple linear regression model fit to co-retrieved variables that are highly correlate with XCO2 error. The operational bias correction is fit in tandem with a hand-tuned quality filter which limits error variance and reduces the regime of interaction between state variables and error to one that is largely linear. While the operational correction and filter are successful in reducing biases in retrievals, they do not allow for throughput or correction of data in which biases become nonlinear in predictors or features. In this paper, we demonstrate a clear improvement in the reduction of error variance over the operational method using a robust data driven, non-linear method. We further illustrate how the operational quality filter can be relaxed when used in conjunction with a non-linear bias correction, which allows for an increase of sounding throughput by 16 % while maintaining the residual error of the operational correction. The method can readily be applied to future ACOS algorithm updates, OCO-2’s companion instrument OCO-3, and to other retrieved atmospheric state variables of interest.
William R. Keely et al.
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2023-362', Anonymous Referee #2, 26 Apr 2023
- RC2: 'Comment on egusphere-2023-362', Anonymous Referee #1, 04 May 2023
William R. Keely et al.
William R. Keely et al.
Viewed (geographical distribution)
This paper describes a machine learning bias correction for the XCO2 retrievals from OCO-2. The model is trained using XCO2 data from TCCON, the small-area approximation, and chemical transport models informed by surface observations as the “truth.” The primary benefit of this approach (relative to previous work using linear bias corrections) and thus contribution of this paper is the relaxation of quality filters on the data.
The topic is within the scope of AMT. However, the experimental design was not clear in a few ways that makes it difficult to fully evaluate the paper, as described below. I also have minor concerns related to the validation of the product, accessibility for non-OCO-2 expert readers, and the claimed reproducibility of the work. Finally, there are a number of technical errors in the paper that made it difficult to read, such as incorrect Figure/Table numbers in the text, errors in the Tables and Figures, and typos throughout. These comments need to be addressed before publication in AMT.
Could the authors clarify if this understanding is correct? There are two XGBoost models (one for ocean and one for land) trained on data from all three proxy datasets that are the main contribution of this paper and these two models are used for Table 4, Table 5, Figure 3, Figure 4, Table 6, Figure 5, Table 7, Figure 7, Figure 8, and Figure 10. An additional six XGBoost models (one for each of the two surface types and three proxy datasets) are trained for Figures 2 and 9 to understand feature importance, but these models are not applied elsewhere. If this is correct (or incorrect), I believe Section 3.4 on Experiment Design could make this more clear.
The authors seem to go through a lot of trouble to produce the XGBoost models for different proxy types for Figures 2 and 9, but there is little discussion other than Lines 209-213, Lines 399-401, and Lines 405-406 which take the form of “they are different for different proxy types.” I suggest the authors either simplify the feature importance discussion to the XGBoost models used for the rest of the paper (trained on all proxy data and just divided by land/water) or improve the discussion related to information from different proxy datasets.
It would be very helpful to have a table (could be in the main text or an Appendix) defining all of the variables discussed in this paper. Ideally this table would contain all of the variables considered for all of your models (I believe the “subset of 27 co-retrieved state vector variables” stated in Line 202). This table would be useful for a few reasons. (1) While most variables are already defined in Table 3, some of the variables used for QFNew in Figures 7 and 8 are never defined (e.g., max_declocking_3). (2) It would be useful to have a little more information about the variables, such as how co2_grad_del is calculated (Equation 5 from O’Dell2018).
For the machine learning model evaluation, it is my impression only 2018 should be used (the testing dataset). It is not clear in Figures 3/4/7/8/10 what date range is being used, but it should in theory be only 2018 since the model has been trained with the data for other years and the goal is to see how generalizable the model is to data it has never seen. This is concerning for Figure 5, which tries to evaluate the model using (in part) data from 2016 and 2017 that the model was already trained on. This could suggest corrections that are overly optimistic.
To go further with validation, have the authors considered leaving one of these proxies (like TCCON) completely out of their bias correction (or bringing in an independent dataset)? It seems to me that at the end of this, you have no independent datasets to evaluate your bias-corrected retrievals with, as they have all been used for training the model.
For Table 1 and the rest of the paper, what version of TCCON data is used here? Presumably GGG2014 given the reference list. Is there a reason to not use GGG2020 at this point? It is my understanding that OCO-2 B10 uses the same prior as GGG2020, so this would be more appropriate. If not, it might be necessary to account for the difference in priors when comparing GGG2014 and OCO-2 data in calculating deltaXCO2 (if this effect is large).
I am suspect of the authors’ claim (Lines 65, 438) that this method is “reproducible.” There is still some hand-tuning in these methods, including picking which variables to include for the regression task (How do you reconcile the different proxies saying different variables are important in Figure 2? How do you pick which redundant variables to drop based on correlation before doing the analysis in Figure 2?) and how to adjust the filters for QFNew. This is fine, but with no code published alongside the paper, this could be difficult to reproduce.
Is there a plan to incorporate this into future versions of the OCO-2 data? Regardless, will the authors be making available the bias-corrected data produced in this paper?
Line 59: I am not sure if “relative to the operational linear correction” is accurate with respect to the TROPOMI methane retrieval in Schneising et al. (2019).
Line 61: a slightly longer discussion of how this work differentiates from Mauceri2023 could be appropriate.
Line 70: are the averaging kernels taken into account when comparing OCO-2 to TCCON or the model atmospheres?
Figure 1: The “N = 1022636” title for Figure 1a seems to be a typo. I would expect the number of soundings here to be close to the total number of OCO-2 soundings since model grids are continuous in space and time (depending on how often the 1.5 ppm threshold is passed) and thus N for the model proxy should be > N for TCCON, not equal. On a related note, what is the total number of OCO-2 soundings considered to give a sense of the percentage used for each proxy dataset?
Table 1 contains errors for the Tsukuba altitude, Sodankylä altitude, Izaña altitude, Wollongong latitude, Réunion latitude, Lamont latitude, and Karlsruhe latitude (and maybe others).
Table 2: could you specify the resolutions of these models?
Line 128-129: it is my understanding (and you state as such in line 130) that XGBoost is not the average across an ensemble, but rather the sum across the ensemble.
Line 135: how do you search for these hyperparameters? Are these the same for all of the XGBoost models discussed in this paper?
Line 179: Is data from 2014-2018 used for all three proxy datasets?
Line 184-185: In Section 3.3 and elsewhere, it is stated that the models are trained for Ocean G and Land NG (2 models). In these lines, it is suggested that there are three models.
Line 206: Is there a threshold you used for the correlation coefficient?
Line 219: Is there a reason to specify that the prior pressure is from the strong band? Is the prior pressure different for the weak band?
Line 233: Why is the variable dp_sco2 considered? Lines 220-221 dp_frac discussed the disadvantages of this kind of pressure difference term.
Figure 2: Why are there a different number of considered features for land and ocean? Were the same 27 variables (Line 202) started with and a different subset dropped for LandNG versus OceanG because of different correlations (Line 206) for the different operation modes?
Line 276: what percentage of retrievals are filtered because of this?
Figures 3 and 4: Is this just 2018 data? Could you add arrows to indicate the direction of the filters (or write the ranges like in Figures 7 and 8)?
Figure 5: It seems to me you cannot properly evaluate the model with the training data (2016-2017) and this plot should only show 2018 data. Additionally, I am interested to know if this is data from all proxy datasets? This would help answer the question of if the remaining differences are due to shortcomings in the bias correction method or in the proxy datasets.
Line 338: Is this trained on QF = 0 + 1 data and all three of the truth proxies? And two different models (one for land and one for ocean)?
Figures 7/8: It is not clear to me what each of the colors represent. The black line is deltaXCO2 for the raw XCO2 retrievals. But between Line 348 and the figure captions, I can’t figure out what the difference between the light and dark green/blue lines are (maybe dark is XGBoost bias-corrected and light is operationally bias-corrected?).
Figures 5/10: titles or different labels on the colormaps might make it more clear which plots are for operationally bias-corrected data and which are for XGBoost bias-corrected data (but this is clear in the caption).
Figure 10: Is this just for 2018? Is it for all proxy datasets?
There are in-text references for Kuze2009, Palmer2019, Crowell2019, Peiro2021, Mendonca2021, Jacobs2020, Osterman2020, Worden2017, Taylor2012, Morino2018a, Morino2018b, and Hase2015 (and maybe others) in the text and tables that are missing in the Reference section.
There are many typos throughout this paper. It needs to be significantly cleaned up before publication. I tried to catch as many as I could here.
Line 12: Obersvatory => Observatory
Line 15: correlate => correlated
Line 35: m). => m.
Line 36: Rogers => Rodgers
Line 42: Missing subscript on XCO2
Line 47: emperically => empirically
Line 53: reduce => reduces
Line 55: filter => filters
Line 55: correction. to which => correction to which
Line 57: missing word before “or too limiting”
Line 59: 2021 => 2022
Line 65: reproduceable => reproducible (or be consistent with Line 438)
Line 66: remove “upcoming missions such as GeoCarb”
Line 68: mode => mole
Line 78: Lever => Level
Line 86: remove comma
Line 91: offering => offers
Table 1 caption: rephrase/missing words in “proxy the TCCON”
Line 104: missing subscript on XCO2
Line 106: XCO2, is => XCO2 is
Line 106: overs => offers
Line 110: Table 1 => Table 2
Line 121: employee => employ
Line 132: into => in
Line 134: we hold out small subset => we hold out a small subset
Line 145: variables not defined (though common notation is used)
Line 172: remove “a”
Line 176: mode; => mode,
Line 185: or features are => or features, are
Line 186: This, allows =? This allows
Line 192: data then, => data and then
Line 205: delete “we”
Line 213: Table 1 => Table 3
Line 223: retrivals => retrievals
Line 227 (and Table 3): aod_stratear => aod_strataer
Line 232: In addition to the albedo_slope_sco2, four => In addition to albedo_slope_sco2 and co2_grad_del, three
Table 3 caption: features for in => features for use in
Table 3: dpfrac versus dp_frac is inconsistent between the text (Line 216) and here/other figures.
Section 4.1 is repeated (4.1 Feature Selection, 4.1 Model evaluation for QF = 0).
Line 261: Table 2 => Table 4
Line 275: Sentence beginning with “However,” could be rephrased. The portion inside parentheses is confusing.
Line 280: Table 3 => Table 5
Figures 3/4: inconsistent x-axis labels with the text/Table 3 (e.g., aod_st versus aod_strataer). Arrows designating the direction of the filters (e.g., on the aod_st plot) could be helpful but are not necessary.
Line 304: Table 4 => Table 6
Line 307: double-check percentages. For example, for ocean QF=0, ((0.67^2)-(0.61^2))/(0.67^2) = 17%, not 4% (if I correctly understand your methodology).
Table 6: stddev => standard deviation; caption mentions raw XCO2 data, but this is not present in the table.
Line 342: Table 5 => Table 7
Line 348: Figure 8 and 9 => Figures 7 and 8
Line 354: benefit of quality => benefit of a quality
Table 7: Region/Truth Proxy => “Surface/Mode” (?); through put => throughput
Line 383: Qf => QF
Line 384: Figure 9 => Figure 10; Feaures => Features
Line 394: filter bound h2o_ratio => filter bounds, h2o_ratio
Line 400: noteably => notably
Lines 415-416: Figure 11 => Figure 10
Line 442: remove ; and rephrase
Line 449: ISS not defined in text
Title: data driven => data-driven. Hyphen usage should be reviewed throughout (Lines 12, 13, 19, 32, 52, 58, 87, 89, 103, 157, 164, 273, 336, 355, 421, 438, 447, etc.).