the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A non-linear data driven approach to bias correction of XCO2 for OCO-2 NASA ACOS version 10
Abstract. Measurements of column averaged, dry air mole fraction of CO2 (termed XCO2) from the Orbiting Carbon Obersvatory-2 (OCO-2) contain systematic errors and regional scale biases; often induced by forward model error or nonlinearity in the retrieval. Operationally, these biases are corrected for by a multiple linear regression model fit to co-retrieved variables that are highly correlate with XCO2 error. The operational bias correction is fit in tandem with a hand-tuned quality filter which limits error variance and reduces the regime of interaction between state variables and error to one that is largely linear. While the operational correction and filter are successful in reducing biases in retrievals, they do not allow for throughput or correction of data in which biases become nonlinear in predictors or features. In this paper, we demonstrate a clear improvement in the reduction of error variance over the operational method using a robust data driven, non-linear method. We further illustrate how the operational quality filter can be relaxed when used in conjunction with a non-linear bias correction, which allows for an increase of sounding throughput by 16 % while maintaining the residual error of the operational correction. The method can readily be applied to future ACOS algorithm updates, OCO-2’s companion instrument OCO-3, and to other retrieved atmospheric state variables of interest.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(6138 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(6138 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-362', Anonymous Referee #2, 26 Apr 2023
This paper describes a machine learning bias correction for the XCO2 retrievals from OCO-2. The model is trained using XCO2 data from TCCON, the small-area approximation, and chemical transport models informed by surface observations as the “truth.” The primary benefit of this approach (relative to previous work using linear bias corrections) and thus contribution of this paper is the relaxation of quality filters on the data.
The topic is within the scope of AMT. However, the experimental design was not clear in a few ways that makes it difficult to fully evaluate the paper, as described below. I also have minor concerns related to the validation of the product, accessibility for non-OCO-2 expert readers, and the claimed reproducibility of the work. Finally, there are a number of technical errors in the paper that made it difficult to read, such as incorrect Figure/Table numbers in the text, errors in the Tables and Figures, and typos throughout. These comments need to be addressed before publication in AMT.
General Comments
Could the authors clarify if this understanding is correct? There are two XGBoost models (one for ocean and one for land) trained on data from all three proxy datasets that are the main contribution of this paper and these two models are used for Table 4, Table 5, Figure 3, Figure 4, Table 6, Figure 5, Table 7, Figure 7, Figure 8, and Figure 10. An additional six XGBoost models (one for each of the two surface types and three proxy datasets) are trained for Figures 2 and 9 to understand feature importance, but these models are not applied elsewhere. If this is correct (or incorrect), I believe Section 3.4 on Experiment Design could make this more clear.
The authors seem to go through a lot of trouble to produce the XGBoost models for different proxy types for Figures 2 and 9, but there is little discussion other than Lines 209-213, Lines 399-401, and Lines 405-406 which take the form of “they are different for different proxy types.” I suggest the authors either simplify the feature importance discussion to the XGBoost models used for the rest of the paper (trained on all proxy data and just divided by land/water) or improve the discussion related to information from different proxy datasets.
It would be very helpful to have a table (could be in the main text or an Appendix) defining all of the variables discussed in this paper. Ideally this table would contain all of the variables considered for all of your models (I believe the “subset of 27 co-retrieved state vector variables” stated in Line 202). This table would be useful for a few reasons. (1) While most variables are already defined in Table 3, some of the variables used for QFNew in Figures 7 and 8 are never defined (e.g., max_declocking_3). (2) It would be useful to have a little more information about the variables, such as how co2_grad_del is calculated (Equation 5 from O’Dell2018).
For the machine learning model evaluation, it is my impression only 2018 should be used (the testing dataset). It is not clear in Figures 3/4/7/8/10 what date range is being used, but it should in theory be only 2018 since the model has been trained with the data for other years and the goal is to see how generalizable the model is to data it has never seen. This is concerning for Figure 5, which tries to evaluate the model using (in part) data from 2016 and 2017 that the model was already trained on. This could suggest corrections that are overly optimistic.
To go further with validation, have the authors considered leaving one of these proxies (like TCCON) completely out of their bias correction (or bringing in an independent dataset)? It seems to me that at the end of this, you have no independent datasets to evaluate your bias-corrected retrievals with, as they have all been used for training the model.
For Table 1 and the rest of the paper, what version of TCCON data is used here? Presumably GGG2014 given the reference list. Is there a reason to not use GGG2020 at this point? It is my understanding that OCO-2 B10 uses the same prior as GGG2020, so this would be more appropriate. If not, it might be necessary to account for the difference in priors when comparing GGG2014 and OCO-2 data in calculating deltaXCO2 (if this effect is large).
I am suspect of the authors’ claim (Lines 65, 438) that this method is “reproducible.” There is still some hand-tuning in these methods, including picking which variables to include for the regression task (How do you reconcile the different proxies saying different variables are important in Figure 2? How do you pick which redundant variables to drop based on correlation before doing the analysis in Figure 2?) and how to adjust the filters for QFNew. This is fine, but with no code published alongside the paper, this could be difficult to reproduce.
Is there a plan to incorporate this into future versions of the OCO-2 data? Regardless, will the authors be making available the bias-corrected data produced in this paper?
Specific Comments
Line 59: I am not sure if “relative to the operational linear correction” is accurate with respect to the TROPOMI methane retrieval in Schneising et al. (2019).
Line 61: a slightly longer discussion of how this work differentiates from Mauceri2023 could be appropriate.
Line 70: are the averaging kernels taken into account when comparing OCO-2 to TCCON or the model atmospheres?
Figure 1: The “N = 1022636” title for Figure 1a seems to be a typo. I would expect the number of soundings here to be close to the total number of OCO-2 soundings since model grids are continuous in space and time (depending on how often the 1.5 ppm threshold is passed) and thus N for the model proxy should be > N for TCCON, not equal. On a related note, what is the total number of OCO-2 soundings considered to give a sense of the percentage used for each proxy dataset?
Table 1 contains errors for the Tsukuba altitude, Sodankylä altitude, Izaña altitude, Wollongong latitude, Réunion latitude, Lamont latitude, and Karlsruhe latitude (and maybe others).
Table 2: could you specify the resolutions of these models?
Line 128-129: it is my understanding (and you state as such in line 130) that XGBoost is not the average across an ensemble, but rather the sum across the ensemble.
Line 135: how do you search for these hyperparameters? Are these the same for all of the XGBoost models discussed in this paper?
Line 179: Is data from 2014-2018 used for all three proxy datasets?
Line 184-185: In Section 3.3 and elsewhere, it is stated that the models are trained for Ocean G and Land NG (2 models). In these lines, it is suggested that there are three models.
Line 206: Is there a threshold you used for the correlation coefficient?
Line 219: Is there a reason to specify that the prior pressure is from the strong band? Is the prior pressure different for the weak band?
Line 233: Why is the variable dp_sco2 considered? Lines 220-221 dp_frac discussed the disadvantages of this kind of pressure difference term.
Figure 2: Why are there a different number of considered features for land and ocean? Were the same 27 variables (Line 202) started with and a different subset dropped for LandNG versus OceanG because of different correlations (Line 206) for the different operation modes?
Line 276: what percentage of retrievals are filtered because of this?
Figures 3 and 4: Is this just 2018 data? Could you add arrows to indicate the direction of the filters (or write the ranges like in Figures 7 and 8)?
Figure 5: It seems to me you cannot properly evaluate the model with the training data (2016-2017) and this plot should only show 2018 data. Additionally, I am interested to know if this is data from all proxy datasets? This would help answer the question of if the remaining differences are due to shortcomings in the bias correction method or in the proxy datasets.
Line 338: Is this trained on QF = 0 + 1 data and all three of the truth proxies? And two different models (one for land and one for ocean)?
Figures 7/8: It is not clear to me what each of the colors represent. The black line is deltaXCO2 for the raw XCO2 retrievals. But between Line 348 and the figure captions, I can’t figure out what the difference between the light and dark green/blue lines are (maybe dark is XGBoost bias-corrected and light is operationally bias-corrected?).
Figures 5/10: titles or different labels on the colormaps might make it more clear which plots are for operationally bias-corrected data and which are for XGBoost bias-corrected data (but this is clear in the caption).
Figure 10: Is this just for 2018? Is it for all proxy datasets?
There are in-text references for Kuze2009, Palmer2019, Crowell2019, Peiro2021, Mendonca2021, Jacobs2020, Osterman2020, Worden2017, Taylor2012, Morino2018a, Morino2018b, and Hase2015 (and maybe others) in the text and tables that are missing in the Reference section.
Technical Corrections
There are many typos throughout this paper. It needs to be significantly cleaned up before publication. I tried to catch as many as I could here.
Line 12: Obersvatory => Observatory
Line 15: correlate => correlated
Line 35: m). => m.
Line 36: Rogers => Rodgers
Line 42: Missing subscript on XCO2
Line 47: emperically => empirically
Line 53: reduce => reduces
Line 55: filter => filters
Line 55: correction. to which => correction to which
Line 57: missing word before “or too limiting”
Line 59: 2021 => 2022
Line 65: reproduceable => reproducible (or be consistent with Line 438)
Line 66: remove “upcoming missions such as GeoCarb”
Line 68: mode => mole
Line 78: Lever => Level
Line 86: remove comma
Line 91: offering => offers
Table 1 caption: rephrase/missing words in “proxy the TCCON”
Line 104: missing subscript on XCO2
Line 106: XCO2, is => XCO2 is
Line 106: overs => offers
Line 110: Table 1 => Table 2
Line 121: employee => employ
Line 132: into => in
Line 134: we hold out small subset => we hold out a small subset
Line 145: variables not defined (though common notation is used)
Line 172: remove “a”
Line 176: mode; => mode,
Line 185: or features are => or features, are
Line 186: This, allows =? This allows
Line 192: data then, => data and then
Line 205: delete “we”
Line 213: Table 1 => Table 3
Line 223: retrivals => retrievals
Line 227 (and Table 3): aod_stratear => aod_strataer
Line 232: In addition to the albedo_slope_sco2, four => In addition to albedo_slope_sco2 and co2_grad_del, three
Table 3 caption: features for in => features for use in
Table 3: dpfrac versus dp_frac is inconsistent between the text (Line 216) and here/other figures.
Section 4.1 is repeated (4.1 Feature Selection, 4.1 Model evaluation for QF = 0).
Line 261: Table 2 => Table 4
Line 275: Sentence beginning with “However,” could be rephrased. The portion inside parentheses is confusing.
Line 280: Table 3 => Table 5
Figures 3/4: inconsistent x-axis labels with the text/Table 3 (e.g., aod_st versus aod_strataer). Arrows designating the direction of the filters (e.g., on the aod_st plot) could be helpful but are not necessary.
Line 304: Table 4 => Table 6
Line 307: double-check percentages. For example, for ocean QF=0, ((0.67^2)-(0.61^2))/(0.67^2) = 17%, not 4% (if I correctly understand your methodology).
Table 6: stddev => standard deviation; caption mentions raw XCO2 data, but this is not present in the table.
Line 342: Table 5 => Table 7
Line 348: Figure 8 and 9 => Figures 7 and 8
Line 354: benefit of quality => benefit of a quality
Table 7: Region/Truth Proxy => “Surface/Mode” (?); through put => throughput
Line 383: Qf => QF
Line 384: Figure 9 => Figure 10; Feaures => Features
Line 394: filter bound h2o_ratio => filter bounds, h2o_ratio
Line 400: noteably => notably
Lines 415-416: Figure 11 => Figure 10
Line 442: remove ; and rephrase
Line 449: ISS not defined in text
Title: data driven => data-driven. Hyphen usage should be reviewed throughout (Lines 12, 13, 19, 32, 52, 58, 87, 89, 103, 157, 164, 273, 336, 355, 421, 438, 447, etc.).
Citation: https://doi.org/10.5194/egusphere-2023-362-RC1 - AC1: 'Reply on RC1', William Keely, 30 Jul 2023
- AC3: 'Reply on RC1', William Keely, 30 Jul 2023
-
RC2: 'Comment on egusphere-2023-362', Anonymous Referee #1, 04 May 2023
This manuscript introduces a non-linear machine learning method for the bias correction of OCO-2 XCO2 data, which outperforms the linear correction used in the operational product. The paper falls into the scope of AMT, but there are several clarifications and revisions necessary. Before publication in AMT the following comments have to be addressed.
General Comments
In my opinion, the underlying machine learning method XGBoost is praised beyond measure (highly interpretable compared to other machine learning algorithms, improved predictive performance compared to Random Forest, highly robust to overfitting, ...). It is not trivial to prove such universal statements, and in the context of this paper it is not even necessary. On the other hand, some of these aspects are not sufficiently dealt with regarding the specific example of bias correction presented here. In this sense, please avoid questionable general statements and elaborate on the available results (the proposed bias correction) instead: How can the actually obtained model be interpreted? What is the most appropriate way to calculate feature importances aiming at maximising interpretability of the specific model in question? What is the ranking of the most important features (for land and ocean data)? Does the model overfit for the chosen parameters?
In order to assess (and exclude) potential overfitting tendencies, it might also be useful to define more challenging training and validation data sets or to use completely independent data for validation. The models are trained on data from 2014-2017 and are then evaluated on data from 2018. However, the validation data set is not entirely independent as the biases are likely similar in the statistical sense from year to year. It would be more instructive to leave out whole regions and/or proxy data sets in the training and only use them for validation. It would be interesting whether the figures and tables demonstrating the performance of the correction would change significantly as a result (e.g. Figures 3/4 or Table 6). For example, Figure 5 suggests that the correction does not generalise so well to regions that were rarely considered during training (e.g tropics, parts of South Asia and Canada, compare to Figure 1).
There are quite a few different XGBoost models, e.g. for testing purposes, in the paper; I am not entirely sure what the proposed bias corrected product is in the end. I suspect it is the data set with the correction (split according to land and ocean) learned on all three proxy datasets for QF=0+1 simultaneously and subsequently restricted to QFNew. Is this correct? Or is it some kind of average of the three models for the different truth proxy data sets? Please make this more clear in the text.
The bias correction (for both land and ocean) includes features measuring the deviation of the retrieved quantities from the prior (surface pressure and vertical CO2 profile). I assume that these priors are very consistent with the XCO2 truth used in the supervised learning. Doesn't that run the risk of essentially pushing the XCO2 results back to the truth/prior in specific cases without noticing in the averaging kernels that hardly any information is gained from the actual measurement? Can you exclude that point sources that are not present or not sufficiently resolved in the truth are artificially attenuated or corrected away in unseen data? Due to these potential pitfalls, the results would be more robust if such parameters (co2_grad_del, dpfrac, dp_sco2) were not used in the bias correction. How important are these features in your machine learning model? Please discuss these issues in the manuscript.
Specific Comments
L25-29: There were other satellites measuring CO2 before GOSAT, e.g. AIRS or SCIAMACHY. GOSAT is considered the first satellite designed specifically for the purpose of measuring atmospheric CO2 from space. Please be more specific here.
L58-60: Do you mean (Noel et al., 2021) or (Noel et al., 2022)? There are two papers, but only (Noel et al., 2022) is in the References. (Noel et al., 202?) and (Schneising et al., 2019) both use non-linear bias correction techniques but there are no explicit comparisons to linear corrections and the term "operational" does not fit here either. Please revise this sentence.
L63: Please be more specific what you mean by "interpretable" since XGBoost is a complex black-box model, which is not intrinsically interpretable. Do you mean post-hoc model interpretation methods? Do you refer to global explainability of the model or to local explainability of individual predictions?
L65: "reproducible" is somewhat misleading because it gives the impression that a universal recipe, that can be transferred without any adaptations, is being presented. However, when using other data sets the parameters of the model have to be re-tuned and the used features have to be adapted. Is there a systematic approach, e.g. to feature selection or setting of model parameters, justifying the designation "reproducible"?
L66: The GeoCarb mission was cancelled, please remove.
L85-92: Please specify which TCCON version you are using (GGG2014 or GGG2020?). It seems to be GGG2014, why didn't you use the most recent version? Does it make any difference in performance when training or validating with one or the other version?
L122-L124: I disagree with the statement that XGBoost is highly interpretable compared to other machine learning algorithms; I would even argue that XGBoost is typically the least interpretable well-established machine learning algorithm of all after deep neural networks and also requires post-hoc approaches to understand and explain the model. For example, Support Vector Machines or Random Forest are typically more interpretable than XGBoost. Please revise (or remove) the sentence accordingly.
L128-129: XGBoost does not average across the ensemble (in contrast to Random Forest). Instead, XGBoost trains each subsequent model in the ensemble to improve upon the errors of the previous models and uses a weighted sum of the individual model predictions to make its final prediction. This is also one of the reasons why XGBoost is usually less interpretable than Random Forest.
L130: It depends on the specific task, the available resources, and data set whether XGBoost or Random Forest performs better, as they have different strengths and weaknesses. Moreover, it is hard to do a fair comparison anyway, because you have to tune the corresponding parameters independently, e.g. it makes little sense to fix certain tree structures in a comparison. Therefore, please avoid a general statement comparing the predictive performance of different machine learning algorithms, especially when it comes to XGBoost and Random Forest, which often perform quite similarly after respective optimal parameter tuning (see also general comments). Have you tried other machine learning algorithms for the specific task presented here?
L131-133: The arbitrary application of these strategies does not make XGBoost per se highly robust against overfitting to the training data. Please make clearer that these strategies can avoid overfitting if the parameters and the validation data set are chosen appropriately. Please demonstrate explicitly that there is no overfitting for the non-linear bias correction presented here.
L134-135: How exactly were these parameters determined? By monitoring the performance of the model on a validation dataset during the training process and stopping the training when the performance on the validation dataset does not improve for a given number of consecutive iterations? Are the parameters actually always the same (for all testing models and the final bias correction)?
L136-145: Have you tried other ways to get feature importances from XGBoost, e.g. permutation based importance or importance computed with SHAP values? Is it possible to improve interpretability by a choice tailored to this specific problem?
L207-210 and Figure 2: Please list the used features and the respective importances (split by land and ocean) for the models based on the different proxy data sets and for the final bias correction separately in a table.
L215: Please better explain co2_grad_del.
L219: Are there different prior surface pressures for the weak and the strong band? If so, why?
Table 3: Please remove "large unphysical" in the description of co2_grad_del. Or do you use an additional threshold to rate a deviation from the prior as unphysical?
L260: Overfitting of XGBoost cannot be generally excluded. Please prove that there is indeed no overfitting for this specific task and parameter setup. To this end, a more challenging selection of training and validation datasets could also be considered (see also general comments).
Table 4: Please also list the results for the final proposed bias correction combining all truth proxies.
Figure 3/4, Table 4/5/6: Only validation data (2018 with truth proxy sampling) is displayed here, right? If this is the case, please be more specific in the description.
L303-304: Wouldn't it be better to apply the footprint correction before training the bias correction?
Is it possible to include the footprint correction in the bias correction by introducing a suitable parameter as feature (e.g. row number)?Figure 5: Please highlight more (in the text and in the caption) what exactly is shown here: it is all data from 2016-2018 including three types of data (correct me if I'm wrong), namely 1) training data (truth proxy data for 2016-2017), 2) validation data (proxy data for 2018), 3) data beyond (data from 2016-2018 used neither for training nor validation). It would be very enlightening to show (and compare) this kind of figure separately for each of the three data types, because this could provide indications of how well the correction generalises to actually independent data (type 3).
Table 6: It would be interesting to add the respective performances for type-3-data (or to introduce more challenging training and validation data sets from the start, see also general comments). Due to the lack of an entirely independent validation dataset, the performance suggested here may be too optimistic.
Figure 7/8: Please explain all occurring variables.
Section 5.1, Figure 9: Please also report the individual information gains and not only the differences in feature importances.
Technical CorrectionsThere are incorrect Figure and Table numbers used in the main text. Please check.
Citations in the text do not always match the ones in the References section. Please check.
L12: "Obersvatory" → "Observatory"
L15: "correlate" → "correlated"
L35: Please remove the closing bracket.
L36: It is (Rodgers, 2000).
L47: "epmerically". Do you mean "empirically"?
L49: "inturn" → "in turn"
L53: "reduce" → "reduces"
L55: "applying quality filter" → "applying the quality filter"
L55: Please remove the full stop between "correction" and "to" or rephrase both sentences.
Figure 1: The number of soundings N is wrong in panel (a). The colour scale is oversaturated. Please extend the value range (0-2000 does not seen to be optimal).
L106: "overs" → "offers"
L178: "trainined" → "trained"
L186: "This, allows" → "This allows"
L223: "retrivals" → "retrievals"
L252: The section numbering is inconsistent. This number already exists.
This list of technical corrections is (likely) not exhaustive. Please check the text for further typing errors.
Citation: https://doi.org/10.5194/egusphere-2023-362-RC2 - AC2: 'Reply on RC2', William Keely, 30 Jul 2023
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-362', Anonymous Referee #2, 26 Apr 2023
This paper describes a machine learning bias correction for the XCO2 retrievals from OCO-2. The model is trained using XCO2 data from TCCON, the small-area approximation, and chemical transport models informed by surface observations as the “truth.” The primary benefit of this approach (relative to previous work using linear bias corrections) and thus contribution of this paper is the relaxation of quality filters on the data.
The topic is within the scope of AMT. However, the experimental design was not clear in a few ways that makes it difficult to fully evaluate the paper, as described below. I also have minor concerns related to the validation of the product, accessibility for non-OCO-2 expert readers, and the claimed reproducibility of the work. Finally, there are a number of technical errors in the paper that made it difficult to read, such as incorrect Figure/Table numbers in the text, errors in the Tables and Figures, and typos throughout. These comments need to be addressed before publication in AMT.
General Comments
Could the authors clarify if this understanding is correct? There are two XGBoost models (one for ocean and one for land) trained on data from all three proxy datasets that are the main contribution of this paper and these two models are used for Table 4, Table 5, Figure 3, Figure 4, Table 6, Figure 5, Table 7, Figure 7, Figure 8, and Figure 10. An additional six XGBoost models (one for each of the two surface types and three proxy datasets) are trained for Figures 2 and 9 to understand feature importance, but these models are not applied elsewhere. If this is correct (or incorrect), I believe Section 3.4 on Experiment Design could make this more clear.
The authors seem to go through a lot of trouble to produce the XGBoost models for different proxy types for Figures 2 and 9, but there is little discussion other than Lines 209-213, Lines 399-401, and Lines 405-406 which take the form of “they are different for different proxy types.” I suggest the authors either simplify the feature importance discussion to the XGBoost models used for the rest of the paper (trained on all proxy data and just divided by land/water) or improve the discussion related to information from different proxy datasets.
It would be very helpful to have a table (could be in the main text or an Appendix) defining all of the variables discussed in this paper. Ideally this table would contain all of the variables considered for all of your models (I believe the “subset of 27 co-retrieved state vector variables” stated in Line 202). This table would be useful for a few reasons. (1) While most variables are already defined in Table 3, some of the variables used for QFNew in Figures 7 and 8 are never defined (e.g., max_declocking_3). (2) It would be useful to have a little more information about the variables, such as how co2_grad_del is calculated (Equation 5 from O’Dell2018).
For the machine learning model evaluation, it is my impression only 2018 should be used (the testing dataset). It is not clear in Figures 3/4/7/8/10 what date range is being used, but it should in theory be only 2018 since the model has been trained with the data for other years and the goal is to see how generalizable the model is to data it has never seen. This is concerning for Figure 5, which tries to evaluate the model using (in part) data from 2016 and 2017 that the model was already trained on. This could suggest corrections that are overly optimistic.
To go further with validation, have the authors considered leaving one of these proxies (like TCCON) completely out of their bias correction (or bringing in an independent dataset)? It seems to me that at the end of this, you have no independent datasets to evaluate your bias-corrected retrievals with, as they have all been used for training the model.
For Table 1 and the rest of the paper, what version of TCCON data is used here? Presumably GGG2014 given the reference list. Is there a reason to not use GGG2020 at this point? It is my understanding that OCO-2 B10 uses the same prior as GGG2020, so this would be more appropriate. If not, it might be necessary to account for the difference in priors when comparing GGG2014 and OCO-2 data in calculating deltaXCO2 (if this effect is large).
I am suspect of the authors’ claim (Lines 65, 438) that this method is “reproducible.” There is still some hand-tuning in these methods, including picking which variables to include for the regression task (How do you reconcile the different proxies saying different variables are important in Figure 2? How do you pick which redundant variables to drop based on correlation before doing the analysis in Figure 2?) and how to adjust the filters for QFNew. This is fine, but with no code published alongside the paper, this could be difficult to reproduce.
Is there a plan to incorporate this into future versions of the OCO-2 data? Regardless, will the authors be making available the bias-corrected data produced in this paper?
Specific Comments
Line 59: I am not sure if “relative to the operational linear correction” is accurate with respect to the TROPOMI methane retrieval in Schneising et al. (2019).
Line 61: a slightly longer discussion of how this work differentiates from Mauceri2023 could be appropriate.
Line 70: are the averaging kernels taken into account when comparing OCO-2 to TCCON or the model atmospheres?
Figure 1: The “N = 1022636” title for Figure 1a seems to be a typo. I would expect the number of soundings here to be close to the total number of OCO-2 soundings since model grids are continuous in space and time (depending on how often the 1.5 ppm threshold is passed) and thus N for the model proxy should be > N for TCCON, not equal. On a related note, what is the total number of OCO-2 soundings considered to give a sense of the percentage used for each proxy dataset?
Table 1 contains errors for the Tsukuba altitude, Sodankylä altitude, Izaña altitude, Wollongong latitude, Réunion latitude, Lamont latitude, and Karlsruhe latitude (and maybe others).
Table 2: could you specify the resolutions of these models?
Line 128-129: it is my understanding (and you state as such in line 130) that XGBoost is not the average across an ensemble, but rather the sum across the ensemble.
Line 135: how do you search for these hyperparameters? Are these the same for all of the XGBoost models discussed in this paper?
Line 179: Is data from 2014-2018 used for all three proxy datasets?
Line 184-185: In Section 3.3 and elsewhere, it is stated that the models are trained for Ocean G and Land NG (2 models). In these lines, it is suggested that there are three models.
Line 206: Is there a threshold you used for the correlation coefficient?
Line 219: Is there a reason to specify that the prior pressure is from the strong band? Is the prior pressure different for the weak band?
Line 233: Why is the variable dp_sco2 considered? Lines 220-221 dp_frac discussed the disadvantages of this kind of pressure difference term.
Figure 2: Why are there a different number of considered features for land and ocean? Were the same 27 variables (Line 202) started with and a different subset dropped for LandNG versus OceanG because of different correlations (Line 206) for the different operation modes?
Line 276: what percentage of retrievals are filtered because of this?
Figures 3 and 4: Is this just 2018 data? Could you add arrows to indicate the direction of the filters (or write the ranges like in Figures 7 and 8)?
Figure 5: It seems to me you cannot properly evaluate the model with the training data (2016-2017) and this plot should only show 2018 data. Additionally, I am interested to know if this is data from all proxy datasets? This would help answer the question of if the remaining differences are due to shortcomings in the bias correction method or in the proxy datasets.
Line 338: Is this trained on QF = 0 + 1 data and all three of the truth proxies? And two different models (one for land and one for ocean)?
Figures 7/8: It is not clear to me what each of the colors represent. The black line is deltaXCO2 for the raw XCO2 retrievals. But between Line 348 and the figure captions, I can’t figure out what the difference between the light and dark green/blue lines are (maybe dark is XGBoost bias-corrected and light is operationally bias-corrected?).
Figures 5/10: titles or different labels on the colormaps might make it more clear which plots are for operationally bias-corrected data and which are for XGBoost bias-corrected data (but this is clear in the caption).
Figure 10: Is this just for 2018? Is it for all proxy datasets?
There are in-text references for Kuze2009, Palmer2019, Crowell2019, Peiro2021, Mendonca2021, Jacobs2020, Osterman2020, Worden2017, Taylor2012, Morino2018a, Morino2018b, and Hase2015 (and maybe others) in the text and tables that are missing in the Reference section.
Technical Corrections
There are many typos throughout this paper. It needs to be significantly cleaned up before publication. I tried to catch as many as I could here.
Line 12: Obersvatory => Observatory
Line 15: correlate => correlated
Line 35: m). => m.
Line 36: Rogers => Rodgers
Line 42: Missing subscript on XCO2
Line 47: emperically => empirically
Line 53: reduce => reduces
Line 55: filter => filters
Line 55: correction. to which => correction to which
Line 57: missing word before “or too limiting”
Line 59: 2021 => 2022
Line 65: reproduceable => reproducible (or be consistent with Line 438)
Line 66: remove “upcoming missions such as GeoCarb”
Line 68: mode => mole
Line 78: Lever => Level
Line 86: remove comma
Line 91: offering => offers
Table 1 caption: rephrase/missing words in “proxy the TCCON”
Line 104: missing subscript on XCO2
Line 106: XCO2, is => XCO2 is
Line 106: overs => offers
Line 110: Table 1 => Table 2
Line 121: employee => employ
Line 132: into => in
Line 134: we hold out small subset => we hold out a small subset
Line 145: variables not defined (though common notation is used)
Line 172: remove “a”
Line 176: mode; => mode,
Line 185: or features are => or features, are
Line 186: This, allows =? This allows
Line 192: data then, => data and then
Line 205: delete “we”
Line 213: Table 1 => Table 3
Line 223: retrivals => retrievals
Line 227 (and Table 3): aod_stratear => aod_strataer
Line 232: In addition to the albedo_slope_sco2, four => In addition to albedo_slope_sco2 and co2_grad_del, three
Table 3 caption: features for in => features for use in
Table 3: dpfrac versus dp_frac is inconsistent between the text (Line 216) and here/other figures.
Section 4.1 is repeated (4.1 Feature Selection, 4.1 Model evaluation for QF = 0).
Line 261: Table 2 => Table 4
Line 275: Sentence beginning with “However,” could be rephrased. The portion inside parentheses is confusing.
Line 280: Table 3 => Table 5
Figures 3/4: inconsistent x-axis labels with the text/Table 3 (e.g., aod_st versus aod_strataer). Arrows designating the direction of the filters (e.g., on the aod_st plot) could be helpful but are not necessary.
Line 304: Table 4 => Table 6
Line 307: double-check percentages. For example, for ocean QF=0, ((0.67^2)-(0.61^2))/(0.67^2) = 17%, not 4% (if I correctly understand your methodology).
Table 6: stddev => standard deviation; caption mentions raw XCO2 data, but this is not present in the table.
Line 342: Table 5 => Table 7
Line 348: Figure 8 and 9 => Figures 7 and 8
Line 354: benefit of quality => benefit of a quality
Table 7: Region/Truth Proxy => “Surface/Mode” (?); through put => throughput
Line 383: Qf => QF
Line 384: Figure 9 => Figure 10; Feaures => Features
Line 394: filter bound h2o_ratio => filter bounds, h2o_ratio
Line 400: noteably => notably
Lines 415-416: Figure 11 => Figure 10
Line 442: remove ; and rephrase
Line 449: ISS not defined in text
Title: data driven => data-driven. Hyphen usage should be reviewed throughout (Lines 12, 13, 19, 32, 52, 58, 87, 89, 103, 157, 164, 273, 336, 355, 421, 438, 447, etc.).
Citation: https://doi.org/10.5194/egusphere-2023-362-RC1 - AC1: 'Reply on RC1', William Keely, 30 Jul 2023
- AC3: 'Reply on RC1', William Keely, 30 Jul 2023
-
RC2: 'Comment on egusphere-2023-362', Anonymous Referee #1, 04 May 2023
This manuscript introduces a non-linear machine learning method for the bias correction of OCO-2 XCO2 data, which outperforms the linear correction used in the operational product. The paper falls into the scope of AMT, but there are several clarifications and revisions necessary. Before publication in AMT the following comments have to be addressed.
General Comments
In my opinion, the underlying machine learning method XGBoost is praised beyond measure (highly interpretable compared to other machine learning algorithms, improved predictive performance compared to Random Forest, highly robust to overfitting, ...). It is not trivial to prove such universal statements, and in the context of this paper it is not even necessary. On the other hand, some of these aspects are not sufficiently dealt with regarding the specific example of bias correction presented here. In this sense, please avoid questionable general statements and elaborate on the available results (the proposed bias correction) instead: How can the actually obtained model be interpreted? What is the most appropriate way to calculate feature importances aiming at maximising interpretability of the specific model in question? What is the ranking of the most important features (for land and ocean data)? Does the model overfit for the chosen parameters?
In order to assess (and exclude) potential overfitting tendencies, it might also be useful to define more challenging training and validation data sets or to use completely independent data for validation. The models are trained on data from 2014-2017 and are then evaluated on data from 2018. However, the validation data set is not entirely independent as the biases are likely similar in the statistical sense from year to year. It would be more instructive to leave out whole regions and/or proxy data sets in the training and only use them for validation. It would be interesting whether the figures and tables demonstrating the performance of the correction would change significantly as a result (e.g. Figures 3/4 or Table 6). For example, Figure 5 suggests that the correction does not generalise so well to regions that were rarely considered during training (e.g tropics, parts of South Asia and Canada, compare to Figure 1).
There are quite a few different XGBoost models, e.g. for testing purposes, in the paper; I am not entirely sure what the proposed bias corrected product is in the end. I suspect it is the data set with the correction (split according to land and ocean) learned on all three proxy datasets for QF=0+1 simultaneously and subsequently restricted to QFNew. Is this correct? Or is it some kind of average of the three models for the different truth proxy data sets? Please make this more clear in the text.
The bias correction (for both land and ocean) includes features measuring the deviation of the retrieved quantities from the prior (surface pressure and vertical CO2 profile). I assume that these priors are very consistent with the XCO2 truth used in the supervised learning. Doesn't that run the risk of essentially pushing the XCO2 results back to the truth/prior in specific cases without noticing in the averaging kernels that hardly any information is gained from the actual measurement? Can you exclude that point sources that are not present or not sufficiently resolved in the truth are artificially attenuated or corrected away in unseen data? Due to these potential pitfalls, the results would be more robust if such parameters (co2_grad_del, dpfrac, dp_sco2) were not used in the bias correction. How important are these features in your machine learning model? Please discuss these issues in the manuscript.
Specific Comments
L25-29: There were other satellites measuring CO2 before GOSAT, e.g. AIRS or SCIAMACHY. GOSAT is considered the first satellite designed specifically for the purpose of measuring atmospheric CO2 from space. Please be more specific here.
L58-60: Do you mean (Noel et al., 2021) or (Noel et al., 2022)? There are two papers, but only (Noel et al., 2022) is in the References. (Noel et al., 202?) and (Schneising et al., 2019) both use non-linear bias correction techniques but there are no explicit comparisons to linear corrections and the term "operational" does not fit here either. Please revise this sentence.
L63: Please be more specific what you mean by "interpretable" since XGBoost is a complex black-box model, which is not intrinsically interpretable. Do you mean post-hoc model interpretation methods? Do you refer to global explainability of the model or to local explainability of individual predictions?
L65: "reproducible" is somewhat misleading because it gives the impression that a universal recipe, that can be transferred without any adaptations, is being presented. However, when using other data sets the parameters of the model have to be re-tuned and the used features have to be adapted. Is there a systematic approach, e.g. to feature selection or setting of model parameters, justifying the designation "reproducible"?
L66: The GeoCarb mission was cancelled, please remove.
L85-92: Please specify which TCCON version you are using (GGG2014 or GGG2020?). It seems to be GGG2014, why didn't you use the most recent version? Does it make any difference in performance when training or validating with one or the other version?
L122-L124: I disagree with the statement that XGBoost is highly interpretable compared to other machine learning algorithms; I would even argue that XGBoost is typically the least interpretable well-established machine learning algorithm of all after deep neural networks and also requires post-hoc approaches to understand and explain the model. For example, Support Vector Machines or Random Forest are typically more interpretable than XGBoost. Please revise (or remove) the sentence accordingly.
L128-129: XGBoost does not average across the ensemble (in contrast to Random Forest). Instead, XGBoost trains each subsequent model in the ensemble to improve upon the errors of the previous models and uses a weighted sum of the individual model predictions to make its final prediction. This is also one of the reasons why XGBoost is usually less interpretable than Random Forest.
L130: It depends on the specific task, the available resources, and data set whether XGBoost or Random Forest performs better, as they have different strengths and weaknesses. Moreover, it is hard to do a fair comparison anyway, because you have to tune the corresponding parameters independently, e.g. it makes little sense to fix certain tree structures in a comparison. Therefore, please avoid a general statement comparing the predictive performance of different machine learning algorithms, especially when it comes to XGBoost and Random Forest, which often perform quite similarly after respective optimal parameter tuning (see also general comments). Have you tried other machine learning algorithms for the specific task presented here?
L131-133: The arbitrary application of these strategies does not make XGBoost per se highly robust against overfitting to the training data. Please make clearer that these strategies can avoid overfitting if the parameters and the validation data set are chosen appropriately. Please demonstrate explicitly that there is no overfitting for the non-linear bias correction presented here.
L134-135: How exactly were these parameters determined? By monitoring the performance of the model on a validation dataset during the training process and stopping the training when the performance on the validation dataset does not improve for a given number of consecutive iterations? Are the parameters actually always the same (for all testing models and the final bias correction)?
L136-145: Have you tried other ways to get feature importances from XGBoost, e.g. permutation based importance or importance computed with SHAP values? Is it possible to improve interpretability by a choice tailored to this specific problem?
L207-210 and Figure 2: Please list the used features and the respective importances (split by land and ocean) for the models based on the different proxy data sets and for the final bias correction separately in a table.
L215: Please better explain co2_grad_del.
L219: Are there different prior surface pressures for the weak and the strong band? If so, why?
Table 3: Please remove "large unphysical" in the description of co2_grad_del. Or do you use an additional threshold to rate a deviation from the prior as unphysical?
L260: Overfitting of XGBoost cannot be generally excluded. Please prove that there is indeed no overfitting for this specific task and parameter setup. To this end, a more challenging selection of training and validation datasets could also be considered (see also general comments).
Table 4: Please also list the results for the final proposed bias correction combining all truth proxies.
Figure 3/4, Table 4/5/6: Only validation data (2018 with truth proxy sampling) is displayed here, right? If this is the case, please be more specific in the description.
L303-304: Wouldn't it be better to apply the footprint correction before training the bias correction?
Is it possible to include the footprint correction in the bias correction by introducing a suitable parameter as feature (e.g. row number)?Figure 5: Please highlight more (in the text and in the caption) what exactly is shown here: it is all data from 2016-2018 including three types of data (correct me if I'm wrong), namely 1) training data (truth proxy data for 2016-2017), 2) validation data (proxy data for 2018), 3) data beyond (data from 2016-2018 used neither for training nor validation). It would be very enlightening to show (and compare) this kind of figure separately for each of the three data types, because this could provide indications of how well the correction generalises to actually independent data (type 3).
Table 6: It would be interesting to add the respective performances for type-3-data (or to introduce more challenging training and validation data sets from the start, see also general comments). Due to the lack of an entirely independent validation dataset, the performance suggested here may be too optimistic.
Figure 7/8: Please explain all occurring variables.
Section 5.1, Figure 9: Please also report the individual information gains and not only the differences in feature importances.
Technical CorrectionsThere are incorrect Figure and Table numbers used in the main text. Please check.
Citations in the text do not always match the ones in the References section. Please check.
L12: "Obersvatory" → "Observatory"
L15: "correlate" → "correlated"
L35: Please remove the closing bracket.
L36: It is (Rodgers, 2000).
L47: "epmerically". Do you mean "empirically"?
L49: "inturn" → "in turn"
L53: "reduce" → "reduces"
L55: "applying quality filter" → "applying the quality filter"
L55: Please remove the full stop between "correction" and "to" or rephrase both sentences.
Figure 1: The number of soundings N is wrong in panel (a). The colour scale is oversaturated. Please extend the value range (0-2000 does not seen to be optimal).
L106: "overs" → "offers"
L178: "trainined" → "trained"
L186: "This, allows" → "This allows"
L223: "retrivals" → "retrievals"
L252: The section numbering is inconsistent. This number already exists.
This list of technical corrections is (likely) not exhaustive. Please check the text for further typing errors.
Citation: https://doi.org/10.5194/egusphere-2023-362-RC2 - AC2: 'Reply on RC2', William Keely, 30 Jul 2023
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
403 | 156 | 26 | 585 | 14 | 15 |
- HTML: 403
- PDF: 156
- XML: 26
- Total: 585
- BibTeX: 14
- EndNote: 15
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Steffen Mauceri
Sean Crowell
Christopher W. O'Dell
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(6138 KB) - Metadata XML