the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A hybrid optimal estimation and machine learning approach to predict atmospheric composition
Abstract. We present a HYbrid REtrieval Framework (HYREF) that predicts subcolumn carbon monoxide (CO) concentrations from Cross-track Infrared Sounder (CrIS) observations, trained to replicate the TRopospheric Ozone and its Precursors from Earth System Sounding (TROPESS) retrievals based on optimal estimation (OE). Unlike the OE algorithm, which produces retrievals for only a small fraction of available CrIS observations due to expensive but physically accurate radiative transfer, the addition of machine learning (ML) techniques enables full coverage by providing high-resolution predictions for every valid CrIS sample. Importantly, in addition to CO concentrations, TROPESS-HYREF also predicts key retrieval diagnostics, namely column averaging kernels, degrees of freedom, and retrieval errors, that are essential for meaningful comparison with other observations, models, and ingestion into data assimilation.
The new framework achieves excellent performance with correlation coefficients r>0.99 and a bias <0.1% when benchmarked against an independent test set, and reproduces fine-scale spatial patterns in CO fields observed during a major wildfire over North America. A scale analysis reveals substantial variability in CO concentrations below the nominal 0.80° resolution of the TROPESS OE retrieval, which TROPESS-HYREF successfully resolves. Inference is computationally efficient, with daily global predictions completed in minutes on a single compute node. Continuous training with the operational TROPESS OE algorithm ensures that TROPESS-HYREF adapts to changes in the trends and variability of atmospheric composition. This threading of OE-derived physical information and ML-driven efficiency provides a practical pathway to high-resolution atmospheric CO monitoring with robust diagnostics.
- Preprint
(8337 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-4864', Daniel Miller, 22 Jan 2026
-
RC2: 'Comment on egusphere-2025-4864', Anonymous Referee #2, 05 Feb 2026
Review of A hybrid optimal estimation and machine learning approach to predict atmospheric composition
Werner et al. present the development of a machine-learning-based retrieval framework (TROPESS-HYREF) that predicts subcolumn CO concentrations from CrIS observations, trained using TROPESS optimal estimation (OE) retrievals. Importantly, in addition to CO concentrations, the system also predicts retrieval diagnostics such as averaging kernels, degrees of freedom (DoF), and retrieval errors, which are critical for model-observation comparison, validation, and data assimilation. The paper is generally well written, and the proposed framework represents a potentially important development for improving the spatial coverage and computational efficiency of OE-based retrieval systems, which are often limited by computational bottlenecks. However, I do have several concerns regarding the model’s generalizability, and the interpretation of some of the reported results, which I believe should be addressed to strengthen the manuscript.
Major comments:
1. Generalizability and data-splitting strategy
The use of a 98%/1%/1% split for training, validation, and testing raises concerns regarding the generalizability of the ML model. While the absolute number of samples in the validation and test sets is large, the strong spatial and temporal correlations inherent in satellite observations mean that a random split does not guarantee independence. For example, the 10 June 2023 wildfire case is drawn from the 04/2023-01/2025 period used for training. Given that 98% of the data are included in the training process, the test set likely contains many samples that are spatially and temporally adjacent to training samples. Under this split strategy, the reported test-set performance may largely reflect re-prediction of patterns already seen during training rather than true out-of-sample generalization.
A more robust evaluation would involve temporally or regionally independent splits (e.g., holding out entire months, seasons, or geographic regions), or comparison with fully independent third-party observations such as in situ or ground-based measurements. As currently implemented, the 98%/1%/1% split limits the interpretability of the reported test results.
2. Evaluation of predicted diagnostics
A key claimed advantage of the TROPESS-HYREF framework is its ability to predict retrieval diagnostics such as averaging kernels, DoF, and retrieval errors. However, the evaluation presented in the paper focuses largely on CO column concentrations. Additional assessment of the predicted diagnostics would strengthen the paper. For example, how accurate and stable are the ML-predicted averaging kernels relative to OE? Are the predicted errors statistically consistent with OE-derived uncertainties? How suitable are these diagnostics for downstream applications such as data assimilation?
3. Claims regarding performance relative to OE
Some statements suggesting that the ML system may "outperform" the OE retrieval are concerning. Given that the ML model is trained to reproduce OE results, it is unclear how it could outperform the OE retrieval in a physical sense. Clarifying that the ML system improves coverage and computational efficiency, rather than retrieval accuracy relative to OE, would help avoid overinterpretation.
Minor comments:
L96-97: The description of the forward and backward processes may be confusing for general readers, as it does not explicitly mention the backpropagation algorithm. A brief clarification would improve readability.
L101: The time range (04/2023–01/2025) is critical information and should be mentioned in the Data section.
Section 3: Feature preprocessing is not described: were different input features (radiances, latitude/longitude, UTC time, a priori values) normalized or scaled prior to training?
L106: Is it necessary to use the full spectrum, or would a reduced set of CO-sensitive channels suffice?
L188-189: The mesoscale processes associated with the identified spectral break should be discussed more explicitly.
Figure 4: A direct comparison of power spectral densities between ML-predicted CO and interpolated OE CO would be informative and could further clarify the added value of the ML approach.
Citation: https://doi.org/10.5194/egusphere-2025-4864-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 375 | 152 | 29 | 556 | 24 | 26 |
- HTML: 375
- PDF: 152
- XML: 29
- Total: 556
- BibTeX: 24
- EndNote: 26
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review AMT - egusphere-2025-4864
Title: A hybrid optimal estimation and machine learning approach to predict atmospheric composition
First author: Frank Werner
Summary
This paper describes the development and data product of HYbrid REtrieval Framework (HYREF), which predicts sub-column carbon monoxide (CO) concentrations from Cross-track Infrared Sounder (CrIS) observations. This model was trained on the lower spatial resolution optimal estimation (OE) retrievals from TRopospheric Ozone and its Precursors from Earth System Sounding (TROPESS). The resulting machine learning (ML) data product for CrIS combines high spatial resolution and characterization of degrees of freedom and retrieval errors – which are critical to the ability to compare datasets to other observations, models, and use in data assimilation.
Overall Feedback
I think that this paper is great and worthy of publication with only minor revisions. Most of the feedback below consists of recommendations about different analytical techniques that may help improve the paper.
One specific point of feedback is a rather general critique of the value of the simple linear regression analysis for statistically comparing similar datasets. In the limit that correlations approach 1 linear regression plots start to provide limited visual evaluation ability – or to put it more bluntly everyone has seen a good looking regression and they often look less than usefully similar. There is a more robust approach to construct comparisons like this known as the “Bland-Altman Plot”, that also helps to incorporate additional statistical information about the data and better poses the fundamental question: “could variable B statistically replace variable A”. Figure 2 in your paper currently does a reasonable job of displaying the other dimensions that can matter for such a regression; given that they display retrievals, errors, and degrees of freedom.
I would recommend at least looking at Bland-Altman plots in response to this review and potentially including such analysis in the paper itself. In particular, Bland-Altman offers a better visual framework for handling data comparison tasks if the distribution of data is not linearly distributed or has uniquely variable uncertainty in either of the compared datsets. This is particularly true for variables with non-gaussian variability (e.g., logarithmic distributed variables such as optical thickness) or heteroscedastic uncertainty/variability. It looks to me as though these considerations might matter for the datasets in panels a,b, and c of Figure 2 – whereas panel d appears clearly normally distributed at all scales. One further relevant concern here is that neural network architectures such as yours are largely tuned toward gaussian process prediction and can struggle (without adequate consideration) to handle heteroscedastic variability in datasets because of the common isotropic noise assumption [Stirn et al., 2022].
An example of demonstrating a situation where analysis with Bland-Altman can significantly improve your analytical toolkit can be found in Knobelspiesse, et al. (2019). This paper explores an instrument intercomparison for radiometric polarimeters – which exhibit non-gaussian distributions in observed radiances as well as heteroscedastic variability in in the degree of linear polarization (DoLP) uncertainty. The example therein is discussed in section 3.C and summarized visually in Figure 8 and Figure 9. The links below summarize the methodology and has a python notebook demonstrating examples.
https://github.com/knobelsp/BlandAltman?tab=readme-ov-file
https://colab.research.google.com/github/knobelsp/BlandAltman/blob/main/BlandAltman.ipynb
Furthermore, as a ML retrieval example of how heteroscedasticity can cause issues with application of machine learning methods - the cloud microphysics retrievals in Miller et al., 2020 struggle to handle retrievals across the whole range of variability of the retrieval datasets. This is because of the statistical distributions of radiances and DoLP have rather heteroscedastic dependencies on the geophysical variables attempting to be retrieved.
Specific Feedback
Citations
Stirn, A., Wessels, H.-H., Schertzer, M., Pereira, L., Sanjana, N. E., and Knowles, D. A., “Faithful Heteroscedastic Regression with Neural Networks”, arXiv e-prints, Art. no. arXiv:2212.09184, 2022. doi:10.48550/arXiv.2212.09184.
Knobelspiesse K, Tan Q, Bruegge C, Cairns B, Chowdhary J, van Diedenhoven B, Diner D, Ferrare R, van Harten G, Jovanovic V, Ottaviani M, Redemann J, Seidel F, Sinclair K. Intercomparison of airborne multi-angle polarimeter observations from the Polarimeter Definition Experiment. Appl Opt. 2019 Jan 20;58(3):650-669. doi: 10.1364/AO.58.000650. PMID: 30694252; PMCID: PMC6996873.
Miller, D. J., Segal-Rozenhaimer, M., Knobelspiesse, K., Redemann, J., Cairns, B., Alexandrov, M., van Diedenhoven, B., and Wasilewski, A.: Low-level liquid cloud properties during ORACLES retrieved using airborne polarimetric measurements and a neural network algorithm, Atmos. Meas. Tech., 13, 3447–3470, https://doi.org/10.5194/amt-13-3447-2020, 2020.