Transfer learning-based hybrid machine learning in single-column model of AFES v4
Abstract. The validity of a transfer learning-based hybrid machine learning (ML) in single-column model (SCM) of Atmospheric general circulation For the Earth Simulator (AFES) version 4 is examined. The results of the SCM with and without hybrid ML using transfer learning (i.e., the original and hybrid models) are compared against observational datasets and they are evaluated in tropical and midlatitude intensive observation periods. The hybrid model produces better results compared with the original model in all experiments, even when the period of training data is shifted from the target period. However, seasonality is more important for the midlatitude cases than the tropical cases, i.e., training data from the same month is necessary, even though the year of training data is different. The ML component of the hybrid model successfully corrects the model’s bias, but the correction for temperature is greater than that for humidity, especially in the amplitude rather than in the phase. If the temporal and spatial variability is significant, the ML component fails to correct the biases. Analysis of the bias components reveals that the hybrid model can reduce the mean state bias, but it cannot reduce the high-frequency components of the biases. The hybrid model slightly improves precipitation depending on the cases but does not improve surface heat fluxes that cause biases in the low-level. This implies that further synchronisation is needed for surface heat fluxes. In conclusion, transfer learning-based hybrid ML can better simulate atmospheric variability by reducing mean state bias when the appropriate training data are used. Due to this advantage, the model has the potential to improve the prediction skill of numerical models over longer periods with limited training data.
Summary
This manuscript describes the application of a hybrid approach machine learning approach to a physics-based single-column model. The hybrid ML approach used is reservoir computing, and follows a similar recipe to that described in Arcomano et al. (2022) and Arcomano et al. (2023). Reservoir computing in this project is used to synchronize the temperature and specific humidity at 10 of the 48 sigma levels of the model. The weights of the model were trained using reanalysis data, with the time period of the training data depending on the time period of target test period, sometimes shifted within the year and sometimes not. The baseline (non-hybrid) and hybrid models were tested against observations from field campaigns or sites in the tropics or mid-latitudes. These observations come from fairly limited periods of time, as short as 18 days or as long as 90 days. If provided appropriate training data, the hybrid system generally reduces time-mean temperature and humidity errors throughout the column.
The paper is reasonably well written in terms of describing the work that was done, but I am afraid that I do not see a meaningful scientific advance. The methodology is quite similar to that of Arcomano et al. (2022), but is applied to a substantially simpler problem: a single column model rather than a fully spatially resolved GCM. In addition, I am confused about the motivation behind the machine-learning train-test split and associated sensitivity studies. If training on reanalysis data, what is the value of using so little, or shifting the time of year the training data was derived from relative to the test period? Since it is available, why not train on data across the annual cycle? It seems artificially limiting to select training data only aligned or offset from the month of year of the test period. Finally, while direct observations certainly have unique value, I think it would also be fair, and maybe more informative, to compare to reanalysis, which is ultimately what the hybrid model was trained to emulate. If set up appropriately, this would enable testing on a more complete held out dataset than the limited columns and time periods for which we have direct observations.
Overall this is a useful proof of concept indicating that it could be interesting to see reservoir computing implemented in the Arcomano et al. (2022) style for the spatially resolved AEFS model, which is a more comprehensive model than the SPEEDY model that Arcomano et al. (2022) built around. However, while I am open to being convinced otherwise, at this particular stage, which feels more like a prototype, I think there may still be a substantial amount of work needed, both in implementation and testing, to make a meaningful contribution to the literature.
Specific comments
Lines 34-36: a number of purely data-driven models can do a good job stably emulating present-day climate nowadays (e.g., Chapman et al., 2025; Cresswell-Clay et al., 2025; Guan et al., 2025; Watt-Meyer et al., 2025), and have shown promise for seasonal forecasting (Kent et al., 2025).
Line 52: "Of these issues, validity of transfer learning for a hybrid model is important for practical application." I am not sure I follow this statement. Indeed ML models must be tested on data outside of what they were trained on to demonstrate that they were not overfit to their training data, and therefore have value outside of that. But that does not mean transfer learning (in the sense described in this paper) is necessary. It would be perfectly valid to train across the full annual cycle for a number of years and then test on a held out set of years.
Lines 123-124: is there any concern that synchronizing a sparse set of levels could lead to unphysical patterns passed to the physics-based model? Maybe this is not important in a single column model where the external forcing drives so much of the solution, but it could matter when coupling to a full dynamical model.
Lines 128-130: is the minimum-maximum scaling done independently for each prognostic variable at each vertical level, or is it done across levels? I can imagine, for instance, that specific humidity varies over multiple orders of magnitude across different levels of the atmosphere.
Lines 151-153: indeed I imagine the forcing data of the single column model fairly tightly constrains its solution, so it is not surprising a random initial perturbation does not lead to a substantial spread in solutions.
Lines 196-205: is this a surprise? It maybe makes sense that mid-latitude areas, which naturally have a more pronounced seasonal cycle than tropical areas, could require time-of-year-synced training data to be present for best results.
Figure 10: these corrective tendencies are quite large. For example, 3 x 10-3 K/s is over 250 K/day. Is that expected? Back to my comment on Lines 123-124, this unusual vertical pattern in the corrective tendencies owing to the sparseness of the synchronized predictions is a bit disconcerting.
References
Arcomano, T., Szunyogh, I., Wikner, A., Pathak, J., Hunt, B. R., & Ott, E. (2022). A Hybrid Approach to Atmospheric Modeling That Combines Machine Learning With a Physics-Based Numerical Model. Journal of Advances in Modeling Earth Systems, 14(3), e2021MS002712. https://doi.org/10.1029/2021MS002712
Arcomano, T., Szunyogh, I., Wikner, A., Hunt, B. R., & Ott, E. (2023). A Hybrid Atmospheric Model Incorporating Machine Learning Can Capture Dynamical Processes Not Captured by Its Physics-Based Component. Geophysical Research Letters, 50(8), e2022GL102649. https://doi.org/10.1029/2022GL102649
Chapman, W. E., Schreck, J. S., Sha, Y., II, D. J. G., Kimpara, D., Zanna, L., Mayer, K. J., & Berner, J. (2025). CAMulator: Fast Emulation of the Community Atmosphere Model (arXiv:2504.06007). arXiv. https://doi.org/10.48550/arXiv.2504.0600
Cresswell-Clay, N., Liu, B., Durran, D. R., Liu, Z., Espinosa, Z. I., Moreno, R. A., & Karlbauer, M. (2025). A Deep Learning Earth System Model for Efficient Simulation of the Observed Climate. AGU Advances, 6(4), e2025AV001706. https://doi.org/10.1029/2025AV001706
Guan, H., Arcomano, T., Chattopadhyay, A., & Maulik, R. (2025). LUCIE: A Lightweight Uncoupled Climate Emulator With Long-Term Stability and Physical Consistency. Journal of Advances in Modeling Earth Systems, 17(11), e2025MS005152. https://doi.org/10.1029/2025MS005152
Kent, C., Scaife, A. A., Dunstone, N. J., Smith, D., Hardiman, S. C., Dunstan, T., & Watt-Meyer, O. (2025). Skilful global seasonal predictions from a machine learning weather model trained on reanalysis data. npj Climate and Atmospheric Science, 8(1), 314. https://doi.org/10.1038/s41612-025-01198-3
Watt-Meyer, O., Henn, B., McGibbon, J., Clark, S. K., Kwa, A., Perkins, W. A., Wu, E., Harris, L., & Bretherton, C. S. (2025). ACE2: Accurately learning subseasonal to decadal atmospheric variability and forced responses. npj Climate and Atmospheric Science, 8(1), 205. https://doi.org/10.1038/s41612-025-01090-0