the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Transfer learning-based hybrid machine learning in single-column model of AFES v4
Abstract. The validity of a transfer learning-based hybrid machine learning (ML) in single-column model (SCM) of Atmospheric general circulation For the Earth Simulator (AFES) version 4 is examined. The results of the SCM with and without hybrid ML using transfer learning (i.e., the original and hybrid models) are compared against observational datasets and they are evaluated in tropical and midlatitude intensive observation periods. The hybrid model produces better results compared with the original model in all experiments, even when the period of training data is shifted from the target period. However, seasonality is more important for the midlatitude cases than the tropical cases, i.e., training data from the same month is necessary, even though the year of training data is different. The ML component of the hybrid model successfully corrects the model’s bias, but the correction for temperature is greater than that for humidity, especially in the amplitude rather than in the phase. If the temporal and spatial variability is significant, the ML component fails to correct the biases. Analysis of the bias components reveals that the hybrid model can reduce the mean state bias, but it cannot reduce the high-frequency components of the biases. The hybrid model slightly improves precipitation depending on the cases but does not improve surface heat fluxes that cause biases in the low-level. This implies that further synchronisation is needed for surface heat fluxes. In conclusion, transfer learning-based hybrid ML can better simulate atmospheric variability by reducing mean state bias when the appropriate training data are used. Due to this advantage, the model has the potential to improve the prediction skill of numerical models over longer periods with limited training data.
- Preprint
(4250 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-4612', Anonymous Referee #1, 08 Feb 2026
-
AC1: 'Reply on RC1', Yuya Baba, 12 Feb 2026
Thank you for reviewing my manuscript. This paper presented the validity of transfer learning based hybrid ML in a single-column model, considering its application toward practical operational forecast. The transfer learning is introduced into the hybrid model since the observation or reanalysis data is normally unavailable for the initial condition and training data in the operational forecast, because of their production delay (usually, reanalysis data are produced several months later). In such case, time shifted and little data training is practically useful to conduct timely operational forecast. The reviewer commented that annual cycle training may be useful, but as far as I tried in the preliminary experiments, such longer-range training is not superior to the present training method. This may be derived from the advantage of the hybrid ML that the present model knows physical laws of atmosphere and thus does not need such long-range training to satisfy the physical laws. Indeed, having this advantage, the present transfer learning worked well to reduce the underlying model bias leading to the better prediction. The reviewer commented this study has no clear scientific advance in the method, but this is a clear novel (and advantageous) point of this method. The reviewer finally suggested that I should compare the results with the reanalysis instead of the observation. This may be the another way of evaluation, however, as noted above, the present method is proposed in order to enhance the operational forecast. Therefore, the fidelity of prediction compared to the observation rather than the renalaysis is more important for this study.
As the reviewer commented, this study seems to be a prototype of hybrid ML model. In the next study, I’m planing to conduct seasonal prediction using transfer learning and the full model of AFES, including forecast skill evaluation. In such development and evaluation, they may require modification of training method and structure of the hybrid model. The finding of this study is important to provide valuable information for the correct development direction in such future developments, since the present model involves core parts of the AFES and the results showed fundamental model performance.
Followings are my replies to each of your comment.
Specific comments
Comment 1:
Lines 34-36: a number of purely data-driven models can do a good job stably emulating present-day climate nowadays (e.g., Chapman et al., 2025; Cresswell-Clay et al., 2025; Guan et al., 2025; Watt-Meyer et al., 2025), and have shown promise for seasonal forecasting (Kent et al., 2025).
Reply to comment 1:
The sentence which the reviewer commented did not deny the usefulness of data-driven method, but it mentioned that data-driven models requires training with huge datasets to realize long-range and extreme event prediction. Indeed, the studies the reviewer suggested required large amount of historical data training before conducting prediction (Watt-Meyer et al., 2025; Kent et al., 2025). In addition, indeed, some AI climate models are able to produce observed climate but they sometimes did not show superior prediction skill for seasonal prediction compared to the advanced dynamical prediction system (Bouallègue et al., 2023; Charlton-Perez et al., 2024; Olivetti and Messori, 2024). These explanations will be added as below in the revised manuscript.
“For example, they require huge and comprehensive training datasets for long-range prediction and capturing extreme events (De Burgh-Day and Leeuwenburg, 2023; Watt-Meyer et al., 2025; Kent et al., 2025). In addition, their superiority in seasonal forecast to the dynamical models remains limited depending on the cases (Bouallègue et al., 2023; Charlton-Perez et al., 2024; Olivetti and Messori, 2024), even though some recent models successfully reproduced observed climate (Cresswell-Clay et al., 2025; Guan et al., 2025).”
Comment 2:
Line 52: "Of these issues, validity of transfer learning for a hybrid model is important for practical application." I am not sure I follow this statement. Indeed ML models must be tested on data outside of what they were trained on to demonstrate that they were not overfit to their training data, and therefore have value outside of that. But that does not mean transfer learning (in the sense described in this paper) is necessary. It would be perfectly valid to train across the full annual cycle for a number of years and then test on a held out set of years.
Reply to comment 2:
In the case of operational prediction, training data close to the forecast period are unavailable because of the production delay of the reanalysis data. Therefore in such case, transfer learning is useful for practical application, and if the learning requires only little data, training cost will be reduced, and this is apparently beneficial for the practical use. The reviewer suggested that annual cycle training is necessary and useful for the present hybrid ML, but the present study revealed that transfer learning with little training periods is sufficient to provide better prediction than the original model.
Comment 3:
Lines 123-124: is there any concern that synchronizing a sparse set of levels could lead to unphysical patterns passed to the physics-based model? Maybe this is not important in a single column model where the external forcing drives so much of the solution, but it could matter when coupling to a full dynamical model.
Reply to commen 3:
The preliminarly experiments showed that the present sparse synchronization is sufficient for prediction in a single-column model, i.e., dense synchronization did not show clear superiority compared to the present setting. As the reviewer commented, it may be a problem in a full dynamical model. It cannot be evaluated in the present setting and is beyond the scope of this study, so some mentions will be added at corresponding sentence in the revised manuscript.
Comment 4:
Lines 128-130: is the minimum-maximum scaling done independently for each prognostic variable at each vertical level, or is it done across levels? I can imagine, for instance, that specific humidity varies over multiple orders of magnitude across different levels of the atmosphere.
Reply to comment 4:
Common scaling parameters are used only for each variable and same vertical level, i.e., the scaling parameters are different for different variables and levels. This is because as the reviewer commented, the magnitude of prognostic variables’s scale is different and also in each level. This description will be added in the revised manuscript.
Comment 5:
Lines 151-153: indeed I imagine the forcing data of the single column model fairly tightly constrains its solution, so it is not surprising a random initial perturbation does not lead to a substantial spread in solutions.
Reply to comment 7:
Even with the identical forcing, past studies showed that single-column model can produce the substantial spread in the solution (Hack ad Pedretti, 2000; Zhang et al., 2013; Neggers et al., 2017). This sentence will be added in the revised manuscript.
Comment 6:
Lines 196-205: is this a surprise? It maybe makes sense that mid-latitude areas, which naturally have a more pronounced seasonal cycle than tropical areas, could require time-of-year-synced training data to be present for best results.
Reply to comment 6:
It is natural that the ability of hybrid ML is more sensitive to the seasonality in the midlatitude but its significance and how it is influential were unclear. This result suggests that one-month shift of training data has significant impact on the prediction results. This point will be added in the revised manuscript.
Comment 7:
Figure 10: these corrective tendencies are quite large. For example, 3 x 10-3 K/s is over 250 K/day. Is that expected? Back to my comment on Lines 123-124, this unusual vertical pattern in the corrective tendencies owing to the sparseness of the synchronized predictions is a bit disconcerting.
Reply to comment 7:
The units of Figure 10 was wrong as the reviewer pointed out. So the unit conversion was recalculated and shown by K/6h (ranges from -1.6 to 1.6 K/6h) in the revised figure. The sparseness of the corrective tendencies is derived from the sparseness of the synchronized points. However, as described above, this sparseness did not cause unphysical solution and worked well for correcting model bias, as seen in Figure 8.
Citation: https://doi.org/10.5194/egusphere-2025-4612-AC1
-
AC1: 'Reply on RC1', Yuya Baba, 12 Feb 2026
-
RC2: 'Comment on egusphere-2025-4612', Anonymous Referee #2, 14 Feb 2026
-
AC2: 'Reply on RC2', Yuya Baba, 20 Feb 2026
Thank you for reviewing my manuscript. Followings are my replies to your comments.
Reply to comment 1
Following the reviewer’s comment, the figures for normalized L2 norms (Figures 2 and 4) are changed to those for root mean square error (RMSE) which is commonly used in the meteorological field. Then, each vertical profile has physical units. The resulting profiles are quite similar to those of the L2 norms, since the equation is originally similar to that of RMSE. Please refer to the attach file for the revised Figures 2 and 4. In addition, the reviewer suggested that Figure 2 could show vertical profile of temperature and humidity along with their biases, but including the averaged values for three cases in each panel increases too much information (e.g. in Fig. 4), so they are not added in the revised figure.
Reply to comment 2
- As described in the manuscript, details of parameterizations are summarized in Baba (2020) (Section 3) and some of key parameterizations are added as: “Large-scale cloud, radiation, boundary layer, and land processes are calculated based on Kuwano-Yoshida et al. (2010), MstrnX (Sekiguchi and Nakajima, 2008), Mellor-Yamada Nakanishi-Niino level 2 (Nakanishi and Niino, 2009), and MATSIRO (Takata et al., 2003) schemes, respectively.”
- The model output is 6-hourly.
- The training is conduced from one-month before the IOP period, and evaluation periods are whole periods of each IOP experiment.
-
The lower boundary is sea for tropical cases and is ground for midlatitude cases. No specific boundary condition is applied in the model top.
Some lacked descriptions are added in the revised manuscript.
To answer the reviewer’s question, I added comparison of time varying cloud fraction and their RMSEs in the additional figure (new Fig. 16, please refer to the attached file). Since cloud fraction is available in only limited cases, so I compared them in TWP-ICE case. The result shows that macrophysical properties of clouds are similar each other even though the hybrid model is used. This result is consistent with the comparison of precipitation, showing that original and hybrid models showed similar results. These facts imply that the hybrid model cannot further improve cloud properties but improves mean state of the model. The additional figure and the descriptions will be added in the revised manuscript. As for the radiative properties, this result suggest they are also similar as shown for OLR in the previous Fig. 16.
Reply to comment 3
- Prognostic variables are temperature and humidity, so input state vector (v) in Figure 1 consists of these variables and is given to the ML part. Since input vector contains spatial and time dimension, the vectors cannot be changed to temperature T and humidity q. Therefore, additional equations representing T and q as different inputs are added in Figure 1 (please refer to attached file).
- ML part does not predict any prognostic variables but update reservoir vector r. This is also indicated in the revised Figure 1.
- Prognostic variables predicted by the SCM and reservoir updated by ML part are combined in the output layer, then variables of hybrid model are formed. This is indicated in the revised figure.
-
AC2: 'Reply on RC2', Yuya Baba, 20 Feb 2026
Data sets
scm_afes_v4 Yuya Baba https://doi.org/10.6084/m9.figshare.30060628
Model code and software
scm_afes_v4_code Yuya Baba https://doi.org/10.5281/zenodo.17060903
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 201 | 84 | 23 | 308 | 17 | 17 |
- HTML: 201
- PDF: 84
- XML: 23
- Total: 308
- BibTeX: 17
- EndNote: 17
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Summary
This manuscript describes the application of a hybrid approach machine learning approach to a physics-based single-column model. The hybrid ML approach used is reservoir computing, and follows a similar recipe to that described in Arcomano et al. (2022) and Arcomano et al. (2023). Reservoir computing in this project is used to synchronize the temperature and specific humidity at 10 of the 48 sigma levels of the model. The weights of the model were trained using reanalysis data, with the time period of the training data depending on the time period of target test period, sometimes shifted within the year and sometimes not. The baseline (non-hybrid) and hybrid models were tested against observations from field campaigns or sites in the tropics or mid-latitudes. These observations come from fairly limited periods of time, as short as 18 days or as long as 90 days. If provided appropriate training data, the hybrid system generally reduces time-mean temperature and humidity errors throughout the column.
The paper is reasonably well written in terms of describing the work that was done, but I am afraid that I do not see a meaningful scientific advance. The methodology is quite similar to that of Arcomano et al. (2022), but is applied to a substantially simpler problem: a single column model rather than a fully spatially resolved GCM. In addition, I am confused about the motivation behind the machine-learning train-test split and associated sensitivity studies. If training on reanalysis data, what is the value of using so little, or shifting the time of year the training data was derived from relative to the test period? Since it is available, why not train on data across the annual cycle? It seems artificially limiting to select training data only aligned or offset from the month of year of the test period. Finally, while direct observations certainly have unique value, I think it would also be fair, and maybe more informative, to compare to reanalysis, which is ultimately what the hybrid model was trained to emulate. If set up appropriately, this would enable testing on a more complete held out dataset than the limited columns and time periods for which we have direct observations.
Overall this is a useful proof of concept indicating that it could be interesting to see reservoir computing implemented in the Arcomano et al. (2022) style for the spatially resolved AEFS model, which is a more comprehensive model than the SPEEDY model that Arcomano et al. (2022) built around. However, while I am open to being convinced otherwise, at this particular stage, which feels more like a prototype, I think there may still be a substantial amount of work needed, both in implementation and testing, to make a meaningful contribution to the literature.
Specific comments
Lines 34-36: a number of purely data-driven models can do a good job stably emulating present-day climate nowadays (e.g., Chapman et al., 2025; Cresswell-Clay et al., 2025; Guan et al., 2025; Watt-Meyer et al., 2025), and have shown promise for seasonal forecasting (Kent et al., 2025).
Line 52: "Of these issues, validity of transfer learning for a hybrid model is important for practical application." I am not sure I follow this statement. Indeed ML models must be tested on data outside of what they were trained on to demonstrate that they were not overfit to their training data, and therefore have value outside of that. But that does not mean transfer learning (in the sense described in this paper) is necessary. It would be perfectly valid to train across the full annual cycle for a number of years and then test on a held out set of years.
Lines 123-124: is there any concern that synchronizing a sparse set of levels could lead to unphysical patterns passed to the physics-based model? Maybe this is not important in a single column model where the external forcing drives so much of the solution, but it could matter when coupling to a full dynamical model.
Lines 128-130: is the minimum-maximum scaling done independently for each prognostic variable at each vertical level, or is it done across levels? I can imagine, for instance, that specific humidity varies over multiple orders of magnitude across different levels of the atmosphere.
Lines 151-153: indeed I imagine the forcing data of the single column model fairly tightly constrains its solution, so it is not surprising a random initial perturbation does not lead to a substantial spread in solutions.
Lines 196-205: is this a surprise? It maybe makes sense that mid-latitude areas, which naturally have a more pronounced seasonal cycle than tropical areas, could require time-of-year-synced training data to be present for best results.
Figure 10: these corrective tendencies are quite large. For example, 3 x 10-3 K/s is over 250 K/day. Is that expected? Back to my comment on Lines 123-124, this unusual vertical pattern in the corrective tendencies owing to the sparseness of the synchronized predictions is a bit disconcerting.
References
Arcomano, T., Szunyogh, I., Wikner, A., Pathak, J., Hunt, B. R., & Ott, E. (2022). A Hybrid Approach to Atmospheric Modeling That Combines Machine Learning With a Physics-Based Numerical Model. Journal of Advances in Modeling Earth Systems, 14(3), e2021MS002712. https://doi.org/10.1029/2021MS002712
Arcomano, T., Szunyogh, I., Wikner, A., Hunt, B. R., & Ott, E. (2023). A Hybrid Atmospheric Model Incorporating Machine Learning Can Capture Dynamical Processes Not Captured by Its Physics-Based Component. Geophysical Research Letters, 50(8), e2022GL102649. https://doi.org/10.1029/2022GL102649
Chapman, W. E., Schreck, J. S., Sha, Y., II, D. J. G., Kimpara, D., Zanna, L., Mayer, K. J., & Berner, J. (2025). CAMulator: Fast Emulation of the Community Atmosphere Model (arXiv:2504.06007). arXiv. https://doi.org/10.48550/arXiv.2504.0600
Cresswell-Clay, N., Liu, B., Durran, D. R., Liu, Z., Espinosa, Z. I., Moreno, R. A., & Karlbauer, M. (2025). A Deep Learning Earth System Model for Efficient Simulation of the Observed Climate. AGU Advances, 6(4), e2025AV001706. https://doi.org/10.1029/2025AV001706
Guan, H., Arcomano, T., Chattopadhyay, A., & Maulik, R. (2025). LUCIE: A Lightweight Uncoupled Climate Emulator With Long-Term Stability and Physical Consistency. Journal of Advances in Modeling Earth Systems, 17(11), e2025MS005152. https://doi.org/10.1029/2025MS005152
Kent, C., Scaife, A. A., Dunstone, N. J., Smith, D., Hardiman, S. C., Dunstan, T., & Watt-Meyer, O. (2025). Skilful global seasonal predictions from a machine learning weather model trained on reanalysis data. npj Climate and Atmospheric Science, 8(1), 314. https://doi.org/10.1038/s41612-025-01198-3
Watt-Meyer, O., Henn, B., McGibbon, J., Clark, S. K., Kwa, A., Perkins, W. A., Wu, E., Harris, L., & Bretherton, C. S. (2025). ACE2: Accurately learning subseasonal to decadal atmospheric variability and forced responses. npj Climate and Atmospheric Science, 8(1), 205. https://doi.org/10.1038/s41612-025-01090-0