the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Predicting the distance of the AMOC to its tipping point using CNNs
Abstract. The Atlantic Meridional Overturning Circulation (AMOC) is an important tipping element of the climate system, with the potential to undergo an abrupt transition from its present strong state to a weak state. Such a collapse would have severe global consequences, including regional cooling, sea-level rise, altered precipitation patterns, and cascading impacts on other climate tipping elements. Both statistical and physics-based early warning signals (EWS) of an approaching AMOC tipping event have been proposed. Here, we introduce a convolutional neural network (CNN)–based framework designed to predict the distance of an AMOC state to its tipping point under imposed freshwater flux forcing. We first evaluate the CNN model using simulations from the Earth System Model of Intermediate Complexity CLIMBER-X. We then test its generalization capabilities by applying the CNN model, trained on CLIMBER-X data, to the AMOC tipping trajectory obtained recently in the Community Earth System Model (CESM). Explainable AI methods are used to identify the spatiotemporal features most relevant to the predictions. Our results demonstrate the potential of deep learning to provide reliable estimates of the distance to the AMOC tipping point and generalize across models of varying complexity.
- Preprint
(13885 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 08 Jul 2026)
- RC1: 'Comment on egusphere-2026-1872', Anonymous Referee #1, 08 Jun 2026 reply
-
RC2: 'Comment on egusphere-2026-1872', Anonymous Referee #2, 15 Jun 2026
reply
This manuscript presents a CNN-based approach for predicting the distance of the AMOC from its tipping point. The model uses spatial fields as input, including sea surface temperature (SST), sea surface salinity (SSS), their combination, and the full-depth salinity profile, to predict a normalized distance-to-tipping metric ranging from 0 to 1. The authors futher employ SHAP analysis to identify the features that contribute most strongly the model predictions. In addition, the effort to transfer information learned from CLIMBER-X simulations to CESM simulations is potentially valuable. The topic is important and interesting. However, several important issues need to be addressed before the manuscript can be considered for publication.
- The CNN output is defined as a normalized index, which is designed to avoid explicitly providing information about the freshwater forcing rate during training and testing. However, in CLIMBER-X, different types of tipping may occur, including bifurcation-induced, noise-induced, and rate-induced tipping. The current definition of the distance to the tipping point appears to be mainly applicable to bifurcation-induced tipping under deterministic conditions. Therefore, the authors should provide a clearer explanation of the applicability and limitations of this distance to tipping definition. In particular, it would be helpful to clarify whether this definition is intended to characterize only the distance to a bifurcation threshold, or whether it can also meaningfully describe proximity to noise-induced or rate-induced tipping events. This distinction is important because the timing of noise-induced and rate-induced tipping events can be highly stochastic and may not have a simple linear relationship with the freshwater forcing value.
- In the CLIMBER-X experiments, the LR model performs comparably to, and in some cases even better than, the CNN. This suggests that the relationship between the input variable fields and the target index may be relatively simple, rather than requiring complex nonlinear spatial features learned by the deep neural networks. To better justify the use of CNNs, the authors should include additional baseline models, such as shallow CNNs or other lightweight machine-learning methods. In addition, the training dataset appears to be relatively small for a deep learning approach. The authors should provide a clearer description of the dataset, including the size and construction of training samples. This is important because temporally adjacent fields are often highly correlated, meaning that the nominal sample size may overestimate the amount of independent information available for training and evaluation. The manuscript also reports a relatively large number of hyperparameter settings across different experiments, which may partly reflect the limited sample size and raises questions about the consistency and robustness of the model configuration.
- The manuscript emphasizes that the CNN is trained on CLIMBER-X and then generalized to CESM. However, since part of the CESM simulation is used for validation and hyperparameter selection, the model selection procedure has already incorporated information from the target model. Therefore, this experiment should not be presented as a fully independent cross-model extrapolation. The authors should moderate the relevant claims in the abstract and conclusions. In particular, statements such as “reliable estimates” and “generalize across models” should be softened unless additional validation using fully independent target model simulations is provided.
- The explainability analysis is potentially valuable: however, the SHAP maps appear to be derived from the best performing CNN realizations selected using test set performance. This may introduce selection bias, particularly for CESM, where the model shows substantial variability across realizations.
- Some typos should be corrected throughout the manuscript. For example, in line 64, “Each of these interacting modules are discretized” should be revised to “Each of these interacting modules is discretized”. In line 413, “a modest SST increase is occurs” should be revised to “a modest SST increase occurs”. There also appears to be an incorrectly formatted reference citation around line 600.
Model code and software
Code for reproducing the results of the paper "Predicting the Distance of the AMOC to Its Tipping Point Using CNNs" Francesco Guardamagna https://doi.org/10.5281/zenodo.19369578
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 389 | 186 | 20 | 595 | 40 | 40 |
- HTML: 389
- PDF: 186
- XML: 20
- Total: 595
- BibTeX: 40
- EndNote: 40
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The Atlantic meridional overturning circulation (AMOC) is known to exhibit multiple equilibria and tipping points, which can be observed in climate model simulations where a freshwater flux is applied in the North Atlantic. In this draft, the authors propose inferring the accumulated freshwater flux applied in the North Atlantic from the patterns of sea surface temperature, sea surface salinity and salinity cross-section at 35°S in the Atlantic Ocean. This inference uses data from CLIMBER-X (an earth system model of intermediate complexity) and CESM (a low-resolution climate model). The accumulated freshwater flux is normalized in each model by an estimate of the total accumulated freshwater flux required to trigger a tipping point. Here, the method employs convolutional neural networks (CNNs). The results obtained are discussed using a sensitivity analysis.
The draft presents original research; however significant work is needed to make it publishable. Specifically, the scientific question and the research hypotheses are not clearly presented in the introduction. The results presented are overly technical and lack synthesis. The sensitivity analysis fails to reveal consistent patterns. The conclusion and discussion are underdeveloped.
The data from only two models are used. How representative are these two models? Are their outputs consistent with those of other models concerning the AMOC tipping and its SST and salinity signatures?
Is the accumulated freshwater flux leading to a tipping AMOC well estimated in the simulations? The manuscript lacks clear explanations of how the AMOC collapse or tipping is defined or estimated. Would it be valuable to incorporate some uncertainties?
Data from hosing experiments may differ significantly from real-word. For instance, hosing experiments often involve a shallowing of the mixed layer in the hosing region, whereas rea-world freshwater flux from melting ice sheets has a different pattern. This limitation needs to be discussed, as the machine learning method developed may fail when applied to observational data.
A linear regression model is given as a baseline. However, given the high dimensionality of the input data, the linear regression model may overfill. I suggest using a ridge regression or a random forest model instead.
Another point concerns the normalization. L243 : ‘’CESM inputs are normalized using the mean and standard deviation of the CLIMBER-X training data’’ In climate science, the model data often show large differences, as they all have different biases. I understand that the mean state of CLIMBER-X was removed from CESM data to define anomalies. If this is correct, this approach would emphasize the difference between CESM and CLIMBER-X rather than the distance to tipping. A better approach might be to use anomalies defined with a reference period relative to each model. Additionally, dividing by the standard deviation of CLIMBER-X could lead to a high variability in CESM data if the standard deviation of CLIMBER-X is much smaller than the one of CESM. Can the authors compare the standard deviations of the two models?
The authors should explicitly refer to each figure panel, line and symbol to help reader verify statements and hypotheses. Currently, it is difficult to follow what is being discussed. Are the claims based on figures, hypotheses or suggestions? The results are presented as an exhaustive list of figures, which is overwhelming. I suggest reducing the figures. For instance, Figs. 6, 7, 8, 9 and 10 can be reduced to show the sensitivity analysis for a signle freshwater forcing and CLIMBER-X, with the other results summarized in the text. The same remark applies to appendices B and C that could be removed or significantly condensed. For appendix C, graphs might be more effective than tables for visual clarity.
The unclear presentation and difficult readability made the review challenging, so I primarily focused on the first 3 sections for minor comments.
The training, validation and testing strategy concerning the CESM results is unclear and needs to be clarified.
The introduction needs to better explain the experiments used in the paper, and their limitations. The conclusion needs to provide comparisons with related recent findings, suggest future perspectives and discuss limitations.
Minor comments
Introduction:
L37-38: ‘’First, it requires prior knowledge about the freshwater forcing rate, which is a significant constraint for real-world applications.‘’ The term freshwater forcing is ambiguous. Clarify whether it refers to a freshwater imposed from in hosing experiments, or the climatological freshwater forcing from the surface water budget (e.g., precipitation minus evaporation and runoff)?
L38-30 : ‘’ the model must be trained on data from the simulation itself, extending up to 100 years before the onset of collapse. ‘’ Please specify the total number of year used, the physical field or time series used in the training data.
L39-41: ‘’ as the model relies solely on one-dimensional indices, such as the AMOC strength at 26◦N as input, it precludes the application of explainable AI techniques ‘’ Why do the authors argue IA technique cannot be applied to indices? Provide justification or revise the statement
L68-69: ‘’ matches many aspects of state-of-the-art CMIP6 models across diverse forcings and boundary conditions (Willeit et al. (2022a)). ‘’ Can the authors be more specific and describe which processes have been validated and for what types of experiments?
L74 : ‘’ the AMOC collapses once the freshwater forcing reaches F_H^C = 0.22 Sv’’. How was the threshold determined? The manuscript does not provide sufficient explanation. Also define AMOC collapse. This term is loosely used in the literature. Specify the criteria for collapse (threshold, or detection of bifurcation). I suggest that the authors show and illustrate the results supporting this value.
L78: ‘’The Community Earth System Model (CESM) is a fully coupled GCM. ‘’ Clarify the differences between CESM and CLIMBER-X. What processes included in each model?
L86-87: ‘’ Under this forcing, van Westen et al. (2024a) estimated that the AMOC reaches its tipping point at model year 1758, when the freshwater input into the North Atlantic reaches F_H^E = 0.53 Sv.’’ How was the tipping point estimated? Is it equivalent to say that the AMOC collapsed and that the AMOC reached a tipping point? What does the model year 1758 refer to? Does the simulation start at year 0? Provide contextfor the timeline.
L112-113 and Fig. 1: ‘’ The CNN is trained using Sea Surface Temperature (SST) and Sea Surface Salinity (SSS) fields across the Atlantic Ocean (spanning from 90°N to 35◦S)’’ Define the Atlantic Ocean boundaries. Most definition does not extend beyond 80°N (e.g., Fram straight). The Arctic included (see Fig. 1) justify this choice. Why do the authors choose to use SST and SST as input? Why not include subsurface ocean data, which may provide additional predictive skill?
L114: ‘’the full-depth salinity profile at 35◦S ‘’ Do the authors mean that they used the cross-section at 35°S in the Atlantic Ocean in the depth-longitude space? Justify the choice of this latitude and its relevance to AMOC dynamics and tipping point.
L121 : ‘’where F_H(t) denotes the freshwater flux value ‘’ Define freshwater flux. Does it refer to an additional freshwater flux added (e.g. hosing)? Or is it the actual total freshwater flux (precipitation minus evaporation + runoff)? Is it the same as the forcing rate provided L129 and L130? Clarify also the unit for both terms.
L129-130 ‘’ For all forcing rates, d_F(t) is defined with respect to a freshwater flux at tipping F_H (t_p) = F_H^C = 0.22 Sv, which corresponds to the tipping point identified for the slowest forcing experiment ‘’ How was the tipping point identified? Why use the same F_H (t_p) for all forcing rates? Would the results differ if F_H (t_p) varied? Just a question : would the results be different if d_F(t) was defined as (t-t_0)/(t_p-t), where t_0 the time of the initial conditions.
L130: Why is the unit of a freshwater forcing given in Sv yr-1? Clarify the units. A freshwater flux is typical expressed in Sv. Is the forcing a rate of change?
L135: ‘’ then evaluated on the trajectory excluded from both training and validation ‘’ Define trajectory. Does it refer to a time series from a single simulation? Please specify the data used for evaluation.
L161-162: ‘’ The reported predictions are the median across 20 independent training trials; variability across trials is negligible and therefore not shown. ‘’ Justify the need for multiple trials, and reduce the number of trial if results are deterministic.
L157-159 : ‘’ The LR model is trained using the same input variables and target (dF ), again following the procedure outlined in Section 2.4. ‘’ The LR model may overfit when using such high dimensional input. Did the authors try to apply regularization or to reduce the dimensionality of the input? If not, address this limitation and switch to a more robust baseline.
L172-173: ‘’ corresponding to a prediction uncertainty of 961 years (9.61 × 10−3 Sv if we express the error in terms of freshwater forcing) ‘’ Clarify the calculation. How was the 961 years derived? How does this translate to 9.61 10-3 Sv?
L179-180 : ‘’ with very low percentage errors relative to the total span of the test simulations.’’ Can the authors explain better? What is the length of the test simulation then?
L188-189: ‘’ the LR provides skillful predictions ‘’ Define skillful. Is this based on a statistical metric?
L213-215 : ‘’ The primary advantage of the CNN lies in its generalization capability. When trained on CLIMBER-X data and evaluated on the more complex CESM model, the LR model fails to provide reliable predictions, whereas the CNN demonstrates robust 215 generalization performance (see Section 3.2). ‘’ The claim about CESM results is premature, as these results are not yet presented in the manuscript.
L219-220: ‘’ For rF = 10−4 Sv yr−1, the collapse of the AMOC is initiated approximately 200 years before the system reaches its actual 220 tipping point, marking the onset of a regime shift’’ Which figure illustrates this result? Define AMOC collapse. Explain ‘’initiated’’. Does this refer to the start of a decline?
Appendix B : reduce the text and figures to focus on key results. Improve explanations to highlight the most important findings.
Fig. 3, legend : no need to explain what is a boxplot, this is common knowledge. However, I suggest to keep the definition of extremes and error bars.
Fig. 3: I suggest to use consistent color for models using the same inputs. For instance, CNN using SST-only in yellow boxplot and LR using SST-only with a yellow triangle…
Fig. 3: What is shown here? Evaluation of the model when using the test simulation with the six other simulations used for training and validation?
L246-251: The authors explained L226-232 that CESM data was only used for evaluation, why then the hyperparameter are modified using CESM data?
L248 : ‘’the first 430 years are discarded to remove transient effects’’ Define ‘’transient effects’’.
L263-268 : ‘’ Despite these measures, some variability remains in the validation result’’ and ‘’ In what follows, we present results obtained with the best-performing configuration on the CESM test set (last 880 years).‘’ the large variability in the results obtained suggest that there is a large uncertainty in the inference. I suggest to quantify the related uncertainty.
Fig 4 : Where is the training data in panel (a)? How is the tipping detected in panel (a)?
L277 : here 500 trials are mentioned while 50 trials are mentioned at L255? Why 500 ? Clarify the purpose of such a large number of trials.
L344-347 : here and more generally in this subsection, can the authors refer to the figure / panel for each statement.
L374 : ‘’SV relevance maps’’ Explain the acronym SV.
L390 : ‘’ the magnitude is quantified using Sen’s slope estimator.’’ Justify the choice of Sen’s slope estimator over a linear trend.
L409-410: ‘’ First, we note that patterns are consistent with those reported by Stouffer et al. (2006), who conducted an inter-comparison of several EMICs, including an earlier version of the CLIMBER-X model used here. ‘’ Update the discussion to include more recent papers for context.
Section 4.2 and 4.3 : refer ro figure panels when describing results to improve readability. Maybe focus on only one freshwater forcing, and show the sensitivity or the SST, SSS and S35S for CLIBBER-X only.
The relevance score in Figs. 6, 7, 8, 9 and 10 are quite noisy and suggest that the CNN may not be learning physically consistent features. Discuss the implication for the skill of the CNN. Can it be linked to the large variability obtained?
L600 : one reference is missing.