An Earth system deep learning classifier for tipping point detection
Abstract. Tipping points are thresholds at which a system, often abruptly and irreversibly, transitions from a stable state to a contrasting one. Crossing such critical boundaries poses a risk to Earth system stability and may have catastrophic consequences. This is especially relevant, as current climate change is destabilizing Earth subsystems, potentially bringing them closer to tipping points. Thus, it is important to be able to detect approaching tipping points in the Earth’s system, which can be achieved through calibration on palaeo-records. Recently, new deep learning (DL) methods have been established that are able to confidently and quantitatively identify different types of critical transitions characterised by their abruptness and (ir)reversibility. Based on this, we develop a new (simplified) DL classifier focusing on the quantitative detection of catastrophic tipping points (fold bifurcations) in the Earth system. Our approach reduces computational demand and improves performance, especially for short timeseries. We first test the new classifier's performance on synthetic data and subsequently on different existing Cenozoic proxy records. Our DL results are compared to the results from previous studies applying generic early warning signals (EWS), which can detect approaching transitions qualitatively but cannot distinguish bifurcation types (abruptness and (ir)reversibility of the transition). Our DL classifier enables us to identify how abrupt and (ir)reversible an approaching transition is, which is important for tipping point risk assessment and mitigation. Results are generally consistent between generic EWS from previous studies and our DL approach and fit with what is known from the geological context. We note that some results are dependent on the length of the classifier used and the time interval investigated before the bifurcation. We implement an out-of-distribution (OOD) detection method to reduce the misclassification of non-catastrophic bifurcations as catastrophic tipping points. Combined with the binary DL classifier, this approach enables reliable, quantitative detection of catastrophic tipping points in Earth system records.
I found the manuscript to be an interesting read and a useful follow-on from the Bury et al., 2021 paper, with a focus that switches to important considerations for the Earth system, such as are the abrupt shifts the DL model might predict the approach to irreversible or not. I think the work presents a great step forward in the literature, but I have some concerns about the robustness of the results and I think it would benefit from stepping away from the Bury et al., paper by explaining certain aspects in more detail rather than leading the reader to read the other paper. I have detailed my comments below.
One of my main comments is that I do not think the training set is explained well enough. I think the time series essentially are the ones from the ‘fold’ and ‘null’ categories in the other paper, but perhaps an equation explaining the derivation would help the reader understand what is being used in this training specifically. I get the impression that the underlaying equation can be used to simulate a number of different bifurcations but this information would help the reader understand properly.
To a certain degree I can understand why the transcritical time series were used as an OOD test but currently it is difficult to look past what feels like a strong focus on fold and transcritical bifurcations. An instant question would be why not include transcritical time series in the training set as null cases and I assume this is because it would be difficult to determine where to draw the line on what to include. I would be interested to know how well a DL model trained on a binary output of those two differs from the current model.
Regarding the OOD tests, I would be interested in seeing how other time series are classified e.g. different types of bifurcations (non-catastrophic Hopf), white to red noise processes where the memory in the system is increased, or stable time series where the noise level is increased. Furthermore, including catastrophic Hopf bifurcations in an OOD test could tell you how reliant the model is on only fold bifurcations being considered catastrophic shifts.
I have a few minor comments regarding readability and presentation which I have detailed below.
Line 24 – The classifier only identifies if a transition is abrupt and irreversible, not ‘how’ abrupt or irreversible it is.
Line 36 – Consider ‘The Earth system’ rather than ‘System Earth’.
Line 65 – I’d argue that generic EWS do not really have a qualitative nature, they use basic statistical measure and trends can be quantified. Perhaps rephrase to properly describe the added benefit of DL.
Line 85 – ‘Two bifurcation types of interest’ suggests that the ‘null’ category is a bifurcation in this instance.
Line 213 – 100 time series seems quite low, why was this?
Mentions of Fig. 5 – I would consider how Figure 5 is referenced as it is used a lot throughout the manuscript. Maybe do not reference it until the results section so it comes after Fig. 4. It is also confusing as it has results on it regarding the palaeo data that are not discussed in the text until much later.
Figure 5 caption – Should briefly explain how the threshold (yellow line) was calculated or point the reader to the main text to find out.
Table 5 – To me (both in the table and in the main text) it reads that time series with higher noise are likely to be above the energy threshold regardless of if they are ID or OOD, meaning that it is classifying noisier time series as fold bifurcations regardless. Higher energy scores according to Fig. 5 should be associated with fold bifurcations yet the text on line 294 says they are associated with greater confidence in identification. Line 514 says something similar.
Section 3 – I think some specific referencing would be beneficial throughout the explaining the context of the results. Line 314 for example saying that your results corroborate ETM3 being the weakest event needs a reference.
Line 475 – ‘other’ should come after ‘additional mechanisms’ not before.
I would suggest on the heatmaps (Fig. 6-8) on having one colour bar that goes from blue to red with grey in the middle. I am struggling with ~0.6 looking quite similar to me on both bars. Also be careful around these values on using the black fold text (Figure 7 for example), although solving the colour bar problem might fix this.