Quantitative Comparison of Causal Inference Methods for Climate Tipping Points

Lohmann, Niki; Strahl, David; Högner, Annika; Huiskamp, Willem; Boehm, Matthias; Wunderling, Nico

doi:10.5194/egusphere-2025-6258

Preprints

https://doi.org/10.5194/egusphere-2025-6258

Preprints

18 Dec 2025

| 18 Dec 2025

Quantitative Comparison of Causal Inference Methods for Climate Tipping Points

Niki Lohmann, David Strahl, Annika Högner, Willem Huiskamp, Matthias Boehm, and Nico Wunderling

Abstract. Causal inference methods present a statistical approach to the analysis and reconstruction of dynamic systems as observed in nature or in experiments. Climate tipping points are likely present in several core components of the Earth system, such as the Greenland ice sheet or the Atlantic Meridional Overturning Circulation (AMOC), and are characterized by an abrupt and irreversible degradation under sustained global temperatures above their corresponding thresholds. Causal inference methods may provide a promising way to study the interactions of climate tipping elements, which are currently highly uncertain due to limitations in model-based approaches. However, the data-driven analysis of climate tipping elements presents several challenges, e.g., with regard to nonlinearity, delayed effects and confoundedness. In this study, we quantify the accuracy of three commonly used multivariate causal inference methods with regard to these challenges and find unique advantages of each method: The Liang–Kleeman Information Flow is preferable in simple settings with limited data availability, the Peter–Clark Momentary Conditional Independence (PCMCI) provides the most control, e.g., to integrate expert knowledge, and the Granger Causality for State Space Models is advantageous for large datasets and delayed interactions. In general, data sampling intervals should be aligned with the interaction delays, and the inclusion of a confounder (like global temperatures) is crucial to deal with the nonlinear response to (climate) forcing. Based on these findings and given their data masking capabilities, we apply the LKIF and PCMCI methods to reanalysis data to detect tipping point interactions between the AMOC and Arctic summer sea ice, which imply a bidirectional stabilizing interaction, in agreement with physical mechanisms. Our results therefore contribute robust evidence to the study of interactions of the AMOC and the cryosphere.

Received: 15 Dec 2025 – Discussion started: 18 Dec 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3256 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (3256 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

19 Jun 2026

Quantitative comparison of causal inference methods for climate tipping points

Niki Lohmann, David Strahl, Annika Högner, Willem Huiskamp, Matthias Boehm, and Nico Wunderling

Nonlin. Processes Geophys., 33, 313–334, https://doi.org/10.5194/npg-33-313-2026,https://doi.org/10.5194/npg-33-313-2026, 2026

Short summary

Niki Lohmann, David Strahl, Annika Högner, Willem Huiskamp, Matthias Boehm, and Nico Wunderling

Interactive discussion

Status: closed

RC1:
'Review of egusphere-2025-6258', Anonymous Referee #1, 12 Jan 2026
Review: Quantitative Comparison of Causal Inference Methods for Climate Tipping Points
General Comments:
In this work, the authors conduct a quantitative investigation into the reliability and robustness of three multivariate causal inference methods:
Liang–Kleeman Information Flow (LKIF)

Peter–Clark Momentary Conditional Independence (PCMCI)

Granger Causality for State Space Models (GCSS)

This is done in the context of studying the interactions of climate tipping elements in various facets of the Earth system which pose specific operational challenges. Through the quantitative metric of choice (Matthews Correlation Coefficient; MCC), the authors showcase unique advantages for each method, while also identifying three general principles for addressing nonlinear responses, delayed effects, and confounders during the application of these causal methods to climate tipping points. The use of MCC is natural and justified, as it considers balanced ratios of the confusion matrix in binary classification.
Following a preliminary study on synthetic data generated by a network of differential equations, they apply LKIF and PCMCI, based on their recommendations, on reanalysis data to detect tipping point interactions between Atlantic Meridional Overturning Circulation (AMOC) and Arctic summer sea ice (ASSI), confirming established physical mechanisms (bidirectional stabilizing interactions) beyond confounding influences (Arctic temperatures).
This study is a welcome addition to both the climate tipping literature and to the causal inference community. The structure of the paper is sound, transitioning from a synthetic-data investigation to a realistic application to climate tipping point interactions between AMOC and ASSI. Physically consistent results are derived in the latter study, both in terms of state-space causality and temporal causal influence regions, by applying the recommendations derived from the former experiment.
My general assessment is that this is a well-written paper overall, with the authors presenting their methodology and results succinctly and clearly, which should be of interest to the relevant researchers. The results are put in context, well interpreted, and presented without drawing strong conclusions. This work fits into the scientific scope of NPG. My recommendation is that it can be published to NPG following some major revisions and clarifications, as well as some minor corrections and adjustments.
Specific Comments:
(Format: p.##, l.## - Page number, line number | Section/Appendix/Figure/Table ##)
p.2, l.32—44 – In terms of references, the authors appropriately cite most relevant works in the associated fields throughout the manuscript. But, while the authors succinctly explain climate tipping points and provide constructive and relevant examples here, I would recommend that they note, either implicitly or explicitly, how they essentially describe bifurcation-generated tippings here (with a hint towards rate-induced tips when referring to effects across time scales), with other regime-switching driving mechanisms (internal variability and rate-limited tipping) also being possible [1].
[1] https://arxiv.org/abs/1103.0169
p.7, l.178—179 – I would recommend elaborating a bit more on the details of how the time lag analysis is calculated in LKIF (“Time lag analysis is implemented by shifting any single input time series by a given number of time steps.”) and whether the adopted approach is operationally consistent with the approaches in PCMCI and GCSS. These details can be added in the appendix if preferred.
Section 2.3 – I would recommend adding a very brief paragraph here providing an interpretation of MCC and its values for people not familiar with the metric (e.g., maximum values and zero values, relation to the chi-squared statistic or other scores for intuition, etc.), which would also help with the self-containedness of the work.
p. 9, l. 223—227 (and p.21, l.481—483 by extension) – As a quick clarification, how are the results affected by a different heuristic choice of the time step Δt to account for causal delays? Specifically, how are the results in Panel (b), Figure 2 affected by implicitly changing the signal-to-noise ratio of each relation? A quick note here would also help with empirically elaborating on Recommendation 1 in Section 4.
p.10, l. 239—240 – Indeed, since GCSS assesses causal relationships by projecting the latent state process (cause) onto the space spanned by the infinite past of the observation variables with and without the effect, it provides higher explanatory power with the autoregressive part resolving the presence of time lags. Elaborating a bit more on this argument will make the justification slightly more rigorous.
p. 14, l. 325 – A quick note on why GCSS cannot be implemented here in a straightforward manner due to implementation details would be welcome (the small number of samples available for this experiment is also a valid limitation, based on the results of Fig. 2a). I do note the additional clarification in p. 17, l. 412—413, which is more than enough, but it does come towards the end of the case study.
p.14, l. 341—343 – Is the choice of including cells above the 66^th percentile based on a specific heuristic in the associated literature? How sensitive are the results to this choice?
Figure 5 and p.15—16, l. 369—389 – I would recommend a clarification of the results in Figure 5 and the associated text: The colour of the causal arrow from ASSI to AMOC indicates a stabilizing/negative effect at one month delay but at five months delay there’s a destabilizing/positive link instead. While that is the weaker link, as the text notes, adding two arrows that are independently coloured instead of single one that is coloured with respect to the stronger link would remove any ambiguity or confusion. I would recommend the same for the causal effect from ASSI to the Arctic temperatures (adding three arrows). That way, the coefficient of each link can also be superimposed next to the corresponding arrow for clarity. If the authors choose to implement these changes, they can also apply them to Figure F1 for consistency.
Table C1 – For the synthetic data experiments, has a larger (or non-uniform) noise level been tested (but not too large as to break the required assumption of stationarity through the linear couplings)? Also, I would recommend adding the variable next to the parameter name in the first column, which would also clarify the use of uniform coupling strengths (not considering the choice of sign). E.g., “Noise scale (σ)”.
Appendices D, E, and F – Just wanted to note that these are a very nice addition to the text, further illuminating the implementation intricacies behind the causal methods utilized in terms of different variants, operational complexity, and robustness to observational noise and different reanalysis approaches, respectively.

Technical Corrections:
(Format: p.#, l.# - Page number, line number)
p. 1, l.9 – The LKIF abbreviation is used in l.14 of the abstract but not defined here.

p. 3, l.50—53 – If possible, the authors can slightly revise this sentence for clarity and readability.

p. 3, l.59 – The PCMCI acronym is used here but first defined in l.77.

p. 4, l.104 – Small typo: “…not known to the causal method and introduces common…”.

p. 5, l.116 – As a small note, since the Wiener processes for each state variable are mutually independent (as also noted in the text), please consider adding an appropriate subscript to W to indicate this (I assume the diffusion feedback σ is held constant across x_i).

p. 5, l.117 – As a minor note, I would recommend first noting here the role of c as the common confounder in the induced causal diagrams, akin to p.10, l.255—262, which would also clarify Panel (a) in Figure 1.

p. 5, l.121—123 – It could be that I’m missing something, but shouldn’t the cited negative critical value correspond to the transition threshold in the absence of the additive noise and linear coupling terms? If yes, maybe consider slightly rephrasing this sentence to avoid ambiguity.

p. 6, l. 140 – Small typo: “…are established in the literature.”.

p. 7, l. 163 – Small typo: “…can be found in Appendix D.”.

p. 10, l. 239 – Small typo: “…for time lag. We consider…”.

p. 10, l. 258 – For clarification: “…(without interactions and noise).”.

p. 10, l. 260—279 – “exclusion” and “inclusion” are used here for the confounder term, but “hidden” and “known” are used in the legend of Figure 4. I would recommend sticking to the former throughout for consistency.

p. 10, l. 263—266 – I could be misinterpreting Figure 4, but I think this excerpt should read as “…for the LKIF algorithm in the absence of forcing.”, “…the forcing strength does not have an influence on the true positive rate of LKIF.”, and “…GCSS drops for an unknown confounder, but the false positive rate remains unchanged if the confounder is included in the causal analysis.”.

p. 12, Figure 4 Caption – For clarification: “The LKIF method instead sees a large decline in false positives and the PCMCI method does not show a clear effect”.

p. 13, Section 4 – Referencing the relevant panels from Figures 2—4 here would help for fast lookup.

p. 13, l. 293 – Small typo: “…as described in Appendix B.”.

p. 15, l. 346 – Small typo: “…(Carvalho and Wang, 2020). A reduction…”.

Panels (e) & (f), Figure A1 – Some connections (e.g., 9→11) might be construed as being moderated by an intermediate variable (10 in this case). While this shouldn’t be an issue considering the structure of the linear and explicit couplings in Eqs. (1) and (2), having the arrows circumvent the extraneous nodes in the diagram would leave no room for misinterpretation. Finally, I would recommend noting Panel (c) as the default model network for clarity, just like the last column of Table C1.

p. 21, l.470 – I would recommend writing “Δx_i= x_i - 1” here for simplicity.

p. 21, l.478—483 – Small typo: Δx_1,tshould read as Δx_i,tin Eq. (B2). I would also recommend including the noise term (using Euler—Maruyama), making (B2) a coupled VAR(1) process that is consistent with the preceding exposition, as the simplification without the stochastic term does not really simplify things that much. The approximation statement in l.479—480 is still true then, both weakly and strongly (under appropriate convergence orders).
Citation: https://doi.org/10.5194/egusphere-2025-6258-RC1
- AC1: 'Reply on RC1', Niki Lohmann, 25 Mar 2026
  
  Dear Reviewer,
  please find our responses in the attached pdf file.
  Kind regards,
  Niki Lohmann, on behalf of all co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-6258-AC1
RC2:
'Comment on egusphere-2025-6258', Anonymous Referee #2, 19 Jan 2026
The authors compare multiple causal inference methods for their applicability to climate tipping points. They do this by applying all methods to different datasets generated from a network of nodes with tipping dynamics and comparing the dependence of the results on certain data parameters. Next, they apply two of the three methods to climate data to study the interaction between the AMOC and Arctic summer sea ice.
This is an important paper to establish which methods are most suitable for which types of data. However, since the results may depend on the methodological choice made, these need to be discussed more clearly and argued for thoroughly. My main comments are given below, and I attach a pdf with my specific comments.
Main comments:
Please mention the data parameters you vary more clearly and discuss as to why these are relevant. Also, the description of the generation of the data can be improved (e.g. we link variables through a network, where at each node variability is modelled by SDE), where I suggest including the delay and explicit time dependence directly. The networks given in the appendix need to be justified up to some level, and some discussion as to how sensitive the results are to the choices made would be valuable. Furthermore, how sensitive are your results to parameters like noise size? It would be valuable to discuss the limitation of the chosen parameter settings and take those into account for the recommendations. And please specify earlier whether or not some variables undergo tipping in the time series you consider.

The discussion and explanation of the three methods need to be improved. Currently particularly LKIF and GCSS are not clear to me, with e.g. the equations given not explained (what are X, A, …?). It would be valuable to discuss the choices that can be made when implementing this model and how much the results depend on this.

There is great value in including a case study, however I have two concerns relating the discussed example.
Firstly, temperature is included as a potential common driver, but then the causal interaction is found to go the other direction with multiple lags. Have you checked the results when excluding temperature and the sensitivity of the results to method parameters? And can you discuss the physical realism of the network, as it states that AMOC is the main drivers of sea ice with temperature only following? Also, please clarify in the main text that you have considered other temperature datasets.

And secondly, it would be valuable to include an example where GCSS can be used, e.g. using climate model data. This allows for a more robust comparison and verification whether GCSS does as well for real data. Furthermore, in your recommendations PCMCI comes out worst, while here you deem it more reliable. Can you reconcile these conclusions?

In addition to the above comments, it would be valuable to check the structure of the paper, specifically:
In the introduction you go causal – tipping elements – causal, while I think tipping – causal would be more insightful.

Throughout the paper there is a lot of repetition. Up to some level this is ok, but some parts are mentioned over four times. I suggest reducing repetition throughout. E.g. in the introduction you already discuss a lot of results, which I suggest leaving for later and instead focus reasoning why this is a valuable exercise.

Lots of paragraphs consist of only one sentence, which I found distracting while reading. Please check for coherent paragraphs.

The order in which you explain the methods is opposite to what you show in the results figures. I suggest making this consistent.
Citation: https://doi.org/10.5194/egusphere-2025-6258-RC2
- AC2: 'Reply on RC2', Niki Lohmann, 25 Mar 2026
  
  Dear Reviewer,
  please find our responses in the attached pdf file.
  Kind regards,
  Niki Lohmann, on behalf of all co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-6258-AC2
CC1:
'Comment on egusphere-2025-6258', X. San Liang, 07 Feb 2026

I was notified by someone of this nice piece of work. It was a good read. I have some comments but here I don't have enough time writing down all of them. Here I just want to bring to the authors' attention that the IF-based causality analyses with external forcing and with time-delayed systems are handled a little bit differently; they should be treated with systems with augmented dimension(s). Here I am attaching two files, causality_ext_frc.pdf & delayed_causality.pptx, for your reference. The pdf regards the effect of external forcing with a system similar to your Eq. (1). Comparing Fig. 2 with Fig. 3 in the pdf, you will see how the external forcing issue is resolved. The second contains two PPT slides, regarding a VAR with time delays. In your case, if the delay tau=0.1, while you choose a time step dt=0.01 to generate the series, then there should be ten extra variables, represented by delayed series, to be added to augment the dimension of the system.
References:
X.S. Liang, 2016: Information flow and causality as rigorous notions ab initio. PRE 94, 052201.
X.S. Liang, 2021: Normalized multivariate time series causality analysis and causal graph reconstruction. Entropy, 23, 679.

Citation: https://doi.org/10.5194/egusphere-2025-6258-CC1
- AC3: 'Reply on CC1', Niki Lohmann, 25 Mar 2026
  
  Dear X. San Liang,
  please find our responses in the attached pdf file.
  Kind regards,
  Niki Lohmann, on behalf of all co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-6258-AC3

Interactive discussion

Status: closed

RC1:
'Review of egusphere-2025-6258', Anonymous Referee #1, 12 Jan 2026
Review: Quantitative Comparison of Causal Inference Methods for Climate Tipping Points
General Comments:
In this work, the authors conduct a quantitative investigation into the reliability and robustness of three multivariate causal inference methods:
Liang–Kleeman Information Flow (LKIF)

Peter–Clark Momentary Conditional Independence (PCMCI)

Granger Causality for State Space Models (GCSS)

This is done in the context of studying the interactions of climate tipping elements in various facets of the Earth system which pose specific operational challenges. Through the quantitative metric of choice (Matthews Correlation Coefficient; MCC), the authors showcase unique advantages for each method, while also identifying three general principles for addressing nonlinear responses, delayed effects, and confounders during the application of these causal methods to climate tipping points. The use of MCC is natural and justified, as it considers balanced ratios of the confusion matrix in binary classification.
Following a preliminary study on synthetic data generated by a network of differential equations, they apply LKIF and PCMCI, based on their recommendations, on reanalysis data to detect tipping point interactions between Atlantic Meridional Overturning Circulation (AMOC) and Arctic summer sea ice (ASSI), confirming established physical mechanisms (bidirectional stabilizing interactions) beyond confounding influences (Arctic temperatures).
This study is a welcome addition to both the climate tipping literature and to the causal inference community. The structure of the paper is sound, transitioning from a synthetic-data investigation to a realistic application to climate tipping point interactions between AMOC and ASSI. Physically consistent results are derived in the latter study, both in terms of state-space causality and temporal causal influence regions, by applying the recommendations derived from the former experiment.
My general assessment is that this is a well-written paper overall, with the authors presenting their methodology and results succinctly and clearly, which should be of interest to the relevant researchers. The results are put in context, well interpreted, and presented without drawing strong conclusions. This work fits into the scientific scope of NPG. My recommendation is that it can be published to NPG following some major revisions and clarifications, as well as some minor corrections and adjustments.
Specific Comments:
(Format: p.##, l.## - Page number, line number | Section/Appendix/Figure/Table ##)
p.2, l.32—44 – In terms of references, the authors appropriately cite most relevant works in the associated fields throughout the manuscript. But, while the authors succinctly explain climate tipping points and provide constructive and relevant examples here, I would recommend that they note, either implicitly or explicitly, how they essentially describe bifurcation-generated tippings here (with a hint towards rate-induced tips when referring to effects across time scales), with other regime-switching driving mechanisms (internal variability and rate-limited tipping) also being possible [1].
[1] https://arxiv.org/abs/1103.0169
p.7, l.178—179 – I would recommend elaborating a bit more on the details of how the time lag analysis is calculated in LKIF (“Time lag analysis is implemented by shifting any single input time series by a given number of time steps.”) and whether the adopted approach is operationally consistent with the approaches in PCMCI and GCSS. These details can be added in the appendix if preferred.
Section 2.3 – I would recommend adding a very brief paragraph here providing an interpretation of MCC and its values for people not familiar with the metric (e.g., maximum values and zero values, relation to the chi-squared statistic or other scores for intuition, etc.), which would also help with the self-containedness of the work.
p. 9, l. 223—227 (and p.21, l.481—483 by extension) – As a quick clarification, how are the results affected by a different heuristic choice of the time step Δt to account for causal delays? Specifically, how are the results in Panel (b), Figure 2 affected by implicitly changing the signal-to-noise ratio of each relation? A quick note here would also help with empirically elaborating on Recommendation 1 in Section 4.
p.10, l. 239—240 – Indeed, since GCSS assesses causal relationships by projecting the latent state process (cause) onto the space spanned by the infinite past of the observation variables with and without the effect, it provides higher explanatory power with the autoregressive part resolving the presence of time lags. Elaborating a bit more on this argument will make the justification slightly more rigorous.
p. 14, l. 325 – A quick note on why GCSS cannot be implemented here in a straightforward manner due to implementation details would be welcome (the small number of samples available for this experiment is also a valid limitation, based on the results of Fig. 2a). I do note the additional clarification in p. 17, l. 412—413, which is more than enough, but it does come towards the end of the case study.
p.14, l. 341—343 – Is the choice of including cells above the 66^th percentile based on a specific heuristic in the associated literature? How sensitive are the results to this choice?
Figure 5 and p.15—16, l. 369—389 – I would recommend a clarification of the results in Figure 5 and the associated text: The colour of the causal arrow from ASSI to AMOC indicates a stabilizing/negative effect at one month delay but at five months delay there’s a destabilizing/positive link instead. While that is the weaker link, as the text notes, adding two arrows that are independently coloured instead of single one that is coloured with respect to the stronger link would remove any ambiguity or confusion. I would recommend the same for the causal effect from ASSI to the Arctic temperatures (adding three arrows). That way, the coefficient of each link can also be superimposed next to the corresponding arrow for clarity. If the authors choose to implement these changes, they can also apply them to Figure F1 for consistency.
Table C1 – For the synthetic data experiments, has a larger (or non-uniform) noise level been tested (but not too large as to break the required assumption of stationarity through the linear couplings)? Also, I would recommend adding the variable next to the parameter name in the first column, which would also clarify the use of uniform coupling strengths (not considering the choice of sign). E.g., “Noise scale (σ)”.
Appendices D, E, and F – Just wanted to note that these are a very nice addition to the text, further illuminating the implementation intricacies behind the causal methods utilized in terms of different variants, operational complexity, and robustness to observational noise and different reanalysis approaches, respectively.

Technical Corrections:
(Format: p.#, l.# - Page number, line number)
p. 1, l.9 – The LKIF abbreviation is used in l.14 of the abstract but not defined here.

p. 3, l.50—53 – If possible, the authors can slightly revise this sentence for clarity and readability.

p. 3, l.59 – The PCMCI acronym is used here but first defined in l.77.

p. 4, l.104 – Small typo: “…not known to the causal method and introduces common…”.

p. 5, l.116 – As a small note, since the Wiener processes for each state variable are mutually independent (as also noted in the text), please consider adding an appropriate subscript to W to indicate this (I assume the diffusion feedback σ is held constant across x_i).

p. 5, l.117 – As a minor note, I would recommend first noting here the role of c as the common confounder in the induced causal diagrams, akin to p.10, l.255—262, which would also clarify Panel (a) in Figure 1.

p. 5, l.121—123 – It could be that I’m missing something, but shouldn’t the cited negative critical value correspond to the transition threshold in the absence of the additive noise and linear coupling terms? If yes, maybe consider slightly rephrasing this sentence to avoid ambiguity.

p. 6, l. 140 – Small typo: “…are established in the literature.”.

p. 7, l. 163 – Small typo: “…can be found in Appendix D.”.

p. 10, l. 239 – Small typo: “…for time lag. We consider…”.

p. 10, l. 258 – For clarification: “…(without interactions and noise).”.

p. 10, l. 260—279 – “exclusion” and “inclusion” are used here for the confounder term, but “hidden” and “known” are used in the legend of Figure 4. I would recommend sticking to the former throughout for consistency.

p. 10, l. 263—266 – I could be misinterpreting Figure 4, but I think this excerpt should read as “…for the LKIF algorithm in the absence of forcing.”, “…the forcing strength does not have an influence on the true positive rate of LKIF.”, and “…GCSS drops for an unknown confounder, but the false positive rate remains unchanged if the confounder is included in the causal analysis.”.

p. 12, Figure 4 Caption – For clarification: “The LKIF method instead sees a large decline in false positives and the PCMCI method does not show a clear effect”.

p. 13, Section 4 – Referencing the relevant panels from Figures 2—4 here would help for fast lookup.

p. 13, l. 293 – Small typo: “…as described in Appendix B.”.

p. 15, l. 346 – Small typo: “…(Carvalho and Wang, 2020). A reduction…”.

Panels (e) & (f), Figure A1 – Some connections (e.g., 9→11) might be construed as being moderated by an intermediate variable (10 in this case). While this shouldn’t be an issue considering the structure of the linear and explicit couplings in Eqs. (1) and (2), having the arrows circumvent the extraneous nodes in the diagram would leave no room for misinterpretation. Finally, I would recommend noting Panel (c) as the default model network for clarity, just like the last column of Table C1.

p. 21, l.470 – I would recommend writing “Δx_i= x_i - 1” here for simplicity.

p. 21, l.478—483 – Small typo: Δx_1,tshould read as Δx_i,tin Eq. (B2). I would also recommend including the noise term (using Euler—Maruyama), making (B2) a coupled VAR(1) process that is consistent with the preceding exposition, as the simplification without the stochastic term does not really simplify things that much. The approximation statement in l.479—480 is still true then, both weakly and strongly (under appropriate convergence orders).
Citation: https://doi.org/10.5194/egusphere-2025-6258-RC1
- AC1: 'Reply on RC1', Niki Lohmann, 25 Mar 2026
  
  Dear Reviewer,
  please find our responses in the attached pdf file.
  Kind regards,
  Niki Lohmann, on behalf of all co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-6258-AC1
RC2:
'Comment on egusphere-2025-6258', Anonymous Referee #2, 19 Jan 2026
The authors compare multiple causal inference methods for their applicability to climate tipping points. They do this by applying all methods to different datasets generated from a network of nodes with tipping dynamics and comparing the dependence of the results on certain data parameters. Next, they apply two of the three methods to climate data to study the interaction between the AMOC and Arctic summer sea ice.
This is an important paper to establish which methods are most suitable for which types of data. However, since the results may depend on the methodological choice made, these need to be discussed more clearly and argued for thoroughly. My main comments are given below, and I attach a pdf with my specific comments.
Main comments:
Please mention the data parameters you vary more clearly and discuss as to why these are relevant. Also, the description of the generation of the data can be improved (e.g. we link variables through a network, where at each node variability is modelled by SDE), where I suggest including the delay and explicit time dependence directly. The networks given in the appendix need to be justified up to some level, and some discussion as to how sensitive the results are to the choices made would be valuable. Furthermore, how sensitive are your results to parameters like noise size? It would be valuable to discuss the limitation of the chosen parameter settings and take those into account for the recommendations. And please specify earlier whether or not some variables undergo tipping in the time series you consider.

The discussion and explanation of the three methods need to be improved. Currently particularly LKIF and GCSS are not clear to me, with e.g. the equations given not explained (what are X, A, …?). It would be valuable to discuss the choices that can be made when implementing this model and how much the results depend on this.

There is great value in including a case study, however I have two concerns relating the discussed example.
Firstly, temperature is included as a potential common driver, but then the causal interaction is found to go the other direction with multiple lags. Have you checked the results when excluding temperature and the sensitivity of the results to method parameters? And can you discuss the physical realism of the network, as it states that AMOC is the main drivers of sea ice with temperature only following? Also, please clarify in the main text that you have considered other temperature datasets.

And secondly, it would be valuable to include an example where GCSS can be used, e.g. using climate model data. This allows for a more robust comparison and verification whether GCSS does as well for real data. Furthermore, in your recommendations PCMCI comes out worst, while here you deem it more reliable. Can you reconcile these conclusions?

In addition to the above comments, it would be valuable to check the structure of the paper, specifically:
In the introduction you go causal – tipping elements – causal, while I think tipping – causal would be more insightful.

Throughout the paper there is a lot of repetition. Up to some level this is ok, but some parts are mentioned over four times. I suggest reducing repetition throughout. E.g. in the introduction you already discuss a lot of results, which I suggest leaving for later and instead focus reasoning why this is a valuable exercise.

Lots of paragraphs consist of only one sentence, which I found distracting while reading. Please check for coherent paragraphs.

The order in which you explain the methods is opposite to what you show in the results figures. I suggest making this consistent.
Citation: https://doi.org/10.5194/egusphere-2025-6258-RC2
- AC2: 'Reply on RC2', Niki Lohmann, 25 Mar 2026
  
  Dear Reviewer,
  please find our responses in the attached pdf file.
  Kind regards,
  Niki Lohmann, on behalf of all co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-6258-AC2
CC1:
'Comment on egusphere-2025-6258', X. San Liang, 07 Feb 2026

I was notified by someone of this nice piece of work. It was a good read. I have some comments but here I don't have enough time writing down all of them. Here I just want to bring to the authors' attention that the IF-based causality analyses with external forcing and with time-delayed systems are handled a little bit differently; they should be treated with systems with augmented dimension(s). Here I am attaching two files, causality_ext_frc.pdf & delayed_causality.pptx, for your reference. The pdf regards the effect of external forcing with a system similar to your Eq. (1). Comparing Fig. 2 with Fig. 3 in the pdf, you will see how the external forcing issue is resolved. The second contains two PPT slides, regarding a VAR with time delays. In your case, if the delay tau=0.1, while you choose a time step dt=0.01 to generate the series, then there should be ten extra variables, represented by delayed series, to be added to augment the dimension of the system.
References:
X.S. Liang, 2016: Information flow and causality as rigorous notions ab initio. PRE 94, 052201.
X.S. Liang, 2021: Normalized multivariate time series causality analysis and causal graph reconstruction. Entropy, 23, 679.

Citation: https://doi.org/10.5194/egusphere-2025-6258-CC1
- AC3: 'Reply on CC1', Niki Lohmann, 25 Mar 2026
  
  Dear X. San Liang,
  please find our responses in the attached pdf file.
  Kind regards,
  Niki Lohmann, on behalf of all co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-6258-AC3

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Niki Lohmann on behalf of the Authors (25 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (03 Apr 2026) by Nan Chen

RR by Anonymous Referee #1 (07 Apr 2026)

RR by Anonymous Referee #2 (28 Apr 2026)

Suggestions for revision or reasons for rejection

I thank the authors for addressing my comments, especially running the additional experiments. I believe the paper has substantially improved. I have one last comment, a thought which you may include in the discussion and some minor specific points.

1. GCSS – I still do not fully understand how you implement this. Is this again fitting the model to the data and then testing it, as for LKIF? Then please say so. This only becomes clearer when reading the appendix.

Thought for the discussion – What does a monthly feedback timescale tell us on tipping dynamics on the order of years to decades? Physically, it takes time for water to be transported from the sea ice region, sink and contribute to the AMOC. Even 5 months is short for the deeper water masses.

Specific feedback:

L18 – “Recent decades” suggests at the latest after 2000, while the AMOC as having a tipping point is older than that.

L26 – was -> has been

L45 – Remove “statistically”

L63 – points -> elements

L92-98 – Inconsistent use of Section v Sect.

Eqn. (1) – Say x_i is a variable, t is time, either before or after.

L113-115 – x -> x_i

L170 – dW -> sigma dW

Eqn. (3) – Say what x_t is.

L199-200 – I might have missed this, but what are “all four basic metrics”? Do you mean the TP/FP/TN/FN? Please clarify.

L246-247 – Why does GCSS having a state space model means it would have a higher explanatory power?

L250 – each variable ->? each method

Fig 3 – (a) <-> (c) in caption and throughout the text. Also point out the different scales for the different methods, this is slightly confusing and makes it hard to compare.

Fig 4 – Can you put in gridlines to allow for better comparison between the methods? It would help comparing the values for e.g. GCSS and PCMCI, which appear comparible for the TPR.

L289-290 – Can you comment a bit on PCMCI earlier on? And also, it’s TPR appears comparible to GCSS, and better after tipping, when LKIF is very unreliable because of the high FPR. So then the conclusion would be that neither really works well-enough.

Sec 4. – Recommendations 1 and 2 are on how and what data to consider, while 3 is on the suitability of the methods. I would suggest naming then 1a and 1b (and 3 becomes 2) and specify they are on data, not on method.

L378 – decided for -> decide on

L451 – Larger timescales can also be considered using piControl instead of the scenario runs.

Hide

ED: Publish subject to minor revisions (review by editor) (28 Apr 2026) by Nan Chen

AR by Niki Lohmann on behalf of the Authors (21 May 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (21 May 2026) by Nan Chen

AR by Niki Lohmann on behalf of the Authors (29 May 2026) Author's response Manuscript

Journal article(s) based on this preprint

19 Jun 2026

Quantitative comparison of causal inference methods for climate tipping points

Niki Lohmann, David Strahl, Annika Högner, Willem Huiskamp, Matthias Boehm, and Nico Wunderling

Nonlin. Processes Geophys., 33, 313–334, https://doi.org/10.5194/npg-33-313-2026,https://doi.org/10.5194/npg-33-313-2026, 2026

Short summary

Niki Lohmann, David Strahl, Annika Högner, Willem Huiskamp, Matthias Boehm, and Nico Wunderling

Model code and software

Quantitative Comparison of Causal Inference Methods for Climate Tipping Points (Software) Niki Lohmann and Nico Wunderling https://doi.org/10.5281/zenodo.17864597

Niki Lohmann, David Strahl, Annika Högner, Willem Huiskamp, Matthias Boehm, and Nico Wunderling

Viewed

Total article views: 4,049 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
2,394	1,488	167	4,049	457	467

HTML: 2,394
PDF: 1,488
XML: 167
Total: 4,049
BibTeX: 457
EndNote: 467

Views and downloads (calculated since 18 Dec 2025)

Month	HTML	PDF	XML	Total
Dec 2025	470	355	54	879
Jan 2026	673	406	30	1,109
Feb 2026	456	329	24	809
Mar 2026	615	305	56	976
Apr 2026	118	59	2	179
May 2026	48	29	1	78
Jun 2026	14	5	0	19
Jul 2026	0

Cumulative views and downloads (calculated since 18 Dec 2025)

Month	HTML	PDF	XML	Total
Dec 2025	470	355	54	879
Jan 2026	673	406	30	1,109
Feb 2026	456	329	24	809
Mar 2026	615	305	56	976
Apr 2026	118	59	2	179
May 2026	48	29	1	78
Jun 2026	14	5	0	19
Jul 2026	0

Viewed (geographical distribution)

Total article views: 4,041 (including HTML, PDF, and XML) Thereof 4,041 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 07 Jul 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (3256 KB)
Metadata XML

Short summary

Causal inference methods could be used to study the interaction of climate tipping elements, which may degrade abruptly due to climate change. We compare three of these methods to determine their reliability and apply two of them to the Arctic summer sea ice and the Atlantic Meridional Overturning Circulation (AMOC). Our results imply that a weaker AMOC would stabilize Arctic summer sea ice, and that a loss of Arctic summer sea would stabilize the AMOC.


Total:	0
HTML:	0
PDF:	0
XML:	0