the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Assessing seasonal climate predictability using a deep learning application: NN4CAST
Abstract. Seasonal climate predictions are essential for climate services, being the changes in tropical sea surface temperature (SST) the most influential drivers. SST anomalies can affect the climate in remote regions through various atmospheric teleconnection mechanisms, and the persistence/evolution of those SST anomalies can give seasonal predictability to atmospheric signals. Dynamical models often struggle with biases and low signal-to-noise ratios, making statistical methods a valuable alternative. Deep learning models are currently providing accurate predictions, mainly in short range weather forecast. Nevertheless, the blackbox nature of this methodology makes necessary the identification of its explainability. In this context, we present NN4CAST (Neural Network foreCAST), a versatile Python deep learning tool designed to assess seasonal predictability. Starting from the original files, NN4CAST performs all the methodological steps, enabling researchers to rapidly explore the predictability of a target variable and identify its main drivers. This flexible framework allows for the quick testing of predictive skill from different sources of predictability, making it a valuable asset for climate services. As SST is the primary source of seasonal predictability, we illustrate the application of NN4CAST to tropical and extratropical teleconnections forced by the Pacific SSTs. We show that NN4CAST can provide skillful seasonal forecasts in regions where the atmospheric response to SST anomalies is predominantly linear, such as the tropics, as well as in remote regions where the signal is highly non-linear, like Europe. Two key examples are the prediction of SST anomalies in the tropical Atlantic region during boreal spring and precipitation anomalies over the European continent during boreal autumn. The former exemplifies a predominantly linear tropical linear ENSO-Tropical North Atlantic, whereas the latter involves a highly non-linear and non-stationary ENSO-Euro-Atlantic teleconnection. Our results demonstrates NN4CAST’s potential to determine and quantify the influence of specific drivers on a target variable, offering a useful tool for improving climate predictability assessments. NN4CAST enables the attribution of predictions to specific input features, helping to identify the relative importance of different sources of predictability over time and space. In summary, NN4CAST offers a powerful framework to better characterize and understand the complex, non-linear, and non-stationary remote climate interactions.
- Preprint
(4400 KB) - Metadata XML
-
Supplement
(2695 KB) - BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-3162', Anonymous Referee #1, 26 Aug 2025
-
RC2: 'Comment on egusphere-2025-3162', Anonymous Referee #2, 09 Sep 2025
The manuscript presents NN4CAST, a Python framework intended to streamline seasonal predictability studies with deep learning. The pipeline covers data preprocessing (region/season selection, anomaly computation, trend removal), model construction with regularization, cross-validation and tuning, and interpretation via an XAI module and EOF analysis. Two case studies are used to illustrate skill: Pacific SST forcing of tropical North Atlantic (TNA) SST in boreal spring, and Pacific SST forcing of European autumn precipitation. The overall aim to facilitate testing sources of predictability and attributing predictions to input regions is very relevant to climate services, but the manuscript in its current form requires revision before it is suitable for publication.
General comments
- The first case study (DJF tropical “Pacific” predictors-> MAM TNA SST) formally respects the lag, yet the predictor domain extends into the western tropical Atlantic. Given the well-known persistence of tropical Atlantic SST, even a narrow DJF Atlantic band can carry substantial memory into MAM and thus contribute to the high ACC shown in Fig. 2. In that sense, part of the reported skill may reflect local persistence rather than a Pacific-forced bridge. It would be helpful to clarify whether masking local Atlantic SST alters the ACC/RMSE/importance patterns.
- In Fig. 3, the comparison between the regression composite for El Niño years (predicted TNA) and the importance composite should be improved. The fact that the attribution map contains cooling features does not, by itself, demonstrate added value, like mentioned in L280. It would help to explain how their sign and placement align with the atmospheric bridge and Wind–Evaporation–SST mechanism (e.g., stronger trades, surface heat-flux anomalies, wind-stress curl…), and whether the lead–lag structure supports that interpretation. As presented, it is difficult to separate a genuine teleconnection signal from collinearity in the SST field or residual Atlantic persistence. Showing that attribution hotspots co-locate with observed flux/SLP/wind anomalies, and repeating the analysis with the Atlantic belt removed from the predictors, would clarify whether the cooling patterns reflect a physical mechanism or a model artifact.
- The manuscript often uses the language of “drivers,” yet the analysis is primarily associational. This matters for teleconnections, where shared low-frequency covariates can produce strong correlations without isolating a pathway. The discussion around XAI (e.g., Fig. 3) therefore could be improved. For both applications, it would be helpful to control for NAO variability and check how strongly it modulates the two teleconnections, and how this impacts the attribution maps. The recent literature arguing for causal-inference tools in teleconnection analysis points in this direction and could be useful (e.g. https://journals.ametsoc.org/view/journals/bams/102/12/BAMS-D-20-0117.1.xml).
- The manuscript describes the toolkit as “versatile,” yet for identifying dominant spatial modes it offers only EOF analysis of model outputs versus observations. For teleconnection work, this is a narrow diagnostic. At minimum, a versatile layer would include a menu of spatial-mode tools beyond EOF (e.g. maximal covariance or canonical correlation analysis for coupled patterns). Please consider either expanding the diagnostics accordingly or reframing the package as a DL-first pipeline with basic (EOF-based) spatial diagnostics.
Specific comments
L1 (...) with the changes in tropical sea surface temperatures (SST) being (...)
L8 Please be more specific than writing “(...) performs all the methodological steps”.
L27 you already defined SST in the abstract. If you decide to define again, please use lower case as in the abstract.
L75-77 Here you introduce the tool for the first time, after a long introduction on seasonal forecasting and ML. I suggest bringing up the goal/what’s new about your paper much earlier in the introduction, to help the reader to situate themselves.
L86 I don’t think bringing up the possibility to combine NN4CAST with ESMValTool in the introduction is relevant. I think this could be mentioned in the conclusions/future work.
L89 Please give an example of such tools written in C/C++.
L92-103 I recommend not giving so many details of the applications to be analysed in this final introduction paragraph. Similarly, mentioning GitHub and code availability here seems misplaced.
L337 I think the sentence should be rewritten, as significant skill is not found in most of the European continent, rather in parts of it.
L375 and L380 are repeated
L380, L395 The authors mention that the tool has a primary application to identify windows of opportunity (WoO). However, in the two applications given, there was no framing related to WoO. I recommend improving the discussion towards the context of WoO.
L388-L389 I suggest to focus on more specific advantages offered by the NN4CAST in your conclusions. “These complementary approaches offer valuable contributions to the scientific community and support the improvement of current seasonal forecasting systems” seems a bit vague and exaggerated at the same time. In particular for the first application, the authors did not go in depth to highlight any new insights concerning the teleconnection, rather used it as an example to illustrate what the tool does.
Citation: https://doi.org/10.5194/egusphere-2025-3162-RC2 -
RC3: 'Comment on egusphere-2025-3162', Anonymous Referee #3, 22 Sep 2025
The paper introduces NN4CAST, a Python-based framework designed to identify and investigate drivers of seasonal climate predictability. It shows that NN4CAST provides explainability by attributing predictions to specific regions of the chosen predictor field, thereby quantifying the relative importance of different sources of predictability.
The paper addresses an interesting problem and proposes a framework for understanding sources of predictability. However, the manuscript currently lacks details on the method and justification of key choices, as well as on the interpretation of XAI results to make the framework truly useful for climate services and science. in the perspective of this reviewer, the framework as well as the examples chosen to illustrate its usefulness would benefit from some reconsideration prior to possible resubmission.
General comments
The method chosen to make the predictions is not discussed or justified in the paper. Why is an autoencoder architecture chosen in the first example? It should definitely be discussed whether this makes a difference to the regions identified by the XAI method? Given the short observational record and non-stationarity of the teleconnections, can a deep learning approach always be justified compared to a regularized regression?
This reviewer agrees with the two other reviewers that tropical Atlantic should not be included in predictor region in the first example.
Parts of the paper read a lot like a Python package documentation rather than a method or framework description (for example lines 156-164, Table 1, Listing 1-3). Since the paper is presenting a framework and not a package, this reviewer thinks that they might be better suited in the Appendix or Supplementary Material. In particular, the paper contains no details or discussion on the choice of deep learning method, which should be included in the main text - perhaps at the expense of the code description.
In further agreement with the other reviewers, the results presented in Figure 3 c and d do not seem particularly convincing to this reviewer, and do not seem to highlight the value of model-based attributions. In the eyes of this reviewer, the composite importances identified by the XAI methods have very low amplitudes and don't show physically interpretable structure or coherence. How would the authors explain this? Furthermore, why is the data first filtered for El-Niño events, and how is the threshold chosen?
More specific comments
Line 8: What do the authors mean by the 'original files'? Especially since this is in the abstract, a more specific term should be chosen.
Line 59: It should be noted that this paragraph talks about AI models at weather timescales.
Line 67: "The use of DL models to assess seasonal forecast is not so common" - Aside from the spelling error, this statement is very vague. Given the vast emerging literature on deep learning for seasonal forecasting, examples should be cited here, or the sentence should more specifically say what DL models have not been used for.
Line 132: It would be valuable to state why this method is chosen over others.
Line 133: "This method addresses the issue of non-linear problems, where the derivative of the output with respect to the inputs is not constant." This sentence is a bit too vague and slightly misleading - other XAI methods address non-linear problems as well, and Integrated Gradients can be applied to linear problems as well.
Line 167: It is unclear to this reviewer what bullet point one intends to state. Furthermore, points 1-4 would be addressed by a regularized linear regression model as well - it would be valuable to include in this list why a deep learning approach is chosen here.
This reviewer is a non-English native speaker and appreciates the difficulties in writing in a second language. However, the paper would benefit from grammatical corrections, including but not limited to the following:
- Line 1: 'being the changes in tropical sea surface temperature the most influential drivers'
- Line 190 "By this way it avoids to introduce"
Citation: https://doi.org/10.5194/egusphere-2025-3162-RC3
Data sets
NN4CAST_manual Víctor Galván Fraile et al. https://doi.org/10.5281/zenodo.15682872
Model code and software
NN4CAST Víctor Galván Fraile https://doi.org/10.5281/zenodo.14011998
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
1,776 | 75 | 23 | 1,874 | 26 | 22 | 21 |
- HTML: 1,776
- PDF: 75
- XML: 23
- Total: 1,874
- Supplement: 26
- BibTeX: 22
- EndNote: 21
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
General Comments:
The authors provide a tool that may be utilized to research seasonal predictability using basic deep learning methods. The code library provides a pipeline to preprocess data, train the model, evaluate, and calculate some metrics/attributions, based on a user-defined namelist and input files. Although the model will not achieve state-of-the-art skill, it does have potential for mechanistic studies through explainable AI. However, I do not believe the manuscript in its current state effectively communicates this message.
1. The analysis of the teleconnection between DJF Pacific tropical SST and MAM tropical Atlantic SST and related evaluation of the model is not valid, due to the region of the input predictor field, which includes parts of the western tropical Atlantic. Looking through the individual Integrated*Gradient attribution samples on Zenodo, it is clear that the largest attributions are most often in this area, rather than in the tropical Pacific. This is also confirmed by calculating correlations between areal-averaged SST in the target WTNA or SMSCU region with the input SST field. This leads to unrealistic, inflated skill in Figure 2, which is a result of the inclusion of the west Atlantic in the input fields, rather than the Pacific-Atantic teleconnection, as stated in the text (line 266-267).
2. The discussion surrounding XAI in Figure 3 is unconvincing. Although the model attribution plot (Fig 3c) shows more spatial variability than the simple regression (Fig 3e), this does not necessarily mean there is added value. The work would benefit from further exploring the physical mechanisms associated with the Integrated Gradients attribution. There is not a clear connection between the spatial variance in Fig 3c and the citation of Wade et al. 2023 in the text. How much does the attribution pattern change with different initial seeds? What is the sample size? There is only a ~100 year record that is being used, with even fewer El Niño’s, so I am skeptical of the robustness of model attribution. Have you tried calculating attribution plots, compositing on a warm WTNA or SMSCU, rather than ENSO?
3. The analysis of European precipitation is useful for showing how the predictability varies between different periods. However, the regression analysis in Figure 6 is a little confusing, as you could perform the exact same regression with only observational data, yielding more faithful results and yielding the same conclusion regarding ENSO and European precipitation. Figure 5 shows the model can reproduce some of the same trends as observations, but doesn’t reveal any new insights not available from solely observations.
Similarly to the previous analysis, it does not seem like the model is directly capturing a connection between ENSO and European precipitation, based on the individual attribution plots on Zenodo, which mostly show the model thinks SST anomalies in the extratropical Pacific and Atlantic Ocean are important. What could maybe be useful is to look at the attribution plots for precipitation in skillful regions during 1942-1969? Maybe there is a change in the background state (e.g. the extratropical jet), which changes the propagation of the extratropical Rossby wavetrains that affect European precipitation and thus predictability?
4. In the introduction it is stated that “The idea behind NN4CAST is to mitigate the risk of treating deep learning methods as “black boxes”, thereby enabling users to identify sources of predictability and assess the sensitivity of predictions to variations in the training period and/or to the predictor region.” (line 80). However, the current manuscript does not really analyze the sensitivity to the training period or predictor region.
Specific comments: