the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Improving seasonal predictions of German Bight storm activity
Abstract. Extratropical storms are one of the major coastal hazards along the coastline of the German Bight, the southeastern part of the North Sea, and a major driver of coastal protection efforts. However, the predictability of these regional extreme events on a seasonal scale is still limited. We therefore improve the seasonal prediction skill of the Max-Planck-Institute Earth System Model (MPI-ESM) large-ensemble decadal hindcast system for German Bight storm activity (GBSA) in winter. We define GBSA as the 95th percentiles of three-hourly geostrophic wind speeds in winter, which we derive from mean sea-level pressure (MSLP) data. The hindcast system consists of an ensemble of 64 members, which are initialized annually in November and cover the winters of 1960/61–2017/18. We consider both deterministic and probabilistic predictions of GBSA, for both of which the full ensemble produces poor predictions in the first winter. To improve the skill, we observe the state of two physical predictors of GBSA, namely 70 hPa temperature anomalies in September, as well as 500 hPa geopotential height anomalies in November, in areas where these two predictors are correlated with winter GBSA. We translate the state of these predictors into a first guess of GBSA and remove ensemble members with a GBSA prediction too far away from this first guess. The resulting subselected ensemble exhibits a significantly improved skill in both deterministic and probabilistic predictions of winter GBSA. We also show how this skill increase is associated with better predictability of large-scale atmospheric patterns.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(20704 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(20704 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-2676', Anonymous Referee #1, 10 Jan 2024
The paper investigates and improves the seasonal prediction skill for German Bight storm activity (GBSA) in a large-ensemble decadal hindcast system. This system is based on the model MPI-ESM-LR and consists of 64 yearly initialized members for the period 1960-2018. The authors use two physical predictors of GBSA to make a first guess of the GBSA state and select ensemble members close to this first guess. The sub-selected ensemble shows significantly improved prediction skill for winter GBSA.
The paper covers an interesting and relevant topic. It is well written and clearly structured. Apart from some comments below, I can recommend the paper for publication and I feel it will provide a useful contribution to the field.
Comments:
- The paper lacks some basic details and references on, among others, storm activity and its variability, as well as decadal prediction systems (see for example your own introduction in Krieger et al., 2022). I understand that you are trying to tell a different story than in Krieger et al. (2022), but you should not expect the reader to know your previous publication.
- There are several relevant publications from the MetOffice colleagues on seasonal predictions, for example Athanasiadis et al. (2017), Scaife et al. (2014), or Scaife et al. (2016). Please include and discuss some of them in your paper.
- Can you elaborate a bit why you specifically use T70 and Z500 as predictors for GBSA? Is there any reference for this choice?
- How do you define which ensemble members have a GBSA that is closest to the first guess?
- Based on line 170, it looks like the ensemble has 80 members, but you only use 64 of them. Please explain this.
- L268-269: Can you please elaborate on why the perfect test includes future information and why this cannot be used operationally?
- Outlook: How could your approach be used in future studies?
References:
Athanasiadis et al. (2017): A Multisystem View of Wintertime NAO Seasonal Predictions. Journal of Climate, https://doi.org/10.1175/JCLI-D-16-0153.1
Scaife et al. (2014): Skillful long-range prediction of European and North American winters. Geophys. Res. Lett., https://doi.org/10.1002/2014GL059637
Scaife et al. (2016): Seasonal winter forecasts and the stratosphere. Atmos. Sci. Lett., https://doi.org/10.1002/asl.598
Citation: https://doi.org/10.5194/egusphere-2023-2676-RC1 -
AC1: 'Reply on RC1', Daniel Krieger, 26 Feb 2024
We, the authors, sincerely thank Reviewer #1 for their insightful and valuable comments and suggestions on our manuscript. The comments greatly helped to improve the manuscript and to remove unclarities.
Please find enclosed our response to the reviewer's comments.
-
RC2: 'Comment on egusphere-2023-2676', Lisa Degenhardt, 23 Jan 2024
Review of egusphere-2023-2676 „Improving seasonal predictions of German Bight storm activity“
by Lisa Degenhardt
Summary:
The paper investigates the seasonal predictability of storm activity in the German Bight. They using decadal data, as they use a method which needs a big ensemble size. The method is using a first-guess approach by predictors. Meaning 2 selected predictors (T70 and Z500) are used to select these members of the ensemble that are the closest at the GBSA. A combination of both predictors is showing the strongest increase in GBSA skill. Different skill measures are tested but all show roughly that 25 members is the best amount for the subselection.
I think this paper includes a very interesting topic and a method that seems very useful. In my opinion this is very good placed in NHESS, but maybe need a few minor corrections or adjustments in the description of the method as I have a few questions still after reading the paper.
a) I have a few questions about the methods. It does seem alright, it is more the description of the method section, where I would like to see some more details for a better understanding:
L95: My comment doesn’t fit here, but maybe in the introduction or discussion, but I realised it in this line. Why are you choosing exactly these predictors? I think it would be nice to have a bit more details why you choose those and with references and why T70 is used from September.
L101-Eq. 1: I am a bit confused with this paragraph. And is np=nx?
L117-120: I think this part is a bit confusingly written or at least I got confused. I believe it does explain what Fig. 1 is showing. Which I did understood, but maybe need some more clarity.
b) Chapter 2.5: How can you train with ERA5 data, but use the decadal data as predictor? And what are you training for?
c) I think the Introduction but especially the Discussion is missing some external references. I added a personal note at the bottom, which is just a suggestion as I worked in a similar field. But in general, I would like to see a bit more referencing to other studies.
Major corrections:
L198: I thought initialisation is November (see abstract), which I thought is mostly around the 1st and I thought you are then predicting core winter (DJF), why now “end of November”?
Fig. 6 & 7: I don’t understand the difference between Fig. 6b, e & h and Fig. 7b, e & h. I thought the sub-selected ensemble is by the 25 closest members? So for me both would be the same, but they aren’t.
L308: I understand what a “training period” is, but when I reached this part of the paper, I realised that I don’t know why you needed a training period at all.
L315: I could have missed it in the paper, but where is this result coming from?
Minor corrections/suggestions:
L123: you only have 2 predictors, so you can say “both” instead of “multiple”, right?
L134: I found this sentence a bit unnecessarily confusing written. I do understand it and it is right, but maybe it is easier to say: “ACC values of 1 indicate a perfect correlation, 0 no correlation, and -1 a perfect anticorrelation.”
L152: Shouldn’t Fi be Oi?
L170: What are the 17-80? Are these IDs of members? If yes, why not using the first 16?
Fig 2&3: Is there a specific reason why the dots have a white inside? The difference between white and black makes the dots quite blurry on my screen. Maybe test to make them fully black. Also in Fig. 2, I believe there is a significant area in the positive bloop over eastern Russia, maybe you could increase the density of the dots to make at least one or two dots visible there?
L213/214: I would add here that the first sentence is for the measure correlations. Even the 25 members is right for all measures at the end. “The optimal sample size is found
at 25 members per predictor for correlations (r = 0.64).”
L219: What is Z500,sep now? I thought T70 is used from September.
L221: Maybe add an “individually” to make clear that you now talking about the sensitivity of each predictor alone.
L265: Is “perfect test” and “perfect ensemble” here the same? I think I would stick with one, maybe perfect ensemble?!
L281: I believe “stark” is supposed to be “strong”
Personal Point:
I did a similar skill study to seasonal predictions of European (wind-)storms and what could improve their skill. There are some publications available about that, other seasonal predictability and their influencing factors like atmospheric drivers. It would be nice to see a bit more of these studies in either the Introduction or Discussion. No need to use mine, but as we are both looking for storm activity over Europe, I wanted to mention it at least: Degenhardt, L., Leckebusch, G.C. & Scaife, A.A. Large-scale circulation patterns and their influence on European winter windstorm predictions. Clim Dyn 60, 3597–3611 (2023). https://doi.org/10.1007/s00382-022-06455-2
Citation: https://doi.org/10.5194/egusphere-2023-2676-RC2 -
AC2: 'Reply on RC2', Daniel Krieger, 26 Feb 2024
We, the authors, sincerely thank Lisa Degenhardt for her insightful and valuable comments and suggestions on our manuscript. The comments greatly helped to improve the manuscript and to remove unclarities.
Please find enclosed our response to the reviewer's comments.
-
AC2: 'Reply on RC2', Daniel Krieger, 26 Feb 2024
-
RC3: 'Comment on egusphere-2023-2676', Anonymous Referee #3, 24 Jan 2024
Review of “Improving seasonal predictions of German Bight storm activity” by Krieger et al.
Summary:
The study makes use of a large ensemble of decadal predictions that are initialized in late autumn that can be used as seasonal predictions for storminess of the subsequent winter months. While the predictive skill of the full ensemble is rather low, the authors demonstrate that subsampling of around 25 members allows for a clear improvement of the skill. These members are constrained through the best fit to observed precursor variables T70 and Z500 prior to the prediction. The authors provide both, statistical as well as physical reasoning to justify the subsampling which is optimized through evaluating the skill dependency on various choices made.
General assessment:
The study presents an impressive improvement in the prediction skill and is generally well written and based on sound choices that are explained and tested in a convincing way. Based on testing the dependency of their skill on the sample size as well as the effect of single versus combined use of two predictor variables, an optimum skill is reached with subsampling around 25 members. The “perfect subselection test” is particularly useful as it clarifies that even the optimal selection of the best match to the observed state, which would otherwise be unknown for a real prediction, falls short of being highly realistic owing to general differences between observations and a physical-dynamical prediction system. A potential way out here might be machine learning based predictions in future studies.
Overall, the study is highly relevant and may lead to further comparable attempts to improve ensemble predictions. I recommend publication after addressing a couple of points below.
Major comments:
The study should provide a bit more historical context of their subsampling idea in the introduction. Also, the Discussion section does not refer to any other studies.
Historical context: Conceptionally, the approach returns to an old idea of “analogue forecasting” of future states that can be constrained based on their similarity to meaningful predictors of the preceding observed state (cf. Lorenz 1969; Barnett et al. 1978). As demonstrated earlier for weather prediction (van den Dool 1994) or weather field reconstructions (Schenk & Zorita 2012), the skill will depend on various choices that Krieger et al. test here for a large-ensemble prediction system. In particular, the skill of the subsampling approach will also depend on the number of spatial degrees of freedom of predictor and target variables.
Statistical context: In summary, the authors make largely appropriate efforts to provide robust statistical significance testing for both, identification of predictors as well as the resulting prediction skill. While the authors consider temporal autocorrelation using block-bootstrapping, I have some concerns that their locally significant results could be randomly significant owing to the potentially very low number of spatial degrees of freedom and hence very large spatial autocorrelation of fields like T70 and Z500. A quick field significance test is suggested below (cf. Livezey & Chen, 1983; Wilks, 2006) which would otherwise not change the results of the final prediction skill in this study. In the historical context, this study is even more remarkable as a quite good skill is achieved with using spatially rather homogeneous predictors.
Minor comments:
Line 28: You directly jump here to the concept of using specific physical predictors to aid in reducing the spread of model predictions and increase prediction skill. You later use this idea in this study to improve subsampling of certain ensemble members that are most similar to an initial predictor state. It might be worthwhile to briefly mention here some historical context mentioned above that this idea is very similar to analogue forecasting attempts already in 1970s to predict future weather (Lorenz 1969) or short-term climate fluctuations (Barnett et al. 1978) based on analogues that proceed from the present state to estimate future states. Also, the idea to predict unknown full field states based on incomplete low-order predictors via analogues was successfully applied in reconstructions (e.g., Schenk & Zorita 2012). Interestingly, in that study the skill improvement was tested in a very similar way as done here regarding the dependency on the number of predictors, use of multivariate predictors and benchmarking with an idealized model-dependent prediction skill.
Lines 97-99: It is a bit unclear how statistical significance is derived here. Fields of T70 and Z500 tend to have a very large spatial autocorrelation (low spatial degree of freedom). It is quite likely that far more than 5% with easily up to more than 20% of locally significant grid cells could be randomly significant globally. To be sure, you could test the global significance by estimating 1000x correlations from bootstrapping of these fields and its correlation with ERA5 fields to evaluate how many grid cells are randomly significant (Wilks 2006). If your T70 and Z500 predictor fields vs. ERA5 fields yield more locally significant grid cells than the randomly significant correlations, you can claim to use globally or regionally significant predictor fields. It should be noted that even if this test fails, predictor fields may still provide predictive skill if the locally significant areas are physically meaningful (e.g., linked to Rossby waves, NAO etc.). It looks like this is the case in your study. You could add a plot of the global field with random correlations and locally significant areas and provide the test quantity of “overall % of n significant grids x 100 / N total grids” which provides more context to Fig. 2 and 3 in the main text. Based on the constraints (line 113), global in this study could also be regionally 30-90°N here. Based on Fig. 2, this area may not be overall significant but appears to provide coherent regions of locally significant correlations. Fig. 3 might show a higher fraction of significant correlations due to wave propagation (N=4 areas) plus Arctic, total N=5 areas of predictive skill but perhaps only N=2 independent predictors?
Lines 102-106: Here, the randomly significant issue becomes obvious from using a gridpoint-wise testing with a 95% local confidence level. The selection process is correct locally but may not provide regionally or globally significant field results regarding the full fields with low spatial degrees of freedom. I would not change this procedure here but just add a small test described above to include a sentence whether you use regionally or globally significant predictor fields or only locally significant results. Despite using fields, your predictor would then be local.
Lines 186-191: Perhaps mention here that the four significant areas in the northern extra-tropics represent the Rossby wave propagation in addition to teleconnections (Arctic Oscillation-like?) with the Arctic versus (sub-)tropical areas of Sahel and Indian Ocean. It is quite nice to see that these physically meaningful areas show up also statistically.
Lines 223-224: “purely coincidental”. Not a coincidence at all. There is a direct relationship between the correlation coefficient and the RMSE, i.e. when standardized observations and predictions are used as RMSE inputs (hence bias = 0). This means that the RMSE is a measure of the unexplained variation, which is inversely proportional to the explained variation, which is the square of the correlation coefficient (as can be seen in Fig. 4). Therefore, it is not purely coincidental that for both predictors the optimal sample sizes for RMSE and correlation are equal, but a consequence of the mathematical relationship between these two statistics. Please replace the sentence with the opposite statement.
Figure 5: Very good illustration and impressive result.
Lines 265-267: I generally like that test regarding the question what the best selection of the 25 members would be knowing the observed state. Here, you could’ve gone even further by evaluating the single best member per year out of 64 members relative to ERA5 for 1960-2017/18. That would be a prediction-system-specific optimum of the ensemble initialised in late autumn for DJF which could be compared to your “almost perfect test”.
Lines 282: Agree. I guess here you could mention the potential for machine learning methods.
Figure 8: How much do these composites differ from a first-year composite? Could the strong Antarctic difference be caused by a long-term trend in the model runs over time rather than highlighting differences from the composites?
Line 299: The whole discussion section does not make any attempts to put results into context with other seasonal prediction studies (e.g., Kruschke et al. 2014; 2016 and many others). I see that some relevant studies were briefly discussed in Krieger et al. (2022) but not here. I suggest adding a paragraph or several sentences throughout chapter 4 where similarities and differences to other studies are discussed.
Lines 300-303: Although you’re using a decadal prediction system, does that really differ from using a seasonal prediction system in your specific case? The initialisation in November to predict DJF is pretty much what a seasonal prediction system would do.
Line 305: Most likely because the annual GBSA is dominated by the variation in winter (high correlation of high annual percentiles with high winter percentiles, i.e. same tail values)?
Line 308: “by two decades”
Line 321: Regarding NAO, perhaps AO would be more appropriate as mentioned above?
References:
Barnett, T. and Preisendorfer, R.: Multifield analog predicction of short-term climate fluctuations using a climate state vector, J. Atmos. Sci., 35, 1771–1787, doi:10.1175/1520-0469(1978)0352.0.CO;2, 1978.
Kruschke, T., Rust, H. W., Kadow, C., Leckebusch, G. C., and Ulbrich, U.: Evaluating decadal predictions of northern hemispheric cyclone frequencies, Tellus A, 66, 22830, https://doi.org/10.3402/tellusa.v66.22830, 2014.
Kruschke, T., Rust, H. W., Kadow, C., Müller, W. A., Pohlmann, H., Leckebusch, G. C., and Ulbrich, U.: Probabilistic evaluation of decadal prediction skill regarding Northern Hemisphere winter storms, Meteorol. Z., 25, 721–738, https://doi.org/10.1127/metz/2015/0641, 2016.
Livezey, R. E., & Chen, W. Y.: Statistical field significance and its determination by Monte Carlo techniques. Monthly Weather Review, 111, 46–59. https://doi.org/10.1175/1520-0493(1983)111<0046:SFSAID>2.0.CO;2, 1983.
Lorenz, E. N.: Atmospheric predictability as revealed by naturally occurring analogs, J. Atmos. Sci., 26, 639–646, doi:10.1175/1520-0469(1969)262.0.CO;2, 1969.
Schenk, F. and Zorita, E.: Reconstruction of high resolution atmospheric fields for Northern Europe using analog-upscaling, Clim. Past, 8, 1681–1703, https://doi.org/10.5194/cp-8-1681-2012, 2012.
van den Dool, H.: Searching for analogs, how long must we wait? Tellus, 46A, 314–324, doi:10.1034/j.1600-0870.1994.t01- 2-00006.x, 1994.
Wilks, D. S.: On “field significance” and the false discovery rate. Journal of Applied Meteorology and Climatology, 45, 1181–1189. https://doi.org/10.1175/JAM2404.1, 2006.
Citation: https://doi.org/10.5194/egusphere-2023-2676-RC3 -
AC3: 'Reply on RC3', Daniel Krieger, 26 Feb 2024
We, the authors, sincerely thank Reviewer #3 for their insightful and valuable comments and suggestions on our manuscript. The comments greatly helped to improve the manuscript and to remove unclarities.
Please find enclosed our response to the reviewer's comments.
-
AC3: 'Reply on RC3', Daniel Krieger, 26 Feb 2024
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-2676', Anonymous Referee #1, 10 Jan 2024
The paper investigates and improves the seasonal prediction skill for German Bight storm activity (GBSA) in a large-ensemble decadal hindcast system. This system is based on the model MPI-ESM-LR and consists of 64 yearly initialized members for the period 1960-2018. The authors use two physical predictors of GBSA to make a first guess of the GBSA state and select ensemble members close to this first guess. The sub-selected ensemble shows significantly improved prediction skill for winter GBSA.
The paper covers an interesting and relevant topic. It is well written and clearly structured. Apart from some comments below, I can recommend the paper for publication and I feel it will provide a useful contribution to the field.
Comments:
- The paper lacks some basic details and references on, among others, storm activity and its variability, as well as decadal prediction systems (see for example your own introduction in Krieger et al., 2022). I understand that you are trying to tell a different story than in Krieger et al. (2022), but you should not expect the reader to know your previous publication.
- There are several relevant publications from the MetOffice colleagues on seasonal predictions, for example Athanasiadis et al. (2017), Scaife et al. (2014), or Scaife et al. (2016). Please include and discuss some of them in your paper.
- Can you elaborate a bit why you specifically use T70 and Z500 as predictors for GBSA? Is there any reference for this choice?
- How do you define which ensemble members have a GBSA that is closest to the first guess?
- Based on line 170, it looks like the ensemble has 80 members, but you only use 64 of them. Please explain this.
- L268-269: Can you please elaborate on why the perfect test includes future information and why this cannot be used operationally?
- Outlook: How could your approach be used in future studies?
References:
Athanasiadis et al. (2017): A Multisystem View of Wintertime NAO Seasonal Predictions. Journal of Climate, https://doi.org/10.1175/JCLI-D-16-0153.1
Scaife et al. (2014): Skillful long-range prediction of European and North American winters. Geophys. Res. Lett., https://doi.org/10.1002/2014GL059637
Scaife et al. (2016): Seasonal winter forecasts and the stratosphere. Atmos. Sci. Lett., https://doi.org/10.1002/asl.598
Citation: https://doi.org/10.5194/egusphere-2023-2676-RC1 -
AC1: 'Reply on RC1', Daniel Krieger, 26 Feb 2024
We, the authors, sincerely thank Reviewer #1 for their insightful and valuable comments and suggestions on our manuscript. The comments greatly helped to improve the manuscript and to remove unclarities.
Please find enclosed our response to the reviewer's comments.
-
RC2: 'Comment on egusphere-2023-2676', Lisa Degenhardt, 23 Jan 2024
Review of egusphere-2023-2676 „Improving seasonal predictions of German Bight storm activity“
by Lisa Degenhardt
Summary:
The paper investigates the seasonal predictability of storm activity in the German Bight. They using decadal data, as they use a method which needs a big ensemble size. The method is using a first-guess approach by predictors. Meaning 2 selected predictors (T70 and Z500) are used to select these members of the ensemble that are the closest at the GBSA. A combination of both predictors is showing the strongest increase in GBSA skill. Different skill measures are tested but all show roughly that 25 members is the best amount for the subselection.
I think this paper includes a very interesting topic and a method that seems very useful. In my opinion this is very good placed in NHESS, but maybe need a few minor corrections or adjustments in the description of the method as I have a few questions still after reading the paper.
a) I have a few questions about the methods. It does seem alright, it is more the description of the method section, where I would like to see some more details for a better understanding:
L95: My comment doesn’t fit here, but maybe in the introduction or discussion, but I realised it in this line. Why are you choosing exactly these predictors? I think it would be nice to have a bit more details why you choose those and with references and why T70 is used from September.
L101-Eq. 1: I am a bit confused with this paragraph. And is np=nx?
L117-120: I think this part is a bit confusingly written or at least I got confused. I believe it does explain what Fig. 1 is showing. Which I did understood, but maybe need some more clarity.
b) Chapter 2.5: How can you train with ERA5 data, but use the decadal data as predictor? And what are you training for?
c) I think the Introduction but especially the Discussion is missing some external references. I added a personal note at the bottom, which is just a suggestion as I worked in a similar field. But in general, I would like to see a bit more referencing to other studies.
Major corrections:
L198: I thought initialisation is November (see abstract), which I thought is mostly around the 1st and I thought you are then predicting core winter (DJF), why now “end of November”?
Fig. 6 & 7: I don’t understand the difference between Fig. 6b, e & h and Fig. 7b, e & h. I thought the sub-selected ensemble is by the 25 closest members? So for me both would be the same, but they aren’t.
L308: I understand what a “training period” is, but when I reached this part of the paper, I realised that I don’t know why you needed a training period at all.
L315: I could have missed it in the paper, but where is this result coming from?
Minor corrections/suggestions:
L123: you only have 2 predictors, so you can say “both” instead of “multiple”, right?
L134: I found this sentence a bit unnecessarily confusing written. I do understand it and it is right, but maybe it is easier to say: “ACC values of 1 indicate a perfect correlation, 0 no correlation, and -1 a perfect anticorrelation.”
L152: Shouldn’t Fi be Oi?
L170: What are the 17-80? Are these IDs of members? If yes, why not using the first 16?
Fig 2&3: Is there a specific reason why the dots have a white inside? The difference between white and black makes the dots quite blurry on my screen. Maybe test to make them fully black. Also in Fig. 2, I believe there is a significant area in the positive bloop over eastern Russia, maybe you could increase the density of the dots to make at least one or two dots visible there?
L213/214: I would add here that the first sentence is for the measure correlations. Even the 25 members is right for all measures at the end. “The optimal sample size is found
at 25 members per predictor for correlations (r = 0.64).”
L219: What is Z500,sep now? I thought T70 is used from September.
L221: Maybe add an “individually” to make clear that you now talking about the sensitivity of each predictor alone.
L265: Is “perfect test” and “perfect ensemble” here the same? I think I would stick with one, maybe perfect ensemble?!
L281: I believe “stark” is supposed to be “strong”
Personal Point:
I did a similar skill study to seasonal predictions of European (wind-)storms and what could improve their skill. There are some publications available about that, other seasonal predictability and their influencing factors like atmospheric drivers. It would be nice to see a bit more of these studies in either the Introduction or Discussion. No need to use mine, but as we are both looking for storm activity over Europe, I wanted to mention it at least: Degenhardt, L., Leckebusch, G.C. & Scaife, A.A. Large-scale circulation patterns and their influence on European winter windstorm predictions. Clim Dyn 60, 3597–3611 (2023). https://doi.org/10.1007/s00382-022-06455-2
Citation: https://doi.org/10.5194/egusphere-2023-2676-RC2 -
AC2: 'Reply on RC2', Daniel Krieger, 26 Feb 2024
We, the authors, sincerely thank Lisa Degenhardt for her insightful and valuable comments and suggestions on our manuscript. The comments greatly helped to improve the manuscript and to remove unclarities.
Please find enclosed our response to the reviewer's comments.
-
AC2: 'Reply on RC2', Daniel Krieger, 26 Feb 2024
-
RC3: 'Comment on egusphere-2023-2676', Anonymous Referee #3, 24 Jan 2024
Review of “Improving seasonal predictions of German Bight storm activity” by Krieger et al.
Summary:
The study makes use of a large ensemble of decadal predictions that are initialized in late autumn that can be used as seasonal predictions for storminess of the subsequent winter months. While the predictive skill of the full ensemble is rather low, the authors demonstrate that subsampling of around 25 members allows for a clear improvement of the skill. These members are constrained through the best fit to observed precursor variables T70 and Z500 prior to the prediction. The authors provide both, statistical as well as physical reasoning to justify the subsampling which is optimized through evaluating the skill dependency on various choices made.
General assessment:
The study presents an impressive improvement in the prediction skill and is generally well written and based on sound choices that are explained and tested in a convincing way. Based on testing the dependency of their skill on the sample size as well as the effect of single versus combined use of two predictor variables, an optimum skill is reached with subsampling around 25 members. The “perfect subselection test” is particularly useful as it clarifies that even the optimal selection of the best match to the observed state, which would otherwise be unknown for a real prediction, falls short of being highly realistic owing to general differences between observations and a physical-dynamical prediction system. A potential way out here might be machine learning based predictions in future studies.
Overall, the study is highly relevant and may lead to further comparable attempts to improve ensemble predictions. I recommend publication after addressing a couple of points below.
Major comments:
The study should provide a bit more historical context of their subsampling idea in the introduction. Also, the Discussion section does not refer to any other studies.
Historical context: Conceptionally, the approach returns to an old idea of “analogue forecasting” of future states that can be constrained based on their similarity to meaningful predictors of the preceding observed state (cf. Lorenz 1969; Barnett et al. 1978). As demonstrated earlier for weather prediction (van den Dool 1994) or weather field reconstructions (Schenk & Zorita 2012), the skill will depend on various choices that Krieger et al. test here for a large-ensemble prediction system. In particular, the skill of the subsampling approach will also depend on the number of spatial degrees of freedom of predictor and target variables.
Statistical context: In summary, the authors make largely appropriate efforts to provide robust statistical significance testing for both, identification of predictors as well as the resulting prediction skill. While the authors consider temporal autocorrelation using block-bootstrapping, I have some concerns that their locally significant results could be randomly significant owing to the potentially very low number of spatial degrees of freedom and hence very large spatial autocorrelation of fields like T70 and Z500. A quick field significance test is suggested below (cf. Livezey & Chen, 1983; Wilks, 2006) which would otherwise not change the results of the final prediction skill in this study. In the historical context, this study is even more remarkable as a quite good skill is achieved with using spatially rather homogeneous predictors.
Minor comments:
Line 28: You directly jump here to the concept of using specific physical predictors to aid in reducing the spread of model predictions and increase prediction skill. You later use this idea in this study to improve subsampling of certain ensemble members that are most similar to an initial predictor state. It might be worthwhile to briefly mention here some historical context mentioned above that this idea is very similar to analogue forecasting attempts already in 1970s to predict future weather (Lorenz 1969) or short-term climate fluctuations (Barnett et al. 1978) based on analogues that proceed from the present state to estimate future states. Also, the idea to predict unknown full field states based on incomplete low-order predictors via analogues was successfully applied in reconstructions (e.g., Schenk & Zorita 2012). Interestingly, in that study the skill improvement was tested in a very similar way as done here regarding the dependency on the number of predictors, use of multivariate predictors and benchmarking with an idealized model-dependent prediction skill.
Lines 97-99: It is a bit unclear how statistical significance is derived here. Fields of T70 and Z500 tend to have a very large spatial autocorrelation (low spatial degree of freedom). It is quite likely that far more than 5% with easily up to more than 20% of locally significant grid cells could be randomly significant globally. To be sure, you could test the global significance by estimating 1000x correlations from bootstrapping of these fields and its correlation with ERA5 fields to evaluate how many grid cells are randomly significant (Wilks 2006). If your T70 and Z500 predictor fields vs. ERA5 fields yield more locally significant grid cells than the randomly significant correlations, you can claim to use globally or regionally significant predictor fields. It should be noted that even if this test fails, predictor fields may still provide predictive skill if the locally significant areas are physically meaningful (e.g., linked to Rossby waves, NAO etc.). It looks like this is the case in your study. You could add a plot of the global field with random correlations and locally significant areas and provide the test quantity of “overall % of n significant grids x 100 / N total grids” which provides more context to Fig. 2 and 3 in the main text. Based on the constraints (line 113), global in this study could also be regionally 30-90°N here. Based on Fig. 2, this area may not be overall significant but appears to provide coherent regions of locally significant correlations. Fig. 3 might show a higher fraction of significant correlations due to wave propagation (N=4 areas) plus Arctic, total N=5 areas of predictive skill but perhaps only N=2 independent predictors?
Lines 102-106: Here, the randomly significant issue becomes obvious from using a gridpoint-wise testing with a 95% local confidence level. The selection process is correct locally but may not provide regionally or globally significant field results regarding the full fields with low spatial degrees of freedom. I would not change this procedure here but just add a small test described above to include a sentence whether you use regionally or globally significant predictor fields or only locally significant results. Despite using fields, your predictor would then be local.
Lines 186-191: Perhaps mention here that the four significant areas in the northern extra-tropics represent the Rossby wave propagation in addition to teleconnections (Arctic Oscillation-like?) with the Arctic versus (sub-)tropical areas of Sahel and Indian Ocean. It is quite nice to see that these physically meaningful areas show up also statistically.
Lines 223-224: “purely coincidental”. Not a coincidence at all. There is a direct relationship between the correlation coefficient and the RMSE, i.e. when standardized observations and predictions are used as RMSE inputs (hence bias = 0). This means that the RMSE is a measure of the unexplained variation, which is inversely proportional to the explained variation, which is the square of the correlation coefficient (as can be seen in Fig. 4). Therefore, it is not purely coincidental that for both predictors the optimal sample sizes for RMSE and correlation are equal, but a consequence of the mathematical relationship between these two statistics. Please replace the sentence with the opposite statement.
Figure 5: Very good illustration and impressive result.
Lines 265-267: I generally like that test regarding the question what the best selection of the 25 members would be knowing the observed state. Here, you could’ve gone even further by evaluating the single best member per year out of 64 members relative to ERA5 for 1960-2017/18. That would be a prediction-system-specific optimum of the ensemble initialised in late autumn for DJF which could be compared to your “almost perfect test”.
Lines 282: Agree. I guess here you could mention the potential for machine learning methods.
Figure 8: How much do these composites differ from a first-year composite? Could the strong Antarctic difference be caused by a long-term trend in the model runs over time rather than highlighting differences from the composites?
Line 299: The whole discussion section does not make any attempts to put results into context with other seasonal prediction studies (e.g., Kruschke et al. 2014; 2016 and many others). I see that some relevant studies were briefly discussed in Krieger et al. (2022) but not here. I suggest adding a paragraph or several sentences throughout chapter 4 where similarities and differences to other studies are discussed.
Lines 300-303: Although you’re using a decadal prediction system, does that really differ from using a seasonal prediction system in your specific case? The initialisation in November to predict DJF is pretty much what a seasonal prediction system would do.
Line 305: Most likely because the annual GBSA is dominated by the variation in winter (high correlation of high annual percentiles with high winter percentiles, i.e. same tail values)?
Line 308: “by two decades”
Line 321: Regarding NAO, perhaps AO would be more appropriate as mentioned above?
References:
Barnett, T. and Preisendorfer, R.: Multifield analog predicction of short-term climate fluctuations using a climate state vector, J. Atmos. Sci., 35, 1771–1787, doi:10.1175/1520-0469(1978)0352.0.CO;2, 1978.
Kruschke, T., Rust, H. W., Kadow, C., Leckebusch, G. C., and Ulbrich, U.: Evaluating decadal predictions of northern hemispheric cyclone frequencies, Tellus A, 66, 22830, https://doi.org/10.3402/tellusa.v66.22830, 2014.
Kruschke, T., Rust, H. W., Kadow, C., Müller, W. A., Pohlmann, H., Leckebusch, G. C., and Ulbrich, U.: Probabilistic evaluation of decadal prediction skill regarding Northern Hemisphere winter storms, Meteorol. Z., 25, 721–738, https://doi.org/10.1127/metz/2015/0641, 2016.
Livezey, R. E., & Chen, W. Y.: Statistical field significance and its determination by Monte Carlo techniques. Monthly Weather Review, 111, 46–59. https://doi.org/10.1175/1520-0493(1983)111<0046:SFSAID>2.0.CO;2, 1983.
Lorenz, E. N.: Atmospheric predictability as revealed by naturally occurring analogs, J. Atmos. Sci., 26, 639–646, doi:10.1175/1520-0469(1969)262.0.CO;2, 1969.
Schenk, F. and Zorita, E.: Reconstruction of high resolution atmospheric fields for Northern Europe using analog-upscaling, Clim. Past, 8, 1681–1703, https://doi.org/10.5194/cp-8-1681-2012, 2012.
van den Dool, H.: Searching for analogs, how long must we wait? Tellus, 46A, 314–324, doi:10.1034/j.1600-0870.1994.t01- 2-00006.x, 1994.
Wilks, D. S.: On “field significance” and the false discovery rate. Journal of Applied Meteorology and Climatology, 45, 1181–1189. https://doi.org/10.1175/JAM2404.1, 2006.
Citation: https://doi.org/10.5194/egusphere-2023-2676-RC3 -
AC3: 'Reply on RC3', Daniel Krieger, 26 Feb 2024
We, the authors, sincerely thank Reviewer #3 for their insightful and valuable comments and suggestions on our manuscript. The comments greatly helped to improve the manuscript and to remove unclarities.
Please find enclosed our response to the reviewer's comments.
-
AC3: 'Reply on RC3', Daniel Krieger, 26 Feb 2024
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
220 | 94 | 28 | 342 | 18 | 13 |
- HTML: 220
- PDF: 94
- XML: 28
- Total: 342
- BibTeX: 18
- EndNote: 13
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Sebastian Brune
Johanna Baehr
Ralf Weisse
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(20704 KB) - Metadata XML