Decomposition of skill scores for conditional verification &ndash; Impact of AMO phases on the predictability of decadal temperature forecasts

Richling, Andy; Grieger, Jens; Rust, Henning W.

doi:https://doi.org/10.5194/egusphere-2023-2582

Preprints

https://doi.org/10.5194/egusphere-2023-2582

Preprints

22 Jan 2024

| 22 Jan 2024

Decomposition of skill scores for conditional verification – Impact of AMO phases on the predictability of decadal temperature forecasts

Andy Richling, Jens Grieger, and Henning W. Rust

Abstract. We present a decomposition of skill scores for the conditional verification of weather and climate forecast systems. Aim is to evaluate the performance of such a system individually for predefined subsets with respect to the overall performance. The overall skill score is decomposed into: (1) the subset skill score assessing the performance of a forecast system compared to a reference system for a particular subset; (2) the frequency weighting accounting for varying subset size; (3) the reference weighting relating the performance of the reference system in the individual subsets to the performance of the full data set. The decomposition and its interpretation is exemplified using a synthetic data set. Subsequently we use it for a practical example from the field of decadal climate prediction: An evaluation of the Atlantic-European near-surface temperature forecast from the German initiative Mittelfristige Klimaprognosen (MiKlip) decadal prediction system conditional on different Atlantic Meridional Oscillation (AMO) phases during initialization. With respect to the chosen Western European North Atlantic sector, the decadal prediction system preop-dcpp-HR performs better than the un-initialized simulations mostly due to performance gain during a positive AMO phase. Compared to the predecessor system (preop-LR), no overall performance benefits are made in this region, but positive contributions are achieved for initialization in neutral AMO phases. Additionally, the decomposition reveals a strong imbalance among the subsets (defined by AMO phases) in terms of reference weighting allowing for sophisticated interpretation and conclusions. This skill score decomposition framework for conditional verification is a valuable tool to analyze the effect of physical processes on forecast performance and consequently supports model development and improvement of operational forecasts.

Received: 02 Nov 2023 – Discussion started: 22 Jan 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Andy Richling, Jens Grieger, and Henning W. Rust

Status: final response (author comments only)

RC1: 'Comment on egusphere-2023-2582', Jonas Bhend, 19 Feb 2024

Richling, Grieger and Rust present a framework for the decomposition of skill scores stratified in subsets. The authors introduce the terms reference and frequency weighting to characterize the contribution of the subset skill scores to the grand total. The manuscript is well written and the application to decadal forecasting is illustrative. The decomposition is a useful addition to the existing forecast verification literature, however, the authors should expand the discussion of the method. In particular it is not clear, if the variability of the contributions is merely a consequence of the geometry of the problem (i.e. there being more room to improve when the reference forecast performs badly) or if we can learn something in excess of this (see also comments below). If the authors are willing to address this minor issue, I fully support publication of the manuscript.

General comments:
While I appreciate the synthetic test cases as a motivating example to elaborate the specifics of the decomposition, I think the setup could be improved for better interpretability. The authors already provide some motivation by mentioning ‘what happens if we improve the forecast in subset 1’. Designing the synthetic cases more prominently along those lines would ease the interpretation. The synthetic setup could be altered to A/B0: base case, A/B1: improve SS1, A/B2: improve SS2. Also, I strongly suggest to consider improvements of the score of equal (relative) size for better comparison of the effect. In particular I would be interested to see if the increased reference weighting is basically a consequence of there being more room to improve when the reference performs poorly. As such a synthetic set of experiments with differing reference weighting, but the same relative improvement in the respective subsets could be illustrative.
The summary in the summary and discussion part is redundant. I suggest to remove or at least considerably shorten this as it doesn’t add to the paper.
I encourage the authors to think of potential applications outside the domain of decadal forecasting to increase the appeal for readers outside of this community.

Minor comments:
L3: Providing some more context with an illustrative example at the start of the abstract would improve readability.
L180ff: This implies that we benefit more from improvements in subsets for which the reference performs badly. With a mildly skillful reference, the reference score basically measures inherent predictability. Consequently, the above translates to we profit more from improvements in situations with limited predictability. If this can be supported, this would imply that we should focus more on subsets that are hard to predict if we want to improve skill in general. This seems contrary to what is usually being done, i.e. exploit situations with relatively high predictability (and plausible hypotheses on drivers) and try to improve predictions there. The authors mention in the conclusion that the focus should not only be with AMO+ situations. Maybe the predictability angle could provide some more grounds for the discussion of the implications of the decomposition.
Figure 3: This is mildly confusing because different labels are used compared with the tables. To improve readability, the corresponding points could be labelled with A0, A1, A2, … and the labels could be replaced with Table A,B instead of Cases A,B (vertical lines) and Case A0/B0, A1/B1, A2/B2 instead of Case 1,2,3 (facets) for clarity.
Figure 4: This figure feels somewhat redundant, maybe the contribution could be integrated with Tables A/B or Figure 1 for clarity.
L257: Are scores indeed computed on 4 yearly average temperatures (period 2-5 years), or are scores computed on monthly mean temperatures as specified in L242 and aggregated for the lead times 2-5 years?

Editorial comments:
L4: The aim is to …
L18: is the comparison against another competing prediction system or a standard reference forecast such as the persistence or climatological forecast.
L24: … and the continuous ranked probability skill score (CRPSS) for probabilistic forecasts are widely used decadal forecast verification (e.g., Kadow et al., 2016; Kruschke et al., 2016; Pasternack et al., 2018, 2021).
L112: adjusts
L139: the mean scores of the forecast systems differ
L140: mean scores of the reference system
L179: in the same way (or in a similar way)
L189: Generally
L241: against monthly mean temperatures from the HadCRUT4 observation data set (Morice et al., 2012). [ Also I suggest to refer to the obs data as HadCRUT4 consistently throughout (e.g. L255). ]
L251: annual averages
L289: with significant values patches with positive but non-significant skill are visible …
L307: uninitialized reference is not influenced …
L393: quite small

Citation: https://doi.org/10.5194/egusphere-2023-2582-RC1
RC2: 'Comment on egusphere-2023-2582', Anonymous Referee #2, 26 Feb 2024

Please see report attached

Citation: https://doi.org/10.5194/egusphere-2023-2582-RC2

Andy Richling, Jens Grieger, and Henning W. Rust

Viewed

Total article views: 312 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
235	55	22	312	17	16

HTML: 235
PDF: 55
XML: 22
Total: 312
BibTeX: 17
EndNote: 16

Views and downloads (calculated since 22 Jan 2024)

Month	HTML	PDF	XML	Total
Jan 2024	70	11	2	83
Feb 2024	51	14	5	70
Mar 2024	23	6	0	29
Apr 2024	27	7	6	40
May 2024	14	7	2	23
Jun 2024	45	6	5	56
Jul 2024	5	4	2	11

Cumulative views and downloads (calculated since 22 Jan 2024)

Month	HTML	PDF	XML	Total
Jan 2024	70	11	2	83
Feb 2024	51	14	5	70
Mar 2024	23	6	0	29
Apr 2024	27	7	6	40
May 2024	14	7	2	23
Jun 2024	45	6	5	56
Jul 2024	5	4	2	11

Viewed (geographical distribution)

Total article views: 312 (including HTML, PDF, and XML) Thereof 312 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 26 Jul 2024

Short summary

The performance of weather and climate prediction systems is variable in time and space. It is of interest how this performance varies in different situations. We provide a decomposition of a skill score –a measure of forecast performance– as a tool for detailed assessment of performance variability to support model development or forecast improvement. The framework is exemplified with decadal forecasts to assess the impact of different ocean states in the North Atlantic on temperature forecast.


Total:	0
HTML:	0
PDF:	0
XML:	0

Decomposition of skill scores for conditional verification – Impact of AMO phases on the predictability of decadal temperature forecasts

Viewed

Viewed (geographical distribution)

Cited

1 citations as recorded by crossref.