the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Decomposition of skill scores for conditional verification – Impact of AMO phases on the predictability of decadal temperature forecasts
Abstract. We present a decomposition of skill scores for the conditional verification of weather and climate forecast systems. Aim is to evaluate the performance of such a system individually for predefined subsets with respect to the overall performance. The overall skill score is decomposed into: (1) the subset skill score assessing the performance of a forecast system compared to a reference system for a particular subset; (2) the frequency weighting accounting for varying subset size; (3) the reference weighting relating the performance of the reference system in the individual subsets to the performance of the full data set. The decomposition and its interpretation is exemplified using a synthetic data set. Subsequently we use it for a practical example from the field of decadal climate prediction: An evaluation of the Atlantic-European near-surface temperature forecast from the German initiative Mittelfristige Klimaprognosen (MiKlip) decadal prediction system conditional on different Atlantic Meridional Oscillation (AMO) phases during initialization. With respect to the chosen Western European North Atlantic sector, the decadal prediction system preop-dcpp-HR performs better than the un-initialized simulations mostly due to performance gain during a positive AMO phase. Compared to the predecessor system (preop-LR), no overall performance benefits are made in this region, but positive contributions are achieved for initialization in neutral AMO phases. Additionally, the decomposition reveals a strong imbalance among the subsets (defined by AMO phases) in terms of reference weighting allowing for sophisticated interpretation and conclusions. This skill score decomposition framework for conditional verification is a valuable tool to analyze the effect of physical processes on forecast performance and consequently supports model development and improvement of operational forecasts.
- Preprint
(1008 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on egusphere-2023-2582', Jonas Bhend, 19 Feb 2024
Richling, Grieger and Rust present a framework for the decomposition of skill scores stratified in subsets. The authors introduce the terms reference and frequency weighting to characterize the contribution of the subset skill scores to the grand total. The manuscript is well written and the application to decadal forecasting is illustrative. The decomposition is a useful addition to the existing forecast verification literature, however, the authors should expand the discussion of the method. In particular it is not clear, if the variability of the contributions is merely a consequence of the geometry of the problem (i.e. there being more room to improve when the reference forecast performs badly) or if we can learn something in excess of this (see also comments below). If the authors are willing to address this minor issue, I fully support publication of the manuscript.
General comments:
While I appreciate the synthetic test cases as a motivating example to elaborate the specifics of the decomposition, I think the setup could be improved for better interpretability. The authors already provide some motivation by mentioning ‘what happens if we improve the forecast in subset 1’. Designing the synthetic cases more prominently along those lines would ease the interpretation. The synthetic setup could be altered to A/B0: base case, A/B1: improve SS1, A/B2: improve SS2. Also, I strongly suggest to consider improvements of the score of equal (relative) size for better comparison of the effect. In particular I would be interested to see if the increased reference weighting is basically a consequence of there being more room to improve when the reference performs poorly. As such a synthetic set of experiments with differing reference weighting, but the same relative improvement in the respective subsets could be illustrative.
The summary in the summary and discussion part is redundant. I suggest to remove or at least considerably shorten this as it doesn’t add to the paper.
I encourage the authors to think of potential applications outside the domain of decadal forecasting to increase the appeal for readers outside of this community.
Minor comments:
L3: Providing some more context with an illustrative example at the start of the abstract would improve readability.
L180ff: This implies that we benefit more from improvements in subsets for which the reference performs badly. With a mildly skillful reference, the reference score basically measures inherent predictability. Consequently, the above translates to we profit more from improvements in situations with limited predictability. If this can be supported, this would imply that we should focus more on subsets that are hard to predict if we want to improve skill in general. This seems contrary to what is usually being done, i.e. exploit situations with relatively high predictability (and plausible hypotheses on drivers) and try to improve predictions there. The authors mention in the conclusion that the focus should not only be with AMO+ situations. Maybe the predictability angle could provide some more grounds for the discussion of the implications of the decomposition.
Figure 3: This is mildly confusing because different labels are used compared with the tables. To improve readability, the corresponding points could be labelled with A0, A1, A2, … and the labels could be replaced with Table A,B instead of Cases A,B (vertical lines) and Case A0/B0, A1/B1, A2/B2 instead of Case 1,2,3 (facets) for clarity.
Figure 4: This figure feels somewhat redundant, maybe the contribution could be integrated with Tables A/B or Figure 1 for clarity.
L257: Are scores indeed computed on 4 yearly average temperatures (period 2-5 years), or are scores computed on monthly mean temperatures as specified in L242 and aggregated for the lead times 2-5 years?
Editorial comments:
L4: The aim is to …
L18: is the comparison against another competing prediction system or a standard reference forecast such as the persistence or climatological forecast.
L24: … and the continuous ranked probability skill score (CRPSS) for probabilistic forecasts are widely used decadal forecast verification (e.g., Kadow et al., 2016; Kruschke et al., 2016; Pasternack et al., 2018, 2021).
L112: adjusts
L139: the mean scores of the forecast systems differ
L140: mean scores of the reference system
L179: in the same way (or in a similar way)
L189: Generally
L241: against monthly mean temperatures from the HadCRUT4 observation data set (Morice et al., 2012). [ Also I suggest to refer to the obs data as HadCRUT4 consistently throughout (e.g. L255). ]
L251: annual averages
L289: with significant values patches with positive but non-significant skill are visible …
L307: uninitialized reference is not influenced …
L393: quite small
Citation: https://doi.org/10.5194/egusphere-2023-2582-RC1 - RC2: 'Comment on egusphere-2023-2582', Anonymous Referee #2, 26 Feb 2024
-
AC1: 'Comment on egusphere-2023-2582', Andy Richling, 01 Aug 2024
We thank the two reviewers for reading our article carefully and providing constructive feedback and apologize for the late reply. We have revised the manuscript to account for their suggestions and insightful thoughts. The useful feedback helped to improve the manuscript’s quality. As a major change, we re-designed and condensed the section describing the synthetic cases and adapted the structure of the skill score decomposition. The detailed responses are provided in the attached pdf file.
Status: closed
-
RC1: 'Comment on egusphere-2023-2582', Jonas Bhend, 19 Feb 2024
Richling, Grieger and Rust present a framework for the decomposition of skill scores stratified in subsets. The authors introduce the terms reference and frequency weighting to characterize the contribution of the subset skill scores to the grand total. The manuscript is well written and the application to decadal forecasting is illustrative. The decomposition is a useful addition to the existing forecast verification literature, however, the authors should expand the discussion of the method. In particular it is not clear, if the variability of the contributions is merely a consequence of the geometry of the problem (i.e. there being more room to improve when the reference forecast performs badly) or if we can learn something in excess of this (see also comments below). If the authors are willing to address this minor issue, I fully support publication of the manuscript.
General comments:
While I appreciate the synthetic test cases as a motivating example to elaborate the specifics of the decomposition, I think the setup could be improved for better interpretability. The authors already provide some motivation by mentioning ‘what happens if we improve the forecast in subset 1’. Designing the synthetic cases more prominently along those lines would ease the interpretation. The synthetic setup could be altered to A/B0: base case, A/B1: improve SS1, A/B2: improve SS2. Also, I strongly suggest to consider improvements of the score of equal (relative) size for better comparison of the effect. In particular I would be interested to see if the increased reference weighting is basically a consequence of there being more room to improve when the reference performs poorly. As such a synthetic set of experiments with differing reference weighting, but the same relative improvement in the respective subsets could be illustrative.
The summary in the summary and discussion part is redundant. I suggest to remove or at least considerably shorten this as it doesn’t add to the paper.
I encourage the authors to think of potential applications outside the domain of decadal forecasting to increase the appeal for readers outside of this community.
Minor comments:
L3: Providing some more context with an illustrative example at the start of the abstract would improve readability.
L180ff: This implies that we benefit more from improvements in subsets for which the reference performs badly. With a mildly skillful reference, the reference score basically measures inherent predictability. Consequently, the above translates to we profit more from improvements in situations with limited predictability. If this can be supported, this would imply that we should focus more on subsets that are hard to predict if we want to improve skill in general. This seems contrary to what is usually being done, i.e. exploit situations with relatively high predictability (and plausible hypotheses on drivers) and try to improve predictions there. The authors mention in the conclusion that the focus should not only be with AMO+ situations. Maybe the predictability angle could provide some more grounds for the discussion of the implications of the decomposition.
Figure 3: This is mildly confusing because different labels are used compared with the tables. To improve readability, the corresponding points could be labelled with A0, A1, A2, … and the labels could be replaced with Table A,B instead of Cases A,B (vertical lines) and Case A0/B0, A1/B1, A2/B2 instead of Case 1,2,3 (facets) for clarity.
Figure 4: This figure feels somewhat redundant, maybe the contribution could be integrated with Tables A/B or Figure 1 for clarity.
L257: Are scores indeed computed on 4 yearly average temperatures (period 2-5 years), or are scores computed on monthly mean temperatures as specified in L242 and aggregated for the lead times 2-5 years?
Editorial comments:
L4: The aim is to …
L18: is the comparison against another competing prediction system or a standard reference forecast such as the persistence or climatological forecast.
L24: … and the continuous ranked probability skill score (CRPSS) for probabilistic forecasts are widely used decadal forecast verification (e.g., Kadow et al., 2016; Kruschke et al., 2016; Pasternack et al., 2018, 2021).
L112: adjusts
L139: the mean scores of the forecast systems differ
L140: mean scores of the reference system
L179: in the same way (or in a similar way)
L189: Generally
L241: against monthly mean temperatures from the HadCRUT4 observation data set (Morice et al., 2012). [ Also I suggest to refer to the obs data as HadCRUT4 consistently throughout (e.g. L255). ]
L251: annual averages
L289: with significant values patches with positive but non-significant skill are visible …
L307: uninitialized reference is not influenced …
L393: quite small
Citation: https://doi.org/10.5194/egusphere-2023-2582-RC1 - RC2: 'Comment on egusphere-2023-2582', Anonymous Referee #2, 26 Feb 2024
-
AC1: 'Comment on egusphere-2023-2582', Andy Richling, 01 Aug 2024
We thank the two reviewers for reading our article carefully and providing constructive feedback and apologize for the late reply. We have revised the manuscript to account for their suggestions and insightful thoughts. The useful feedback helped to improve the manuscript’s quality. As a major change, we re-designed and condensed the section describing the synthetic cases and adapted the structure of the skill score decomposition. The detailed responses are provided in the attached pdf file.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
325 | 124 | 28 | 477 | 20 | 19 |
- HTML: 325
- PDF: 124
- XML: 28
- Total: 477
- BibTeX: 20
- EndNote: 19
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1