the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Reduction of uncertainty in near-term climate forecast by combining observations and decadal predictions
Abstract. The implementation of adaptation policies requires seamless relevant information about near-term climate evolution, which remains highly uncertain due to the strong influence of internal variability. The recent development of approaches to improve near-term climate information by selecting members from large ensembles – based on their agreement with either observed or predicted sea surface temperature patterns – have shown promising results across timescales from weeks to decades. Here, we propose a new method to provide climate forecasts over Europe by combining information from both observations and decadal predictions through a two-stage member selection from ensembles of climate simulations. Several predictors are tested as observational metrics based on their influence on the European climate variability at annual to decadal timescale. A retrospective evaluation over Europe demonstrates the added value of this method in reducing the spread of uncertainty stemming from both internal climate variability and model uncertainty. This method can outperform both historical simulations and decadal prediction in 5- 10- and 15-year temperature forecasts of winter MED, as well as summer NEU and WCE. Significant skill improvements are visible for 10- and 15-year forecasts of winter Mediterranean surface temperature over land, when using the North Atlantic Oscillation or the Atlantic Multidecadal Variability as predictors in the first selection. The optimal predictor varies by region and should be evaluated on a case-by-case basis. This improved regional climate information supports more targeted adaptation strategies for the coming decades.
- Preprint
(3185 KB) - Metadata XML
-
Supplement
(2713 KB) - BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-4463', Anonymous Referee #1, 24 Oct 2025
-
RC2: 'Comment on egusphere-2025-4463', Anonymous Referee #2, 28 Nov 2025
Review of “Reduction of uncertainty in near-term climate forecast by combining observations and decadal predictions” by Rémy Bonnet et al.
This study introduces a method that constrains climate projections based on a combination of previous observations and initialised decadal predictions, to reduce the uncertainty in near-term climate projections (for the next 10-20 years defines near-term in the introduction, although later the results only go up to 15 years). The authors demonstrate that this method reduces the spread of the projections, which they claim adds value. The analysis of a selection of skill scores for three European regions reveals a somewhat mixed picture.
The study is an interesting addition to the growing literature that combines observations, initialised predictions and climate projections to provide seamless climate prediction information. I think, however, that additional work is needed to clarify several points in the manuscript, and in particular the interpretation of the results.
Major comments
- Given the primary effect of the blending is a reduction in spread, while the error of the predictions is not significantly affected, it is not obvious to me that reducing the spread/uncertainty of the predictions necessarily ‘adds value’. This is only true if it enhances the reliability of the predictions, but there is a risk that reducing the spread makes the predictions overconfident. This should be tested by including a metric in the analysis that evaluates the reliability of the blended hindcasts.
- Statistical testing of differences or skill/improvements. In Figure 3, the box plots of the absolute error largely overlap, with some minor variations in the different quantiles of the distribution. The authors interpreted a reduced error from the constraint in some cases, however seeing the large spread of the distributions I doubt that this is a robust result. I therefore recommend applying a suitable testing for statistical significance, if using these figures to interpret improvements. Similar for the skill scores presented in Figure 4 – also here the authors should test which of the skill improvements are in fact significant.
Specific comments
Line 18: Here and elsewhere in the manuscript it is claimed that the method can outperform decadal climate predictions. However it is not clear where this ‘outperforming’ is shown – presuming that the skill scores in e.g. Fig. 4 are calculated using the historical simulations as reference forecast (this is not explicit, but obviously it can only be either historical runs or decadal predictions as reference forecast here – so I wonder where the results using the other reference forecast are shown?)
Line 36: check wording, seems to imply that the internal variability as such is “resolved” as if it was a process that could be parametrized at low resolution. However, both coarse-scale and high-resolution models do simulate their model-specific internal variability.
Lines 36-37: providing some references to explain initialised predictions could be useful, e.g. Meehl et al.2021 (https://doi.org/10.1038/s43017-021-00155-x) or others
Line 53: Is it really “crucial” to have seamless climate climate information – cannot decisions also be made by using the best information available for different specific time scales? Can you provide a reference for the statement that it would be “crucial that climate information be seamless”?
Line 69-74: In this context, it may also be worthwhile to mention the recent study by Acosta et al . 2025 (https://doi.org/10.5194/esd-16-1723-2025) that developed seamless seasonal to multi-annual predictions using similar analogue-based methods
Line 92: That is 70% of land in the CNRM-CM6-1 model, or do you use the (regridded) land fraction from each model (as the masks and land fractions may differ between models)?
Line 92: “enough grid points” not clear – enough for what?
Line 93: lead-time dependent?
Line 100 / Table 1: some of the models in the table provide several hindcast experiments, using different initialisation approaches. For reproducibility, please specify which experiments (e.g. which run identifier in ESGF) were used
Line 113-115: at which time scale (temporal averaging) did you calculate AMV, did you apply temporal smoothing? (often AMV calculations apply a filter over e.g. 11 years)
Line 124: I did not think that Befort 2024 constrained based on spatial correlations?
Line 143: “[1-5]-yr” not clear – does this indicate a 5-year average, or 5 individual years?
Line 155/159: and elsewhere across the text, it seems your dataset acronyms are not consistent, e.g. BLEND_OBS (capital OBS) and BLEND_obs (lower-case obs) – or do these different writings refer to different things? Similar further down for HIST_hindcasts and HIST_Hindcasts. Please check consistency of terminology throughout the text and figures.
Line 153-160: it seems a bit confusing that in some cases 20 simulations are selected, and in other cases 30 simulations/members. What is the rationale for these different numbers? Overall, this description of the exact (admittedly complex) method seems hard to follow.
Line 164: does 90th percentile range mean the range between the 5th and the 95th percentile?
Line 164/165: I am not convinced that ±0.6 is robustly smaller (and indicates reduced uncertainty as claimed here) than ±0.62 in the line above, also considering that DEC has a smaller ensemble size.
Line 190 / Figure 1 caption: please check the sentence “The data used for the
historical simulations and the hindcast simulations mixed all the models together (see section 2.1) are described in Tables 1 and 2.” – seems grammatically off
line 199-201: unclear: 1967 is 27 years after 1940 when the ERA5 data starts. If using 20-year windows, why not start the hindcasts in 1960? Also the DCPP hindcasts are starting in 1960, so it is unclear why you would decrease your already small sample size further?
Line 205: can you please specify: these years 1967-2000 are initialisation years – So 1-15 year forecast initialised in 2000 would go until 2014?
Line 206: it would be good to already mention here which predictions you are considering as reference forecast in the skill score calculations.
Line 229: which block size are you using for the boot strapping? Given the auto-correlation in the data, it is important to use block bootstrapping (see e.g. Goddard et al. 2013, https://doi.org/10.1007/s00382-012-1481-2)
Line 229: Please also specify how you are testing the significance of the other skill metrics (ACC, RPSS, CRPSS, MSSS). Several of the claims about improvements/added values are based on these skill scores, and also the distributions of spread and error, so these statements should be underpinned by suitable tests of the statistical significance of the added values.
Line 239: Please make sure that this methodological detail that analogues are selected solely based on 5-year hindcasts is specified in the Methods section 2.3. I may have missed this information there?
Line 241: Please briefly explain what you mean by “added value”, and how it is measured
Line 244-245: Is this earlier finding f the initialisation shock triggering El Nino events model-specific, or does it systematically affecting the multiple decadal prediction systems that you are using? In other words, how relevant is this for your analysis?
Line 245: “any add values”?
Line 246: Not clear what exactly you refer to by “DEC poor scores”?
Line 242-248: I find this discussion rather confusing, seems to mix up several issues?
Line 265 / Figure 2 caption: should the panel reference for WCE be (c,d)?
Figures 2/3: it is very hard to see which of the box plots belongs to which dataset. Besides enlarging the legend for better readability, maybe also the choice of colours could be optimised to better distinguish/identify the specific datasets.
Line 274: remove comma after CRPSS; and should add “in” after winter (in NEU)
Line 275: “better” in comparison to what?
Line 285: winter “in” WCE?
Line 285: for statements like “perform better than HIST”, etc., please apply a suitable significance testing to avoid interpreting noise
Line 297: all the datasets (plural)
Line 307: all leadtimes (plural)
Line 321: I am sorry, I do not see a clear error reduction in Figure 3e. The box plots pretty much overlap, any slight shifts may be noise.
Line 335: for “tailoring the choice of observational predictors” – so whichpredictor do you recommend for which region?
Fig. 4: Please specify in the caption which reference forecast was used in the skill score calculations? If this figure uses historical simulations as reference(?), did I possibly miss a figure that uses decadal predictions as reference forecast – as several places in the text seem to discuss improvements over the decadal predictions?
Fig. 4 and related discussion in the text: Can you say something systematic of whether (or how often) the blending improves skill over just the HIST constraint? Also can you highlight for which regions in which seasons the constraint is particularly beneficial, and when not? The current description of these results reads fairly repetitive, and it is hard to grasp the key findings.
Line 343 / heading 3.2: the maps discussed here seem to show all of Europe, not just the MED region?
Line 350: “datasets” here refers to the HIST/DEC and the blended simulations?
Line 354: “significant skill improvement” – how is the significance indicated? Similar question applies to line 356.
Line 359 – 370: is any of the results discussed here statistically significant?
Line 373-374: not clear, why does the similarity to regional average time series degrde the skill here? I may be missing something to understand the discussed link?
Line 377/378: “compared to both the historical and hindcast ensembles” – not clear, which results show skill over the historical, and which show skill over the hindcast ensemble?
Also section 3.2 is very difficult to follow: discussing for each method little improvements (whether significant or not) over sub-regions makes the text appear repetitive, and it is hard to identify the really relevant features and messages. Can this be synthesised at higher level, for potential users to have clearer messages on specific benefits (or lack thereof) of the methods?
Figures 5 and 6 should highlight where the scores are significant. And ideally focus the discussion in the text on such significant features. Also please indicate in the figure caption which is the reference forecast for these skill calculations.
Line 403: but this information that you incorporate may be affected by the drift (despite applying basic correction for leadtime-dependent climatologies)?
Line 408: I am not sure I have seen the added value over hindcasts, presuming the skill scores in Fig. 4, 5, 6 are calculated against the historical simulations as reference forecast?
Line 408/9: “good added value” compared to what?
Line 412: again, I am not sure where to see the skill calculated against the hindcasts?
Line 415: the initialisation can also improve the representation of trends; not all improvements are necessarily related to internal variability.
Line 432: wording does not seem to make sense, please check: “the variable exhibits…associated drivers”? Do you mean drivers are identified, or similar?
Citation: https://doi.org/10.5194/egusphere-2025-4463-RC2
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 401 | 53 | 17 | 471 | 25 | 12 | 14 |
- HTML: 401
- PDF: 53
- XML: 17
- Total: 471
- Supplement: 25
- BibTeX: 12
- EndNote: 14
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General comment
This paper attempts to derive improved prediction information on various time horizons by combining or “blending” subsetted information from historical CMIP simulations and initialized decadal prediction hindcasts with observational constraints. The authors illustrate how difficult it is to improve predictions in general, and how each region and quantity of interest needs its own combination of methods to improve skill. Since this is a methodology paper, the results of course depend on how good the method is that the authors are formulating. As such, the paper stands as a testament to the difficulties and challenges involved with initialized Earth system predictability on regional scales and long leads. The authors do demonstrate improvements in skill with their methodology over some regions and seasons, which is encouraging, though the complexities of applying their method are somewhat daunting.
Detailed comments