A barycenter-based approach for the multi-model ensembling of subseasonal forecasts

Le Coz, Camille; Tantet, Alexis; Flamary, Rémi; Plougonven, Riwal

doi:10.5194/egusphere-2025-1330

Preprints

https://doi.org/10.5194/egusphere-2025-1330

Preprints

28 May 2025

| 28 May 2025

A barycenter-based approach for the multi-model ensembling of subseasonal forecasts

Camille Le Coz, Alexis Tantet, Rémi Flamary, and Riwal Plougonven

Abstract. Ensemble forecasts and their combination are examined from the perspective of probability spaces. Manipulating ensemble forecasts as discrete probability distributions, multi-model ensemble (MME) forecasts are reformulated as barycenters of these distributions. We consider two barycenters, each defined with respect to a different distance metric: the L₂ barycenter, which correspond to the traditional pooling method, and the Wasserstein barycenter, which better preserves certain geometric properties of the input ensemble distributions.

As a proof of concept, we apply the L₂ and Wasserstein barycenters to the combination of four models from the Subseasonal to Seasonal (S2S) prediction project database. Their performance is evaluated for the prediction of weekly 2 m temperature, 10 m wind speed, and 500 hPa geopotential height over European winters. By construction, both barycenter-based MMEs have the same ensemble mean, but differ in their representation of the forecast uncertainty. Notably, the L₂ barycenter has a larger ensemble spread, making it more prone to under-confidence. While both methods perform similarly on average in terms of the Continuous Ranked Probability Score (CRPS), the Wasserstein barycenter performs better more frequently.

Received: 21 Mar 2025 – Discussion started: 28 May 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Camille Le Coz, Alexis Tantet, Rémi Flamary, and Riwal Plougonven

Status: final response (author comments only)

RC1: 'Comment on egusphere-2025-1330', Anonymous Referee #1, 13 Oct 2025

Review of "A barycenter-based approach for the multi-model ensembling of subseasonal forecasts" by Camille Le Coz et al (submitted to GMD)
General comments:

The article explores a new method to aggregate members from multimodel ensemble, based on an optimization of the Wasserstein distance. It is applied to univariate point forecast timeseries, on a few weather parameters. It is compared with the more usual (and simpler) member pooling technique, also called the L2 barycenter, using a long range forecast database. The scores show clear benefits of combining several ensembles, which was already known for the pooling method, and they suggest that the Wasserstein-based barycenter may be slightly better than the L2 one according to some performance measures. The evaluation shows that the Wasserstein method decreases a lot the ensemble dispersion, which is a direct consequence of its design.
Is it a good thing to reduce intermodel ensemble spread in a multimodel ensemble ? One could argue that it defeats the purpose of using multiple models. Many weather prediction centers actually use multiple models to increase ensemble dispersion, because it improves key aspects of the forecasts.
Mathematically, the need to reduce spread depends on the presence of over- or under-dispersion in the raw (i.e., pooled) ensemble. In the article setup, the usefulness of implementing the Wasserstein barycenter method is not very clear, as the advertised score benefits could probably have been obtained by (technically and conceptually) simpler methods to reduce the ensemble dispersion, like EMOS or a member-by-member bias correction.
One understands that the Wasserstein barycenter method can get quite complex and expensive if the workspace dimension is increased (here it is only 6, owing to the time averaging applied to the forecasts), or if the distributions are significantly non-gaussian, which will restrict its practical applicability.
Despite these rather underwhelming results, the article can serve as a nicely written course on the Wasserstein distance, which is well known in the field of AI (e.g. as a probabilistic metric), but not yet in meteorology, although similar spatial verification concepts (such as SAL) are standard.
The idea of using the Wasserstein metric as an alternative to CRPS for comparing ensemble forecast distributions is original, it might be useful for other applications than the aggregation of multimodel ensembles (perhaps for clustering ?).
In a nutshell, the article does not present a clearly useful application, but it is a thought-provoking introduction to a so far little used tool that might lead to practical uses in the future.
Specific remarks (tagged by manuscript line number)
Some parts of the text are unclear or ambiguous. There are typos.
In some languages, the term "barycenter" is often used to mean "centroid" (i.e., weight averages), which can be confusing because centroid extraction is commonly used in ensemble post-processing to summarize ensembles as pseudo-deterministic forecast scenarios. It would help to clarify early in the introduction that the aim is to transform an ensemble distribution into another distribution, not into a single value.
l.19, 18: replace "single-model" by "SME" or equivalent.

l.29: insert "which _often_ implies" (a poor MME setup can actually degrade the forecast PDFs)

l. 62-63: do you mean "in the BMA method, _no_ assumtion is made" (otherwise it contradicts the previous sentence on EMOS)

l. 69: at this point you may clarify that the barycenter is a transformed ensemble prediction.

l. 136: pooling was not actually introduced because it minimizes L2, but because it is the correct thing to do if one assumes all members to be independent, identically distributed (iid) samples of the same distribution. Thus, using another method implies that one assumes the distributions sampled by each ensemble are not the same (presumably because they contain model-specific errors, which is a reasonable hypothesis according to previous literature). This should be more explicitly stated. Based on this article I would conclude that the Wasserstein barycenter is a statistical device to represent model-specific error.
l.156: what do you mean by "space dimension" in appendix A ? Is it dimension d ? Or the geographical geographical spatial distribution that you discuss in Fig 5 ? How do you reach a value of 6 (in l. 495; this is only explained later) ? Please replace "space" by a more precise term. According to l.104, it should be something like "the number of lead times".

l.162: please correct "the statistical the difficulty" and fix the Flamary reference.

l.165: in this context, "multivariate" could also mean "joint distribution between several meteorological variables". Please fix.

l.182: duplicate "hypothesis"

l.183: most weather variables of interest, like wind speed or precipitation, have strongly non-gaussian distributions, so that using mappings to displace their gaussian approximations will change their dispersion. These variables have a nonlinear relationship between their average and dispersion, which is why they are usually modelled as e.g. Weibull or gamma distributions.

l.189: again, replace "dimension of the sample space" which is confusing, by "the number of lead times".

l.192: define "OT"
Fig.2 and l.204: please use consistent terminology, e.g. "pooling" instead of "L2-barycenter" or "concatenation", since they are the same thing.

l.210-214: this discussion should give some physical reasons why preserving distribution shape can be regarded as desirable or not. Why use a multimodel ensemble in the first place if differences in distributions are regarded as a problem to fix ? There seems to be an implicit assumption that these differences stem from climatological differences between the single-model ensembles used, so that the motivation for using the GaussW2 barycenter is to implement a bias correction scheme.

l.284: replace "simulation's number" by "ensemble member index"

l.288: compare -> compares

l.294: a drawback of the CRPSf definition is that it penalizes forecast variability: an ensemble with systematically close-to-average spread will be less likely to produce events with large CRPS, so it will have a better (smaller) CRPSf. By design, it will also produce less extreme probabilities (this is quite clear in Fig. 2), so it will have a reduced capablity for detecting extreme events. The performance metrics used in this paper tend to reward average skill. It would be useful to see metrics that are more sensitive to the detection of anomalous events, such as CSI or F1, F2, or the areas under the ROC and precision-recall curves for high physical thresholds.

Table 2: the formulae are unnecessarily heavy. For instance, the double sum over i and j, factor wi etc could be avoided by simply stating that the scores are spatially averaged. This is standard practice in the weather community.
Fig 4, 5, 6: Some graphical indication of significance of the differences is needed, particularly the differences that are presented as results in the text. In maps, an effective method is to leave blank (white) the areas where SSR does not significantly differ from one.

l.367: explain -> explained

l.373: nonexistent section number 22

l.374: please provide references to studies where a similar ensemble is proven to be over-dispersed indeed.
l.376,383: nonexistent section number

l.383: "better" than what ? L2 or SMEs ?

l.386-387: cumbersome sentence. Please rephrase.

l.402-405: I do not understand this statement: where do you see the influence of the forecast initial condition in eq.7 ?

l.425: "use _of_"

l.455: add "model _systematic_" error. Some model errors are not systematic and thus cannot be claimed to be handled by the GaussW2 barycenter method. The claim can only be made for those model errors that are identically activated in all ensemble members of each SME, and that are persistent in time at any given point.

l.461: same remark as on l.402: these methods do not separate initial condition errors from model errors in a general sense, such a separation would only occur under some simplifying assumptions that need to be more explicit in the text.
Appendix B. this appendix does not seem relevant to the rest of the article. It should be deleted if it is not explicitly used in the text body.
Tables D1 to D4: these are not very readable. A graphical presentation of the useful information would improve readability. "models are significantly different" from what ? From each other, or is L2-bary different from GaussW2-bary ?

Citation: https://doi.org/10.5194/egusphere-2025-1330-RC1
RC2: 'Review of egusphere-2025-1330', Anonymous Referee #2, 26 Oct 2025

Review of EGUSPHERE-2025-1330

A barycenter-based approach for the multi-model ensembling of subseasonal forecasts

By Camille Le Coz, Alexis Tantet, Rémi Flamary, and Riwal Plougonven
The authors propose the Wasserstein barycenter for use in multi-model ensemble forecasting and test its relative performance against the more traditional L2 barycenter. Overall, the study is well-written and presented in a way that was digestible by a general audience, despite some of the more complicated mathematical constructs employed. While the results indicate that the Wasserstein barycenter does provide some advantages, is it not universally better than the traditional approaches. Nonetheless, this work is important for exploring how other options for ensemble forecasting could provide additional insights. I recommend the paper for publication subject to minor clarifications and revisions below.
Clarifications:

1. Line 164: “Note that they additionally assumed the signal to be stationary and periodic which we did not do.” Can you clarify the implications of this choice?

2. Line 201: Since the members of the Wasserstein barycenter isn’t drawn from the set of possible simulated states, is it possible that it is not actually physically realizable? Namely, that there’s no physically consistent way to connect the final and initial states?

3. Line 247: “Due to model errors, forecasts tend to drift away from the observed climate toward the model climatology as lead time increases.” Isn’t the drift also caused by initial observational uncertainty?

4. Perhaps the fact that the W2 barycenter could give results which are not part of the ensemble is an advantage: ultimately every model cannot represent physical co-variances correctly and have their own biases (all modeled physical processes are approximations). By choosing a trajectory that is not present in the ensemble you are no longer bound to those biases.
Minor comments:

1. Section numbers referred to in the manuscript seem to be rendered incorrectly by the latex template.

2. Line 143: Should the variables inside the curly brackets be a_1 and a_2 instead of a and b?

3. Line 163: Parenthetical citation should be used for Flamary et al. (2020)

4. Line 182: hypothesis hypothesis

5. Line 428: multi-model

Citation: https://doi.org/10.5194/egusphere-2025-1330-RC2

Camille Le Coz, Alexis Tantet, Rémi Flamary, and Riwal Plougonven

Viewed

Total article views: 1,852 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,781	52	19	1,852	18	34

HTML: 1,781
PDF: 52
XML: 19
Total: 1,852
BibTeX: 18
EndNote: 34

Views and downloads (calculated since 28 May 2025)

Month	HTML	PDF	XML	Total
May 2025	41	4	1	46
Jun 2025	47	6	3	56
Jul 2025	22	7	4	33
Aug 2025	321	6	1	328
Sep 2025	1,250	10	3	1,263
Oct 2025	84	16	6	106
Nov 2025	16	3	1	20

Cumulative views and downloads (calculated since 28 May 2025)

Month	HTML	PDF	XML	Total
May 2025	41	4	1	46
Jun 2025	47	6	3	56
Jul 2025	22	7	4	33
Aug 2025	321	6	1	328
Sep 2025	1,250	10	3	1,263
Oct 2025	84	16	6	106
Nov 2025	16	3	1	20

Viewed (geographical distribution)

Total article views: 1,831 (including HTML, PDF, and XML) Thereof 1,831 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 16 Nov 2025

Short summary

We reformulate multi-model ensembles by treating ensemble forecasts as discrete probability distributions and combining them using barycenters. We compare the L₂ barycenter (equivalent to pooling) with the Wasserstein barycenter (more precisely its Gaussian approximation). Both have the same ensemble mean but differ in how they represent forecasts uncertainty. In terms of Continuous Ranked Probability Score, the Wasserstein barycenter outperforms more often while performing similarly on average.


Total:	0
HTML:	0
PDF:	0
XML:	0