the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A barycenter-based approach for the multi-model ensembling of subseasonal forecasts
Abstract. Ensemble forecasts and their combination are examined from the perspective of probability spaces. Manipulating ensemble forecasts as discrete probability distributions, multi-model ensemble (MME) forecasts are reformulated as barycenters of these distributions. We consider two barycenters, each defined with respect to a different distance metric: the L2 barycenter, which correspond to the traditional pooling method, and the Wasserstein barycenter, which better preserves certain geometric properties of the input ensemble distributions.
As a proof of concept, we apply the L2 and Wasserstein barycenters to the combination of four models from the Subseasonal to Seasonal (S2S) prediction project database. Their performance is evaluated for the prediction of weekly 2 m temperature, 10 m wind speed, and 500 hPa geopotential height over European winters. By construction, both barycenter-based MMEs have the same ensemble mean, but differ in their representation of the forecast uncertainty. Notably, the L2 barycenter has a larger ensemble spread, making it more prone to under-confidence. While both methods perform similarly on average in terms of the Continuous Ranked Probability Score (CRPS), the Wasserstein barycenter performs better more frequently.
- Preprint
(2552 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2025-1330', Anonymous Referee #1, 13 Oct 2025
-
RC2: 'Review of egusphere-2025-1330', Anonymous Referee #2, 26 Oct 2025
Review of EGUSPHERE-2025-1330
A barycenter-based approach for the multi-model ensembling of subseasonal forecasts
By Camille Le Coz, Alexis Tantet, Rémi Flamary, and Riwal PlougonvenThe authors propose the Wasserstein barycenter for use in multi-model ensemble forecasting and test its relative performance against the more traditional L2 barycenter. Overall, the study is well-written and presented in a way that was digestible by a general audience, despite some of the more complicated mathematical constructs employed. While the results indicate that the Wasserstein barycenter does provide some advantages, is it not universally better than the traditional approaches. Nonetheless, this work is important for exploring how other options for ensemble forecasting could provide additional insights. I recommend the paper for publication subject to minor clarifications and revisions below.
Clarifications:
1. Line 164: “Note that they additionally assumed the signal to be stationary and periodic which we did not do.” Can you clarify the implications of this choice?
2. Line 201: Since the members of the Wasserstein barycenter isn’t drawn from the set of possible simulated states, is it possible that it is not actually physically realizable? Namely, that there’s no physically consistent way to connect the final and initial states?
3. Line 247: “Due to model errors, forecasts tend to drift away from the observed climate toward the model climatology as lead time increases.” Isn’t the drift also caused by initial observational uncertainty?
4. Perhaps the fact that the W2 barycenter could give results which are not part of the ensemble is an advantage: ultimately every model cannot represent physical co-variances correctly and have their own biases (all modeled physical processes are approximations). By choosing a trajectory that is not present in the ensemble you are no longer bound to those biases.Minor comments:
1. Section numbers referred to in the manuscript seem to be rendered incorrectly by the latex template.
2. Line 143: Should the variables inside the curly brackets be a_1 and a_2 instead of a and b?
3. Line 163: Parenthetical citation should be used for Flamary et al. (2020)
4. Line 182: hypothesis hypothesis
5. Line 428: multi-modelCitation: https://doi.org/10.5194/egusphere-2025-1330-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,749 | 42 | 16 | 1,807 | 17 | 32 |
- HTML: 1,749
- PDF: 42
- XML: 16
- Total: 1,807
- BibTeX: 17
- EndNote: 32
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of "A barycenter-based approach for the multi-model ensembling of subseasonal forecasts" by Camille Le Coz et al (submitted to GMD)
General comments:
The article explores a new method to aggregate members from multimodel ensemble, based on an optimization of the Wasserstein distance. It is applied to univariate point forecast timeseries, on a few weather parameters. It is compared with the more usual (and simpler) member pooling technique, also called the L2 barycenter, using a long range forecast database. The scores show clear benefits of combining several ensembles, which was already known for the pooling method, and they suggest that the Wasserstein-based barycenter may be slightly better than the L2 one according to some performance measures. The evaluation shows that the Wasserstein method decreases a lot the ensemble dispersion, which is a direct consequence of its design.
Is it a good thing to reduce intermodel ensemble spread in a multimodel ensemble ? One could argue that it defeats the purpose of using multiple models. Many weather prediction centers actually use multiple models to increase ensemble dispersion, because it improves key aspects of the forecasts.
Mathematically, the need to reduce spread depends on the presence of over- or under-dispersion in the raw (i.e., pooled) ensemble. In the article setup, the usefulness of implementing the Wasserstein barycenter method is not very clear, as the advertised score benefits could probably have been obtained by (technically and conceptually) simpler methods to reduce the ensemble dispersion, like EMOS or a member-by-member bias correction.
One understands that the Wasserstein barycenter method can get quite complex and expensive if the workspace dimension is increased (here it is only 6, owing to the time averaging applied to the forecasts), or if the distributions are significantly non-gaussian, which will restrict its practical applicability.
Despite these rather underwhelming results, the article can serve as a nicely written course on the Wasserstein distance, which is well known in the field of AI (e.g. as a probabilistic metric), but not yet in meteorology, although similar spatial verification concepts (such as SAL) are standard.
The idea of using the Wasserstein metric as an alternative to CRPS for comparing ensemble forecast distributions is original, it might be useful for other applications than the aggregation of multimodel ensembles (perhaps for clustering ?).
In a nutshell, the article does not present a clearly useful application, but it is a thought-provoking introduction to a so far little used tool that might lead to practical uses in the future.
Specific remarks (tagged by manuscript line number)
Some parts of the text are unclear or ambiguous. There are typos.
In some languages, the term "barycenter" is often used to mean "centroid" (i.e., weight averages), which can be confusing because centroid extraction is commonly used in ensemble post-processing to summarize ensembles as pseudo-deterministic forecast scenarios. It would help to clarify early in the introduction that the aim is to transform an ensemble distribution into another distribution, not into a single value.
l.19, 18: replace "single-model" by "SME" or equivalent.
l.29: insert "which _often_ implies" (a poor MME setup can actually degrade the forecast PDFs)
l. 62-63: do you mean "in the BMA method, _no_ assumtion is made" (otherwise it contradicts the previous sentence on EMOS)
l. 69: at this point you may clarify that the barycenter is a transformed ensemble prediction.
l. 136: pooling was not actually introduced because it minimizes L2, but because it is the correct thing to do if one assumes all members to be independent, identically distributed (iid) samples of the same distribution. Thus, using another method implies that one assumes the distributions sampled by each ensemble are not the same (presumably because they contain model-specific errors, which is a reasonable hypothesis according to previous literature). This should be more explicitly stated. Based on this article I would conclude that the Wasserstein barycenter is a statistical device to represent model-specific error.
l.156: what do you mean by "space dimension" in appendix A ? Is it dimension d ? Or the geographical geographical spatial distribution that you discuss in Fig 5 ? How do you reach a value of 6 (in l. 495; this is only explained later) ? Please replace "space" by a more precise term. According to l.104, it should be something like "the number of lead times".
l.162: please correct "the statistical the difficulty" and fix the Flamary reference.
l.165: in this context, "multivariate" could also mean "joint distribution between several meteorological variables". Please fix.
l.182: duplicate "hypothesis"
l.183: most weather variables of interest, like wind speed or precipitation, have strongly non-gaussian distributions, so that using mappings to displace their gaussian approximations will change their dispersion. These variables have a nonlinear relationship between their average and dispersion, which is why they are usually modelled as e.g. Weibull or gamma distributions.
l.189: again, replace "dimension of the sample space" which is confusing, by "the number of lead times".
l.192: define "OT"
Fig.2 and l.204: please use consistent terminology, e.g. "pooling" instead of "L2-barycenter" or "concatenation", since they are the same thing.
l.210-214: this discussion should give some physical reasons why preserving distribution shape can be regarded as desirable or not. Why use a multimodel ensemble in the first place if differences in distributions are regarded as a problem to fix ? There seems to be an implicit assumption that these differences stem from climatological differences between the single-model ensembles used, so that the motivation for using the GaussW2 barycenter is to implement a bias correction scheme.
l.284: replace "simulation's number" by "ensemble member index"
l.288: compare -> compares
l.294: a drawback of the CRPSf definition is that it penalizes forecast variability: an ensemble with systematically close-to-average spread will be less likely to produce events with large CRPS, so it will have a better (smaller) CRPSf. By design, it will also produce less extreme probabilities (this is quite clear in Fig. 2), so it will have a reduced capablity for detecting extreme events. The performance metrics used in this paper tend to reward average skill. It would be useful to see metrics that are more sensitive to the detection of anomalous events, such as CSI or F1, F2, or the areas under the ROC and precision-recall curves for high physical thresholds.
Table 2: the formulae are unnecessarily heavy. For instance, the double sum over i and j, factor wi etc could be avoided by simply stating that the scores are spatially averaged. This is standard practice in the weather community.
Fig 4, 5, 6: Some graphical indication of significance of the differences is needed, particularly the differences that are presented as results in the text. In maps, an effective method is to leave blank (white) the areas where SSR does not significantly differ from one.
l.367: explain -> explained
l.373: nonexistent section number 22
l.374: please provide references to studies where a similar ensemble is proven to be over-dispersed indeed.
l.376,383: nonexistent section number
l.383: "better" than what ? L2 or SMEs ?
l.386-387: cumbersome sentence. Please rephrase.
l.402-405: I do not understand this statement: where do you see the influence of the forecast initial condition in eq.7 ?
l.425: "use _of_"
l.455: add "model _systematic_" error. Some model errors are not systematic and thus cannot be claimed to be handled by the GaussW2 barycenter method. The claim can only be made for those model errors that are identically activated in all ensemble members of each SME, and that are persistent in time at any given point.
l.461: same remark as on l.402: these methods do not separate initial condition errors from model errors in a general sense, such a separation would only occur under some simplifying assumptions that need to be more explicit in the text.
Appendix B. this appendix does not seem relevant to the rest of the article. It should be deleted if it is not explicitly used in the text body.
Tables D1 to D4: these are not very readable. A graphical presentation of the useful information would improve readability. "models are significantly different" from what ? From each other, or is L2-bary different from GaussW2-bary ?