the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A new efficiency metric for the spatial evaluation and inter-comparison of climate and geoscientific model output
Abstract. Developing and evaluating spatial efficiency metrics is essential for assessing how well climate or other models of the Earth’s system reproduce the observed patterns of variables like precipitation, temperature, atmospheric pollutants, and other environmental data presented in a gridded format. In this study, we propose a new metric, the Modified Spatial Efficiency (MSPAEF), designed to overcome limitations identified in existing metrics, such as the Spatial Efficiency (SPAEF), the Wasserstein Spatial Efficiency (WSPAEF), or the Spatial Pattern Efficiency metric (Esp). The performance of MSPAEF is systematically compared to these metrics across a range of synthetic data scenarios characterized by varying spatial correlation coefficients, biases, and standard deviation ratios. Results demonstrate that MSPAEF consistently provides robust and intuitive performance, accurately capturing spatial patterns under diverse conditions. Additionally, two realistic but synthetic case studies are presented to further evaluate the practical applicability of the metrics. In both examples, MSPAEF delivers results that align with intuitive expectations, while the other metrics exhibit limitations in identifying specific features in at least one case. Finally, as a real-world application, we rank global Coupled Model Intercomparison Project phase 6 (CMIP6) model data according to their skill in representing precipitation and temperature using the four different metrics. This application highlights that the MSPAEF rankings are most similar with Esp with a normalized absolute ranking difference of 2.8 for precipitation, and 3.8 for temperature. These findings highlight the added value of the MSPAEF metric in evaluating spatial distributions and its potential to be used in climate or other environmental model evaluation or inter-comparison exercises.
- Preprint
(2511 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 23 Jul 2025)
-
CC1: 'Comment on egusphere-2025-1471', Mehmet Cüneyd Demirel, 30 May 2025
reply
I enjoyed reading the manuscript.
I have only two points:
1) The sentence above SPAEF equation in Koch et al, (2018) GMD paper states that “Following the multiple-component idea of KGE we present a novel spatial performance metric denoted SPAtial EFficiency (SPAEF), which was originally proposed by Demirel et al. (2018a, b).” This statement can be helpful to find the origin of SPAEF.
2) Benchmarking SPAEF with other variations of SPAEF has also been done in a 2024 SERRA paper . Table 2 lists all improved versions of SPAEF.
https://link.springer.com/content/pdf/10.1007/s00477-024-02790-4.pdf
Citation: https://doi.org/10.5194/egusphere-2025-1471-CC1 -
AC1: 'Reply on CC1', Andreas Karpasitis, 02 Jun 2025
reply
Thank you for your feedback! We truly appreciate it.
In response to your first point, we will ensure that proper references are included for the introduction of the SPAEF metric. Additionally, we acknowledge your second point about benchmarking SPAEF against existing variations, and we will address this in the introduction.
Citation: https://doi.org/10.5194/egusphere-2025-1471-AC1
-
AC1: 'Reply on CC1', Andreas Karpasitis, 02 Jun 2025
reply
-
RC1: 'Comment on egusphere-2025-1471', Anonymous Referee #1, 15 Jun 2025
reply
General Comment:
This is a well-written and well-structured paper in which the authors propose a new indicator for comparing climate and geoscientific models outputs. The inclusion of the code is appreciated and contributes to open science by facilitating reproducibility. However, I believe that some revisions or additional explanations are necessary, particularly regarding the illustration of performance using synthetic data and the real-world ranking exercise. My comments are detailed below:
Major Comment 1
Please review the use of the term "metric" throughout the manuscript. From a mathematical perspective, not every indicator discussed may satisfy the formal properties required to be considered a metric. To prevent misunderstandings, please use other terms or consider including a brief footnote or definition clarifying your use of the term “metric”.
Major Comment 2
Line 42 The statement is too general. It would be helpful to specify in which contexts or under which conditions these “metrics” perform well, and in which they do not. Phrases such as “inhomogeneous spatiotemporal distribution” and “different statistical properties” are vague without further elaboration or examples.
Major Comment 3
Line 90 The explanation of the γ component is insufficient. In the original reference, K and L are defined as histograms with
n bins. Here, K and L are described as probability distributions, but n is never defined. This could lead readers to misinterpret the formulation as involving n distributions K and L, which are then summed. Please define n and ensure consistency with the source.Major Comment 4
Line 112 Please clarify the definition of Φ. What does the subscript i represent, and what are the bounds of the summation? Not all readers will be familiar with the Wasserstein distance, so it’s important to contextualize and adapt the mathematical notation accordingly.
Major Comment 5
Line 141 This paragraph is somewhat redundant, as the subsequent paragraph explains the measure more clearly. If you decide to keep it, consider refining the exposition. For instance, it’s stated that γ accounts for relative bias, but then you stated that this characteristic is also influenced by β, which may confuse readers. Please clarify the distinct roles and interactions of these components.
Major Comment 6
Line 200 Reintroducing all previously defined terms is redundant. You might instead state that the definitions from Equation (14) still apply, and define only the new terms. Also, would it be possible to visualize the behavior of normally distributed vs. skewed data? Including a reason for considering the skewed data scenario would also strengthen the section.
Major Comment 7
Line 207 I assume the iteration process is introduced due to instability introduced by the exponential transformation (which is a concern). If this is correct, why is iteration also applied to non-transformed data? Does that case also suffer from instability? Please clarify.
Major Comment 8
Line 208, 235 It appears that the “modifications” refer to small adjustments (e.g., subtracting 1), but I believe these should be explicitly stated or shown in an appendix or supplementary material.
Major Comment 9
Line 212 The explanation is clear, but the term “λ-δ plot” is introduced as if it were standard, which may not be the case. Also, since the plots involve correlation (ρ), the name might more appropriately reflect all components (e.g. λ-δ-ρ). Additionally, Figures 2–5 are referenced before Figure 1, which disrupts the reading flow. Consider expanding Figure 1 to include all components (λ, δ, ρ) and use it as a comprehensive illustrative example.
Major Comment 10
Line 239 It would be useful to specify the range of each statistic used in the comparison.
Major Comment 11
Section 3.2 This section introduces an interesting exercise, but lacks key contextual information. Please describe the motivation and objectives of the parameter variations (line 291). What are you trying to illustrate by modifying these parameters? How are the parameters changed and which are the hyperparameters? How did you select these hyperparameters? For instance, if a uniform distribution U(a = −1, b = 1) is used to model bias, why was this range chosen?
Additionally, consider whether the comparison is fair across models (A and B) and variables (precipitation and temperature). For example, is it common for precipitation model outputs to exhibit negative spatial correlation with observations? Why are correlation values so similar for temperature models (A and B)? Are the data transformed for precipitation, considering it typically does not follow a normal distribution? Given the applied nature of your work, an exploratory data analysis would help support the assumptions and setup. For instance, if temperature models replicate the mean very accurately, the insensitivity of SPAEF to bias may not be a serious issue. Presenting an extreme bias scenario (e.g., 7.5 Kelvin) may be less meaningful unless the goal is to show theoretical failures of other methods, rather than plausible real-world behavior. A justification for the selection of the specific scenarios you are presenting must be included.
Major Comment 12
Line 316 When you state, “we averaged over the 1981–2010 period,” do you mean that each grid cell represents the average over all years for that location? If so, is this a standard approach for model evaluation? This process may result in considerable loss of information, so a reference would be helpful.
Major Comment 13
Section 3.3 Following up on comments regarding Section 3.2, I’m concerned about the similarity among bias-insensitive metrics and MSPAEF. How are you defining “similarity” between model output and observations? The issue resembles your first synthetic example. I believe that whether a metric should favor spatial pattern accuracy over mean accuracy or vice versa may depend on the application. If your primary aim is to detect spatial similarity, that should be explicitly stated, but for now this is just inferred.
Consider also that some variants of the Kling-Gupta Efficiency (KGE) allow weighting of each component, which also may be useful for tuning “similarity” in your selected “metrics”. However, this requires the user to define those weights. As an alternative, your exploratory analysis (from Section 3.2) could guide which metric components need stronger discrimination. With these variations, some methods may perform comparably to MSPAEF, without diminishing the merit of the interesting properties of this new indicator. This is something that you should try or at least mention.
Major Comment 14
Table 1,Table 2 Sometimes, it is necessary to rank models based on two or more variables simultaneously. You are currently ranking the models using only one variable at a time. Could you consider adding references that illustrate methods for multi-criteria model ranking?
Citation: https://doi.org/10.5194/egusphere-2025-1471-RC1
Data sets
CMIP6 data - https://climexp.knmi.nl/selectfield_cmip6_knmi23.cgi?
ERA5 data - https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels-monthly-means?tab=overview
Model code and software
Python software Andreas Karpasitis https://doi.org/10.5281/zenodo.15094921
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
136 | 27 | 7 | 170 | 3 | 4 |
- HTML: 136
- PDF: 27
- XML: 7
- Total: 170
- BibTeX: 3
- EndNote: 4
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1