A new efficiency metric for the spatial evaluation and inter-comparison of climate and geoscientific model output

Karpasitis, Andreas; Hadjinicolaou, Panos; Zittis, George

doi:10.5194/egusphere-2025-1471

Preprints

https://doi.org/10.5194/egusphere-2025-1471

Preprints

28 May 2025

| 28 May 2025

Status: this preprint is open for discussion and under review for Geoscientific Model Development (GMD).

A new efficiency metric for the spatial evaluation and inter-comparison of climate and geoscientific model output

Andreas Karpasitis, Panos Hadjinicolaou, and George Zittis

Abstract. Developing and evaluating spatial efficiency metrics is essential for assessing how well climate or other models of the Earth’s system reproduce the observed patterns of variables like precipitation, temperature, atmospheric pollutants, and other environmental data presented in a gridded format. In this study, we propose a new metric, the Modified Spatial Efficiency (MSPAEF), designed to overcome limitations identified in existing metrics, such as the Spatial Efficiency (SPAEF), the Wasserstein Spatial Efficiency (WSPAEF), or the Spatial Pattern Efficiency metric (E_sp). The performance of MSPAEF is systematically compared to these metrics across a range of synthetic data scenarios characterized by varying spatial correlation coefficients, biases, and standard deviation ratios. Results demonstrate that MSPAEF consistently provides robust and intuitive performance, accurately capturing spatial patterns under diverse conditions. Additionally, two realistic but synthetic case studies are presented to further evaluate the practical applicability of the metrics. In both examples, MSPAEF delivers results that align with intuitive expectations, while the other metrics exhibit limitations in identifying specific features in at least one case. Finally, as a real-world application, we rank global Coupled Model Intercomparison Project phase 6 (CMIP6) model data according to their skill in representing precipitation and temperature using the four different metrics. This application highlights that the MSPAEF rankings are most similar with E_sp with a normalized absolute ranking difference of 2.8 for precipitation, and 3.8 for temperature. These findings highlight the added value of the MSPAEF metric in evaluating spatial distributions and its potential to be used in climate or other environmental model evaluation or inter-comparison exercises.

Received: 28 Mar 2025 – Discussion started: 28 May 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Andreas Karpasitis, Panos Hadjinicolaou, and George Zittis

Status: open (until 13 Dec 2025)

Post a comment Subscribe to comment alert

CC1:
'Comment on egusphere-2025-1471', Mehmet Cüneyd Demirel, 30 May 2025 reply

I enjoyed reading the manuscript.
I have only two points:
1) The sentence above SPAEF equation in Koch et al, (2018) GMD paper states that “Following the multiple-component idea of KGE we present a novel spatial performance metric denoted SPAtial EFficiency (SPAEF), which was originally proposed by Demirel et al. (2018a, b).” This statement can be helpful to find the origin of SPAEF.
2) Benchmarking SPAEF with other variations of SPAEF has also been done in a 2024 SERRA paper . Table 2 lists all improved versions of SPAEF.
https://link.springer.com/content/pdf/10.1007/s00477-024-02790-4.pdf

Reply

Citation: https://doi.org/10.5194/egusphere-2025-1471-CC1
- AC1: 'Reply on CC1', Andreas Karpasitis, 02 Jun 2025 reply
  
  Thank you for your feedback! We truly appreciate it.
  In response to your first point, we will ensure that proper references are included for the introduction of the SPAEF metric. Additionally, we acknowledge your second point about benchmarking SPAEF against existing variations, and we will address this in the introduction.
  
  Reply
  
  Citation: https://doi.org/10.5194/egusphere-2025-1471-AC1
RC1:
'Comment on egusphere-2025-1471', Anonymous Referee #1, 15 Jun 2025 reply

General Comment:
This is a well-written and well-structured paper in which the authors propose a new indicator for comparing climate and geoscientific models outputs. The inclusion of the code is appreciated and contributes to open science by facilitating reproducibility. However, I believe that some revisions or additional explanations are necessary, particularly regarding the illustration of performance using synthetic data and the real-world ranking exercise. My comments are detailed below:
Major Comment 1
Please review the use of the term "metric" throughout the manuscript. From a mathematical perspective, not every indicator discussed may satisfy the formal properties required to be considered a metric. To prevent misunderstandings, please use other terms or consider including a brief footnote or definition clarifying your use of the term “metric”.
Major Comment 2
Line 42 The statement is too general. It would be helpful to specify in which contexts or under which conditions these “metrics” perform well, and in which they do not. Phrases such as “inhomogeneous spatiotemporal distribution” and “different statistical properties” are vague without further elaboration or examples.
Major Comment 3
Line 90 The explanation of the γ component is insufficient. In the original reference, K and L are defined as histograms with

n bins. Here, K and L are described as probability distributions, but n is never defined. This could lead readers to misinterpret the formulation as involving n distributions K and L, which are then summed. Please define n and ensure consistency with the source.
Major Comment 4
Line 112 Please clarify the definition of Φ. What does the subscript i represent, and what are the bounds of the summation? Not all readers will be familiar with the Wasserstein distance, so it’s important to contextualize and adapt the mathematical notation accordingly.
Major Comment 5
Line 141 This paragraph is somewhat redundant, as the subsequent paragraph explains the measure more clearly. If you decide to keep it, consider refining the exposition. For instance, it’s stated that γ accounts for relative bias, but then you stated that this characteristic is also influenced by β, which may confuse readers. Please clarify the distinct roles and interactions of these components.
Major Comment 6
Line 200 Reintroducing all previously defined terms is redundant. You might instead state that the definitions from Equation (14) still apply, and define only the new terms. Also, would it be possible to visualize the behavior of normally distributed vs. skewed data? Including a reason for considering the skewed data scenario would also strengthen the section.
Major Comment 7
Line 207 I assume the iteration process is introduced due to instability introduced by the exponential transformation (which is a concern). If this is correct, why is iteration also applied to non-transformed data? Does that case also suffer from instability? Please clarify.
Major Comment 8
Line 208, 235 It appears that the “modifications” refer to small adjustments (e.g., subtracting 1), but I believe these should be explicitly stated or shown in an appendix or supplementary material.
Major Comment 9
Line 212 The explanation is clear, but the term “λ-δ plot” is introduced as if it were standard, which may not be the case. Also, since the plots involve correlation (ρ), the name might more appropriately reflect all components (e.g. λ-δ-ρ). Additionally, Figures 2–5 are referenced before Figure 1, which disrupts the reading flow. Consider expanding Figure 1 to include all components (λ, δ, ρ) and use it as a comprehensive illustrative example.
Major Comment 10
Line 239 It would be useful to specify the range of each statistic used in the comparison.
Major Comment 11
Section 3.2 This section introduces an interesting exercise, but lacks key contextual information. Please describe the motivation and objectives of the parameter variations (line 291). What are you trying to illustrate by modifying these parameters? How are the parameters changed and which are the hyperparameters? How did you select these hyperparameters? For instance, if a uniform distribution U(a = −1, b = 1) is used to model bias, why was this range chosen?
Additionally, consider whether the comparison is fair across models (A and B) and variables (precipitation and temperature). For example, is it common for precipitation model outputs to exhibit negative spatial correlation with observations? Why are correlation values so similar for temperature models (A and B)? Are the data transformed for precipitation, considering it typically does not follow a normal distribution? Given the applied nature of your work, an exploratory data analysis would help support the assumptions and setup. For instance, if temperature models replicate the mean very accurately, the insensitivity of SPAEF to bias may not be a serious issue. Presenting an extreme bias scenario (e.g., 7.5 Kelvin) may be less meaningful unless the goal is to show theoretical failures of other methods, rather than plausible real-world behavior. A justification for the selection of the specific scenarios you are presenting must be included.
Major Comment 12
Line 316 When you state, “we averaged over the 1981–2010 period,” do you mean that each grid cell represents the average over all years for that location? If so, is this a standard approach for model evaluation? This process may result in considerable loss of information, so a reference would be helpful.
Major Comment 13
Section 3.3 Following up on comments regarding Section 3.2, I’m concerned about the similarity among bias-insensitive metrics and MSPAEF. How are you defining “similarity” between model output and observations? The issue resembles your first synthetic example. I believe that whether a metric should favor spatial pattern accuracy over mean accuracy or vice versa may depend on the application. If your primary aim is to detect spatial similarity, that should be explicitly stated, but for now this is just inferred.
Consider also that some variants of the Kling-Gupta Efficiency (KGE) allow weighting of each component, which also may be useful for tuning “similarity” in your selected “metrics”. However, this requires the user to define those weights. As an alternative, your exploratory analysis (from Section 3.2) could guide which metric components need stronger discrimination. With these variations, some methods may perform comparably to MSPAEF, without diminishing the merit of the interesting properties of this new indicator. This is something that you should try or at least mention.
Major Comment 14
Table 1,Table 2 Sometimes, it is necessary to rank models based on two or more variables simultaneously. You are currently ranking the models using only one variable at a time. Could you consider adding references that illustrate methods for multi-criteria model ranking?

Reply

Citation: https://doi.org/10.5194/egusphere-2025-1471-RC1
- AC2:
  'Reply on RC1', Andreas Karpasitis, 23 Jun 2025 reply
  
  We would like to express our sincere gratitude to you for the time and effort invested in providing constructive feedback. We have incorporated all recommendations that we believe have significantly enhanced our manuscript. A detailed, point-by-point response to the referees' comments follows.
  
  Response Major Comment 1:
  Thank you for your important observation. We have added the following clarification in the introduction.
  “In this paper, the term metric is used in a broad sense to refer to all indicators, statistics or distance measures that take as input two datasets and output a value that quantifies their relative performance or similarity. This includes, but is not limited to, quantities that satisfy the formal definition of a metric.”
  
  Response to Major Comment 2:
  We agree with the referee. We have revised the text and now include more specific examples to better demonstrate the contexts in which some commonly used simple metrics perform well or poorly. The revised statement reads as follows:
  “While these metrics can be effective for certain applications, their performance can vary significantly based on the statistical and spatial characteristics of the data. For instance, the Pearson correlation coefficient is useful for capturing the linear relationship between two datasets; however, it does not account for systematic biases in the mean or differences in variance (i.e., scale differences). Similarly, other metrics like RMSE and MAE are sensitive to both magnitude and distribution of errors, but they may be disproportionately affected by a large bias in the mean, which can lead to a misrepresentation of spatial patterns. In contrast, the Kolmogorov-Smirnov test compares the underlying distributions of two datasets but lacks spatial context, which is often crucial in geoscientific modeling. These distinct limitations underscore the challenges of using traditional metrics on variables with uneven spatiotemporal distributions, like precipitation, which is often sparsely distributed in space. They also complicate comparisons between models that possess differing statistical properties, such as varying means or levels of variation.”
  
  Response to Major Comment 3:
  Thank you for this comment. We are now consistent with the source:
  “…, K and L are histograms with n common bins, of the standardized values (z-score) of the model and observations respectively, …”
  
  Response to Major Comment 4:
  Thank you for pointing out the insufficient explanation of Φ and the incomplete equation (6). We have changed K_i and L_i to X_(i) and Y_(i) in Eq. (6) and modified Line 112 to align better with the standard definition of WD:
  “…, WD is the Wasserstein distance of order p=2, with X_(i)and Y_(i)being the i-th order statistic of the samples of the observations and model, respectively, and n is the total number of samples in each dataset. This means that the values of the two datasets have been arranged in ascending order, and X_(i) is the i-th value in the sorted list. WD was calculated using the original values of the two datasets to explicitly account for the bias.”
  
  Response to Major Comment 5:
  We agree that this is redundant. We have now removed this paragraph and adjusted the text for better readability.
  
  Response to Major Comment 6:
  We agree with the reviewer and we have modified the text to avoid redundancies and reintroducing some terms:
  “where x_{rm) is the original matrix after subtracting the mean, and the other terms are the same as defined in Eq. (14).”
  Regarding the visualization of datasets that follow normal and skewed distributions, we have created a plot using synthetic data (see the first figure of the attached material), which shows on the left the spatial distribution of variables that follow each distribution, and on the right the corresponding histogram. The top row is for the normally distributed data, and the bottom row is for the skewed distributed data.
  The synthetic data technically follow a highly positively skewed distribution, due to the way they are created, but when plotted (e.q., using around 20 or so histogram bins), the shape closely resembles an exponential distribution. For this reason, we used the more general term ‘skewed’ throughout the manuscript. We intend to keep this term, but add a clarifying sentence before Line 206, to explain that the distribution in practice appears similar to an exponential:
  “Although the data follow a highly positively skewed distribution, their shape closely resembles an exponential distribution when visualized with an insufficient number of histogram bins, due to the generation process.”
  Regarding introducing a reason for also using skewed distributions, we have added the following sentence in Line 192, after the first sentence of the paragraph:
  “Nevertheless, numerous climate and other geoscientific model output variables do not follow a normal distribution, but instead exhibit skewed or exponential distributions, as is the case with daily precipitation (Ensor L. and Robeson S., 2008).”
  https://journals.ametsoc.org/view/journals/apme/47/9/2008jamc1757.1.xml
  
  Response to Major Comment 7:
  The iteration process was not introduced due to instability caused by the exponential transformation. Rather, it was applied in all cases because the metrics were computed using synthetic data generated through random sampling to match specific target statistics. Due to the stochastic nature of this process, individual realizations can exhibit variability. Repeating this process 200 times and using the median value ensures that the results are stable and robust. This repetition ensures convergence of the metric values and minimizes sensitivity to random fluctuations.
  We now clarify this part of the text:
  “For each combination of the aforementioned parameters, and both the normally and skewed distributed cases, the procedure was repeated 200 times, and the median of the values of each metric was used, to ensure convergence of the metric values.”
  
  Response to Major Comment 8:
  We agree with the reviewer. We will now include Appendix, showing the exact form of each of the modified metrics.
  
  Response to Major Comment 9:
  Thank you for the feedback. We have created an illustrative example plot that looks like Figs 2-5 (see the second figure of the attached material). We also replaced the paragraphs starting in Lines 211 and 217 with the following:
  “In these adjusted conditions, where zero indicates a perfect match with the observations, a well-behaved metric is expected to show decreasing values as the correlation coefficient increases and the bias decreases. In the λ-δ-ρ plots, this is reflected as the curves shift at lower coordinates values as one moves towards the right and upward parts of the subplots, as illustrated by the purple curves in Fig. 1. Additionally, the lowest metric value for any combination of correlation and bias is anticipated at a standard deviation ratio of 1. This is reflected in the curve minimum being at a standard deviation ratio of 1 for each subplot, with metric values increasing as the standard deviation ratio deviates from 1.
  In the examples of the top left subplot of Fig. 1, the black and purple curves indicate well-behaving metrics, since for all of them, the minimum values are found at λ of 1, and the values increase monotonically as the standard deviation ratio deviates from 1, even though some of the curves are not perfectly symmetric. The blue curves indicate poorly behaving metrics because, for curve e, the minimum value is not found at a standard deviation ratio of 1, while for curve d, the curve does not monotonically increase to the left side of the minimum.”
  
  Response to Major Comment 10:
  Thank you for the comment. We have added the following sentence to specify the range of each statistic used:
  “The correlation coefficient, bias, and standard deviation ratio were each sampled at discrete intervals within the following ranges: correlation (−0.9 to 0.9), bias (0 to 3), and standard deviation ratio (0.3 to 1.8).”
  
  Response to Major Comment 11:
  In this part of the analysis, the input parameters (bias, standard deviation ratio, and correlation) were not varied across the 20 runs shown in Fig. 6b. Instead, the same parameter values were used to generate 20 independent realizations of synthetic data, due to the stochastic nature of the random sampling process. The use of boxplots allows us to show the spread and median of the metric values across these realizations, helping reduce the effect of the random variation in the data. We revised the text to clarify this and avoid the misleading use of the term “variations.”:
  “In Fig. 6b, the boxplots show the distribution of the values of the four metrics from 20 different realizations of the synthetic data generated with the aforementioned parameters.”
  Line 306 will also be similarly revised.
  The goal of this section was to illustrate the behavior of the metrics under controlled synthetic conditions and highlight their sensitivity to somewhat atypical but relatively realistic cases of bias and spatial correlation.
  It is common for simulated precipitation to exhibit negative correlation compared to observations in the meteorological sense (comparing short time frames). Similarly, in climatological timescales (e.g., annual averages) it would also not be unusual for it to have negative correlation regionally, since climate models are known to have large precipitation biases, especially near the equator, due to incorrect representation of the location of main atmospheric circulation features such as the Inter-Tropical Convergence Zone (ITCZ). Nevertheless, it would indeed be pretty unusual when taking into account the whole planet.
  The correlation values are quite similar between the temperature models, to better illustrate the overwhelming effect of the mean bias, against small differences in the spatial pattern matching, in the different metrics.
  The data were not transformed to a skewed distribution for the precipitation example. Instead, a normal distribution was used as in the temperature case. While precipitation is typically a skewed distribution, this is not always the case. For example, as seen in the third figure of the attached material, annual precipitation can exhibit a bimodal pattern, reflecting a combination of both skewed and slightly skewed distributions. If the polar regions (all areas beyond 75 degrees latitude at either hemisphere) are excluded, in order to focus in the tropical and extra-tropical regions, the resulting distribution of annual precipitation is only slightly positively-skewed and can be reasonably approximated by a normal distribution. On the other hand, daily precipitation seems to generally follow an exponential distribution (or a highly positively skewed distribution).
  Additionally, although these extreme bias examples for the temperature variable might look unrealistic in a global sense, in a local environment (e.g. country level) such large biases might occur momentarily.
  
  Response to Major Comment 12:
  We would like to clarify that each grid cell represents the average annual value of the variable over the specified period for that specific location. We agree that this averaging may lead to a significant loss of temporal variability information. Nevertheless, in climatology, it is a standard practice to evaluate model performance and spatial patterns based on temporally averaged fields over several decades.
  We could add the following sentence at Line 316:
  “These multi-decadal averages were used to reduce the short-term variability, and highlight the long-term climatological signal (Nooni I. et al, 2022 ;Du Y. et al, 2023).”
  https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/joc.7616
  https://www.mdpi.com/2073-4433/14/3/607
  
  Response to Major Comment 13:
  The similarity of MSPAEF to the bias-insensitive metrics might look a bit strange at first glance (at least compared to the other bias-sensitive metrics), since this is a bias-sensitive metric. Nonetheless, this is not as odd when we take into account the large sensitivity of WSPAEF (the other bias-sensitive metric) to the absolute value of the mean bias. This sensitivity is what causes its performance to diverge significantly from that of MSPAEF (and the other two bias-insensitive metrics).
  We also agree that whether a metric should prioritize spatial pattern accuracy or mean bias accuracy depends on the application. We defined “similarity” between model output and observations as a combination of both spatial pattern agreement and agreement in the mean values. The MSPAEF metric was designed as a comprehensive similarity measure that responds to both spatial pattern agreement and bias in the mean. It is defined to emphasize spatial pattern similarity when the relative bias is small, but to increasingly emphasize the bias as it becomes more significant. This design allows MSPAEF to act as a balanced indicator, adapting its emphasis depending on the characteristics of the data.
  We appreciate your recommendation regarding the potential use of weighted variants of existing similarity indices. While MSPAEF was developed as a fixed form, non-tunable metric to maintain consistency across different contexts, we agree that incorporating adjustable weighted schemes could allow for shifting the emphasis on pattern or bias, depending on the application at hand.
  In Sect. 3.3, we now mention which components will need to have a greater contribution in order for the existing measures to perform similarly to MSPAEF:
  “Using weights for the different components could serve as a way to improve the performance of the existing metrics, to closely match the performance of MSPAEF. For example, in the case of WSPAEF, achieving a behavior more consistent with MSPAEF in the presence of significant absolute mean bias would require reducing the relative contribution of the WD component. By contrast, SPAEF and E_{sp} already perform similarly to MSPAEF when bias values are small, but as bias increases, their lack of bias-sensitive components will limit their ability to achieve similar performance to MSPAEF.”
  
  Response to Major Comment 14:
  In this study, the CMIP6 models were ranked separately for each variable and each metric, resulting in independent rankings that highlight how the models’ performance varies with different variables. we agree that in many practical applications, model performance must be judged using multiple variables simultaneously. We have added the following paragraph:
  “While in this demonstration we evaluate and rank models separately for each variable and each metric, in many real-world applications, the overall model performance can be assessed using multiple variables. There are many multi-criterion model ranking techniques that can do this, such as Compromise Programming (CP) (Refaey M. A. et al., 2019; Baghel T. et al., 2022), Technique for Order Preference by Similarity to an Ideal Solution (TOPSIS) (Srinivasa Raju K. and Nagesh Kumar D., 2014), and Cooperative Game Theory (Gershon M. and Duckstein L., 1983).”
  https://www.ajbasweb.com/old/ajbas/2019/May/85-96(9).pdf
  https://www.sciencedirect.com/science/article/abs/pii/S0048969722016448
  https://iwaponline.com/jwcc/article-abstract/6/2/288/1601/Ranking-general-circulation-models-for-India-using?redirectedFrom=fulltext
  https://ascelibrary.org/doi/10.1061/%28ASCE%290733-9496%281983%29109%3A1%2813%29
  
  Reply
  
  Citation: https://doi.org/10.5194/egusphere-2025-1471-AC2
  - RC2: 'Reply on AC2', Anonymous Referee #1, 02 Jul 2025 reply
    
    Most of my comments were adequately addressed or corrected. However, I still have a few remaining observations. I will continue using the original numbering from my first round of comments.
    Major Comment 7
    
    Following your explanation, I now understand that (even after adjusting for bias, standard deviation, and correlation between observed and modeled data) there can still be non-trivial spatial variations that affect the behavior of certain "metrics". If this interpretation is correct, I recommend stating it explicitly in the manuscript. This clarification would also apply to the evaluation of synthetic data in special-case scenarios.
    Major Comment 11
    
    Regarding the concerns I previously raised (e.g., negative spatial correlation, use of normally distributed data for precipitation), your explanation is satisfactory. However, I still believe it is important to include these justifications in the manuscript. Doing so would help support your choice of parameter values and make the rationale clearer to readers. Please include references.
    Major Comment 13
    
    Again, your explanation is clear and satisfactory (particularly regarding why MSPAEF appears similar to other bias-insensitive "metrics". I suggest incorporating this discussion into the manuscript as well. Additionally, it would strengthen the paper to reflect some of these insights in the Conclusions section. For instance, you could highlight the idea that certain "metrics" may be improved by incorporating weights, while in other contexts, their unweighted forms may be sufficient.
    I also appreciate that you now mention some limitations of the MSPAEF. However, I recommend improving the explanation of its advantages. Although I am aware of the existence of this explanation in the manuscript, I found the following paragraph stronger and suggest including it with the necessary changes:
    “We defined ‘similarity’ between model output and observations as a combination of both spatial pattern agreement and agreement in the mean values. The MSPAEF metric was designed as a comprehensive similarity measure that responds to both spatial pattern agreement and bias in the mean. It is defined to emphasize spatial pattern similarity when the relative bias is small, but to increasingly emphasize the bias as it becomes more significant. This design allows MSPAEF to act as a balanced indicator, adapting its emphasis depending on the characteristics of the data.”
    
    Reply
    
    Citation: https://doi.org/10.5194/egusphere-2025-1471-RC2
    
    AC3: 'Reply on RC2', Andreas Karpasitis, 04 Jul 2025 reply
    
    We're very grateful for the thoughtful time and effort you put into giving us constructive feedback.
    
    Response to Major Comment 7:
    We would like to further clarify that even when synthetic datasets are generated with specific target values for correlation, standard deviation ratio, and bias, due to the stochastic nature of the process, the exact targets are not always precisely achieved. By repeating the generation process multiple times and using the median value of the metrics that are applied in these synthetic data, we reduce the impact of outliers and approximate the behavior of the metrics under the intended statistical conditions.
    We agree with the referee that, beyond minor deviations from the target statistics, there can also be non-trivial spatial variations in the synthetic data fields. These variations can influence the performance of different metrics in distinct ways. We will revise the manuscript to state this explicitly, as this is particularly relevant for interpreting the metrics responses in synthetic-data experiments, including special-case scenarios.
    We will revise this part further:
    “For each combination of the aforementioned parameters, and for both the normally and skewed distributed cases, the procedure was repeated 200 times, and the median value of each metric was used to ensure convergence of the metrics. While the synthetic data are generated to match specific target values of correlation, bias, and standard deviation ratio between them, the stochastic nature of the process means these targets are only approximately achieved. Additionally, non-trivial spatial variations may still occur across realizations, which can affect different metrics in distinct ways. The repetition and the use of the median metric values help reduce the influence of these variations and provide a more robust estimate of each metric’s behavior under the intended conditions.”
    
    Response to Major Comment 11:
    We will modify Line 278 as follows to clarify that these examples use normally distributed synthetic data:
    “Two examples are presented to illustrate the differences in the values and the interpretation of the four metrics, using normally distributed synthetic data as described in the Methods section.”
    To address the use of negative correlation in Example 1, we will add the following at Line 285:
    “While the use of negative spatial correlations might look unusual, especially at global scales, they can occur regionally, especially in precipitation fields due to known large biases near the equator, such as the double ITCZ bias (Ma X. et al., 2023).”
    For the choice of normal distribution for the precipitation example, we will add the following at the end of Line 283:
    “Although daily precipitation often follows a highly skewed or exponential distribution, annual averages can be closely approximated to a normal distribution, especially outside of polar regions (see figure in Appendix).”
    Regarding the use of extreme bias in Example 2, we will add the following at the end of Line 300:
    “While this large bias example might seem unrealistic at global scales, it can occur at local scales and/or momentarily (McSweeney C. F. et al., 2015).”
    https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/joc.7980
    https://link.springer.com/article/10.1007/s00382-014-2418-8
    
    Response to Major Comment 13:
    We are grateful for your constructive suggestions. For the similarity of MSPAEF to the bias-insensitive metrics, we will add the following paragraph after Line 357:
    “The similarity of MSPAEF to the bias-insensitive metrics might look unusual at first glance (compared to the other bias-sensitive metrics), given that MSPAEF is a bias-sensitive metric. However, this is less surprising when considering the large sensitivity of WSPAEF (the other bias-sensitive metric) to the absolute value of the mean bias. This sensitivity is what causes WSPAEF’s performance to diverge significantly from that of MSPAEF (and the other two bias-insensitive metrics), particularly for the precipitation variable, where the bias in the mean can be large in absolute terms.”
    We will add the following paragraph after Line 387, to reflect some insights in the Conclusions section:
    “Through the use of appropriate weights, the existing metrics can be adjusted to better align with the performance of MSPAEF. Nevertheless, on many occasions, such as when there is insignificant bias, their original unweighted forms often perform sufficiently well, reducing the need for such modifications.”
    We will revise the last paragraph of the conclusions section, to emphasize the advantages of MSPAEF:
    “The MSPAEF metric was designed as a comprehensive similarity measure that incorporates both spatial pattern agreement and bias in the mean. It is defined in a way that emphasizes spatial pattern similarity when the relative bias is small, but it increasingly emphasizes the bias as it becomes more significant. This design allows MSPAEF to act as a balanced indicator, adapting its emphasis depending on the characteristics of the data. While this metric was developed for evaluating gridded climate model output, its underlying rationale and its inherent flexibility make it suitable for assessing the performance of other types of geoscientific or environmental models where the spatial distribution of simulated variables is expected to follow certain patterns in space.”
    
    Reply
    
    Citation: https://doi.org/10.5194/egusphere-2025-1471-AC3

Andreas Karpasitis, Panos Hadjinicolaou, and George Zittis

Data sets

CMIP6 data - https://climexp.knmi.nl/selectfield_cmip6_knmi23.cgi?

ERA5 data - https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels-monthly-means?tab=overview

Model code and software

Python software Andreas Karpasitis https://doi.org/10.5281/zenodo.15094921

Andreas Karpasitis, Panos Hadjinicolaou, and George Zittis

Viewed

Total article views: 2,261 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
2,125	108	28	2,261	29	48

HTML: 2,125
PDF: 108
XML: 28
Total: 2,261
BibTeX: 29
EndNote: 48

Views and downloads (calculated since 28 May 2025)

Month	HTML	PDF	XML	Total
May 2025	63	11	3	77
Jun 2025	138	29	12	179
Jul 2025	95	21	6	122
Aug 2025	389	13	0	402
Sep 2025	1,320	10	1	1,331
Oct 2025	71	17	1	89
Nov 2025	49	7	5	61

Cumulative views and downloads (calculated since 28 May 2025)

Month	HTML	PDF	XML	Total
May 2025	63	11	3	77
Jun 2025	138	29	12	179
Jul 2025	95	21	6	122
Aug 2025	389	13	0	402
Sep 2025	1,320	10	1	1,331
Oct 2025	71	17	1	89
Nov 2025	49	7	5	61

Viewed (geographical distribution)

Total article views: 2,130 (including HTML, PDF, and XML) Thereof 2,130 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 22 Nov 2025

Short summary

The use of models to understand Earth's climate is essential, but evaluating how well these models reproduce real-world patterns remains a challenge. In this study, the Modified Spatial Efficiency metric was introduced to improve model assessment. Our results show that this metric reliably captures spatial patterns under diverse conditions and aligns well with our intuition. This advancement can help researchers better compare climate models and improve predictions of environmental changes.


Total:	0
HTML:	0
PDF:	0
XML:	0