SPatial Efficiency And Kmoments (SPEAK): Evaluating Spatial Consistency in (Semi)Distributed Rainfall&ndash;Runoff Models

Moreno, Matías; Mendoza, Pablo; Muñoz-Castro, Eduardo; Zambrano-Bigiarini, Mauricio; Pizarro, Alonso

doi:10.5194/egusphere-2026-2912

Preprints

https://doi.org/10.5194/egusphere-2026-2912

Preprints

05 Jun 2026

| 05 Jun 2026

Status: this preprint is open for discussion and under review for Hydrology and Earth System Sciences (HESS).

SPatial Efficiency And Kmoments (SPEAK): Evaluating Spatial Consistency in (Semi)Distributed Rainfall–Runoff Models

Matías Moreno, Pablo Mendoza, Eduardo Muñoz-Castro, Mauricio Zambrano-Bigiarini, and Alonso Pizarro

Abstract. We introduce the Spatial Efficiency and Kmoments (SPEAK) metric, a novel objective function for the spatial calibration of hydrological models. SPEAK is built on Kmoment-based statistics, including a Kmoment-based: i) correlation, ii) coefficient-of-variation ratio, and iii) probability density function. This novel formulation is explicitly designed to overcome key limitations of existing spatial performance metrics, such as sensitivity to binning strategies, grid resolution, and sample heterogeneity. By relying on distributional properties rather than grid-to-grid correspondence, SPEAK provides a statistically robust framework for evaluating spatial patterns in gridded hydrological variables. The proposed metric is implemented in both semi-distributed and fully distributed configurations of the TUW hydrological model and tested across 99 near-natural Chilean catchments that encompass strong climatic and physiographic gradients. Actual evapotranspiration (ETa) from GLEAM v4.2a is used as an independent spatial benchmark, allowing the assessment of model performance beyond streamflow reproduction. Calibration using SPEAK is compared with a conventional streamflow-only calibration based on the Kling-Gupta Efficiency (KGE) and an ETa-only calibration based on the Spatial Efficiency metric (SPAEF). Model performance is evaluated using the normalised root-mean-square error (NRMSE), the spatial Pearson correlation coefficient, the Fraction Skill Score (FSS), and sensitivity to catchment attributes. Results demonstrate that while streamflow-only calibration leads to satisfactory runoff simulations (KGE ≥ 0.25 for all catchments and cases analysed; whereas the mean and median KGE are 0.80 and 0.85, respectively), it fails to reproduce the spatial patterns of ETa. When ETa is used as a calibration target, SPEAK consistently outperforms SPAEF, exhibiting lower NRMSE (number of catchments with lower NRMSE: 85 and 92 in fully and semi-distributed configuration, respectively), reduced internal component dispersion, and improved representation of spatial patterns across seasons and hydroclimatic zones. Importantly, SPEAK shows limited dependence on catchment characteristics. These findings highlight SPEAK as a methodologically robust spatial performance metric, with clear potential for improving the calibration and diagnosis of distributed hydrological models and other gridded environmental variables.

Received: 20 May 2026 – Discussion started: 05 Jun 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Matías Moreno, Pablo Mendoza, Eduardo Muñoz-Castro, Mauricio Zambrano-Bigiarini, and Alonso Pizarro

Status: open (until 17 Jul 2026)

Post a comment Subscribe to comment alert

RC1: 'Comment on egusphere-2026-2912', Anonymous Referee #1, 25 Jun 2026 reply

The paper is well-structured and proposes a novel approach to improve SPAEF metric.
I have the following comments:
-Did authors apply sensitivity analysis before model calibration? I couldnt find the details. Apparently all TUW parameters are included in the spatial calibration.

A robust calibration framework startes with a sensitivty analysis to reduce the parameter search space i.e. positively affecting the model runs and convergence to globally optimum metrics.
-Line 250: "Fully distributed model configuration: Model inputs are provided at the grid-cell (0.05° x 0.05°) level across

each catchment with uniform-in-space parameters’ values (i.e., not depending on the spatial dimension)."
I think the it would be excellent if the authors could apply pedotransfer functions to full distributed version of the model.

Some models like mHM have this setting to gether with multi-parameter regionalization approach that help to create robust patterns instead of weak patterns due to uniform parameter values.
-Line 490: "Furthermore, the TUW model employed spatially uniform parameter

values, potentially limiting its ability to represent local heterogeneity".

Good that the authors mention this limitation in the text.
-Line 65: "In recent years, the Spatial Efficiency (SPAEF; Koch et al., 2018)"

In that paper the authors state that: "Following the multiple-component idea of KGE we present a novel spatial

performance metric denoted SPAtial EFficiency (SPAEF), which was originally proposed by Demirel et al. (2018a, b)."

https://gmd.copernicus.org/articles/11/1873/2018/
This sentence can be helpful to find the origin of the metric.
-Sample size (number of grid cells) seems small in Fig4-5 and 6 as compared to a basin with 100x100 grids for example.

The reader can be cruious why the authors selected small catchments? Or why didnt they give the gridded maps of continental Chile (99 near-natural catchments)?

Reply

Citation: https://doi.org/10.5194/egusphere-2026-2912-RC1
RC2: 'Comment on egusphere-2026-2912', Anonymous Referee #2, 02 Jul 2026 reply

The manuscript by Moreno et al. presents a variation of an existing metric used for hydrological model calibration. The proposed metric is intended to improve the spatial calibration of evapotranspiration (ET) patterns in distributed hydrological models.
While I agree with the authors that there is room for improving existing spatial performance metrics, I am not convinced that the presented modelling experiments are sufficient to demonstrate the advantages and general applicability of the proposed approach. Furthermore, I find that several of the presented results focus on grid-based ET errors, although the spatial performance metrics investigated (SPAEF and SPEAK) are insensitive to biases. Consequently, some of the presented evaluations appear to be of limited relevance.
I am also not convinced that the selected hydrological model setup/code is appropriate for developing and testing the proposed metric. Although the model is described as semi-distributed or fully distributed, only precipitation and air temperature are spatially distributed forcing variables. Most of the remaining catchment characteristics appear to be spatially uniform because they are derived from the catchment-aggregated CAMELS-CL dataset. Ideally, a study aiming to evaluate spatial ET performance metrics should employ a model with spatially distributed parameters and process representations that are sensitive to ET variability, such as spatially distributed soils, vegetation, and land use.
In addition, the spatial resolution of GLEAM is relatively coarse (approximately 100 km² per grid cell). Many of the investigated catchments contain only around 20 GLEAM cells (most catchments are substantially smaller than 2,000 km²). This considerably limits the amount of spatial information available for evaluating spatial ET patterns.
Overall, while the manuscript addresses an interesting topic, I do not believe that the presented experiments provide sufficient evidence to support the claimed benefits of the proposed metric. Therefore, I cannot recommend the manuscript for publication in HESS in its current form.
Specific comments
Section 2.1
The introduction of this section is unnecessarily broad. The manuscript ultimately evaluates spatial patterns of long-term averaged ET, and therefore statements such as:
"This phenomenon is particularly pronounced in hydrology, where observations are typically limited to a few decades, whereas the processes of interest may operate on centennial or millennial time scales."
do not appear directly relevant to the scope of this study.
Section 2.2
To better illustrate the advantages of the proposed metric (rkm) over conventional correlation measures, the manuscript would benefit from simple synthetic examples presented as scatter plots.
Likewise, the KPDF component could be explained using synthetic examples. Such illustrations would make the methodology considerably more accessible to the broader hydrological community.
Section 2.3
What does the variable Q represent in Equation (7)? This should be explicitly defined.
Section 3.1
The CAMELS-CL dataset provides only catchment-aggregated attributes. However, a spatially distributed hydrological model is applied. I therefore miss a description of the spatial datasets used to construct the distributed model.
Only the CR2MET precipitation and temperature products are referenced. Please specify the data sources for elevation, soils, land use, and any other spatial information required by the hydrological model.
I also question whether bilinear interpolation is an appropriate method for resampling the GLEAM ET product. Bilinear interpolation introduces artificial spatial variability that is not present in the original data. A nearest-neighbour resampling approach would preserve the original observations. Alternatively, the simulated ET could be aggregated to the native GLEAM resolution before calculating the spatial performance metrics.
Section 3.2
Please justify why modelling scenarios (b) and (c) are appropriate for evaluating the proposed metric.
Given that the objective is to assess spatial ET patterns, I am particularly surprised that spatially uniform model parameters are employed. ET patterns are expected to be strongly influenced by spatial variability in land use, vegetation, and soils. Consequently, I question whether the chosen model configuration is suitable for evaluating a spatial ET performance metric.
Section 3.3
Why is each catchment calibrated independently? One of the motivations for spatial performance metrics is to improve the simulation of large-scale spatial patterns. It would therefore seem more appropriate to calibrate a common parameter set capable of reproducing realistic ET gradients across multiple catchments.
Please also clarify how the ET metrics were calculated. Were they computed from daily/monthyl ET maps, or long-term averages/climatologies? How were the ET maps aggregated temporally before calculating the metrics?
Was the discharge KGE also included as an objective during the SPAEF and SPEAK calibrations?
Please specify whether the NRMSE and FSS metrics were calculated for both discharge and ET, or only for one of these variables.
Section 4
Figure 3: Panels (b) to (f) are missing units on the y-axis.
Figure 4: Since SPAEF and SPEAK are designed to be insensitive to bias, I do not see the relevance of presenting absolute ET errors. It is commonly the discharge that is used for water balance closure at basin scale. Also, absolute values of ET products are associated with uncertainties but often contain useful spatial pattern information. This has motivated the definition of the bias-insensitive SPAEF metric.
Figure 6: Similar to Figure 4, the discussion focuses on relative errors, although SPAEF and SPEAK are intentionally insensitive to bias. I therefore question the usefulness of these comparisons.

Reply

Citation: https://doi.org/10.5194/egusphere-2026-2912-RC2
RC3:
'Comment on egusphere-2026-2912', Juraj Parajka, 03 Jul 2026 reply
General comments
The manuscript proposes and evaluates a new spatial performance metric (SPEAK) for calibrating (semi)distributed rainfall–runoff models. The new framework is tested across 99 Chilean catchments, using ETa as a spatial calibration dataset. The presented objectives fit very well to the scope of HESS.
The Introduction clearly presents the context, and the research gap (histogram sensitivity in SPAEF-type metrics) is reasonably well articulated. The proposed framework, which replaces histograms with K-moment-based smooth densities (KPDF) and a K-moment-based correlation (K-correlation) is a clear, well-motivated methodological contribution. The structure and presentation of the manuscript are generally clear, though some equations (rkm, KPDF derivation) lack an intuitive explanation for readers unfamiliar with the K-moment framework. The main weakness of the manuscript is that, in its current form, some interpretations would benefit from some additional analyses and supporting evidence related to independence and sensitivity of validation metrics, and the impact of the GLEAM resolution (and its refinement) on the evaluation of results:
The applied "independent" validation metrics look not fully independent of the calibration objective. SPEAK and SPAEF both explicitly optimize a correlation-type term (rkm, r) and a CV-ratio term (βkm, β) during calibration. The paper then compares and validates using NRMSE and Pearson correlation, but Pearson correlation is structurally very similar to the SPAEF's own r term. Showing that SPEAK-calibrated fields have a higher Pearson correlation with the reference is not strong independent evidence, given that correlation-like objectives were already used directly during calibration. A more independent test would be to use metrics that were not part of either objective function, e.g., semivariogram/variogram range comparison, Moran's I (spatial autocorrelation structure), or a mutual information index between the simulated and reference fields.

Moreover, the NRMSE improvements may be driven mostly by bias/magnitude correction, not spatial pattern. NRMSE is likely sensitive to the combined effects of bias and pattern errors. So, the interpretations will be better supported if the NRMSE is decomposed into bias and unbiased/pattern components.

The interpretations from using GLEAM for validation have potentially the following limitations. Every 0.10° GLEAM native pixel becomes up to 4 correlated 0.05° pixels after interpolation. When SPEAK/SPAEF are computed "concatenated on time" across thousands of grid cells per catchment, a large fraction of those cells are not independent observations of spatial ETa variability. K-moment-based order statistics (Eq. 1) are computed over sample size n, and the true effective n is much smaller than the nominal n once interpolation redundancy is accounted for. This can affect interpretations that favor SPEAK, as it is unclear whether this reflects superior handling of hydrological heterogeneity or is a computational artifact. Another question concerns the fact that the number of catchments is smaller than that of a single native GLEAM 0.10° pixel. For such catchments, the reference dataset contains essentially no independent spatial information, yet these catchments are included in the summarized statistics. The manuscript/results/interpretations will benefit from stratifying results by catchment area relative to GLEAM's native pixel footprint (e.g., number of native 0.10° pixels per catchment), which can support/test whether SPEAK's advantage holds/increases if spatial information increases. A useful robustness check can also be if the SPEAK-vs-SPAEF comparison is repeated using GLEAM at its native 0.10° resolution (i.e. aggregating simulated ETa up to 0.10° rather than disaggregating GLEAM down to 0.05°).

Specific comments
- Equations 1–8 are dense math with inconsistent notation between the main text and equations (e.g., sub/superscripts for K-moments are hard to follow without more explanation of what K′₁, K′₂ physically represent beyond Table 2).
- Eq. 4 (KCDF) is introduced, but its role in deriving Eq. 5 (KPDF) via "numerical differentiation" is not clear. Perhaps a worked example (even in supplementary material) showing how KPDF is computed from a sample would help readers unfamiliar with Koutsoyiannis's framework.
- The rkm formula (Eq. 3) is presented without much intuition — it is not obvious why a "max ratio" formulation produces something behaving like a correlation coefficient bounded appropriately; a brief justification or reference to boundedness proofs would help.
- The discussion connects reasonably well to previous literature on spatial calibration, however the manuscript will benefit from some additional comparisons of SPEAK against the other SPAEF variants listed in Table 1. Given Table 1 provides alternative competing metrics, so benchmarking against 1–2 other alternatives (e.g., WSPAEF, which also addresses histogram sensitivity) would substantially strengthen/support the novelty claims.
- The results and interpretations do not include calibration uncertainty/equifinality assessment. The results (especially catchment-count comparisons like "85 of 99 catchments") could be sensitive to the calibration method. Please consider discussing whether the SPEAK/SPAEF differences reflect more the metric difference rather than the variability in the calibration approach.

Reply
Citation: https://doi.org/10.5194/egusphere-2026-2912-RC3

Matías Moreno, Pablo Mendoza, Eduardo Muñoz-Castro, Mauricio Zambrano-Bigiarini, and Alonso Pizarro

Data sets

Codes And Data For "SPatial Efficiency And Kmoments (SPEAK): Evaluating Spatial Consistency In (Semi)Distributed Rainfall–Runoff Models" M. Moreno et al. https://doi.org/10.17605/OSF.IO/86VQ3

Matías Moreno, Pablo Mendoza, Eduardo Muñoz-Castro, Mauricio Zambrano-Bigiarini, and Alonso Pizarro

Viewed

Total article views: 92 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
64	25	3	92	3	4

HTML: 64
PDF: 25
XML: 3
Total: 92
BibTeX: 3
EndNote: 4

Views and downloads (calculated since 05 Jun 2026)

Month	HTML	PDF	XML	Total
Jun 2026	47	22	3	72
Jul 2026	17	3	0	20

Cumulative views and downloads (calculated since 05 Jun 2026)

Month	HTML	PDF	XML	Total
Jun 2026	47	22	3	72
Jul 2026	17	3	0	20

Viewed (geographical distribution)

Total article views: 78 (including HTML, PDF, and XML) Thereof 78 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 16 Jul 2026

Short summary

We developed a new method called SPEAK to better evaluate how well hydrological models reproduce spatial patterns such as evapotranspiration across different regions. We tested it in 99 Chilean catchments with different climates and terrain types. Compared with existing methods, SPEAK more accurately represented spatial patterns and was less affected by map resolution or catchment characteristics. This could help improve water resource modelling and planning across diverse local conditions.


Total:	0
HTML:	0
PDF:	0
XML:	0