Technical note: Euclidean Distance Score (EDS) for algorithm performance assessment in aquatic remote sensing
Abstract. In the absence of community consensus, there remains a gap in standardized, consistent performance assessment of remote-sensing algorithms for water-quality retrieval. Although the use of multiple metrics is common, whether reported individually or combined into scoring systems, approaches are often constrained by statistical limitations, redundancy, and dataset- and context-dependent normalizations, leading to subjective or inconsistent interpretations. To address this, we propose the Euclidean Distance Score (EDS), which integrates five statistically appropriate and complementary metrics into a composite score. Capturing three core aspects of performance (regression fit, retrieval error, and robustness), EDS is computed as the Euclidean distance from an idealized point of perfect performance, providing a standardized and interpretable measure. We demonstrate the applicability of EDS in three scenarios: assessing a single algorithm for different retrieved variables, comparing two algorithms on shared retrievals, and evaluating performance across contrasting trophic conditions. By offering an objective framework, EDS supports consistent validation of aquatic remote sensing algorithms and transparent comparisons in varied contexts.
The paper proposes a strategy for algorithm comparison/evaluation by designing a single metric to combine multiple metrics. This is a solid progression from previous work (referenced) that looked at metrics for algorithm assessment. The “Euclidan Distance Score” (EDS) is a strong approach to summarize the data. A critical objective of the authors is to identify only the metrics that are relevant, and summarize those, rather than to include lots of (often closely related) metrics and leave it to the reader to make sense of them. I will say that this paper was a pleasure to review, and it will become an excellent paper that should be quite important (and hopefully well used). But it does need revision to make sure it is correct.
A concern with comparing metrics is how to “normalize” those metrics that have quite disparate ranges. This approach addresses it by treating ratios & proportions, and so are unitless. That provides a good approach that is not arbitrary. While it does not force results to be between 0 and 1, it is set up with two strong conditions. An EDS = 1 is “perfect”. Any EDS < 0 is unacceptably poor, and each of the input parameters to the EDS are typically going to be between 0 and 1. The ones that are not (proportional slope deviation, proportional error, and proportional bias), are really unacceptable if the values exceed 1.
I have two large concerns that should be directly solvable. First: the parameters to input. Second is whether the configuration of the equation parameters is correct.
The inputs are R (Pearson correlation coefficient), linear regression slope calculated in log space (m), median ratio error (e ~ epsilon), Median ratio bias (B ~ beta), and valid retrieval ratio (n).
The question is: are these all robust and independent?
Of these, e, B, and n are quite good. It is true that e and B are not actually independent, but as there appears to be no robust means of separating the two (de-biasing the error means calculating mean errors, rather than median errors, which gets into non-robust methods), so we will go with it. As a practical matter a competent product should tend toward a bias ratio of 1. If it does not, then it is punished relatively severely, as e >= B. A biased “low error” model will probably do worse than an unbiased relatively high error model. This should be noted in the paper.
At lines 24-28 the paper notes the problem of using root-mean-square error metrics. This is a critical point. Basically, the paper sets out that robust metrics should be used, which is why the paper proposed median e and B. However, Pearson regression and linear regression slope are least squares solutions. Thiel-Sen slope, or an equivalent, should be used for the slope. This is necessary, as many optical models (or for that matter, many models) often deviate at very low or very high values. That statistical leverage will severely alter a least squares regression slope, but not a robust slope metric.
Regression as a metric has an additional critical flaw: it normalizes to the standard deviation of the data. Therefore, an exact subset of a population that has a smaller range will have a lower R value than the population. (Worse, as observed in Seegers et al., a low error method with a small range of data will have a lower R values than a higher error method with a much larger range of data.) This problem is also seen in Figure 3. Oligotrophic water has the smallest error, but a low R value. The problem is the narrow range of data. Conversely if the range is large enough, R provides no useful information, both good and poor models can have high R values. Because of this problem, including R means that EDS values are not be comparable across the different data sets. (There is a good discussion of the problem of R by a top statistician https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf ) .
By the way, R and linear regression slope are not independent, slope = S_y/S_x * R.
As to the input metrics, based on appropriate and consistently robust metrics, the appropriate ones would then appear to be
1 median (Thiel=Sen) slope, to capture whether the data generally behaves well across the range. (I will say that I don’t really like slope, but I do not see a better option, as that would involve more complex partitioning alternatives that are difficult to standardize.)
2 median error
3 median bias
4 retrievals n.
Median error and bias do not appear to be correctly specified in EDS equation (7). As these are ratios, shouldn’t they be (e – 1)2 and (B-1) 2? Both are defined as a ratio of E/O (expected/observed), so a value of 1, is perfect, and should reduce to zero. Equation would be:
EDS = 1 – sqrt [ (m-1) 2 + (e-1) 2 + (B-1) 2 + (n-1) 2 ]
The authors might ponder thought experiments as examples (suggestion only). I did only one. An algorithm that has all results on an exact line with a slope of 1, but is severely biased. Error (e ~ epsilon) and (B ~ beta) will be equal. If the bias is 2x, which is a low performance, the EDS would return a value of zero.