the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Technical note: Euclidean Distance Score (EDS) for algorithm performance assessment in aquatic remote sensing
Abstract. In the absence of community consensus, there remains a gap in standardized, consistent performance assessment of remote-sensing algorithms for water-quality retrieval. Although the use of multiple metrics is common, whether reported individually or combined into scoring systems, approaches are often constrained by statistical limitations, redundancy, and dataset- and context-dependent normalizations, leading to subjective or inconsistent interpretations. To address this, we propose the Euclidean Distance Score (EDS), which integrates five statistically appropriate and complementary metrics into a composite score. Capturing three core aspects of performance (regression fit, retrieval error, and robustness), EDS is computed as the Euclidean distance from an idealized point of perfect performance, providing a standardized and interpretable measure. We demonstrate the applicability of EDS in three scenarios: assessing a single algorithm for different retrieved variables, comparing two algorithms on shared retrievals, and evaluating performance across contrasting trophic conditions. By offering an objective framework, EDS supports consistent validation of aquatic remote sensing algorithms and transparent comparisons in varied contexts.
- Preprint
(762 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 17 Dec 2025)
- RC1: 'Comment on egusphere-2025-4343', Richard Stumpf, 20 Oct 2025 reply
-
RC2: 'Comment on egusphere-2025-4343', Anonymous Referee #2, 19 Nov 2025
reply
Review of the “Technical note: Euclidean Distance Score (EDS) for algorithm performance assessment in aquatic remote sensing” by Amanda de Liz Arcari et al.
The manuscript addresses an issue of great importance, which is the assessment and comparison of remote sensing algorithms, given the non-normal distributions of most bio-optical variables. Generally, the text is brief, and the points made by the authors are clear and relevant. I have recognized the need for a robust assessment method for some time, and I see the advantages of the proposed one. However, I think the manuscript is too brief at times, and some sections may benefit from more in-depth explanation.
Firstly, regarding the assumptions made, the reduced major axis regression is mentioned several times, but I find that additional information is needed to clarify the significance of the problem. To illustrate my point, I found an interesting publication by Bilal et al. (2022) in the Encyclopedia of Mathematical Geosciences (https://doi.org/10.1007/978-3-030-26050-7_270-1). This work discusses the presence of errors in both the dependent and independent variables in geosciences, which is exactly what I find missing in this text to highlight the value of this study. Furthermore, I am curious as to why the Pearson correlation coefficient was selected instead of the Mann-Kendall test, which does not have such strict assumptions, particularly when not all variables have ideal log-normal distributions and log-transformation does not always ensure normality.
Secondly, a paper of this nature, aiming to establish a certain assessment standard, should provide a broader explanation of the somewhat arbitrary nature of the logarithm selection mentioned in line 106. I remember being quite confused about this when I was a beginning researcher, and I believe that a methods paper should explain it more thoroughly. Similarly, the definition of the number of valid retrievals in Equation 6 seems rather vague. I would expect a more specific definition of what "valid" means here and how it may affect the results.
Lastly, I appreciate presenting real-life examples. However, I believe that adding a few more commonly used metrics, such as the root mean squared error, and discussing their limitations could help illustrate why the proposed approach is more robust.
To summarize, I find this work to be much needed and valuable, and it is already well-written. However, to convince sceptics and encourage broader application, I recommend providing additional explanations for those entering the field who may not understand the jargon or have not yet grasped all the challenges related to assessing optical algorithms.
Citation: https://doi.org/10.5194/egusphere-2025-4343-RC2 -
RC3: 'Comment on egusphere-2025-4343', Anonymous Referee #3, 21 Nov 2025
reply
1. Conceptual and Statistical Foundations
The EDS aggregates correlation, slope, two error metrics, and a retrieval-success ratio into a single Euclidean distance from an ideal point. This mathematical construction assumes that all components behave like orthogonal dimensions with comparable scales and variances, but none of these conditions are satisfied. The five variables have different statistical ranges, distributions, sensitivities, and units. Correlation is bounded between −1 and 1, slope is unbounded, the two error metrics are semi-unbounded positives, and the valid-retrieval ratio is constrained between 0 and 1. Without normalization, the Euclidean combination gives disproportionate weight to whichever component naturally has the largest numerical variance. As a result, the EDS does not behave as a standardized metric despite the authors’ claims.
The manuscript acknowledges possible correlations among components but dismisses them by appealing to conceptual distinctiveness. This is a flawed justification. Independence in a Euclidean-based score must be statistical, not conceptual. Otherwise, redundant information inflates the distance metric and misrepresents differences among algorithms.
2. Misinterpretation of Reduced Major Axis (RMA) Regression
The manuscript incorrectly asserts that the RMA slope is not directly dependent on the correlation coefficient. Although the sign is taken from the correlation, the slope is indirectly tied to it because both slope and correlation originate from the same joint distribution of the variables. In log-transformed bio-optical data, which often display heteroscedasticity and skewness, the underlying assumptions of RMA are violated. Treating correlation and slope as independent axes in a distance metric is therefore mathematically incorrect.
3. Arbitrary Ideal Point
The EDS defines a single “ideal” point: perfect correlation, a slope of unity, zero error, zero bias, and full retrieval success. These values do not reflect physical or algorithmic realities. Optical water-quality inversions do not inherently aim for a slope of exactly one, because systematic scaling offsets are common due to sensor geometry, illumination, and non-linear IOP–Rrs relationships. Similarly, requiring all retrievals to converge is unrealistic and conceptually meaningless; in optically extreme or highly absorbing waters, algorithm failure reflects physical non-identifiability rather than algorithmic inadequacy.
Because the ideal point is not grounded in physical or statistical theory, the EDS becomes a measure of deviation from an arbitrary and often irrelevant target.
4. Mathematical Instability and Lack of Normalization
Because the score does not normalize its components, it behaves unpredictably across different datasets. Deviations in slope dominate the score because slope is unbounded and can vary widely between algorithms. In contrast, correlation deviations typically contribute very little because the correlation coefficient is tightly bounded and usually relatively high in aquatic data. The error metrics contribute substantially in some cases and minimally in others. The retrieval-success ratio contributes very little because it tends to remain close to unity, except in extreme conditions.
The result is a score dominated by the slope term, which is readily seen in the manuscript’s own examples. This contradicts the authors’ assertion that regression and error contributions “typically weigh equally,” which is demonstrably incorrect.
5. Redundancy Among Metrics
The EDS treats correlation and median accuracy as separate dimensions, but both describe the dispersion between retrieved and observed values. Likewise, slope and symmetric bias both reflect systematic deviations. Treating these as independent components double-counts aspects of performance and violates the assumption that each axis captures unique information. This redundancy undermines the validity of using Euclidean distance.
6. Behaviour Exposed in the Manuscript’s Own Examples
The examples reveal fundamental problems. In the bbp retrieval case, the relative errors and biases are modest and largely within community standards, yet the EDS is extremely low because of one exaggerated slope. This demonstrates that the metric can declare a retrieval nearly worthless even when traditional performance metrics indicate acceptable behavior. Conversely, the Kd example for oligotrophic waters shows low error and very poor correlation, yet the EDS produces a moderate score, despite the algorithm clearly failing to track the variability in the data. These outcomes contradict the stated goals of the metric and show that EDS is not aligned with accepted interpretations of retrieval performance.
7. Inadequate Theoretical Justification for Euclidean Geometry
The authors cite studies that apply Euclidean metrics in other domains, but those works typically include normalization, weighting, or variance scaling—none of which are implemented here. Distance geometry requires justified isotropy; otherwise, the metric becomes arbitrary and misleading. The manuscript does not provide any analysis demonstrating that the proposed five dimensions satisfy the conditions for Euclidean aggregation.
8. Absence of Sensitivity, Stability, or Uncertainty Analyses
A composite score derived from log-normal, skewed data requires uncertainty propagation and sensitivity testing. The manuscript provides neither. Without such analyses, the robustness of EDS under different water types, sample sizes, spectral conditions, or sensor noise cannot be validated. The metric’s behavior could vary dramatically with no clear interpretation.
9. Physically Inconsistent Treatment of Variables
Treating all IOPs as equally suited to the same composite score ignores the optical physics governing remote sensing reflectance. Different IOPs have different dynamic ranges, non-linear sensitivities, and model dependencies. A single, unweighted score cannot uniformly characterize algorithm performance across such heterogeneous variables. This contradicts the manuscript’s claim of general applicability.
Citation: https://doi.org/10.5194/egusphere-2025-4343-RC3
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 619 | 93 | 24 | 736 | 17 | 17 |
- HTML: 619
- PDF: 93
- XML: 24
- Total: 736
- BibTeX: 17
- EndNote: 17
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The paper proposes a strategy for algorithm comparison/evaluation by designing a single metric to combine multiple metrics. This is a solid progression from previous work (referenced) that looked at metrics for algorithm assessment. The “Euclidan Distance Score” (EDS) is a strong approach to summarize the data. A critical objective of the authors is to identify only the metrics that are relevant, and summarize those, rather than to include lots of (often closely related) metrics and leave it to the reader to make sense of them. I will say that this paper was a pleasure to review, and it will become an excellent paper that should be quite important (and hopefully well used). But it does need revision to make sure it is correct.
A concern with comparing metrics is how to “normalize” those metrics that have quite disparate ranges. This approach addresses it by treating ratios & proportions, and so are unitless. That provides a good approach that is not arbitrary. While it does not force results to be between 0 and 1, it is set up with two strong conditions. An EDS = 1 is “perfect”. Any EDS < 0 is unacceptably poor, and each of the input parameters to the EDS are typically going to be between 0 and 1. The ones that are not (proportional slope deviation, proportional error, and proportional bias), are really unacceptable if the values exceed 1.
I have two large concerns that should be directly solvable. First: the parameters to input. Second is whether the configuration of the equation parameters is correct.
The inputs are R (Pearson correlation coefficient), linear regression slope calculated in log space (m), median ratio error (e ~ epsilon), Median ratio bias (B ~ beta), and valid retrieval ratio (n).
The question is: are these all robust and independent?
Of these, e, B, and n are quite good. It is true that e and B are not actually independent, but as there appears to be no robust means of separating the two (de-biasing the error means calculating mean errors, rather than median errors, which gets into non-robust methods), so we will go with it. As a practical matter a competent product should tend toward a bias ratio of 1. If it does not, then it is punished relatively severely, as e >= B. A biased “low error” model will probably do worse than an unbiased relatively high error model. This should be noted in the paper.
At lines 24-28 the paper notes the problem of using root-mean-square error metrics. This is a critical point. Basically, the paper sets out that robust metrics should be used, which is why the paper proposed median e and B. However, Pearson regression and linear regression slope are least squares solutions. Thiel-Sen slope, or an equivalent, should be used for the slope. This is necessary, as many optical models (or for that matter, many models) often deviate at very low or very high values. That statistical leverage will severely alter a least squares regression slope, but not a robust slope metric.
Regression as a metric has an additional critical flaw: it normalizes to the standard deviation of the data. Therefore, an exact subset of a population that has a smaller range will have a lower R value than the population. (Worse, as observed in Seegers et al., a low error method with a small range of data will have a lower R values than a higher error method with a much larger range of data.) This problem is also seen in Figure 3. Oligotrophic water has the smallest error, but a low R value. The problem is the narrow range of data. Conversely if the range is large enough, R provides no useful information, both good and poor models can have high R values. Because of this problem, including R means that EDS values are not be comparable across the different data sets. (There is a good discussion of the problem of R by a top statistician https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf ) .
By the way, R and linear regression slope are not independent, slope = S_y/S_x * R.
As to the input metrics, based on appropriate and consistently robust metrics, the appropriate ones would then appear to be
1 median (Thiel=Sen) slope, to capture whether the data generally behaves well across the range. (I will say that I don’t really like slope, but I do not see a better option, as that would involve more complex partitioning alternatives that are difficult to standardize.)
2 median error
3 median bias
4 retrievals n.
Median error and bias do not appear to be correctly specified in EDS equation (7). As these are ratios, shouldn’t they be (e – 1)2 and (B-1) 2? Both are defined as a ratio of E/O (expected/observed), so a value of 1, is perfect, and should reduce to zero. Equation would be:
EDS = 1 – sqrt [ (m-1) 2 + (e-1) 2 + (B-1) 2 + (n-1) 2 ]
The authors might ponder thought experiments as examples (suggestion only). I did only one. An algorithm that has all results on an exact line with a slope of 1, but is severely biased. Error (e ~ epsilon) and (B ~ beta) will be equal. If the bias is 2x, which is a low performance, the EDS would return a value of zero.