Evaluating uncertainty and predictive performance of probabilistic models devised for grade estimation in a porphyry copper deposit

Leung, Raymond; Lowe, Alexander; Melkumyan, Arman

doi:10.5194/egusphere-2024-4051

Preprints

https://doi.org/10.5194/egusphere-2024-4051

Preprints

04 Feb 2025

| 04 Feb 2025

Evaluating uncertainty and predictive performance of probabilistic models devised for grade estimation in a porphyry copper deposit

Raymond Leung, Alexander Lowe, and Arman Melkumyan

Abstract. Probabilistic models are used extensively in geoscience to describe random processes as they allow prediction uncertainties to be quantified in a principled way. These probabilistic predictions are valued in a variety of contexts ranging from geological and geotechnical investigations to understanding subsurface hydrostratigraphic properties and mineral distribution. However, there are no established protocols for evaluating the uncertainty and predictive performance of univariate probabilistic models, and few examples that researchers and practitioners can lean on. This paper aims to bridge this gap by developing a systematic approach that targets three objectives. First, geostatistics are used to check if the probabilistic predictions are reasonable given validation measurements. Second, image-based views of the statistics help facilitate large-scale simultaneous comparisons for a multitude of models across space and time, for instance, spanning multiple regions and inference periods. Third, variogram ratios are used to objectively measure the spatial fidelity of models. In this study, the model candidates include ordinary kriging and Gaussian Process, with and without sequential or correlated random field simulation. A key outcome are recommendations that encompass the FLAGSHIP statistics which examine the fidelity, likelihood, accuracy, goodness, synchronicity, histogram, interval tightness and precision of the model predictive distributions. These statistics are standardised, interpretable and amenable to significance testing. The proposed methods are demonstrated using extensive data from a real copper mine in a grade estimation task, and accompanied by an open-source implementation. The experiments are designed to emphasise data diversity and convey insights, such as the increased difficulty of future-bench prediction (extrapolation) relative to in-situ regression (interpolation). This work presents a holistic approach that enables modellers to evaluate the merits of competing models and employ models with greater confidence by assessing the robustness and validity of probabilistic predictions under challenging conditions.

Received: 20 Dec 2024 – Discussion started: 04 Feb 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3569 KB)

Supplement (1176 KB)

Download & links

Raymond Leung, Alexander Lowe, and Arman Melkumyan

Status: closed

RC1:
'Comment on egusphere-2024-4051', Anonymous Referee #1, 27 Feb 2025
This manuscript proposes a set of metrics for probabilistic model validation and selection. This approach is applied to a synthetic example and a real case study, with a strong focus on mining applications.
Overall the manuscript is well written, although it could be shortened and the vocabulary is sometimes confusing. Analysis 2 shows some valuable work on metric visualization, but the manuscript lacks a thorough theoretical base and, in the end, the analysis isn't designed in a way that can prove that the metrics - most of which being not new - improve our ability to validate and select models compared to the current best practices. Another objective was to compare kriging and Gaussian processes, but the analysis isn't robust enough to conclude anything there. I think there is potential to make valuable scientific contributions on model comparison and validation based on this application, but this will require a more careful reflection on the study's goals and how to best achieve them.
Major comments:
I find the way you've introduced Gaussian processes (GPs), kriging, and Gaussian simulation confusing, with some comments being factually incorrect. To be clear, kriging and Gaussian processes are both based on the random function concept with Gaussian distributions, so they have the same theoretical basis, and Gaussian process and simple kriging estimate and use their parameters in the same way. This is made clear by Williams & Rasmussen (1995, https://proceedings.neurips.cc/paper_files/paper/1995/file/7cce53cf90577442771720a370c3c723-Paper.pdf):

"Gaussian processes have also been used in the geostatistics field (e .g. Cressie, 1993), and are known there as "kriging"."

And also by Rasmussen & Williams (2006, https://gaussianprocess.org/gpml/chapters/RW.pdf), which repeat that comment p.30, and introduce GP as a linear predictor p.17 (which corresponds to the usual way kriging is introduced in geostatistics and the way you've introduced it in the manuscript).
The key practical difference between kriging and GPs is the inference of the hyper-parameters:
In the subsurface we often don't have enough data to robustly estimate a semivariogram model, so for kriging the process is done manually based on an experimental semivariogram. It's been proposed to fit the semivariogram model automatically to the experimental semivariogram (e.g., Pardo-Igúzquiza, 1999, https://doi.org/10.1016/S0098-3004(98)00128-9), but some authors have discouraged it because of risk of biases.

In machine learning (where GPs have been formalized), the goal is to automate as much as possible, so hyperparameters are optimized based on the negative log likelihood.

So when comparing kriging and GPs, you're not comparing two different approaches, you're comparing two different ways of inferring the hyper-parameters of the same approach (at least for simple kriging, ordinary kriging with a local neighborhood is different, but equivalent techniques exist in the GP world, see Nearest Neighbor Gaussian Procceses).
Then comes the question of transforming the data. As long as this transformation is linear (e.g., simple standardization), there's no problem. But when it is non-linear, back-transforming the mean prediction and any confidence interval from kriging/GP can lead to biases, so it shouldn't be done. There are specific techniques to deal with this (e.g., log-normal kriging), but a simple and generic way of doing this is multi-Gaussian kriging (see Deutsch & Journel, 1997, p.81-82, http://claytonvdeutsch.com/wp-content/uploads/2019/03/GSLIB-Book-Second-Edition.pdf, although the original author is probably Verly I think), where kriging is applied on the transformed data, then we simulate multiple realizations and we back-transform them (so not directly back-transforming the kriging mean as you seem to do). This is equivalent to using SGS directly, and taking the mean of the realizations at each location.
Following up on simulation, Gaussian simulation is a way of sampling from the normal distributions of a Gaussian process while preserving the covariance function away from the data. It can be done in one generative step based on LU decomposition or based on a FFT when using a grid, or sequentially based on the chain rule of probability. This rule tells us that the fields generated in one step or sequentially are equivalent. What breaks this equivalency is when we use a neighborhood in the sequential predictions, which is often done in practice for efficiency.
The entire manuscript needs to be reshaped to properly account for theory. This includes not introducing kriging and GPs as two different methods, being clear on how hyper-parameters are inferred, and not testing cases that we know from theory are not optimal (e.g., using the back-transformed mean of kriging/GP, using too few realizations) or the same (comparing one step generative method with a sequential approach, unless you use a restricted neighborhood in the sequential scheme, but then this needs to be detailed). The methods also need to be clearly explained (e.g., what you're doing in GP(L) isn't explained with enough details, what SK-SGS and OK-SGS mean isn't self-explanatory: are you first using SK or OK then sampling from the distributions using SGS or are you using directly sampling suing SGS with SK or OK? Those aren't quite the same).
On that note, many references are not the original sources but more recent ones. Beside being unfair to the original authors, I strongly encourage you to have a proper look at the original studies of the methods you're using.
I also miss a proper literature study and discussion around what has been done on model validation and selection in statistics, geostatistics, and machine learning. You state in the abstract that "there are no established protocols for evaluating the uncertainty and predictive performance of univariate probabilistic models", which is a baseless, and even false, claim. A lot has been done on evaluating predictive performance, and scikit-learn has a whole documentation about it:

https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-of-time-series-data

Cross-validation is also the standard method for validation in geostatistics, and is used in Deutsch (1997), which you refer to multiple time in your work. In statistics, you can have a look at the work of Aki Vehtari, who has done many studies on validation, including of GPs. Validation of uncertainty is less often done in practice indeed, but that doesn't mean that no work has been done on it.
In the end, I remained unconvinced that the proposed FLAGSHIP approach improves our ability to validate probabilistic models. The key problem here is that you don't have any baseline to compare to, so you can't prove that your method improves anything.
I think a simple strategy with a cross-validation (based on group k-fold to assess extrapolation to different domains and intervals), and as metrics R2 to assess predictive performance and expected calibration error for uncertainty (what Deutsch (1997) calls precision basically) would be enough. Those two metrics can also be easily visualized in a similar manner (like in your figure 19), which helps with interpretation and decision making. The log likelihood or negative log probability might be worth testing as a complement. But using too many metrics (including some that can be biased, as you mention in the conclusion) is just counterproductive.
Specific comments:
Abstract
Line 2: I'm not sure that "valued" is the right term there, maybe "useful" or "essential" would be more appropriate.
Line 4: That is a bold statement that statisticians won't enjoy reading. Regarding predictive performance it's just wrong, a lot of work on this has been done on this in statistics and data science, and cross-validation is mentioned in the classical geostatistics textbooks as well (see Deutsch & Journel, 1997, p.94, http://claytonvdeutsch.com/wp-content/uploads/2019/03/GSLIB-Book-Second-Edition.pdf). Regarding uncertainty it is more debatable in the sense that it is less often done in practice, but techniques based on cross-validation also exist for this.
Line 10: The "with or without sequential or correlated random field simulation" isn't really clear at this stage. I suggest to rephrase this sentence.
Line 11: Fidelity to what? What's goodness? How is it different from accuracy or precision? What is synchronicity? How is histogram an evaluation metric? This feels like a somewhat random list of concepts to get a nice acronym, but what it actually means (and whether it is any useful) remains unclear.
Line 14: Data diversity isn't a property of the experiments in this case, it's a property of the datasets used in the experiments. If you use a single dataset with a single data type, then the data diversity is pretty poor. But the abstract isn't clear on that.
1. Introduction
Line 27: A single realization isn't enough to quantify uncertainty, you need several ones.
Line 28: "It requires geostatistical simulation of each hydrostratigraphic layer using boundary points specified by geologists." I'm not sure what you mean by that here, nor how it is helpful to understand what you are trying to do.
Line 32: This is the first paragraph of your introduction, yet you start by mentioning studies that are not relevant to your own work? Your goal here is to catch readers' interest, so better go straight to why your specific work is important.
Line 31: Three studies are a little light for an "intense interest", which only started in the 2020s it seems.
Line 34: "Rapid development"? Interest around probabilistic modeling for the subsurface started in the 1950s, so this statement combined with citations from the 2020s leaves me puzzled. I hope you realize the long history behind probabilistic modeling, both in statistics and geostatistics.
Line 36: It would be much better to cite the original studies on kriging here (so Matheron's work, although Krige's work could also be cited).
Line 40: GPs aren't an alternative to kriging, they are the same approach (the only difference comes from what is considered a hyperparameter and how they are estimated).
Lines 41-42: Kriging doesn't require a sequential scheme for predictions, this is just a way to alleviate the extra computational cost that comes with more data and larger grids. And a similar scheme has been proposed with GPs (see Nearest Neighbor Gaussian Procceses). Sampling from the posterior distributions (i.e., simulation) is required by both kriging and GPs in case of non-linear transformation (something that is very clear in geostatistics, less so in machine learning unfortunately). And GPs suffer from the exact same limitations as kriging regarding the non-reproduction of the covariance function (again, they are the same method).
Line 47: At this stage the motivation and goals of the paper are becoming blurry. It would be better to reorganize the introduction to bring the different goals together.
Line 49: The focus on mining is important and should be better highlighted in the introduction and abstract.
Line 63: This is the third time that this objective is mentioned as such in the introduction, which highlights an issue with the flow of ideas and the logic of the introduction.
Lines 66: RMSE is just a validation metric. Considering errors as independent or dependent isn't related to the validation metric, it's a modeling decision. And both kriging and GPs assume that errors are independent. If validation shows correlation in the errors, then either there is an issue with the prior model (i.e., the covariance function) or kriging/GP isn't appropriate for the dataset at hand.
Line 67: What is the link between errors being dependent and uncertainty quantification? And to any limitations of RMSE as a metric? This whole paragraph is really unclear.
2. Geostatistical modelling
Line 81: That's a fundamental viewpoint of random processes, not necessarily of probabilistic models.
Line 112: What past experience? This sounds oddly unscientific, why not do hyperparameter tuning based on a cross-validation, as is standard in machine learning (and should be standard in geostatistics but unfortunately isn't always done)?
Line 130: Indeed, so why introducing kriging and GPs as two different techniques?
Line 162: It's important to mention there that Gaussian simulation is required to preserve the spatial structure of the covariance function (at least away from the data, as the covariance function is part of the prior and will get updated by the data) but not sequential Gaussian simulation. The former can be done through LU decomposition, which implies building a very large covariance matrix, so the later alleviates computational costs by using the chain rule of probability and building instead multiple smaller covariance matrices. And Gaussian simulation refers to a set of techniques to sample from a Gaussian process, so it isn't independent from kriging in that sense.
Line 163: It would be much better to refers to the original study introducing SGS (I suppose that might be Gomez-Hernandez & Journel, 1993, Joint Sequential Simulation of MultiGaussian Fields).
Line 180: That definition was already given, since random process, random field, and random function refer to the same thing essentially.
Line 190: At this stage we have no explanation of what SK-SGS and OK-SGS mean, so they require further explanations of what they mean and why picking those options specifically.
Line 190: How do you conditioned the mean on the local neighborhood in the GP? The previous sections were quite generic, and in the end give little detail as to what you do exactly.
Line 192: Why using an isotropic covariance function for kriging and an anisotropic one for the GP? This creates a bias in your experiments, especially for your goal of comparing the performance of kriging and GPs.
Lines 196-199: This is the wrong procedure when dealing with transformed data, you should sample from the predicted distributions (using SGS for instance), back-transform the realizations, then take the mean. Otherwise your predictions will be biased (see Deutsch & Journel, 1997, p.81-82, http://claytonvdeutsch.com/wp-content/uploads/2019/03/GSLIB-Book-Second-Edition.pdf).
3. Geostatistical measures
Line 202: Before defining the metrics used for validation, it would be better to define the validation strategy: how do you assess the generalization error and the uncertainty quantification? Do you use a holdout validation? A cross-validation? Or do you just do some form of residual analysis? It seems that you're doing a mix, but that's not so clear.
Line 208: How are the groundtruth histograms define? In a real case we won't have a ground truth, so how can we translate your method to real cases?
Line 209: Why just with the mean predictions? Why not with the full distributions?
Lines 210-211: You need to explain exactly why you picked those approaches, and more importantly why you picked four approaches doing essentially the same thing and not just one. What extra insight do we get by using all those?
Line 242: It's a loss of spatial fidelity only if you assume that a covariance function (or semivariogram model) can be robustly estimated from the data.
Line 246: The range and nugget would be more appropriate parameters to assess a reduction in spatial variability. So why using the sill?
Equation 22: Is r_model(d) defined anywhere? What is that spatial fidelity supposed to be measuring? And what's its theoretical justification?
Line 256: There we touched upon how to use those metrics in practice, but this is too little too late. We need clearer explanations on the overall validation methodology and its theoretical justification.
Line 260: Better to stay consistent with the notation than creating an unnecessary source of potential confusion.
Line 305: Why not use the negative log probability of the target under the mode as a metric that can also capture the accuracy of the mean prediction (see Rasmussen & Williams, 2006, p.23, https://gaussianprocess.org/gpml/chapters/RW.pdf)?
Line 314: What happens when you don't have a ground truth and a small, potentially biased validation set?
Line 333: "Model 1 is simulated using a uniform distribution": What does that mean exactly?
Figure 3: The figures are hard to read unless one zoom in a lot. It would be better if the font size of the figure was closer to the font size of the text. Also shouldn't there be an uncertainty estimate?
Line 340: So here you're doing a residual analysis, is that correct? That doesn't tell you anything about generalization error and about the ability of the models to extrapolate though, and the ability of your metrics to reflect that.
Table 2: Having so many non-standard metrics is just counterproductive in my opinion, because figuring out what each one means gets complicated, so comparing the models gets complicated. Having just two or maybe three well-chosen metrics would be much more efficient.
4. Experiments
line 374: So here you're looking at extrapolation, although the previous synthetic example wasn't. Being consistent between the synthetic case study where we have a ground truth for comparison and the real case study would make the analysis of the proposed method more robust.
Line 383: I would add a link to the repository here.
5. Results
Line 412: That part is incredibly confusing, because it states basic properties of simulations (the more simulations, the closer to the kriging/GP predictions), but also suggest that fewer simulations get you closer to the ground truth, which is wrong, this is just an artifact from using too few simulations.
Figures 7 & 9: The problem here is that the first row is the wrong way of proceeding, while the bottom row is the right way. But this is nothing new, it's basic geostatistical knowledge (see Deutsch & Journel, 1997, p.81-82, http://claytonvdeutsch.com/wp-content/uploads/2019/03/GSLIB-Book-Second-Edition.pdf). I suppose "from_32" means from 32 realizations, but in general you need a few hundred realizations to have a robust estimate of the mean, and a few thousand for the standard deviation.
Figures 8 & 10: I'm not sure what is the point of those figures, they illustrate a very basic property of simulations.
Line 421: Estimation of the variance in GP is the same as in kriging, so it isn't based on the observed values either. Yet there is a difference between the uncertainty quantification from kriging and from the GP, why is that?
Line 425: Any heterogeneity you see there comes from the insufficient number of realizations that you're using. It's a bias, not a feature.
Line 434: At this stage it remains unclear to me what does it mean to couple SGS or CRF with a GP. Are you sampling from the GP distributions using SGS or CRF?
Table 4: This is quite hard to understand in the end. Some kind of bar plot would already be an improvement.
Line 447: How did you infer the semivariogram model parameters though? That needs to be explained in the method section.
Line 460: This is a basic property of kriging and GPs. You will observe the same for the simulations if only you simulate enough realizations. The only difference will be that you'll do the back-transformation correctly, which should impact the results, although I'm not sure to what degree.
Line 469: A key question is do you need to do so? Reproducing semivariogram models per se isn't particularly valuable, since a model can be wrong anyway. Assessing generalization error is much more valuable.
Line 475: What do you mean by "32 iterations"? I've assumed realizations so far, but now I'm confused. If it's indeed realizations, then you shouldn't cherry pick a number of realizations because you get a better value on your metrics: more realizations mean a more accurate approximation of the full predicted distribution, that predicted distribution might be wrong, but that issue won't be solved by sampling less.
Table 6: This is also tricky to decipher. In the end figure 19 is much easier to interpret.
Line 505 and figures 20 & 21: I don't see what's so special about the synchronicity. In the end you could get the same insights from a map of mean error.
Figure 22: I would put the two subfigures in a single column and increase their size, including font size. Otherwise readability is quite low. It took me some time to figure out what the labels on the horizontal axis meant, I would suggest to move the axis labels to the left-hand side of the axis, with "inference period" above "domain label".
Figure 23: I get what you want to achieve with the dotted cells, but they stand out so much that they attract attention too much, whereas you want people to still focus on the values of your metrics. Maybe filling out the cells in light grey would work better?
Line 677: A reasonable number of realizations is as high as possible, since your approximation of the full distribution improves with the number of realizations.
Line 680: If you didn't use a neighborhood in the SGS, then this is just what theory predicts, hence why it has received little attention.
Line 694: This is why people in statistics and machine learning usually use R2. What Deutsch called accuracy and precision is the same as the expected calibration error in machine learning. While you're right that this isn't as used as it should be, this isn't something new. What I'm missing at this point is actual proof that your proposed FLAGSHIP gives us more insight than those standard metrics.
Line 703: The medical literature has shown that hypothesis testing isn't as robust as we'd like to think, and p-hacking is real. In the end plotting the mean value of your metric and the standard error for each model, and looking at overlap between the standard errors of different models, is a simpler but no-less robust way of assessing the value of each model.
6. Discussion
Table 16: It's still not clear how your procedure relates to the best practices in statistics, machine learning/data science, and geostistics. The absence of any literature on those in the entire manuscript is really problematic. From that perspective, the discussion isn't really a discussion (there's no comparison to what has been suggested in the literature, nor discussions on the strength and limitations of your approach and of your study, nor perspectives for improvement or future work) but more a conclusion. I'm also missing a discussion on kriging vs GP, although that was mentioned as an objective of this work.
7. Conclusions
Line 735: That's highly debatable. Personally, I find that your approach has too many metrics, which blurs the analysis and leads to less clarity.
Lines 736-737: Then why using those metrics? How are they helping to get clarity if they can be biased and another, single metric can do better?
Lines 745-746: You're mixing things up here, the sequential scheme won't improve anything, the final predictions will be the same (which is exactly what the chain rule is saying).
Line 759: You would need a more robust analysis to make such a general conclusion. You can only say it's true for your particular case study and setting.
Citation: https://doi.org/10.5194/egusphere-2024-4051-RC1
- AC1: 'Reply on RC1', Raymond Leung, 01 Apr 2025
  
  The authors' response is attached as a separate file (see "response-to-reviewer1-comments.pdf").
  
  Citation: https://doi.org/10.5194/egusphere-2024-4051-AC1
- AC4: 'Reply on RC1', Raymond Leung, 03 Apr 2025
  
  The manuscript was revised to address a large number of review comments with the goal of making its scope, objectives, contributions and application settings clearer while staying true to its course. A list of changes is included as a PDF and attached to https://doi.org/10.5194/egusphere-2024-4051-AC3. The authors have also prepared a "diff" version highlighting the changes made between 04 Feb 2025 and 31 Mar 2025. Unfortunately, the submission system would not allow us to upload this with our comments at this time.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4051-AC4
RC2:
'Comment on egusphere-2024-4051', Anonymous Referee #2, 19 Mar 2025
This manuscript proposes a suit of probabilistic prediction validation measures named "FLAGSHIP" intended for use in evaluation of interpolations in the mining industry.
The presentation and writing are both very polished and I personally could not find any spelling or grammatical errors. There is a significant number of experiments on both real and synthetic data which are all adequately documented. I think some of the concepts here, such as comparing histograms and variograms, have promise, but the execution is poor. Overall the work lacks a principled approach to justifying performance metrics, the experiments are not set up in a way that can lead to insights, there are major theoretical mistakes, and the literature review on predictive performance metrics and paradigms is lacking.
To do this topic justice I advise that the authors do more focused work on a smaller subset of measures and analyze them more thoroughly. I also suggest not focusing on the krigging vs GP comparison, but to instead compare variogram/kernel model choices and fitting methods withing each paradigm separately.
Major Comments:
I am recommending that this work be rejected for several reasons, listed here in order of importance:

The experiments lack purpose. For any experiment, including computer algorithm experiments, there needs to be some prior concept of what the potential outcomes are and what different conclusions would be drawn in each case. In this manuscript we are simply presented with different measures applied to different interpolations and are told one method performs better than anther. Some principle needs to be defined for how metrics are assessed and what the experiments are meant to contribute to our understanding.

There is no theoretical justification of the metrics proposed. Performance metrics are supposed to be abstract proxies for what is valued in the real world (e.g., cost of recovering minerals). There is no discussion here of how the metrics relate to the real world setting or objective.

The literature review on predictive performance measures is inadequate. There is an enormous amount published on this topic. The authors need to broaden their search outside of goescience and geostatistics. Consider literature in machine learning, statistics, Bayesian methods, meta-science, and philosophy of science. For things specific to spatial predictions, I know there is a lot in environmental science.

Trading off different metrics is not discussed. Some of these measures (e.g. accuracy) can be trivially maximized by simply making the prediction standard deviation as wide as possible. Others (e.g.fidelity) can be maximized by over-fitting. How are the different properties to be traded off against each other? The problem is not even acknowledged. There are plenty of existing metrics, such as cross-entropy, which have the tradeoff built in. I get that you may want to measure parts of the objective separately, but a validation framework that does not even acknowledge the trade-off issue can only mislead.

The distinction between a predicted distribution and a sample from such a distribution is not reflected in the experiments. For example, one cannot simply compare the mean image obtained from OK directly with a single sample or finite sample set obtained from OK_SGS from_n (where n is small) directly, as one is a distribution and the other a realization (or set of realizations) from that distribution. Similarly, comparing a real data histogram to a histogram of a predicted mean is not meaningful. The same also applies to variogram comparison.

The first reviewer has already detailed the equivalence between kriging and GPs. I will agree here that the comparing them is effectively a comparison of how the variogram is fitted and how the posterior distribution is calculated/approximated. I believe much of the conclusions draw comparing these two methods are over-generalizations resulting from this lack of theoretical understanding.

Specific Comments:
Lines 41: Kriging also provides a predicted mean and variance and the covariance can be calculated as well. The apparent over-smoothing is likely due to specifics of how the variograms are fitted and will be sensitive to details of how hyper parameters are fit. Note, for both kriging and GPs there are multiple methods.
Line 160: The statement that kriging does not reproduce variability between pairs of test points seems wrong to me. A mean predicted image will necessarily be smoother than the true image. The roughness of the mean prediction should not be directly interpreted as the expected roughness of the truth. The variance between two test points can be calculated simply by applying the variogram to their relative lag, thus any bias towards underestimating variance would be due to the variogram fitting process or its restrictions (e.g. stationarity and isotropy), or due to the true process being significantly non-Gaussian.
Line 180: This definition is for a finite field. In general, spatial processes are defined over an infinite number of variables.
Line 191: Both kriging and GP can use isotropic or non-isotropic variograms/kernels. The way it is written here suggest that these are limitations of the methods and not of the specific implementations.
Line 248: Fidelity feels like the wrong word here given that accuracy has no effect on it. Also this measure can be easily maximized by simply over-fitting thus there needs to be some discussion on how it is to be traded-off with other measures.
Line 264: This sentence makes no sense to me. What does "once the validation measure is revealed" mean? Conditioning is done on random variable outcomes, not measure types. What does "likelihood of that the model is correct mean", likelihood is the probability of data given a model as a function of the model, the shaded area is proportional to the probability of data being in that interval given the model as an assumption.
Line 267: The rational behind 'S' as a measure is not clear to me at all and needs elaboration.
Line 271: Likelihood is the probability density of the known data assuming the candidate model. Why is it defined as cumulative density here?
Section 3.3: I think the kappa statistics are interesting but without discussion on how they can be traded-off against each-other there is no clear way to use them. There is no discussion on what it is you are looking for from them. There is no mention of the fact that some can be easily maximized by arbitrarily over-fitting or under-fitting predictions.
Line 418: Estimated variance does depend on the data as the variogram is estimated from it. The uniformity you see is due to the regular spacing of the data used combined with the stationarity assumption. Questions relating legitimacy should be about the stationarity assumption, which is not strictly needed for kriging or GPs. Analysis of the appropriateness of the stationarity assumption should precede the interpolation.
The observed higher spatial variability of GP over SK and OK are simply due to it inferring a different variogram with higher variance at short lags.
I do not understand the logic in comparing nst with SGS or CRF. Variance estimates from single samples are not comparable to whole posterior distributions. The apparent differences are due to misinterpreting what they produce.
Line 430: The ground truth has thicker tails because it is a realization of the random variable which is being compared here to mean predictions. Again, the mistake here is to expect distribution means to have the same properties as realizations from those distributions. The correct thing to do here would be to convolve each prediction mean with its predicted standard deviation Gaussian kernel to obtain a correct posterior expected histogram.
Sections 5.1.2: Again, properties of distributions are being compared to properties of samples. SK_SGS_single is a sample from the SK distributions. Their variograms are not comparable. One could compare the SGS_single variogram with the ground truths, or the SK_SGS_from_highestnumberthatispractical with SK, but not across those groups.
I am leaving out my notes for the remaining sections because they are all about the same point: properties of distributions and samples from those distributions should not be expected to be the same and are not directly comparable.
Citation: https://doi.org/10.5194/egusphere-2024-4051-RC2
- AC2: 'Reply on RC2', Raymond Leung, 01 Apr 2025
  
  The authors' response is attached as a separate file (see "response-to-reviewer2-comments.pdf")
  
  Citation: https://doi.org/10.5194/egusphere-2024-4051-AC2
- AC5: 'Reply on RC2', Raymond Leung, 03 Apr 2025
  
  The manuscript was revised to address a large number of review comments with the goal of making its scope, objectives, contributions and application settings clearer while staying true to its course. A list of changes is included as a PDF and attached to https://doi.org/10.5194/egusphere-2024-4051-AC3. The authors have also prepared a "diff" version highlighting the changes made between 04 Feb 2025 and 31 Mar 2025. Unfortunately, the submission system would not allow us to upload this with our comments at this time.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4051-AC5
EC1:
'Comment on egusphere-2024-4051', Thomas Poulet, 20 Mar 2025

First, I would like to sincerely thank both reviewers for their thorough and detailed evaluations of the manuscript. Their careful assessments and constructive feedback are greatly appreciated.
To the authors, as you prepare your response, I encourage you to focus on addressing the key concerns raised in the reviews, particularly the fundamental issues that have been highlighted. At this stage, rather than engaging with every minor point, it would be most constructive to consider the overarching critiques and how they impact the manuscript’s core contributions.
I look forward to your response.
Best regards,
Thomas Poulet

(Topical editor)

Citation: https://doi.org/10.5194/egusphere-2024-4051-EC1
- AC3: 'Reply on EC1', Raymond Leung, 01 Apr 2025
  
  Dear Dr. Poulet,
  Thank you for coordinating this review. The authors are grateful for the opportunity to respond to the reviewers' comments and share our perspective. We would like to draw your attention to the three items that have been uploaded.
  1. Response to Reviewer 1's critique (https://egusphere.copernicus.org/preprints/egusphere-2024-4051#AC1)
  
  - The authors have responded fully to all of the comments as this was prepared before we received the editor's feedback on 20 Mar 2025.
  2. Response to Reviewer 2's critique (https://egusphere.copernicus.org/preprints/egusphere-2024-4051#AC2)
  
  - Following your advice to "focus on addressing the key concerns raised in the reviews, particularly the fundamental issues that have been highlighted", we responded mostly to the overarching issues (main comments) as instructed.
  3. Revised manuscript
  
  - The manuscript was modified with the goal of making its scope, objectives, contributions and application settings much clearer, to address reviewers' concerns and accommodate where we can while staying true to its course. This revision was near completion by the time the editor's feedback had reached us.
  
  - A list of changes is provided in the PDF (attached). The full version which highlights the differences between 04 Feb 2025 and 31 Mar 2025 is available for upload. However, we are unable to attach it with these comments at this time due to the regulations of the Copernicus submission system.
  Best regards,
  Raymond Leung
  
  Corresponding author
  
  Citation: https://doi.org/10.5194/egusphere-2024-4051-AC3

Status: closed

RC1:
'Comment on egusphere-2024-4051', Anonymous Referee #1, 27 Feb 2025
This manuscript proposes a set of metrics for probabilistic model validation and selection. This approach is applied to a synthetic example and a real case study, with a strong focus on mining applications.
Overall the manuscript is well written, although it could be shortened and the vocabulary is sometimes confusing. Analysis 2 shows some valuable work on metric visualization, but the manuscript lacks a thorough theoretical base and, in the end, the analysis isn't designed in a way that can prove that the metrics - most of which being not new - improve our ability to validate and select models compared to the current best practices. Another objective was to compare kriging and Gaussian processes, but the analysis isn't robust enough to conclude anything there. I think there is potential to make valuable scientific contributions on model comparison and validation based on this application, but this will require a more careful reflection on the study's goals and how to best achieve them.
Major comments:
I find the way you've introduced Gaussian processes (GPs), kriging, and Gaussian simulation confusing, with some comments being factually incorrect. To be clear, kriging and Gaussian processes are both based on the random function concept with Gaussian distributions, so they have the same theoretical basis, and Gaussian process and simple kriging estimate and use their parameters in the same way. This is made clear by Williams & Rasmussen (1995, https://proceedings.neurips.cc/paper_files/paper/1995/file/7cce53cf90577442771720a370c3c723-Paper.pdf):

"Gaussian processes have also been used in the geostatistics field (e .g. Cressie, 1993), and are known there as "kriging"."

And also by Rasmussen & Williams (2006, https://gaussianprocess.org/gpml/chapters/RW.pdf), which repeat that comment p.30, and introduce GP as a linear predictor p.17 (which corresponds to the usual way kriging is introduced in geostatistics and the way you've introduced it in the manuscript).
The key practical difference between kriging and GPs is the inference of the hyper-parameters:
In the subsurface we often don't have enough data to robustly estimate a semivariogram model, so for kriging the process is done manually based on an experimental semivariogram. It's been proposed to fit the semivariogram model automatically to the experimental semivariogram (e.g., Pardo-Igúzquiza, 1999, https://doi.org/10.1016/S0098-3004(98)00128-9), but some authors have discouraged it because of risk of biases.

In machine learning (where GPs have been formalized), the goal is to automate as much as possible, so hyperparameters are optimized based on the negative log likelihood.

So when comparing kriging and GPs, you're not comparing two different approaches, you're comparing two different ways of inferring the hyper-parameters of the same approach (at least for simple kriging, ordinary kriging with a local neighborhood is different, but equivalent techniques exist in the GP world, see Nearest Neighbor Gaussian Procceses).
Then comes the question of transforming the data. As long as this transformation is linear (e.g., simple standardization), there's no problem. But when it is non-linear, back-transforming the mean prediction and any confidence interval from kriging/GP can lead to biases, so it shouldn't be done. There are specific techniques to deal with this (e.g., log-normal kriging), but a simple and generic way of doing this is multi-Gaussian kriging (see Deutsch & Journel, 1997, p.81-82, http://claytonvdeutsch.com/wp-content/uploads/2019/03/GSLIB-Book-Second-Edition.pdf, although the original author is probably Verly I think), where kriging is applied on the transformed data, then we simulate multiple realizations and we back-transform them (so not directly back-transforming the kriging mean as you seem to do). This is equivalent to using SGS directly, and taking the mean of the realizations at each location.
Following up on simulation, Gaussian simulation is a way of sampling from the normal distributions of a Gaussian process while preserving the covariance function away from the data. It can be done in one generative step based on LU decomposition or based on a FFT when using a grid, or sequentially based on the chain rule of probability. This rule tells us that the fields generated in one step or sequentially are equivalent. What breaks this equivalency is when we use a neighborhood in the sequential predictions, which is often done in practice for efficiency.
The entire manuscript needs to be reshaped to properly account for theory. This includes not introducing kriging and GPs as two different methods, being clear on how hyper-parameters are inferred, and not testing cases that we know from theory are not optimal (e.g., using the back-transformed mean of kriging/GP, using too few realizations) or the same (comparing one step generative method with a sequential approach, unless you use a restricted neighborhood in the sequential scheme, but then this needs to be detailed). The methods also need to be clearly explained (e.g., what you're doing in GP(L) isn't explained with enough details, what SK-SGS and OK-SGS mean isn't self-explanatory: are you first using SK or OK then sampling from the distributions using SGS or are you using directly sampling suing SGS with SK or OK? Those aren't quite the same).
On that note, many references are not the original sources but more recent ones. Beside being unfair to the original authors, I strongly encourage you to have a proper look at the original studies of the methods you're using.
I also miss a proper literature study and discussion around what has been done on model validation and selection in statistics, geostatistics, and machine learning. You state in the abstract that "there are no established protocols for evaluating the uncertainty and predictive performance of univariate probabilistic models", which is a baseless, and even false, claim. A lot has been done on evaluating predictive performance, and scikit-learn has a whole documentation about it:

https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-of-time-series-data

Cross-validation is also the standard method for validation in geostatistics, and is used in Deutsch (1997), which you refer to multiple time in your work. In statistics, you can have a look at the work of Aki Vehtari, who has done many studies on validation, including of GPs. Validation of uncertainty is less often done in practice indeed, but that doesn't mean that no work has been done on it.
In the end, I remained unconvinced that the proposed FLAGSHIP approach improves our ability to validate probabilistic models. The key problem here is that you don't have any baseline to compare to, so you can't prove that your method improves anything.
I think a simple strategy with a cross-validation (based on group k-fold to assess extrapolation to different domains and intervals), and as metrics R2 to assess predictive performance and expected calibration error for uncertainty (what Deutsch (1997) calls precision basically) would be enough. Those two metrics can also be easily visualized in a similar manner (like in your figure 19), which helps with interpretation and decision making. The log likelihood or negative log probability might be worth testing as a complement. But using too many metrics (including some that can be biased, as you mention in the conclusion) is just counterproductive.
Specific comments:
Abstract
Line 2: I'm not sure that "valued" is the right term there, maybe "useful" or "essential" would be more appropriate.
Line 4: That is a bold statement that statisticians won't enjoy reading. Regarding predictive performance it's just wrong, a lot of work on this has been done on this in statistics and data science, and cross-validation is mentioned in the classical geostatistics textbooks as well (see Deutsch & Journel, 1997, p.94, http://claytonvdeutsch.com/wp-content/uploads/2019/03/GSLIB-Book-Second-Edition.pdf). Regarding uncertainty it is more debatable in the sense that it is less often done in practice, but techniques based on cross-validation also exist for this.
Line 10: The "with or without sequential or correlated random field simulation" isn't really clear at this stage. I suggest to rephrase this sentence.
Line 11: Fidelity to what? What's goodness? How is it different from accuracy or precision? What is synchronicity? How is histogram an evaluation metric? This feels like a somewhat random list of concepts to get a nice acronym, but what it actually means (and whether it is any useful) remains unclear.
Line 14: Data diversity isn't a property of the experiments in this case, it's a property of the datasets used in the experiments. If you use a single dataset with a single data type, then the data diversity is pretty poor. But the abstract isn't clear on that.
1. Introduction
Line 27: A single realization isn't enough to quantify uncertainty, you need several ones.
Line 28: "It requires geostatistical simulation of each hydrostratigraphic layer using boundary points specified by geologists." I'm not sure what you mean by that here, nor how it is helpful to understand what you are trying to do.
Line 32: This is the first paragraph of your introduction, yet you start by mentioning studies that are not relevant to your own work? Your goal here is to catch readers' interest, so better go straight to why your specific work is important.
Line 31: Three studies are a little light for an "intense interest", which only started in the 2020s it seems.
Line 34: "Rapid development"? Interest around probabilistic modeling for the subsurface started in the 1950s, so this statement combined with citations from the 2020s leaves me puzzled. I hope you realize the long history behind probabilistic modeling, both in statistics and geostatistics.
Line 36: It would be much better to cite the original studies on kriging here (so Matheron's work, although Krige's work could also be cited).
Line 40: GPs aren't an alternative to kriging, they are the same approach (the only difference comes from what is considered a hyperparameter and how they are estimated).
Lines 41-42: Kriging doesn't require a sequential scheme for predictions, this is just a way to alleviate the extra computational cost that comes with more data and larger grids. And a similar scheme has been proposed with GPs (see Nearest Neighbor Gaussian Procceses). Sampling from the posterior distributions (i.e., simulation) is required by both kriging and GPs in case of non-linear transformation (something that is very clear in geostatistics, less so in machine learning unfortunately). And GPs suffer from the exact same limitations as kriging regarding the non-reproduction of the covariance function (again, they are the same method).
Line 47: At this stage the motivation and goals of the paper are becoming blurry. It would be better to reorganize the introduction to bring the different goals together.
Line 49: The focus on mining is important and should be better highlighted in the introduction and abstract.
Line 63: This is the third time that this objective is mentioned as such in the introduction, which highlights an issue with the flow of ideas and the logic of the introduction.
Lines 66: RMSE is just a validation metric. Considering errors as independent or dependent isn't related to the validation metric, it's a modeling decision. And both kriging and GPs assume that errors are independent. If validation shows correlation in the errors, then either there is an issue with the prior model (i.e., the covariance function) or kriging/GP isn't appropriate for the dataset at hand.
Line 67: What is the link between errors being dependent and uncertainty quantification? And to any limitations of RMSE as a metric? This whole paragraph is really unclear.
2. Geostatistical modelling
Line 81: That's a fundamental viewpoint of random processes, not necessarily of probabilistic models.
Line 112: What past experience? This sounds oddly unscientific, why not do hyperparameter tuning based on a cross-validation, as is standard in machine learning (and should be standard in geostatistics but unfortunately isn't always done)?
Line 130: Indeed, so why introducing kriging and GPs as two different techniques?
Line 162: It's important to mention there that Gaussian simulation is required to preserve the spatial structure of the covariance function (at least away from the data, as the covariance function is part of the prior and will get updated by the data) but not sequential Gaussian simulation. The former can be done through LU decomposition, which implies building a very large covariance matrix, so the later alleviates computational costs by using the chain rule of probability and building instead multiple smaller covariance matrices. And Gaussian simulation refers to a set of techniques to sample from a Gaussian process, so it isn't independent from kriging in that sense.
Line 163: It would be much better to refers to the original study introducing SGS (I suppose that might be Gomez-Hernandez & Journel, 1993, Joint Sequential Simulation of MultiGaussian Fields).
Line 180: That definition was already given, since random process, random field, and random function refer to the same thing essentially.
Line 190: At this stage we have no explanation of what SK-SGS and OK-SGS mean, so they require further explanations of what they mean and why picking those options specifically.
Line 190: How do you conditioned the mean on the local neighborhood in the GP? The previous sections were quite generic, and in the end give little detail as to what you do exactly.
Line 192: Why using an isotropic covariance function for kriging and an anisotropic one for the GP? This creates a bias in your experiments, especially for your goal of comparing the performance of kriging and GPs.
Lines 196-199: This is the wrong procedure when dealing with transformed data, you should sample from the predicted distributions (using SGS for instance), back-transform the realizations, then take the mean. Otherwise your predictions will be biased (see Deutsch & Journel, 1997, p.81-82, http://claytonvdeutsch.com/wp-content/uploads/2019/03/GSLIB-Book-Second-Edition.pdf).
3. Geostatistical measures
Line 202: Before defining the metrics used for validation, it would be better to define the validation strategy: how do you assess the generalization error and the uncertainty quantification? Do you use a holdout validation? A cross-validation? Or do you just do some form of residual analysis? It seems that you're doing a mix, but that's not so clear.
Line 208: How are the groundtruth histograms define? In a real case we won't have a ground truth, so how can we translate your method to real cases?
Line 209: Why just with the mean predictions? Why not with the full distributions?
Lines 210-211: You need to explain exactly why you picked those approaches, and more importantly why you picked four approaches doing essentially the same thing and not just one. What extra insight do we get by using all those?
Line 242: It's a loss of spatial fidelity only if you assume that a covariance function (or semivariogram model) can be robustly estimated from the data.
Line 246: The range and nugget would be more appropriate parameters to assess a reduction in spatial variability. So why using the sill?
Equation 22: Is r_model(d) defined anywhere? What is that spatial fidelity supposed to be measuring? And what's its theoretical justification?
Line 256: There we touched upon how to use those metrics in practice, but this is too little too late. We need clearer explanations on the overall validation methodology and its theoretical justification.
Line 260: Better to stay consistent with the notation than creating an unnecessary source of potential confusion.
Line 305: Why not use the negative log probability of the target under the mode as a metric that can also capture the accuracy of the mean prediction (see Rasmussen & Williams, 2006, p.23, https://gaussianprocess.org/gpml/chapters/RW.pdf)?
Line 314: What happens when you don't have a ground truth and a small, potentially biased validation set?
Line 333: "Model 1 is simulated using a uniform distribution": What does that mean exactly?
Figure 3: The figures are hard to read unless one zoom in a lot. It would be better if the font size of the figure was closer to the font size of the text. Also shouldn't there be an uncertainty estimate?
Line 340: So here you're doing a residual analysis, is that correct? That doesn't tell you anything about generalization error and about the ability of the models to extrapolate though, and the ability of your metrics to reflect that.
Table 2: Having so many non-standard metrics is just counterproductive in my opinion, because figuring out what each one means gets complicated, so comparing the models gets complicated. Having just two or maybe three well-chosen metrics would be much more efficient.
4. Experiments
line 374: So here you're looking at extrapolation, although the previous synthetic example wasn't. Being consistent between the synthetic case study where we have a ground truth for comparison and the real case study would make the analysis of the proposed method more robust.
Line 383: I would add a link to the repository here.
5. Results
Line 412: That part is incredibly confusing, because it states basic properties of simulations (the more simulations, the closer to the kriging/GP predictions), but also suggest that fewer simulations get you closer to the ground truth, which is wrong, this is just an artifact from using too few simulations.
Figures 7 & 9: The problem here is that the first row is the wrong way of proceeding, while the bottom row is the right way. But this is nothing new, it's basic geostatistical knowledge (see Deutsch & Journel, 1997, p.81-82, http://claytonvdeutsch.com/wp-content/uploads/2019/03/GSLIB-Book-Second-Edition.pdf). I suppose "from_32" means from 32 realizations, but in general you need a few hundred realizations to have a robust estimate of the mean, and a few thousand for the standard deviation.
Figures 8 & 10: I'm not sure what is the point of those figures, they illustrate a very basic property of simulations.
Line 421: Estimation of the variance in GP is the same as in kriging, so it isn't based on the observed values either. Yet there is a difference between the uncertainty quantification from kriging and from the GP, why is that?
Line 425: Any heterogeneity you see there comes from the insufficient number of realizations that you're using. It's a bias, not a feature.
Line 434: At this stage it remains unclear to me what does it mean to couple SGS or CRF with a GP. Are you sampling from the GP distributions using SGS or CRF?
Table 4: This is quite hard to understand in the end. Some kind of bar plot would already be an improvement.
Line 447: How did you infer the semivariogram model parameters though? That needs to be explained in the method section.
Line 460: This is a basic property of kriging and GPs. You will observe the same for the simulations if only you simulate enough realizations. The only difference will be that you'll do the back-transformation correctly, which should impact the results, although I'm not sure to what degree.
Line 469: A key question is do you need to do so? Reproducing semivariogram models per se isn't particularly valuable, since a model can be wrong anyway. Assessing generalization error is much more valuable.
Line 475: What do you mean by "32 iterations"? I've assumed realizations so far, but now I'm confused. If it's indeed realizations, then you shouldn't cherry pick a number of realizations because you get a better value on your metrics: more realizations mean a more accurate approximation of the full predicted distribution, that predicted distribution might be wrong, but that issue won't be solved by sampling less.
Table 6: This is also tricky to decipher. In the end figure 19 is much easier to interpret.
Line 505 and figures 20 & 21: I don't see what's so special about the synchronicity. In the end you could get the same insights from a map of mean error.
Figure 22: I would put the two subfigures in a single column and increase their size, including font size. Otherwise readability is quite low. It took me some time to figure out what the labels on the horizontal axis meant, I would suggest to move the axis labels to the left-hand side of the axis, with "inference period" above "domain label".
Figure 23: I get what you want to achieve with the dotted cells, but they stand out so much that they attract attention too much, whereas you want people to still focus on the values of your metrics. Maybe filling out the cells in light grey would work better?
Line 677: A reasonable number of realizations is as high as possible, since your approximation of the full distribution improves with the number of realizations.
Line 680: If you didn't use a neighborhood in the SGS, then this is just what theory predicts, hence why it has received little attention.
Line 694: This is why people in statistics and machine learning usually use R2. What Deutsch called accuracy and precision is the same as the expected calibration error in machine learning. While you're right that this isn't as used as it should be, this isn't something new. What I'm missing at this point is actual proof that your proposed FLAGSHIP gives us more insight than those standard metrics.
Line 703: The medical literature has shown that hypothesis testing isn't as robust as we'd like to think, and p-hacking is real. In the end plotting the mean value of your metric and the standard error for each model, and looking at overlap between the standard errors of different models, is a simpler but no-less robust way of assessing the value of each model.
6. Discussion
Table 16: It's still not clear how your procedure relates to the best practices in statistics, machine learning/data science, and geostistics. The absence of any literature on those in the entire manuscript is really problematic. From that perspective, the discussion isn't really a discussion (there's no comparison to what has been suggested in the literature, nor discussions on the strength and limitations of your approach and of your study, nor perspectives for improvement or future work) but more a conclusion. I'm also missing a discussion on kriging vs GP, although that was mentioned as an objective of this work.
7. Conclusions
Line 735: That's highly debatable. Personally, I find that your approach has too many metrics, which blurs the analysis and leads to less clarity.
Lines 736-737: Then why using those metrics? How are they helping to get clarity if they can be biased and another, single metric can do better?
Lines 745-746: You're mixing things up here, the sequential scheme won't improve anything, the final predictions will be the same (which is exactly what the chain rule is saying).
Line 759: You would need a more robust analysis to make such a general conclusion. You can only say it's true for your particular case study and setting.
Citation: https://doi.org/10.5194/egusphere-2024-4051-RC1
- AC1: 'Reply on RC1', Raymond Leung, 01 Apr 2025
  
  The authors' response is attached as a separate file (see "response-to-reviewer1-comments.pdf").
  
  Citation: https://doi.org/10.5194/egusphere-2024-4051-AC1
- AC4: 'Reply on RC1', Raymond Leung, 03 Apr 2025
  
  The manuscript was revised to address a large number of review comments with the goal of making its scope, objectives, contributions and application settings clearer while staying true to its course. A list of changes is included as a PDF and attached to https://doi.org/10.5194/egusphere-2024-4051-AC3. The authors have also prepared a "diff" version highlighting the changes made between 04 Feb 2025 and 31 Mar 2025. Unfortunately, the submission system would not allow us to upload this with our comments at this time.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4051-AC4
RC2:
'Comment on egusphere-2024-4051', Anonymous Referee #2, 19 Mar 2025
This manuscript proposes a suit of probabilistic prediction validation measures named "FLAGSHIP" intended for use in evaluation of interpolations in the mining industry.
The presentation and writing are both very polished and I personally could not find any spelling or grammatical errors. There is a significant number of experiments on both real and synthetic data which are all adequately documented. I think some of the concepts here, such as comparing histograms and variograms, have promise, but the execution is poor. Overall the work lacks a principled approach to justifying performance metrics, the experiments are not set up in a way that can lead to insights, there are major theoretical mistakes, and the literature review on predictive performance metrics and paradigms is lacking.
To do this topic justice I advise that the authors do more focused work on a smaller subset of measures and analyze them more thoroughly. I also suggest not focusing on the krigging vs GP comparison, but to instead compare variogram/kernel model choices and fitting methods withing each paradigm separately.
Major Comments:
I am recommending that this work be rejected for several reasons, listed here in order of importance:

The experiments lack purpose. For any experiment, including computer algorithm experiments, there needs to be some prior concept of what the potential outcomes are and what different conclusions would be drawn in each case. In this manuscript we are simply presented with different measures applied to different interpolations and are told one method performs better than anther. Some principle needs to be defined for how metrics are assessed and what the experiments are meant to contribute to our understanding.

There is no theoretical justification of the metrics proposed. Performance metrics are supposed to be abstract proxies for what is valued in the real world (e.g., cost of recovering minerals). There is no discussion here of how the metrics relate to the real world setting or objective.

The literature review on predictive performance measures is inadequate. There is an enormous amount published on this topic. The authors need to broaden their search outside of goescience and geostatistics. Consider literature in machine learning, statistics, Bayesian methods, meta-science, and philosophy of science. For things specific to spatial predictions, I know there is a lot in environmental science.

Trading off different metrics is not discussed. Some of these measures (e.g. accuracy) can be trivially maximized by simply making the prediction standard deviation as wide as possible. Others (e.g.fidelity) can be maximized by over-fitting. How are the different properties to be traded off against each other? The problem is not even acknowledged. There are plenty of existing metrics, such as cross-entropy, which have the tradeoff built in. I get that you may want to measure parts of the objective separately, but a validation framework that does not even acknowledge the trade-off issue can only mislead.

The distinction between a predicted distribution and a sample from such a distribution is not reflected in the experiments. For example, one cannot simply compare the mean image obtained from OK directly with a single sample or finite sample set obtained from OK_SGS from_n (where n is small) directly, as one is a distribution and the other a realization (or set of realizations) from that distribution. Similarly, comparing a real data histogram to a histogram of a predicted mean is not meaningful. The same also applies to variogram comparison.

The first reviewer has already detailed the equivalence between kriging and GPs. I will agree here that the comparing them is effectively a comparison of how the variogram is fitted and how the posterior distribution is calculated/approximated. I believe much of the conclusions draw comparing these two methods are over-generalizations resulting from this lack of theoretical understanding.

Specific Comments:
Lines 41: Kriging also provides a predicted mean and variance and the covariance can be calculated as well. The apparent over-smoothing is likely due to specifics of how the variograms are fitted and will be sensitive to details of how hyper parameters are fit. Note, for both kriging and GPs there are multiple methods.
Line 160: The statement that kriging does not reproduce variability between pairs of test points seems wrong to me. A mean predicted image will necessarily be smoother than the true image. The roughness of the mean prediction should not be directly interpreted as the expected roughness of the truth. The variance between two test points can be calculated simply by applying the variogram to their relative lag, thus any bias towards underestimating variance would be due to the variogram fitting process or its restrictions (e.g. stationarity and isotropy), or due to the true process being significantly non-Gaussian.
Line 180: This definition is for a finite field. In general, spatial processes are defined over an infinite number of variables.
Line 191: Both kriging and GP can use isotropic or non-isotropic variograms/kernels. The way it is written here suggest that these are limitations of the methods and not of the specific implementations.
Line 248: Fidelity feels like the wrong word here given that accuracy has no effect on it. Also this measure can be easily maximized by simply over-fitting thus there needs to be some discussion on how it is to be traded-off with other measures.
Line 264: This sentence makes no sense to me. What does "once the validation measure is revealed" mean? Conditioning is done on random variable outcomes, not measure types. What does "likelihood of that the model is correct mean", likelihood is the probability of data given a model as a function of the model, the shaded area is proportional to the probability of data being in that interval given the model as an assumption.
Line 267: The rational behind 'S' as a measure is not clear to me at all and needs elaboration.
Line 271: Likelihood is the probability density of the known data assuming the candidate model. Why is it defined as cumulative density here?
Section 3.3: I think the kappa statistics are interesting but without discussion on how they can be traded-off against each-other there is no clear way to use them. There is no discussion on what it is you are looking for from them. There is no mention of the fact that some can be easily maximized by arbitrarily over-fitting or under-fitting predictions.
Line 418: Estimated variance does depend on the data as the variogram is estimated from it. The uniformity you see is due to the regular spacing of the data used combined with the stationarity assumption. Questions relating legitimacy should be about the stationarity assumption, which is not strictly needed for kriging or GPs. Analysis of the appropriateness of the stationarity assumption should precede the interpolation.
The observed higher spatial variability of GP over SK and OK are simply due to it inferring a different variogram with higher variance at short lags.
I do not understand the logic in comparing nst with SGS or CRF. Variance estimates from single samples are not comparable to whole posterior distributions. The apparent differences are due to misinterpreting what they produce.
Line 430: The ground truth has thicker tails because it is a realization of the random variable which is being compared here to mean predictions. Again, the mistake here is to expect distribution means to have the same properties as realizations from those distributions. The correct thing to do here would be to convolve each prediction mean with its predicted standard deviation Gaussian kernel to obtain a correct posterior expected histogram.
Sections 5.1.2: Again, properties of distributions are being compared to properties of samples. SK_SGS_single is a sample from the SK distributions. Their variograms are not comparable. One could compare the SGS_single variogram with the ground truths, or the SK_SGS_from_highestnumberthatispractical with SK, but not across those groups.
I am leaving out my notes for the remaining sections because they are all about the same point: properties of distributions and samples from those distributions should not be expected to be the same and are not directly comparable.
Citation: https://doi.org/10.5194/egusphere-2024-4051-RC2
- AC2: 'Reply on RC2', Raymond Leung, 01 Apr 2025
  
  The authors' response is attached as a separate file (see "response-to-reviewer2-comments.pdf")
  
  Citation: https://doi.org/10.5194/egusphere-2024-4051-AC2
- AC5: 'Reply on RC2', Raymond Leung, 03 Apr 2025
  
  The manuscript was revised to address a large number of review comments with the goal of making its scope, objectives, contributions and application settings clearer while staying true to its course. A list of changes is included as a PDF and attached to https://doi.org/10.5194/egusphere-2024-4051-AC3. The authors have also prepared a "diff" version highlighting the changes made between 04 Feb 2025 and 31 Mar 2025. Unfortunately, the submission system would not allow us to upload this with our comments at this time.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4051-AC5
EC1:
'Comment on egusphere-2024-4051', Thomas Poulet, 20 Mar 2025

First, I would like to sincerely thank both reviewers for their thorough and detailed evaluations of the manuscript. Their careful assessments and constructive feedback are greatly appreciated.
To the authors, as you prepare your response, I encourage you to focus on addressing the key concerns raised in the reviews, particularly the fundamental issues that have been highlighted. At this stage, rather than engaging with every minor point, it would be most constructive to consider the overarching critiques and how they impact the manuscript’s core contributions.
I look forward to your response.
Best regards,
Thomas Poulet

(Topical editor)

Citation: https://doi.org/10.5194/egusphere-2024-4051-EC1
- AC3: 'Reply on EC1', Raymond Leung, 01 Apr 2025
  
  Dear Dr. Poulet,
  Thank you for coordinating this review. The authors are grateful for the opportunity to respond to the reviewers' comments and share our perspective. We would like to draw your attention to the three items that have been uploaded.
  1. Response to Reviewer 1's critique (https://egusphere.copernicus.org/preprints/egusphere-2024-4051#AC1)
  
  - The authors have responded fully to all of the comments as this was prepared before we received the editor's feedback on 20 Mar 2025.
  2. Response to Reviewer 2's critique (https://egusphere.copernicus.org/preprints/egusphere-2024-4051#AC2)
  
  - Following your advice to "focus on addressing the key concerns raised in the reviews, particularly the fundamental issues that have been highlighted", we responded mostly to the overarching issues (main comments) as instructed.
  3. Revised manuscript
  
  - The manuscript was modified with the goal of making its scope, objectives, contributions and application settings much clearer, to address reviewers' concerns and accommodate where we can while staying true to its course. This revision was near completion by the time the editor's feedback had reached us.
  
  - A list of changes is provided in the PDF (attached). The full version which highlights the differences between 04 Feb 2025 and 31 Mar 2025 is available for upload. However, we are unable to attach it with these comments at this time due to the regulations of the Copernicus submission system.
  Best regards,
  Raymond Leung
  
  Corresponding author
  
  Citation: https://doi.org/10.5194/egusphere-2024-4051-AC3

Raymond Leung, Alexander Lowe, and Arman Melkumyan

Supplement

https://doi.org/10.5194/egusphere-2024-4051-supplement

Data sets

EUP3M: Evaluating uncertainty and predictive performance of probabilistic models—Python code for model construction and statistical analysis Raymond Leung and Alexander Lowe https://github.com/raymondleung8/eup3m/blob/main/data

Model code and software

Raymond Leung, Alexander Lowe, and Arman Melkumyan

Viewed

Total article views: 985 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
600	344	41	985	112	25	72

HTML: 600
PDF: 344
XML: 41
Total: 985
Supplement: 112
BibTeX: 25
EndNote: 72

Views and downloads (calculated since 04 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	84	13	5	102
Mar 2025	53	10	2	65
Apr 2025	64	17	9	90
May 2025	19	12	1	32
Jun 2025	10	12	4	26
Jul 2025	24	10	2	36
Aug 2025	54	14	1	69
Sep 2025	98	18	1	117
Oct 2025	21	6	1	28
Nov 2025	20	22	0	42
Dec 2025	37	69	4	110
Jan 2026	22	32	7	61
Feb 2026	46	45	3	94
Mar 2026	24	30	0	54
Apr 2026	24	34	1	59
May 2026	0

Cumulative views and downloads (calculated since 04 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	84	13	5	102
Mar 2025	53	10	2	65
Apr 2025	64	17	9	90
May 2025	19	12	1	32
Jun 2025	10	12	4	26
Jul 2025	24	10	2	36
Aug 2025	54	14	1	69
Sep 2025	98	18	1	117
Oct 2025	21	6	1	28
Nov 2025	20	22	0	42
Dec 2025	37	69	4	110
Jan 2026	22	32	7	61
Feb 2026	46	45	3	94
Mar 2026	24	30	0	54
Apr 2026	24	34	1	59
May 2026	0

Viewed (geographical distribution)

Total article views: 962 (including HTML, PDF, and XML) Thereof 962 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 02 May 2026

Short summary

Probabilistic models are used in geoscience to enable quantification of uncertainties, particularly in predicting subsurface geochemical distributions. However, there is a lack of established protocols for evaluating the uncertainty and predictive performance of univariate probabilistic models. This paper presents a systematic approach which enables modellers to assess competing models, conduct large-scale model comparisons and validate probabilistic predictions using statistical measures.


Total:	0
HTML:	0
PDF:	0
XML:	0