Technical note: Machine learning metamodelling for global sensitivity analysis
Abstract. Global sensitivity analysis (GSA) plays a central role in hydrologic modelling by supporting model understanding, diagnosis, and decision-making through the identification of influential and non-influential parameters and their interactions. Variance-based methods provide a rigorous framework for GSA but are often computationally expensive, as their estimation requires a large number of model evaluations. Metamodelling has therefore been widely adopted as a strategy to alleviate this issue, with recent advances in machine learning (ML) offering new opportunities to construct accurate and flexible surrogates for complex models. This technical note examines the practical relationship between Sobol’ total-effect indices (Ti) and feature importance measures derived from ML metamodels within a hydrologic modelling context. Building on theoretical results that link Ti to permutation variable importance (PVIi) under independence assumptions, we provide systematic numerical evidence using three conceptual hydrologic models of varying complexity (HBV, HyMod, and VIC) applied to three headwater catchments in northern Germany, together with three ML metamodels: a random forest (RF), a neural network (NN), and a linear model (LM). The three metamodels were trained on Monte Carlo samples and used to estimate sensitivities through PVIi and SHapley Additive exPlanations (SHAPi). The results demonstrate that RF and NN metamodels reliably reproduce both the ranking and relative magnitude of Ti using PVIi across all hydrologic models, providing clear empirical support for the theoretical connection between the two measures. In contrast, the performance of LM-based estimates depends strongly on the degree of linearity in the underlying model response. Mean absolute SHAPi values exhibit a consistent monotonic relationship with Ti and preserve parameter rankings, while sample-specific SHAPi values enable a distributed evaluation of sensitivities across both the parameter space and the target variable space. Overall, this study highlights ML metamodelling as a computationally efficient and conceptually sound framework for GSA in hydrologic modelling and beyond.
The paper presents use hydrological models and generate simulations, then train ML surrogate models and perform sensitivity & explainability analysis of parameters. Yes, these parameters are often already calibrated or studied in hydrological modelling literature, but sensitivity analysis is still meaningful because the parameters are treated as uncertain within feasible ranges. This approach is actually a good framework instead of using only ML models such as RF/ANN/LSTM/Transformer and explained those ML models. Hydrological simulations can be computationally expensive if we use sensitivity analysis (SA) which are heavily dependent on the sample size and number of parameters. For hydrological models like HBV/VIC, the traditional SA becomes extremely expensive. Mainly due to the high number of interactions, and uncertain input factors causing the curse of dimensionality problem. On the other side, run hydrological model a limited number of times, train ML surrogate, use ML surrogate for sensitivity analysis, because ML inference is very fast. This is standard surrogate modeling logic. From my understanding, the authors attempt to encourage hydrological modelers to consider surrogate modelling approaches over advanced standalone ML frameworks, mainly due to the computational efficiency offered by surrogate models based on the current experiment.
#Q1--Authors should provide the explanation for choosing these two models (RF & ANN), why not “XgBoost instead of RF” and “LSTM instead of ANN?
#Q2-- If these parameter ranges are already well established, what new hydrological insight is obtained from surrogate explainability? Or just comparing SHAP/PVI with traditional Sobol SA.
#Q3-- It would be valuable to discuss whether the reported agreement between SA and XAI SHAP remains consistent under substantially different parameter ranges. Based on the SA, RF and ANN provide the same feature rankings based on this experiment. When this setup changes, why can RF and ANN still produce similar sensitivity rankings? Even though RF uses tree splits, recursive partitioning, & ensemble averaging whereas ANN uses weighted neurons, nonlinear activations and gradient optimization. RF and ANN will behave similarly and may converge toward similar functional approximations but does not necessarily imply methodological equivalence or robustness. Results may vary across random seeds, training samples, architectures, hyperparameters and basin characteristics. I understand this is a big ask when the paper is fully concentrated on SA application (PVI & Sobol) and its compare with XAI. But one paragraph should be included to make the reader understand “Do these tools truly provide valuable insights into where ML models “align with” or “diverge from” theoretical expectations or the hydrological system understanding as per these papers- https://doi.org/10.1029/2024WR037398, https://doi.org/10.1016/j.envsoft.2026.107007
#Q4--For a moment, I considered the idea of ML as a surrogate, but do that ML models have that real practical solution where I don’t know which surrogate best reproduces hydrological model behavior? Can I trust the ML model? Since the sensitivity structure may depend strongly on the prescribed parameter bounds, it is unclear whether the reported consistency between RF/NN-based importance measures and XAI would remain stable under broader or alternative parameter ranges.
#Q5—In figure-1, although sanity check has been performed, aims to demonstrate robustness across configurations, the near-complete overlap of the importance curves makes the interpretation difficult. Quantitative stability measures would strengthen the conclusions drawn from this figure. Additional robustness analyses using different random seeds and sampling strategies would strengthen the experiment. For robustness checking, Kendall τ / Spearman ρ, top-k Jaccard, rank variance / entropy, and bootstrap confidence interval metrics could be considered. Even inclusion a few of these metrics may be sufficient to better justify the robustness claims.
#Q6: From lines 385-395 authors mentioned “The ability of SHAPi values to characterize how parameter influence varies across both the parameter space and the target….. such as DELSA, while naturally integrating with ML metamodelling frameworks.” The discussion comparing SHAPi and distributed sensitivity approaches such as DELSA (Rakovec et al. (2014) and Razavi & Gupta (2015, 2016)) may require additional nuance.
--Here, the technical note appears to implicitly position SHAP-based XAI and GSA (Sobol & PFI) as equivalent frameworks. However, while SHAP provides valuable sample-specific feature attributions, it fundamentally operates as a local prediction-explanation method rather than a global sensitivity propagation framework. maybe we can say as complementary in nature rather than same approach.
--Although aggregation in GSA (DELSA &VARS as cited by authors) may obscure localized behaviour, SHAP-based global importance derived from mean absolute SHAP values should not be interpreted as mathematically equivalent to classical GSA. In simpler term SHAP and GSA are different but may provide similar feature ranking and complementary to each other.
#Q7: Positioning SHAPi as a computationally efficient analogue to distributed sensitivity approaches may overstate the equivalence between these methodologies. In particular, SHAP computational efficiency is highly dependent on the underlying ML architecture (e.g., TreeSHAP versus ANN-based explainers) and may scale substantially with increasing feature dimensionality and sample size. I would refer this paper-https://doi.org/10.1016/j.envsoft.2026.107007
#Q8: Authors should clarify the SHAP implementation details used for different surrogate architectures (e.g., TreeExplainer for RF; linear explainer for LM model; Kernel/Deep Explainer for ANN or kernel explainer for all ML models). Please specify the SHAP explainer used for each ML model, as this information is important for interpreting both the computational cost and the resulting feature attributions.
#Q9: Since the computational efficiency and resulting feature attributions may depend strongly on the chosen explainer methodology. PVI provides a computationally efficient sensitivity-analysis framework; however, SHAP-based methodologies are generally more computationally demanding than PVI, except in the case of optimized implementations such as TreeSHAP. Both KernelSHAP and DeepSHAP are computationally more expensive compared to TreeSHAP. Authors should explain these in detail to strengthen the conclusion.
#Q10: If it is Kernel explainer for ANN, then authors should clearly mention that as KernelSHAP is highly sensitive to choices such as background dataset size, number of explained examples, and nsamples. Therefore, the study must clearly provide information such as how the background set was selected, n_samples and other KernelExplainer arguments, the number of explained points.
Q11: It is unclear why SHAP-derived global importance rankings were not included in Figure 4 alongside the PVI-based comparisons with Sobol Ti. Including normalized mean absolute SHAP rankings could help readers better assess the extent to which SHAP-based interpretations align with classical GSA results.
#Q12: Why are Sections 2.1 & 2.2 presented separately when GSA & Sobol appear to describe the same methodology. Similarly different section numbering for PVI & estimation of PVI?