Technical note: Machine learning metamodelling for global sensitivity analysis

Yeste, Patricio; Melsen, Lieke A.; Brêda, João Paulo L. F.; Tacoronte, Nicolás; Saltelli, Andrea; Vannucci, Giulia; Siciliano, Roberta; Bronstert, Axel

doi:10.5194/egusphere-2026-1787

Preprints

https://doi.org/10.5194/egusphere-2026-1787

Preprints

14 Apr 2026

| 14 Apr 2026

Technical note: Machine learning metamodelling for global sensitivity analysis

Patricio Yeste, Lieke A. Melsen, João Paulo L. F. Brêda, Nicolás Tacoronte, Andrea Saltelli, Giulia Vannucci, Roberta Siciliano, and Axel Bronstert

Abstract. Global sensitivity analysis (GSA) plays a central role in hydrologic modelling by supporting model understanding, diagnosis, and decision-making through the identification of influential and non-influential parameters and their interactions. Variance-based methods provide a rigorous framework for GSA but are often computationally expensive, as their estimation requires a large number of model evaluations. Metamodelling has therefore been widely adopted as a strategy to alleviate this issue, with recent advances in machine learning (ML) offering new opportunities to construct accurate and flexible surrogates for complex models. This technical note examines the practical relationship between Sobol’ total-effect indices (T_i) and feature importance measures derived from ML metamodels within a hydrologic modelling context. Building on theoretical results that link T_i to permutation variable importance (PVI_i) under independence assumptions, we provide systematic numerical evidence using three conceptual hydrologic models of varying complexity (HBV, HyMod, and VIC) applied to three headwater catchments in northern Germany, together with three ML metamodels: a random forest (RF), a neural network (NN), and a linear model (LM). The three metamodels were trained on Monte Carlo samples and used to estimate sensitivities through PVI_i and SHapley Additive exPlanations (SHAP_i). The results demonstrate that RF and NN metamodels reliably reproduce both the ranking and relative magnitude of T_i using PVI_i across all hydrologic models, providing clear empirical support for the theoretical connection between the two measures. In contrast, the performance of LM-based estimates depends strongly on the degree of linearity in the underlying model response. Mean absolute SHAP_i values exhibit a consistent monotonic relationship with T_i and preserve parameter rankings, while sample-specific SHAP_i values enable a distributed evaluation of sensitivities across both the parameter space and the target variable space. Overall, this study highlights ML metamodelling as a computationally efficient and conceptually sound framework for GSA in hydrologic modelling and beyond.

Received: 30 Mar 2026 – Discussion started: 14 Apr 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2425 KB)

Supplement (3711 KB)

Download & links

Patricio Yeste, Lieke A. Melsen, João Paulo L. F. Brêda, Nicolás Tacoronte, Andrea Saltelli, Giulia Vannucci, Roberta Siciliano, and Axel Bronstert

Status: final response (author comments only)

RC1:
'Comment on egusphere-2026-1787', Anonymous Referee #1, 30 May 2026

The paper presents use hydrological models and generate simulations, then train ML surrogate models and perform sensitivity & explainability analysis of parameters. Yes, these parameters are often already calibrated or studied in hydrological modelling literature, but sensitivity analysis is still meaningful because the parameters are treated as uncertain within feasible ranges. This approach is actually a good framework instead of using only ML models such as RF/ANN/LSTM/Transformer and explained those ML models. Hydrological simulations can be computationally expensive if we use sensitivity analysis (SA) which are heavily dependent on the sample size and number of parameters. For hydrological models like HBV/VIC, the traditional SA becomes extremely expensive. Mainly due to the high number of interactions, and uncertain input factors causing the curse of dimensionality problem. On the other side, run hydrological model a limited number of times, train ML surrogate, use ML surrogate for sensitivity analysis, because ML inference is very fast. This is standard surrogate modeling logic. From my understanding, the authors attempt to encourage hydrological modelers to consider surrogate modelling approaches over advanced standalone ML frameworks, mainly due to the computational efficiency offered by surrogate models based on the current experiment.

#Q1--Authors should provide the explanation for choosing these two models (RF & ANN), why not “XgBoost instead of RF” and “LSTM instead of ANN?
#Q2-- If these parameter ranges are already well established, what new hydrological insight is obtained from surrogate explainability? Or just comparing SHAP/PVI with traditional Sobol SA.
#Q3-- It would be valuable to discuss whether the reported agreement between SA and XAI SHAP remains consistent under substantially different parameter ranges. Based on the SA, RF and ANN provide the same feature rankings based on this experiment. When this setup changes, why can RF and ANN still produce similar sensitivity rankings? Even though RF uses tree splits, recursive partitioning, & ensemble averaging whereas ANN uses weighted neurons, nonlinear activations and gradient optimization. RF and ANN will behave similarly and may converge toward similar functional approximations but does not necessarily imply methodological equivalence or robustness. Results may vary across random seeds, training samples, architectures, hyperparameters and basin characteristics. I understand this is a big ask when the paper is fully concentrated on SA application (PVI & Sobol) and its compare with XAI. But one paragraph should be included to make the reader understand “Do these tools truly provide valuable insights into where ML models “align with” or “diverge from” theoretical expectations or the hydrological system understanding as per these papers- https://doi.org/10.1029/2024WR037398, https://doi.org/10.1016/j.envsoft.2026.107007
#Q4--For a moment, I considered the idea of ML as a surrogate, but do that ML models have that real practical solution where I don’t know which surrogate best reproduces hydrological model behavior? Can I trust the ML model? Since the sensitivity structure may depend strongly on the prescribed parameter bounds, it is unclear whether the reported consistency between RF/NN-based importance measures and XAI would remain stable under broader or alternative parameter ranges.

#Q5—In figure-1, although sanity check has been performed, aims to demonstrate robustness across configurations, the near-complete overlap of the importance curves makes the interpretation difficult. Quantitative stability measures would strengthen the conclusions drawn from this figure. Additional robustness analyses using different random seeds and sampling strategies would strengthen the experiment. For robustness checking, Kendall τ / Spearman ρ, top-k Jaccard, rank variance / entropy, and bootstrap confidence interval metrics could be considered. Even inclusion a few of these metrics may be sufficient to better justify the robustness claims.

#Q6: From lines 385-395 authors mentioned “The ability of SHAPi values to characterize how parameter influence varies across both the parameter space and the target….. such as DELSA, while naturally integrating with ML metamodelling frameworks.” The discussion comparing SHAPi and distributed sensitivity approaches such as DELSA (Rakovec et al. (2014) and Razavi & Gupta (2015, 2016)) may require additional nuance.

--Here, the technical note appears to implicitly position SHAP-based XAI and GSA (Sobol & PFI) as equivalent frameworks. However, while SHAP provides valuable sample-specific feature attributions, it fundamentally operates as a local prediction-explanation method rather than a global sensitivity propagation framework. maybe we can say as complementary in nature rather than same approach.

--Although aggregation in GSA (DELSA &VARS as cited by authors) may obscure localized behaviour, SHAP-based global importance derived from mean absolute SHAP values should not be interpreted as mathematically equivalent to classical GSA. In simpler term SHAP and GSA are different but may provide similar feature ranking and complementary to each other.
#Q7: Positioning SHAPi as a computationally efficient analogue to distributed sensitivity approaches may overstate the equivalence between these methodologies. In particular, SHAP computational efficiency is highly dependent on the underlying ML architecture (e.g., TreeSHAP versus ANN-based explainers) and may scale substantially with increasing feature dimensionality and sample size. I would refer this paper-https://doi.org/10.1016/j.envsoft.2026.107007

#Q8: Authors should clarify the SHAP implementation details used for different surrogate architectures (e.g., TreeExplainer for RF; linear explainer for LM model; Kernel/Deep Explainer for ANN or kernel explainer for all ML models). Please specify the SHAP explainer used for each ML model, as this information is important for interpreting both the computational cost and the resulting feature attributions.

#Q9: Since the computational efficiency and resulting feature attributions may depend strongly on the chosen explainer methodology. PVI provides a computationally efficient sensitivity-analysis framework; however, SHAP-based methodologies are generally more computationally demanding than PVI, except in the case of optimized implementations such as TreeSHAP. Both KernelSHAP and DeepSHAP are computationally more expensive compared to TreeSHAP. Authors should explain these in detail to strengthen the conclusion.

#Q10: If it is Kernel explainer for ANN, then authors should clearly mention that as KernelSHAP is highly sensitive to choices such as background dataset size, number of explained examples, and nsamples. Therefore, the study must clearly provide information such as how the background set was selected, n_samples and other KernelExplainer arguments, the number of explained points.

Q11: It is unclear why SHAP-derived global importance rankings were not included in Figure 4 alongside the PVI-based comparisons with Sobol Ti. Including normalized mean absolute SHAP rankings could help readers better assess the extent to which SHAP-based interpretations align with classical GSA results.

#Q12: Why are Sections 2.1 & 2.2 presented separately when GSA & Sobol appear to describe the same methodology. Similarly different section numbering for PVI & estimation of PVI?

Citation: https://doi.org/10.5194/egusphere-2026-1787-RC1
- AC1: 'Reply on RC1', Patricio Yeste, 22 Jul 2026
  
  Dear reviewer,
  Our responses can be found in the attached PDF document.
  On behalf of all the authors,
  Patricio Yeste
  
  Citation: https://doi.org/10.5194/egusphere-2026-1787-AC1
RC2:
'Comment on egusphere-2026-1787', Anonymous Referee #2, 24 Jun 2026

The present contribution elaborates on the numerical validation of the theoretical relationship between variance-based Sobol’ total index and Permutation Variable Importance (PVI), considering diverse hydrological models and three (humid and groundwater driven) catchments and Random Forest, Neural Network and Linear Model as surrogates. Additionally, the Authors explore empirically the connection between Sobol’ total index and Shapley Additive exPlanations (SHAP) index.
The work is sound and convincing, and proves how surrogate modelling can contribute to sensitivity analysis for in the hydrological context. I only have few very minor comments.
Comment
The math behind SHAP_i should be better defined in the work (maybe in an appendix).

Comment
In figure 6 the label in panel (a) should be S_m instead of S.

Comment
Why Section 3.4 Sensitivity measures is not included in Section 2? Where there are all the others definitions of sensitivity?

Comment
Non-aggregated SA measure has also been proposed by Dell’Oca et al., 2020, i.e., sensitivity can be assessed across the output range of variability.

Aronne Dell'Oca, Alberto Guadagnini, Monica Riva, 2020. Copula density-driven metrics for sensitivity analysis: Theory and application to flow and transport in porous media. Advances in Water Resources, 145, https://doi.org/10.1016/j.advwatres.2020.103714.

Citation: https://doi.org/10.5194/egusphere-2026-1787-RC2
- AC2: 'Reply on RC2', Patricio Yeste, 22 Jul 2026
  
  Dear reviewer,
  Our responses can be found in the attached PDF document.
  On behalf of all the authors,
  Patricio Yeste
  
  Citation: https://doi.org/10.5194/egusphere-2026-1787-AC2

Patricio Yeste, Lieke A. Melsen, João Paulo L. F. Brêda, Nicolás Tacoronte, Andrea Saltelli, Giulia Vannucci, Roberta Siciliano, and Axel Bronstert

Supplement

https://doi.org/10.5194/egusphere-2026-1787-supplement

Patricio Yeste, Lieke A. Melsen, João Paulo L. F. Brêda, Nicolás Tacoronte, Andrea Saltelli, Giulia Vannucci, Roberta Siciliano, and Axel Bronstert

Viewed

Total article views: 521 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
357	139	25	521	55	29	28

HTML: 357
PDF: 139
XML: 25
Total: 521
Supplement: 55
BibTeX: 29
EndNote: 28

Views and downloads (calculated since 14 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	210	97	12	319
May 2026	87	23	6	116
Jun 2026	17	6	2	25
Jul 2026	43	13	5	61

Cumulative views and downloads (calculated since 14 Apr 2026)

Month	HTML	PDF	XML	Total
Apr 2026	210	97	12	319
May 2026	87	23	6	116
Jun 2026	17	6	2	25
Jul 2026	43	13	5	61

Viewed (geographical distribution)

Total article views: 496 (including HTML, PDF, and XML) Thereof 496 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 26 Jul 2026

Short summary

Understanding which factors most influence streamflow is key for accurate hydrologic modelling. This study shows that machine learning methods, like random forests and neural networks, can efficiently identify the most important model inputs, producing results similar to traditional sensitivity analysis but with far less computation. This approach helps scientists explore complex models faster and more reliably, improving insights into how catchments respond to changing conditions.


Total:	0
HTML:	0
PDF:	0
XML:	0