the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Optimizing Gaussian Process Emulation and Generalized Additive Model Fitting for Rapid, Reproducible Earth System Model Analysis
Abstract. Causes of model uncertainty in complex modeling systems can be identified using large perturbed-parameter ensembled (PPEs), combined with statistical emulators to increase sample size and enable variance-based sensitivity analyses and observational constraint. In global climate models such as the UK Earth System Model (UKESM), these approaches are typically applied at the global or regional mean scales for a limited set of variables. To accelerate progress in understanding the multi-faceted causes of climate model uncertainty, requires implementing such workflows at the model grid box scale, to enable analyses across variables that reveal how uncertainties propagate and interact spatially. However, this approach requires training millions of Gaussian process (GP) emulators and fitting an equal number of generalized additive models (GAMs) – a major computational bottleneck. We present a high-performance, open-source pipeline that introduces optimisations for this workflow. For GP emulation, we implement task-level parallelism and streamlined data handling on high-performance computing systems. For GAM fitting, we integrate a parallelized pyGAM interface with R's mgcv::bam() back end, using fast fREML estimation with discrete smoothing, memory-efficient batching, and improved input–output routines. These changes reduce GP training time by 97.5 % (6177 → 154 s) and GAM fitting time by 95.2 % (10623 → 511 s), yielding a ~ 25 times faster end-to-end workflow (96 % total runtime reduction) and cutting peak memory use by a factor of 12. Outputs are numerically identical to the baseline implementation (Pearson correlation = 1.00 for both GP and GAM predictions). We demonstrate the approach using a UKESM PPE comprising 221 members scaled up to 1 million using GP emulators, and GAM fits applied to output for a single target variable, to show that the improved performance enables multi-variable, higher-resolution, and potentially multi-model analyses that were previously impractical. These improvements pave the way for PPE studies to scale in scope without compromising statistical fidelity, enabling more comprehensive exploration of model parameter uncertainty within feasible HPC budgets.
- Preprint
(6098 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-5533', Anonymous Referee #1, 10 Mar 2026
-
AC1: 'Author Comment on egusphere-2025-5533', Kunal Ghosh, 10 Apr 2026
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-5533/egusphere-2025-5533-AC1-supplement.pdf
-
AC1: 'Author Comment on egusphere-2025-5533', Kunal Ghosh, 10 Apr 2026
-
RC2: 'Comment on egusphere-2025-5533', Anonymous Referee #2, 15 Mar 2026
This paper discusses how the speed of training of Gaussian Process emulators and associated fitting of Generalised Additive Models can be be improved by an optimised workflow, use of in-memory I/O, and (I think) an alternative algorithm (as opposed to an alternative implementation) of the necessary prediction matrix.
The overall outcome is a significant improvement in performance, although it it is not clear to me whether the scale up goes beyond 200 cores. Is Figure 5, showing the results of running 10K tasks through 200 cores? In which case, how does the speedup go beyond what can be achieved in 200 cores alone? There is no point showing more than 200 tasks if the rest is linear?
This little example shows the problem I have with the paper. I think the work done is very interesting, and worthy of publication, but the write-up doesn't quite do it justification. The authors assume a lot of the readers and the material is not always well ordered.
I would like to see this appear as it is a good piece of work, and a nice example of optimisation a workflow, but would recommend a substantial revision of the material in terms of order and clarity (I say substantial in that i think it would be of substantial benefit, but probably wouldn't take that long and I would expect the editor could decide on this without coming back to me).
In particular, I'd like to see
-
Some clarity in section 3.1 as to whether figure 4 is necessary at all? It appears that the innovation here was to take out a loop over grid points and use slurm job arrays? Is that it? The comment about "spawning competing rscript processes" is presumably the key part of this? But if this is just making better use of the parallelism and isn't changing the algorithm, why do we need figure 4? And if it is changing the algorithm, how is it doing it, there are no details?
-
A better explanation in section 3.2 of of what is going on, that doesn't mix together algorithm and implementation. I found this really hard to understand. Pseudo function calls are mixed with algorithmic notation. (I said "I think" in my summary, because I have vaguely understood what was going on here, but I would like to actually understand it based on the text.)
-
The tables in B1-B4 need some thinking about. I understand that these are exemplars showing the difference between the two implementations at four points, but these sort of things are notorious. Why not show scatter plots (or whatever) of the difference across the domain (of however many points it was), for some of the parameters, not all, so we can have confidence that these are not the "best" results, and are typical?
I should say that Appendix A is excellent and very useful.
Minor comments:
P 2, line 28; the sentence beginning "However" implies that somehow the size of PPEs is linked the complexity of the model. Might be cleaner to say something like: "However, complex models are expensive, and costs are prohibitive beyond o(100) members"
(There is also an assumption in this sentence about how big PPEs need to be, which is surely problem related).The paragraph then carries on and conflates PPES with the assumption that they are only useful if they have to be expanded in size ... this paragraph probably just needs reordering, since the justification for the expansion in size follows the assumption.
P2 ,line 39/40 bottlenecks are or bottleneck is
P 6 On page 5 we were told that each task had 16 GB of memory, but here we are told that peak memory use exceeded 80 GB. I've gotten confused as to what is going on. Is this an example of the "otherwise noted"?
P 7. It is not necessary to keep citing Lawrence et al after the first time JASMIN was mentioned.
Citation: https://doi.org/10.5194/egusphere-2025-5533-RC2 -
AC1: 'Author Comment on egusphere-2025-5533', Kunal Ghosh, 10 Apr 2026
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-5533/egusphere-2025-5533-AC1-supplement.pdf
-
Data sets
gp-gam-optimisation-dataset: Example input and output data for Gaussian Process and GAM workflow (H₂SO₄, January 2017) Kunal Ghosh and Leighton A. Regayre https://doi.org/10.5281/zenodo.17543623
Model code and software
gp-gam-optimisation-pipeline: High-performance workflow for Gaussian Process emulation and GAM fitting Kunal Ghosh and Leighton A. Regayre https://github.com/Kunal198/gp-gam-optimisation-pipeline
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 302 | 197 | 36 | 535 | 50 | 44 |
- HTML: 302
- PDF: 197
- XML: 36
- Total: 535
- BibTeX: 50
- EndNote: 44
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This manuscript describes an approach to computer model emulation and sensitivity analysis using parallel and distributed computing on high performance computing (HPC). Specifically, they fit scalar Gaussian process (GP) emulators to grid-cell level Earth system model output in an embarassingly-parallel fashion, and use the fitted emulators to conduct main effects-style sensitivity analysis using generalized additive models (GAMs). Each model fit occurs on a core independently of the other fits and speedup is realized by running many such fits simultaneously on many cores, with no between-node communication required.
The resulting parallel speedups seem reasonable and are significant in magnitude. It might be easier to see the parallel efficiency if plotted differently, e.g. on a log-log plot, or with a theoretical maximum speedup curve (equal to number tasks) superimposed, or speedup as a fraction of theoretical. I am not sure I would say that optimized walltime is nearly constant up to three orders of magnitude in task count; we'd expect it to be constant when the tasks are less than the concurrency cap of 200 cores, but it should (and does) increase after that. The speedup graphs are not as informative as they could be, for that reason, because the baseline was never run (and the speedup calculated) for task sizes exceeding the number of cores, where different speedup behavior is expected (of course this is because of the increasing expense of running the baseline) - perhaps at least one long baseline analysis at (say) 400 cores could be run?
Other than that, a few other points to consider:
Could the GAM have been fit directly to the data instead of to GP emulator samples? The GAM essentially is a simple emulator (neglecting parameter interactions).
It may be worth pointing out that the GAM sensitivity analysis is like a (unnormalized) first-order Sobol' sensitivity analysis with the assumption that there are no between-parameter interactions.
Could this software be extended to analyze time series at a grid cell? This would require a different form of emulator and different distribution of data onto tasks. (Obviously a spatial analysis would be harder since communication across boundaries would become inevitable, although spatially-partitioned Gaussian processes can be applied.)