the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
QuadTune version 1: A regional tuner for global atmospheric models
Abstract. When a new, better-formulated physical parameterization is introduced into a global atmospheric model, aspects of the global model solutions are sometimes degraded. Then, in order to use the new global model to address science questions, there is an incentive to restore its accuracy. Oftentimes this restoration is achieved by tuning of model parameter values. Unfortunately, the retuning process is expensive because characterizing the parameter dependence requires numerous time-consuming global simulations.
To reduce the cost of tuning, this manuscript introduces a "poor man's'' model tuner, "QuadTune''. QuadTune carves the globe into regions and approximates the model parameter dependence by use of an uncorrelated quadratic emulator. The simplicity of the emulator reduces the required number of global model simulations and aids explainability.
Tuning removes parametric error but leaves behind model structural error. Structural error manifests itself as regional residual biases, such as stubborn biases and tuning trade-offs. To visualize these residual biases, QuadTune's software includes a set of diagnostic plots. This paper illustrates the use of the plots for characterizing residual biases with an example tuning problem.
- Preprint
(5551 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-1593', Anonymous Referee #1, 06 Aug 2025
The paper presents QuatTune, a software that can be use to do multi-parametric calibration of
climate models. The paper is well written and clearly structured, the presetation is clear and
methodology explaind in detail and with useful pedagogical descriptions. The method is applied
to a development version of E3SM. I have thoroughly enjoyed reading the paper, and I would recommend
publication afer minor revisions. Please see my specific comments below.COMMENTS
Figure 3. I think it would be worth reiterating in the caption that the midpoint of the parabola is
the default parameter value. It would be interesting to see the simulated values for these regions,
in addition to the QuadTune optimised predictions.L441-2. The los function is based on least squares, which prioritises regions with large biases.
I´d like to see a brief explanation on the effect of choosing a different funtional form of the
los function. Is this easily configurable in QuadTune?I don´t think the computational cost of QuadTune (the optimisation process) is described in the paper. What is it?
How does it scale with respect to parameters and target metrics?Citation: https://doi.org/10.5194/egusphere-2025-1593-RC1 -
RC2: 'Comment on egusphere-2025-1593', Anonymous Referee #2, 08 Aug 2025
The paper approaches the tuning problem in a way that aims to use as few expensive runs as possible, in order to learn about structural error / parametric uncertainty / biases.
I’m not currently convinced that all assumptions have been justified (e.g., ignoring errors of emulator) and I’m not wholly convinced that the method is working well based on the current example, however I think these could be fixable with additional justifications, and perhaps an example where it works better, and additional exploration of sensitivities in the method due to assumptions being made:
Main comments:
1. The emulator used here is based on a 2nd order Taylor approx around the default value, in order to minimise the number of simulations required. A calibration method is only as good as the emulator underpinning in it, and it’s not clear to me that the emulator in the example is good (perhaps because it’s being used to extrapolate too far from the default), which will be leading to biases in the ‘optimal’ values, and hence incorrect conclusions about what is structural error, and subsequent analysis/conclusions based around this – it is unclear whether a lot of the results and comments in Section 7 would still be true if the emulator were more accurate:
- There needs to be some validation of the emulator, to check its accuracy. This is partially done, in recognising post-hoc that it predicts it will remove the bias for some regions, but doesn’t get close (line 405 ‘QuadTune’s prediction is imperfect’), and in the preceding few lines, where we see it predicts almost zero bias for some locations, but in reality hasn’t halved it). Emulator validation should be done before validation rather than discovering this error after doing a new GCM run, as the inadequacies of the emulator will affect the optimisation process, and the results shouldn’t be trusted.
- The example shown has a poorly tuned default (line 392), but this isn’t meant to be the use case as the emulator is only good locally - is there an issue here of extrapolating too far from the default, where the Taylor expansion is not going to be valid? Some validation of the emulator is critical (often out-of-sample predictions or leave-one-outs when the emulator is a GP, something appropriate needs showing here). If the predictions are poor, the emulator needs changing before being used for optimisation.
- Line 551, elsewhere – ‘nearby location’ in parameter space, how defining nearby? How checking not extrapolating too far? Line 227 mentions can’t be large as this violates the Taylor assumption (this sentence would be good in Section 5 when introduce the emulator). What is large?
- The problem in the example could be because the optimisation is allowing it to extrapolate too far from the default; it could be because the parameter perturbation used was too large for the Taylor approx to be valid; or some other reason – but whatever the case, there’s error in the emulator, and this will affect optimisation of the loss function, and hence optimal parameters and conclusions about patterns of structural error (and what is in fact structural error – this is likely to result in a map that is combination of structural error and parametric uncertainty, as demonstrated by the fact that the hand-tuned version is ‘better’ in terms of RMSE).
- The emulator seems to be poor for 6_14, 6_18. Perhaps a way to overcome or explore this issue is to remove or downweight these (and any other) regions where the emulator is poor, and what the suggested optimal is.
- The model is ‘badly out-of-tune’, and some bias is removed by QuadTune ‘despite the simplicity of the emulator’. The model being badly out-of-tune suggests it’s relatively trivial to find some better parameter estimates, and given the identified issues that the emulator has at making predictions in at least some regions, is this ‘despite the inaccuracies’ of the emulator? I.e., it’s found better parameters, despite being inaccurate when extrapolating, because it was straight forward to do so? It might be better to also show a use-case where the default is already much better: in this case, the choice of emulator should be more valid and accurate as only looking locally (as was the initial assumption), and any improvement will be less down to chance.
- The inaccuracy for 6_14 and 6_18 might be hiding/drowning out biases that *can* be tuned out (but that don’t lead to as large reductions as the incorrect predictions for these regions). E.g. line 434 ‘QuadTune strives to reduce the bias in 6_14, 6_18 at the expense of other regions’, but it only does so because the emulator predicts this bias to be (incorrectly) reduced by much more than it actually can be. So these conclusions and discussion are perhaps only relevant conditional on the poor emulator. It’s also ‘prioritising bias reduction’ in these regions because of the choice of weightings.
- With a more accurate emulator, parametric uncertainty might be reduced much more, and in a different way, trading off other biases – e.g. Figure 4 shows that getting increased biases at some locations as trade-offs for improvements that don’t actually exist, which is problematic. The fact that the ‘hand-tuned’ version has found a better set of parameters (which itself are likely not optimal) show that QuadTune has not removed parametric uncertainty.
- Figure 2 could demonstrate the accuracy of emulator in a clearer way – e.g., plot QuadTune predictions (b) vs true values (c), or predicted reduction vs actual reduction. Is it just poor for the 2 mentioned regions, or systematically wrong elsewhere?
2. For identifying structural error, need to ensure have removed parametric error. The purpose of tuning is described in the abstract as removing parametric error and leaving structural error behind, and line 37 talks about needing ‘guidance on what structural errors remain after parametric errors tuned out’. Slightly caveated by ‘gives hints about nature of structural error’, but this can only be done with confidence if have removed parametric error. Throughout, the two sources of error are combined (eq 3, 27), and results are a sum of the two. The example given suggests that all parametric error has not been removed, and so the observed biases are some mix of the 2.
- Usually in calibration exercises using best-input based methods such as Bayesian calibration and History matching, parametric, observational, and structural uncertainties are treated separately. Better justification of why it’s fine to combine these here (and ignore observational error) would be helpful.
- Line 81 says the model doesn’t match the observations for any p, and e.g. eq (3) includes this error term. But line 139 assumes that ‘near the optimal values, the model output is an approximate match’ and ignores the error. Is this a valid assumption?
- Other emulators used in these types of methods use things such as Gaussian processes, to quantify the parametric uncertainty and have some understanding of how accurate the emulator is. Here, ignoring this error and treating as a perfect model (equivalently, assuming constant error at all p). This might be valid very locally around the default, but not generally. Further justification of the choice of emulator, and how ‘local’/’nearby’ it is valid, is needed, as in the example this is clearly not true (perturbing by too much?). Is the example a good illustration, or is it being used in a way it shouldn’t be?
3. Explanation of the steps in QuadTune in Section 3 could be clearer, and possibly a general re-ordering would be helpful for this. Currently the outline of QuadTune is given in Section 3, but relies on things not yet mentioned, in particular the main descriptions are not given until Section 5. It might be better to first define the key elements, then give overall algorithm, then demonstrate, e.g. Section 2, then 5, then 3, then 4, 6 etc. More specifically:
- ‘linearly added to the loss function’ in step 2 – up to here there’s no mention of a ‘loss function’ except in the description of the sections in the intro, and I don’t think it’s addressed properlyl until Section 5.
- Similarly, the description of step 3b made me question why the simulations are being designed in this way, and this only become clear in Section 5 when the emulator was explained.
- Some of the comments within the steps of the algorithm aren’t really required in line as they’re not parts of the general method, and might be better discussed after, so that the steps of QuadTune are concisely and clearly communicated. E.g., ‘for illustration, this paper tunes a single field’ is not required for #2 of QuadTune.
- Step 6 of QuadTune is given as ‘re-run’, but this contradicts its mention as a possible extension of QuadTune in Section 9 – is this already part or not?
- In step 5, ‘run QuadTune’ – it is not explained at this point what this actually means, which makes me think this overall algorithm should come much later after ‘QuadTune’ has been developed.
- Line 116 mentions ‘the quadratic emulator’ but this has not really been mentioned yet. Similarly line 133. ‘QuadTune’s emulator’ is explained in Section 5, but in the QuadTune algorithm there should be some clearer explanation of this – could be as simple as ‘emulate model output, use this to tune parameters/minimise loss function’ – I didn’t feel this was spelt out until much later, and overall clarity would be aided by this.
- Step 3 is the first mention of ‘default’ – are these standard, or another user choice?
4. The main example would benefit from slightly more explanation at the start – some of the assumptions being used here are spread throughout other parts of the paper, whereas it would be clearer to describe the assumptions you’re making in order to run QuadTune in this particular case at the start of Section 7. E.g.,
- Fig 3 – mentions the tunable parameters are defined in B1. Say this at the start of the example section, make clear what the parameters are, what P is, etc.
- What are the perturbations? I don’t think these are mentioned. Stating the default values and the range they vary in in Table B2 might be helpful.
- Could be more specific about the hand-tuned version – what are the optimal parameters chosen here, how compare to QuadTune optimal? Are they at least in a similar region of parameter space or moved in different directions from the default? Do the bias patterns (when aggregated to regions, like in Fig 2) look the same?
- Can you be more specific about the number of simulations done for hand-tuning than ‘dozens’ (was it done specifically for this comparison or existed already?) These other simulations exist and so could possibly be used as out-of-sample points for assessing accuracy of the emulator.
5. I think consideration of other sensitivities is important to demonstrate the method works. Section 8 considers sensitivity to size of regions and duration, but I think there would be larger sensitivities to other assumptions in the method:
- Weighting
- Choice of emulator
- Size of parameter perturbation
- Choice of default
The example presented does worse than hand-tuning, and the emulator is extrapolating inaccurately. Does changing emulator/perturbations make this more accurate and out-perform hand-tuning? Does weighting differently remove different patterns of bias and hence lead to different conclusions about structural error? This seems important as the purpose of the method is to give insight about this. Even in the context of the comparisons already made, with little different in the RMSE, do these lead to very different parameter estimates / pattern of bias? I think there’s work to convince that the method can lead to reasonably robust results, and is not being influenced by an inaccurate emulator or other strict assumptions (re. structure of errors).
The paper would also benefit from an example where the emulator + optimisation are demonstrated to work well, as this combination is ‘QuadTune’. Currently the optimal results, and hence downstream explanations, are being driven by the emulator predicting it removes biases where it can’t. Perhaps the better use case is when the default is already reasonably good, so the choice of simulation design and Taylor approximation is valid as truly looking locally only, but I'm not sure.
6. Notation – there’s a few places where there’s inconsistencies or undefined terms, including:
- In Section 2, line 90 N refers to the number of regions. In Section 3, line 109 it’s this multiplied by number of fields. Easier to follow if don’t re-define.
- Line 91 – a region is defined as x. In Section 4 they’re defined a bit differently, with e.g. x \in Sc, ideally do so consistently.
- Eq 10 – bold p, b not been explicitly defined. Similarly, the other vectors in (11) and (12) not explicitly defined.
- Line 140 – the equations here are assuming p_1, p_2 are the ‘optimal’ values, but should these be distinguished by e.g. p_1*, p_2* (see e.g. best input approaches, optimal commonly written x* or \theta*).
- Section 5.1 – defines regional metrics for the first time as m_i. The same type of regional metrics were written in a different way in Section 4, and for consistency should probably use the m_i style notation there as well.
- Eq 19 – m_obs isn’t defined
- Eq 21 – j, k not defined. Compared to Eq 22, this equation is missing sums?
- Eq 26, and several places thereafter – (no sum over i) – no need to say this, self-evident from the equations. Could write i = 1, … N to be clear doing for each metric.
- Eq 26, elsewhere - no need for …
Other comments:
- Line 402 – phrases like ‘QuadTune thinks’ are used in several places. I’d much prefer phrasings like ‘QuadTune predicts’, as that more accurately describes what’s actually happening.
- Line 122 - ‘A possibly weighted version of Eq (5)’ – I think everywhere it is assumed that this is weighted geographically, so Eq 5 should probably include weights. Can comment that the equal weighting case is then a special case if that, if it’s actually useful.
- Line 153 – ‘similarly for \delta p_2’ – could just define in terms of p_i, so that then works for general case later as well.
- The phrasing ‘stubborn bias’ is defined/explained on page 10, but has already been used on pages 1, 5 and 9. I don’t think it’s obvious how this is being defined until p10, so perhaps just be explained earlier.
- Line 255 – because of the ordering, with Section 5 after the algorithm is given, ‘requiring an extra P global simulations’ could be read as on top of the 2P+1 simulations that were mentioned earlier, rather than these already being part of it.
- Line 275 – ‘simplicity of the approximation helps us better understand structural errors’ – this is only true if the approximation is accurate, which the example suggests it is not (perhaps because in this case ‘nearby’ is defined too widely). Need some mention about sizes of perturbations, checking accuracy of local approximation?
- Fig 3 – each row should have a common y axis range.
- Line 513 – ‘region 3_6 cannot be improved by tuning’ is quite a strong statement. It might be if you varied parameters jointly, fit a different emulator? I.e. it’s conditional on your assumptions that you can’t improve it.
- Line 515 – ‘if wish to remove a residual bias…remains after tuning’ – I don’t think you have to find another metric or change your model. Within this method, you could also change the weighting of that region? You could also improve the emulator.
- 1st paragraph of Section 7 is a description of QuadTune’s properties, rather than part of the example – better in the QuadTune section?
- Section 8 is extremely short, can probably include as a subsection if remains this short.
Minor edits/typos:
- Line 8 – ‘through the use of’?
- Line 9 – ‘explainability’ of what?
- Line 21 – ‘improvement in the overall…’
- Line 26 – ‘Big gains…often come from structural model improvements’ should probably be referenced
- Line 50-51 – quite similar to lines 46-48, probably combine these 2 paragraphs and be less repetitive.
- Line 77 – ‘particular’ said twice, only needs one.
- Line 80 – does f_obs need x dependence here? Often does in what follows.
- Line 88 – ‘at fine resolution’ – I’m not sure you’ve defined x as being ‘fine’ resolution, just lon/lat. Maybe say something like across all lon/lat instead.
- Eq (4) – x in
- Line 103: ‘steps’ – missing colon
- Line 166 – ‘we regard as knowns…’ would sound better as ‘we regard the…as knowns’
- Line 186: ‘perturbing the jth parameter by \delta p_j’ ?
- Line 273 – ‘will we’ should be ‘we will’
- Line 285 – what is ‘typical in size’?
- Line 293 – ‘ithe’
- Line 298: ‘in the ith…’
- Line 369 – ‘off of’ could be something like ‘from’
- Line 393 – ‘much of the bias’ – perhaps too optimistic – only reduced by ~20%
- Line 458 – ‘SVD has’
- Line 459 – ‘the first singular vector’ of what? (Mentioned in Figure caption, but not clear from text).
- Line 466 – ‘is c8 so important’
Citation: https://doi.org/10.5194/egusphere-2025-1593-RC2
Model code and software
Tuning code used in 2025 QuadTune v1 GMD paper V. Larson et al. https://doi.org/10.5281/zenodo.15132492
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
731 | 27 | 14 | 772 | 20 | 32 |
- HTML: 731
- PDF: 27
- XML: 14
- Total: 772
- BibTeX: 20
- EndNote: 32
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1