the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Stochastic perturbation of inputs to parametrisation schemes machine-learnt from high-resolution model variability
Abstract. Stochastic parametrisation schemes represent sources of uncertainty in atmospheric model and several types of these schemes are in widespread use in general circulation models across a variety of temporal and spatial resolutions. We introduce a new stochastic scheme for use in global atmospheric models, which uses a machine learning model trained on high-resolution convection-permitting simulation data to estimate properties of the distribution of subgrid variability in potential temperature. This then informs the profile of stochastic perturbations being applied to the inputs of traditional parametrisation schemes. This scheme is tested in single column model experiments over the tropical west Pacific and is shown to improve model performance in this case.
- Preprint
(3026 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (extended)
-
CEC1: 'Comment on egusphere-2025-6312', Juan Antonio Añel, 28 Mar 2026
reply
-
AC1: 'Reply on CEC1', Helena Reid, 30 Mar 2026
reply
Have uploaded these to Zenodo. The GitHub references can be revised to use https://doi.org/10.5281/zenodo.19331887 and https://doi.org/10.5281/zenodo.19331816 for ENNUF and LFRic respectively.
Citation: https://doi.org/10.5194/egusphere-2025-6312-AC1
-
AC1: 'Reply on CEC1', Helena Reid, 30 Mar 2026
reply
-
RC1: 'Comment on egusphere-2025-6312', Anonymous Referee #1, 20 Apr 2026
reply
The authors present a novel approach that leverages machine learning (ML) to generate perturbed inputs for a suite of physical parameterizations. Using output from several limited-area model simulations, the proposed method emulates subgrid-scale variability in key thermodynamic variables. The resulting framework, referred to as PAPILLON, produces stochastic perturbations that are then applied to the inputs of conventional physical parameterizations.
The ML emulator is evaluated in a single-column model configuration against the ERA-5 dataset, based on a single test case. The results suggest that using PAPILLON to perturb the inputs leads to slightly improved performance compared to an ensemble generated using the SPT perturbation scheme.
I appreciate the originality of the proposed approach. The framework introduced here provides an interesting way to combine machine learning with existing physical parameterizations. However, the conclusions would be strengthened by the inclusion of additional test cases to assess the robustness of the results.
While the manuscript is generally well written, I found parts of it difficult to follow. In particular, the level of detail provided in some sections tends to obscure the main message. Streamlining the presentation and improving the overall structure, e.g. by introducing additional subsections and clearer signposting, would significantly enhance readability.
General comments
- The introduction contains a substantial amount of useful background information. However, I found it somewhat difficult to follow, and the main line of reasoning is not always clear. Clarifying or simplifying the introduction would strengthen the argument.
- The manuscript would benefit from substantial restructuring to improve clarity and focus. At present, the content is somewhat diffuse, with a level of detail in places that tends to obscure the main message. Streamlining the text and emphasizing the key ideas more directly would significantly improve readability.
- It is not entirely clear to me why perturbations are applied only to potential temperature. While the introduction highlights the importance of this variable for convection, it is presented more as an example than as a justification for this specific choice. The authors are encouraged to clarify this point more explicitly. It may also be more appropriate to move this discussion to the Methods section.
- Although I understand that running a large number of limited-area model simulations is computationally expensive, I wonder whether sampling variability over only one month is sufficient to ensure robustness. Some discussion of this limitation would be helpful.
- It would be useful to include a brief description in the main text of the numerical implementation in the single-column model (SCM), in particular regarding the use of ENNUF.
Specific comments
- l. 141-142: Was the model trained using randomly selected samples across all spatial domains and timesteps? To improve independence between training, validation, and test datasets, the authors might consider leaving out entire simulations (e.g. some LAM runs) or contiguous time periods.
- l. 270: How is the height of the troposphere diagnosed?
- Figure 8: When either the length scale or the time parameter is varied, what value is used for the other parameter that is held fixed?
Citation: https://doi.org/10.5194/egusphere-2025-6312-RC1 -
AC2: 'Reply on RC1', Helena Reid, 27 May 2026
reply
We thank the reviewer for their time and valuable comments, which we address point by point in turn as follows:
General comments:- The introduction contains a substantial amount of useful background information. However, I found it somewhat difficult to follow, and the main line of reasoning is not always clear. Clarifying or simplifying the introduction would strengthen the argument.
The manuscript would benefit from substantial restructuring to improve clarity and focus. At present, the content is somewhat diffuse, with a level of detail in places that tends to obscure the main message. Streamlining the text and emphasizing the key ideas more directly would significantly improve readability.- We have moved the parts of the introduction discussing cases in which stochastic perturbations that are linear (or similar) w.r.t. parametrisation scheme outputs are unable to capture uncertainty (approx. lines 60 to 90 and figure 1) to a new subsection “motivation” at the start of the methods section. The introduction can then go directly from an overview of stochastic schemes present in the literature to a brief description of the key differences in the stochastic scheme that will be presented in this paper, leaving detailed justification for these choices to a later subsection rather than in the introduction. This restructuring should hopefully help the flow of the paper.
- It is not entirely clear to me why perturbations are applied only to potential temperature. While the introduction highlights the importance of this variable for convection, it is presented more as an example than as a justification for this specific choice. The authors are encouraged to clarify this point more explicitly. It may also be more appropriate to move this discussion to the Methods section.
- We have rephrased the presentation of the examples in an attempt to present them as more of a justification. Our reasoning about the examples mentioned (the triggering and vertical extent of convection) was that while we certainly do not expect these to be the only cases in which the uncertainty in the effects of the subgrid state on the gridscale state does not scale linearly with the predicted parametrised tendency, they already seem justification enough to try out a stochastic scheme that permits more complicated relationships between the uncertainty and the stochastic perturbation, to check whether relaxing this requirement on that relationship is beneficial in practice. We will include some discussion of why potential temperature was chosen, and give an explanation here too. From a naïve parcel theory examination of convection (that is, in the absence of entrainment), the main variables to consider when estimating the height and vigour of convection are the potential temperature and (except for dry convection) the humidity of the initial parcel, generally near the surface, and the vertical profile of potential temperature. We’d thus expect a convection scheme to be very sensitive to perturbations in the near-surface values of both of these variables, and also sensitive to perturbations in the vertical profile of potential temperature, so we narrowed our choices to these. If we were to perturb both humidity and temperature, we would need to make additional choices in the structure of our scheme, namely, how to correlate perturbations in both of these variables (should a cooling perturbation always be paired with a drying one? The inverse? Should they be independent? Something in between?), and each would need its own scaling factors, which would all need tuning to values that produce sensible model behaviour, necessitating more experiments. To avoid this, we chose to perturb only one of the two, and because we hypothesised the impact from perturbing potential temperature aloft might be greater than perturbing humidity aloft, we chose to perturb potential temperature. A different choice would still have been a valid thing to investigate, and indeed the effects of a scheme which chooses to perturb surface humidity instead is discussed in Tomkins and Berner 2008 (https://doi.org/10.1029/2007JD009284).
- Although I understand that running a large number of limited-area model simulations is computationally expensive, I wonder whether sampling variability over only one month is sufficient to ensure robustness. Some discussion of this limitation would be helpful.
- We will include some discussion of this limitation, as it is indeed a limitation. We do not have good evidence that the machine learning model’s training data covers a sufficient variety of states that it will still perform well in all possible unseen conditions, nor do we know how much data is required for this particular problem to allow a model to predominantly be interpolating when given unseen data rather than extrapolating. That said, we do not believe that this limitation detracts from the overall conclusions of the paper, firstly because we only show that the inclusion of the ML model in the scheme is beneficial in this specific test case (which is correct regardless of whether poor generalisability makes it detrimental in other cases, something which could be true but is beyond the scope of the paper), and secondly because the proposed type of scheme is shown to still be beneficial even if the ML model is removed entirely (albeit less so). Running more kilometre-scale simulations similar to the ones used here would be computationally prohibitive, though there are existing datasets which could be drawn from to create a larger dataset (though differences in resolution between datasets would need addressing), e.g. the DYAMOND project, and we are happy mention this in the conclusions of the paper. If in future work examining less idealised cases or further examples of idealised cases it is found that the ML model performs poorly in different regions, times, or weather conditions, then this limitation may be responsible, so it is important to be wary of that possibility at this stage and we thank the reviewer for this point.
- It would be useful to include a brief description in the main text of the numerical implementation in the single-column model (SCM), in particular regarding the use of ENNUF.
- When we say ENNUF translated the neural network which we trained in python using tensorflow/keras to Fortran, we mean that ENNUF contains Fortran code for components of a neural network (dense layers, convolutional layers, etc) and it can automatically generate a subroutine which calls those components in the correct order (which may be more complex than a sequential structure) with the correct arguments and with the correct weights. The weights are printed into the Fortran subroutine as constants, and the files containing this subroutine and the neural network components can then be placed in, and called in the appropriate place, within the Fortran project of your choice, in our case the “fast physics” section of the timestep of the LFRic SCM.
Specific comments:
- l. 141-142: Was the model trained using randomly selected samples across all spatial domains and timesteps? To improve independence between training, validation, and test datasets, the authors might consider leaving out entire simulations (e.g. some LAM runs) or contiguous time periods.
- Yes, it was. This is a fair point, the analysis of the performance of the ML model on the test data (Fig. 5) may give the model’s R2 score as higher than it would be on test data more substantially different to that in the training data. This has a similar effect on the conclusions of the paper to the point made above concerning whether this dataset is sufficient to allow the ML model to be robust to the wider variety of conditions it might see if deployed in a global model. That is, we don’t know how well it might perform on any data other than what we’ve tested it against here, and this could mean the beneficial effects of the ML model on the scheme are reduced or nullified in less idealised or other test cases. We can amend the wording so that the answer to this question is clear. Our result that this scheme is beneficial may not continue to hold when deployed in a global model, and these points about the ML model’s ability to generalise may be the cause if that turns out to be so. Our results are an encouraging sign that the effect of the scheme in other scenarios seems worth investigating.
- l. 270: How is the height of the troposphere diagnosed?
- It is not, this was not made clear in the text. The perturbations are actually multiplied by a constant which is reduced linearly in height from 1 at model level 45 (about 14km) to 0 at model level 50 (about 18km). Saying “no perturbations in the stratosphere” is thus only approximately true. The model top is at 80km, so perturbations are zero from 18km to 80km, which covers most or all of the stratosphere, depending on the exact height of the tropopause, which could be above 18km in the tropics, but we do not actually diagnose it. This linear tapering is something that is done for the existing SPPT-type stochastic scheme in LFRic, and we included it in our scheme for similar reasons – that if there ought to generally be little to no tendency from parametrised convection and boundary layer processes at these altitudes, and this is thought to be correct w.r.t reality, then all random perturbations are able to do is make things worse, by very occasionally introducing erroneously inflated tendencies. We could have left this out and it probably wouldn’t have changed anything in this test case, because the chances of a random perturbation accidentally causing say, moist convection that rises well above the tropopause is extremely unlikely (or maybe even impossible, since we cap perturbations to no more than three times the predicted standard deviation of potential temperature). The chance of detrimental impacts may be more significant when the scheme is called more frequently, since in a global model the scheme could easily be called ~10^6 times per timestep. We have rephrased this part of the text to state that a linear in height taper was used, rather than saying perturbations above the tropopause were removed.
- Figure 8: When either the length scale or the time parameter is varied, what value is used for the other parameter that is held fixed?
- 10km and 6 hours respectively. These values are given on line 220, but we have now repeated this information in the figure 8 caption so that the reader doesn’t have to refer back to the text to check this while interpreting the figure.
In addition, the reviewer points out that the evidence in favour of the use of this scheme may be strengthened by additional test cases beyond the ensembles ran over the 1 month period in the tropical west Pacific presented here. This particular test case was chosen for the frequent convective activity in the region (which means we ought to see a measurable effect from our stochastic perturbations to the convection scheme) and the quantity and quality of observations in this time and place (which makes the comparison to reanalysis more meaningful), but other idealised test cases could also fulfil these criteria.
Although the scheme might benefit from using extra training data, it is not clear how much would be required to ensure that all possible scenarios have been covered. Similarly, it is not clear how many additional single-column location would need to be simulated before wondering how this scheme would perform in a 3d model. It is very much our plan to deploy and test this scheme in a 3d climate model in the future.
Citation: https://doi.org/10.5194/egusphere-2025-6312-AC2 - The introduction contains a substantial amount of useful background information. However, I found it somewhat difficult to follow, and the main line of reasoning is not always clear. Clarifying or simplifying the introduction would strengthen the argument.
Data sets
CRMML Cyril Morcrette https://doi.org/10.5281/zenodo.13332843
Model code and software
LFRic atmospheric model UK Met Office https://github.com/MetOffice/lfric_apps/
ENNUF machine learning translator Helena Reid, Theano Xirouchaki, Joana Rodrigues, and Cyril Morcrette https://github.com/MetOffice/ennuf
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 719 | 207 | 63 | 989 | 60 | 66 |
- HTML: 719
- PDF: 207
- XML: 63
- Total: 989
- BibTeX: 60
- EndNote: 66
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
You have archived your code on GitHub. However, GitHub is not a suitable repository for scientific publication. GitHub itself instructs authors to use other long-term archival and publishing alternatives, such as Zenodo. Therefore, the current situation with your manuscript is irregular. Please, publish your code in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy.
In addition, you must include a modified 'Code and Data Availability' section in a potentially reviewed manuscript, containing the information of the new repositories.
I must note that if you do not fix this problem, we cannot continue with the peer-review process or accept your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor