the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Tuning a Climate Model with Machine-learning based Emulators and History Matching
Abstract. In climate model development, tuning refers to the important process of adjusting uncertain free parameters of subgrid-scale parameterizations to best match a set of Earth observations such as global radiation balance or global cloud cover. This is traditionally a computationally expensive step as it requires a large number of climate model simulations to create a Perturbed Parameter Ensemble (PPE), which is increasingly challenging with increasing spatial resolution and complexity of climate models. In addition, this manual tuning relies strongly on expert knowledge and is thus not independently reproducible. Here, we develop a Machine Learning (ML)-based tuning method with the goal to reduce subjectivity and computational demands. This method consists of three steps: (1) creating a PPE of limited size with randomly selected parameters, (2) fitting an ML-based emulator to the PPE and generate a large PPE with the emulator, and (3) shrinking the parameter space with history matching. We apply this method to the Icosahedral Nonhydrostatic Weather and Climate Model (ICON) for the atmosphere to tune for global radiative and cloud properties. With one iteration of this method, we achieve a model configuration yielding a global top-of-atmosphere net radiation budget in the range of [0,1] W/m2, and global radiation metrics and water vapor path consistent with the reference observations. Furthermore, the resulting ML-based emulator allows to identify the parameters that most impact the outputs that we target with tuning. The parameters that we identified as mostly influential for the physics output metrics are the critical relative humidity in the upper troposphere and the coefficient conversion from cloud water to rain, influencing the radiation metrics and global cloud cover, together with the coefficient of sedimentation velocity of cloud ice, having a strong non-linear influence on all the physics metrics. The existence of non-linear effects further motivates the use of ML-based approaches for parameter tuning in climate models.
- Preprint
(3480 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
CEC1: 'Comment on egusphere-2024-2508', Astrid Kerkweg, 06 Sep 2024
Dear authors,
in my role as Executive editor of GMD, I would like to bring to your attention our Editorial version 1.2:
https://www.geosci-model-dev.net/12/2215/2019/
This highlights some requirements of papers published in GMD, which is also available on the GMD website in the ‘Manuscript Types’ section:
http://www.geoscientific-model-development.net/submission/manuscript_types.html
In particular, please note that for your paper, the following requirements have not been met in the Discussions paper:
- "The main paper must give the model name and version number (or other unique identifier) in the title."
- "If the model development relates to a single model then the model name and the version number must be included in the title of the paper. If the main intention of an article is to make a general (i.e. model independent) statement about the usefulness of a new development, but the usefulness is shown with the help of one specific model,the model name and version number must be stated in the title. The title could have a form such as, “Title outlining amazing generic advance: a case study with Model XXX (version Y)”.''
- "Code must be published on a persistent public archive with a unique identifier for the exact model version described in the paper or uploaded to the supplement, unless this is impossible for reasons beyond the control of authors. All papers must include a section, at the end of the paper, entitled "Code availability". Here, either instructions for obtaining the code, or the reasons why the code is not available should be clearly stated. It is preferred for the code to be uploaded as a supplement or to be made available at a data repository with an associated DOI (digital object identifier) for the exact model version described in the paper. Alternatively, for established models, there may be an existing means of accessing the code through a particular system. In this case, there must exist a means of permanently accessing the precise model version described in the paper. In some cases, authors may prefer to put models on their own website, or to act as a point of contact for obtaining the code. Given the impermanence of websites and email addresses, this is not encouraged, and authors should consider improving the availability with a more permanent arrangement. Making code available through personal websites or via email contact to the authors is not sufficient. After the paper is accepted the model archive should be updated to include a link to the GMD paper."
All these rules apply to your paper. As you use ICON as a test case please expand the title along the lines: "Tuning a Climate Model with Machine-learning based Emulators and History Matching: a case study with ICON X.y"
Furthermore, your code availability section is insufficient. All code used for the publication needs to be available already at the time of the review / public discussion. Thus a promise into the future as issued in your code availability section is not acceptable. Please provide asap. the code and data you used. This includes the ML-algorithm used, the training data and the exact version of the ICON code used.
Yours,
Astrid Kerkweg (GMD executive Editor)
Citation: https://doi.org/10.5194/egusphere-2024-2508-CEC1 -
RC1: 'Comment on egusphere-2024-2508', Qingyuan Yang, 11 Sep 2024
Summary:
This work uses Gaussian Process emulator together with history matching to estimate the parameters of the atmospheric component of the ICON model. It is well-written with high-quality tables and figures. The following aspects are considered and analyzed in the ML-based tuning approach: (1) The optimal set of parameters to tune; (2) different weights on the target variables to tune (a priority is given for radiation). These two points and the discussions on them together with the fact that very limited ensemble members are available are the innovative parts of this manuscript, in my opinion. The work shows that (1) one iteration of the method could converge relatively well to a model configuration that are generally consistent with observations, and (2) the temporal variability in each model run could be at the same magnitude as the variability from the varied parameters for the dynamics outputs, which have implications for future studies on climate model PPEs. The analysis in this work is based on limited ensemble members, which is a challenge in building emulators, but I find the analysis to evaluate the performance of the emulator and to identify the sensitive parameters robust and convincing.
The confusing part of this manuscript is how the parameter sets are determined (e.g., Pp1, Pp2, and Ppd), which seems a bit arbitrary for those who do not know much about the model. The corresponding part of the text also could benefit from a better organization. I recommend moderate to minor revision for this manuscript. Please see the comments below for more detailed questions.
Main comments
1. It is difficult to understand why and how parameter sets Pp1, Pp2, and Ppd are selected. It seems that Pp2 is selected partially because it has a strong influence on global cloud cover (also pointed out in Line 260), which is proved in Appendix C, but the way it is described seems to be Pp2 is first selected, and then proved to do better for cloud cover. What about other possible parameter combinations?
I wouldn’t consider Pp2 an extension of Pp1, because some parameters in Pp1 that are varied are kept fixed in PPE3 and PPE4, and some are varied in these two PPEs. What is the justification in this? Figs 2 and 3 provide some explanations, but more discussion is needed. For example, the black points (PPE2) cluster around a certain area in Fig. 3c, which seems to suggest that Pr0 and crt do not need to be varied (hence fixed in PPE3 and PPE4), but they are included and varied in Pp2. Another example could be csatsc, it belongs to the cloud cover scheme, but why it is not varied in Pp1 and Pp2, and its value is chosen to be fixed while other cloud cover parameters are varied? It is varied in PpD but only briefly mentioned in Appendix C. Justifications on why some parameters are included in Pp2 and Ppd need to be pointed out more explicitly (references, analysis, knowledge on the simulated processes, or limited computational resources?).
2. In addition to emulator construction and history matching, this work does another level of analysis/comparison, which is to compare which parameter set (Pp1 or Pp2) is better to tune. This seems more like a part of a method. Regardless I think this level (i.e., the parameter set selection between Pp1 and Pp2) should be more explicitly pointed out somewhere in the manuscript. It would also help with the organization of the manuscript.
3. Some content in the Appendices, especially Appendices C and F, is very tied to the logic and flow of the manuscript. This is most pronounced at Lines 263-266, which is very strongly supported by Fig D1. and Fig. F1. Similarly, statements in the last few sentences of the abstract seem to be also supported by contents in the Appendices. I think moving some contents in the Appendices to the main text (e.g., presenting a subset of Fig F1 for the parameters that are mentioned in the main text) would make the manuscript flow better.
Minor comments
1. Because of the main comments, I think Figure 1 and the workflow described in Lines 7-10 and 135-165 a bit oversimplify (or undersell) the work done in this manuscript.
2. How many samples are generated from the emulator for history matching?
3. If something is mentioned earlier (as suggested in the Main comment #1), it would be easier to justify (or understand) Lines 224-235.
4. Fig. 3: it is interesting that Fig. 3a and b has two clusters of PPE2 points. This should be pointed out?
Details
Line 70-75: A bit confusing here because here it seems to say either Pp1 + Ppd or Pp2 + Ppd, the comparison between Pp1 and Pp2 is not mentioned here.
Line 75-77: the second tuning targets the dynamics outputs, but Lines 76-77 (starting from where …) seem to suggest that the criterion is just based on achieving a nearly balanced global annual net radiation flux at TOA. Please clarify (although it is clarified later in Line 175). Maybe something like “keep the highest priority … meanwhile trying to match …”.
Line 111: Why the average period is different (1980 in Line 105 and 1980-1989 in Line 111)?
Line 155: the symbol n is used too many times (here and the number of ensemble members before, and in point 5, too).
Eq. 3: can you explain or provide reference on why these are perturbed? (similar question to Main comment #1).
Line 192: the “previous PPEs” here refer to from previous studies or PPE1 and PPE2 done in this work? It seems to be PPE1 and PPE2 here, but not clear. Please clarify. This is related to Main comment #2.
Line 215 and Lines 219-223: I assume what Line 192 refers to is Lines 219-223? I think it might be better to put Lines 219-223 to Line 215?
Section 3.3.1 helps explain why some observations corresponding to the dynamics output cannot be matched. I think a sentence or two pointing this out would make the logic flow more nicely. For example, at the end of Line 295, add a sentence saying that the output variability is also a factor.
Line 347: I recommend deleting the sentence “since the required size of the PPEs …. The tuning parameter space.”, as this is only practically true. We know that ideally to fill in the parameter space with points, one more parameter means significantly more points (e.g., the difference of 10^7 and 10^8).
Tables and Figures
Table 5: fixed parameters in PPE1 and PPE3 are for different reasons (PPE1 for default, but PPE3 seems to be from PPE2). Maybe worth pointing that out? Besides that, I think this is a great table that summarizes what is done in this work.
Fig. 2: there is only 29 squares (PPE2; including the selected run). Is one outside the extent or overlapping?
Fig. 4: there are less than 30 points for PPE3 and PPE4. Please specify why they are not plotted here (maybe outside the extent).
Line 233: I recommend adding boxes in Fig. 4 showing the extent of Fig. 2 to highlight the point in this line.
Fig. 5: I think this is a figure with robust results but I recommend putting the y-axis limit to be 0-1 or -0.1-1. The way it is presented now, i.e., lower y-axis limit being -0.5, seems a bit unnecessary.
Fig. 7a: there is another two points that are just right above the yellow triangle. They seem to fit well to Fig. 7a, too. Maybe it would help highlighting where they are in Fig. 7b-d (which I assume that are far from the observations), such that the selection of the two runs is more convincing?
Qingyuan YangCitation: https://doi.org/10.5194/egusphere-2024-2508-RC1 -
RC2: 'Comment on egusphere-2024-2508', Frédéric Hourdin, 12 Sep 2024
The paper entitled "Tuning a Climate Model with Machine-learning based Emulators and History Matching" describes a tuning protocol applied to the Icon global climate model.
Although the paper is very clear, well written and easy to read, and probably useful for the community as well, I have major concerns which make it not suitable for publication at this stage.
Major concerns.My first major concern is about the novelty of the work and the way it is presented in the frame of the ongoing literature on the subject. Since this point concerns in particular scientific papers I was involved in (most of which are cited, this is not an issue), and to avoid any ambiguity, I decided to sign my review, although I usually prefer not to. The title, the abstract and the introduction, are suggesting or saying that the originality of the paper is to use History Matching and Emulators to tune a climate model (for instance in the abstract, line 6, "Here, we develop a MB based tuning method ...). However, this method is exactly the one that was proposed by Daniel Williamson, and first applied to an oceanic model in Williamson et al., 2017. Proof of concept of the potential of the method to tune a global climate model were given in two papers (Hourdin et al., 2021, 2023). In the first paper, we have shown how, with a combination of single column simulations and global climate simulations, using History Matching with Gaussian Process (GP) based emulators, we were able to automatically retune the model’s free parameters and automatically reach a tuning as good as that of the previous 6A version of the model (used for CMIP6 production). The second paper, which is cited here in a general sentence about uncertainty quantification, presents a successful automatic tuning for the IPSL global climate model: 18 parameters are varied and we go so far as to show that 2 of the finally selected simulations could have been used as reference configurations for CMIP6 in place of the IPSL-CM6A configuration, which was obtained after a long and fastidious phase of manual tuning. If we go into even more details, the fact that the "physics tuning" is done considering the second year of 2-year long forced-by-SSTs simulations (line 94-95) is exactly the protocol we already published (Hourdin et al. 2021, 2023). In fact the authors mention that it was a protocol already used for manual tuning (which is the case as well for the IPSL model). But it is interesting to underline that it seems to be a relevant and shared protocol.
This is not to say that the work itself is not interesting and does not deserve publication. I am actually quite convinced that we need much more publications of this type and I am glad that this paper was submitted. As we have written in the conclusions of several papers on the subject, this History Matching approach for model tuning is not the end of the story. Rather, and this is something that has gained in strength and depth as we keep working with it, it is opening a new area for climate modeling, one with a lot of room and questions to investigate. Indeed, the approach only provides a framework. There are so many possible ways to implement it, concerning for instance the choice of metrics and model configurations. Depending on these choices, the approach may be more or less efficient or successful, in ways we do not know or understand very well yet. I think that this is the interesting part of the work, and therefore that the paper should really focus on the specificities of the protocol. The approach is not novel but as it is quite recent, every new implementation of it brings new insights into modeling and models’ behavior. The authors could for instance reflect on: What is common with or different from previous studies? Why did they make a particular choice instead of another? For instance, from my perspective, one specificity is to propose successive phases of tuning in which some parameters are set to their “best values” and new parameters are varied. This might be an interesting choice and it would be very interesting to discuss it in more depth. It may be cheaper than varying all the parameters at once. Also it probably makes tuning experiments easier to interpret, by separating questions from the beginning. On the other hand, it clearly forbids some compensation between parameters of two phases as clearly seen when looking at the TOA global net radiative budget in the various experiments. One thing I would wish to find in a revised version of the manuscript is the authors’ view on these questions, and more generally, I wish that this paper and other publications on tuning would be thought and written as a contribution into building this new science, beyond just reporting results — which is of course also an important aspect of publication.
My second major concern is about the way history matching is presented. It really is a major concern since one important aspect of this particular moment in the history of climate modeling science is to clarify the concepts, establish a common vocabulary and so on. There are two points I wish to make. First, the fact that the approach is presented as an optimization problem (see e. g. line 42 p2 , line 54 p2 , line 31 p7). Daniel Williamson, when promoting history matching, insisted rather heavily on the fact that it is not an optimization approach. After years of working with this approach, I am convinced that it is one of the most important aspects of his proposal. The approach consists in finding the region of the free-parameter space (a hypercube defined as the product of [min,max] segments for each free parameter) that is compatible with observations for a series of chosen metrics. “Compatibility” is defined through a set tolerances to error which should in principle include at least the uncertainty on the target (often observational uncertainty) and the model structural errors (generally unknown). A state that was reached by optimizing farther than this tolerance should not and theoretically can not, be preferred to another one. Of course, in practice, climate modelers may (and may have to) choose their “best” setup (that can be for instance selected using metrics not already used in the tuning procedure). But they should clearly motivate the choice and be conscious that this is outside the history matching philosophy. The second point is tightly related to the first one. It concerns the definition of Implausibility. When formulating history matching in a Bayesian framework, the Implausibility should include at the denominator the uncertainty of the emulator (as proposed here, Eq 1) but also the tolerance to error associated with the other sources of uncertainty. One of the goals of the iterative refocusing usually associated with history matching is to reduce the uncertainty of the emulator and reduce the denominator to the a priori tolerance to error. Is the choice of not including tolerance to error at the denominator related to the idea of seeing history matching as an optimisation problem ? If yes, this should be discussed much more in depth and the mathematical foundation presented. It should be acknowledged that it is not what is proposed in the papers cited on History Matching. If it is a bad understanding, I recommend modifying the text to correct this and to interpret the results with this in mind. In fact, in practice it is probably possible to discuss the results presented including the idea of tolerance to error.
The last concern, less fundamental, is that I am missing an evaluation of whether the approach proposed was at the end conclusive or not. Would you choose your final best simulation instead of the previous ICON-aes-1.3 configuration ? The text and Figure 7 suggest that at least one metric is farther from observation for all the simulations of the ensemble. This question could be addressed by computing some classical RMS metrics (see for instance Fig 3 and SI in Hourdin et al. 2023), independent from the one used for tuning on both the ensemble and ICON-aes-1.3 configuration. The answer can be no, in which case it is stil important to discuss the reason for this relative failure in the conclusion (which is partly done already).
Specific commentsAbstract line 6 : change the wording to better acknowledge that it is an application of already published work.
line 37 p2 : should be good to have a citation here. For instance one concerning the GFDL results reproduced in Fig 3 of Hourdin et al. 2017.
line 44 p2 : not sure to know what you have in mind when saying "most commonly used one in climate model tuning". Citations ?
line 47 p2 : think you should give more details on the work done in the citations, to better position your work with respect to it.
line 53 p2 : citation of Hourdin et al 2023 should also be listed in the examples of use of history matching with GP emulators for climate model tuning.
line 64-67 p3 : Worth making it clearer.
line 21 : is not 1W/m2 a very optimistic value for errors on TOA fluxes ?
line 80-82 p 8 are worth expanding a little bit.
Table 6 : is there some argument, statistical or physical, to say that a value of R^2>0.75 is enough ? Could be interesting to discuss this a little bit more.
Figure 5 : Why using only five samples ? I should be quite cheap to run more to have more robust estimates, no ?
line 8 p48 : you say that the approach can guide the sensitivity analyses [...] as we did with [...] Sobol indices. I would rather say that history matching is a way to make a much more complete sensitivity analysis than local linearization or Sobol indices.
line 65-66 p14 : you could mention that cvtfall was identify as a tuning parameter widely shared among climate models in the Hourdin et al 2017 synthesis paper.
Figure 6 : why starting numbering graphs from "b, c, d ..." rather than "a, b, c ..." ?
Figure 7 : other choices of marker color and thickness could make the figure easier to read.
line 96 p16 : "the effects of" can be removed
Figure 8 : It took me time to understand exactly this (relevant and interesting) figure. Changing "against annual mean (1980" by "against the mean of one particular year (here 1980" could help.
line 23 p19 : "aided by an emulator for the outputs" could be a little bit better phrased. "aided by building and using emulators for each output metrics"
line 24 p19 : I do not like the idea of using the wording PPE to describe an ensemble of metrics computed with the emulator. For me, a PPE is an ensemble of (real) GCM runs.
line 30 p19 : could be interesting to mention that these results may be strongly dependent on the setup used. With in particular, here, a small number of parameters.
line 43-44 : this discussion is interesting and important. You may spend more on it and make the link with the spread on the global radiative metrics in PPE5 while the physics parameters were fixed previously.
line 51 p19 : good to remind that the number of real simulations is still the limiting factor for tuning. But it would be worse with any other method available.
End of conclusion : I partly disagree, but on a rather fundamental level, with the last paragraph “[...] the seamless integration of such methods within the specific climate modeling framework - to practically enable a largely automatic application - is an aspect that needs to be addressed in further studies. We foresee that incorporating the other tuning steps, such as sensitivity analysis and choice of tuning parameters, their exploration and the evaluation of the outcomes in an automated approach will lead to more accurate and potentially computationally cheaper model tuning, also making this important step in climate model development more objective and reproducible.” Of course history matching allows us to make, in an objective, efficient and reproducible way, things which were very hard to conceive and formalize before. However, the choice of metrics has to and will remain subjective, given the dimension and complexity of the system. Using history matching with different metrics, either process oriented, or end-user oriented, not only may be relevant depending on the target applications, but also may help to go much more in depth into the link between physics content and climate simulations performances. This is why we are convinced that istory Matching is opening a new area in climate research and model tuning. And being able to make subjective choices more objective or at least quantifiable (through parameters ranges, metrics choices, targets and tolerances) will allow sharing, improving, making tuning more efficient in the future. But I am convinced we should absolutely avoid starting to propose standardizing and automatizing those choices. We need diversity. Whe should promote different teams trying different ways of using it. Not only for improving simulations but also to understand the climate system better through numerical modeling.
Frédéric Hourdin
Citation: https://doi.org/10.5194/egusphere-2024-2508-RC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
342 | 59 | 43 | 444 | 8 | 4 |
- HTML: 342
- PDF: 59
- XML: 43
- Total: 444
- BibTeX: 8
- EndNote: 4
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1