Differentiable Programming for Earth System Modeling

Gelbrecht, Maximilian; White, Alistair; Bathiany, Sebastian; Boers, Niklas

doi:https://doi.org/10.5194/egusphere-2022-875

Preprints

https://doi.org/10.5194/egusphere-2022-875

Preprints

30 Sep 2022

| 30 Sep 2022

Differentiable Programming for Earth System Modeling

Maximilian Gelbrecht, Alistair White, Sebastian Bathiany, and Niklas Boers

Abstract. Earth System Models (ESMs) are the primary tools for investigating future Earth system states at time scales from decades to centuries, especially in response to anthropogenic greenhouse gas release. State-of-the-art ESMs can reproduce the observational global mean temperature anomalies of the last 150 years. Nevertheless, ESMs need further improvements, most importantly regarding (i) the large spread in their estimates of climate sensitivity, i.e., the temperature response to increases in atmospheric greenhouse gases, (ii) the modeled spatial patterns of key variables such as temperature and precipitation, (iii) their representation of extreme weather events, and (iv) their representation of multistable Earth system components and their ability to predict associated abrupt transitions. Here, we argue that making ESMs automatically differentiable has huge potential to advance ESMs, especially with respect to these key shortcomings. First, automatic differentiability would allow objective calibration of ESMs, i.e., the selection of optimal values with respect to a cost function for a large number of free parameters, which are currently tuned mostly manually. Second, recent advances in Machine Learning (ML) and in the amount, accuracy, and resolution of observational data promise to be helpful with at least some of the above aspects because ML may be used to incorporate additional information from observations into ESMs. Automatic differentiability is an essential ingredient in the construction of such hybrid models, combining process-based ESMs with ML components. We document recent work showcasing the potential of automatic differentiation for a new generation of substantially improved, data-informed ESMs.

Received: 02 Sep 2022 – Discussion started: 30 Sep 2022

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 1316 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (1316 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

02 Jun 2023

| Review and perspective paper

Differentiable programming for Earth system modeling

Maximilian Gelbrecht, Alistair White, Sebastian Bathiany, and Niklas Boers

Geosci. Model Dev., 16, 3123–3135, https://doi.org/10.5194/gmd-16-3123-2023,https://doi.org/10.5194/gmd-16-3123-2023, 2023

Short summary Executive editor

Maximilian Gelbrecht, Alistair White, Sebastian Bathiany, and Niklas Boers

Interactive discussion

Status: closed

RC1: 'Comment on egusphere-2022-875', Samuel Hatfield, 26 Oct 2022

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2022/egusphere-2022-875/egusphere-2022-875-RC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2022-875-RC1
RC2: 'Comment on egusphere-2022-875', Anonymous Referee #2, 04 Nov 2022

This paper makes the case that differentiable programming and automatic differentiation can greatly improve the utility and accuracy of earth system models. I agree with the premise and I think the paper is doing a service by urging researchers in the earth system sciences to use the best tools available. That said, I have a few concerns.

Many relevant works are not cited. There are several PDE software packages that employ differentiable programming: FEniCS, Firedrake, and devito, to name a few in chronological order. While not originally built to use differentiable programming, it is also possible in deal.II using dual numbers. There are yet more software packages built on these tool kits for modeling individual components of the earth system, for example Gusto, Thetis, icepack, VarGlaS. Granted, no one has built, say, a coupled atmosphere-ocean GCM using these tools, but they're worth mentioning nonetheless The biggest omission regarding differential programming for PDE solvers is Farrell et al (2013), Automated derivation of the adjoint of high-level transient finite element programs. This paper won the SIAM Wilkinson Prize for Numerical Software in 2015.

* 32, "Modern AD systems are able to differentiate most typical operations that appear in ESMs": What about flux or slope limiters? Do you believe in discretize then optimize, or the other way around?

* 40, "Third, additional information from observations can be integrated into ESMs with Machine Learning (ML) models." I'd say that ML tools enable you to construct very complex statistical models and train them with the data you have, but ML as such does not somehow enable you to integrate more information from this data into process-based physical models of the earth system than you could with a more old-school statistical system identification or parameter estimation viewpoint. This is classic information theory, see Kullback's 1958 book.

* 91-94, "Artificial neural networks (ANNs) can be seen as a subset of these models, but differentiable programming goes far beyond these building blocks": A lot of the wording here is conflating what problem you're trying to solve with how you're trying to solve it. Fitting the parameters of a model, whether it's an ANN or process-based physics model, is the answer to the "what" question. There are many ways you could solve this fitting problem. You could use derivative-free optimization methods -- it's not a very good idea, but you could do it. Using gradient-based optimization methods is the answer to a "how" question, and using AD to compute the gradient as opposed to deriving it on pen and paper (which you can still do for some PDE models) is a subset of that "how" question. The fact that you can differentiate through control flow or user-defined types is definitely a compelling reason to use AD. You do address this and quite well in section 4, but it's really important to make the distinction clear.

* 170: I think it's worth making a bigger deal out of the fact that you can get the second derivative so easily with AD. It's often painful but still possible to manually derive a first-order adjoint model, but going to second order by hand is really atrocious.

* 175: Here it's worth citing some of Noemi Petra's work, including here paper on stochastic Newton MCMC as well as her more recent work on hIPPYlib.

Citation: https://doi.org/10.5194/egusphere-2022-875-RC2
AC1:
'Response to the Comments RC1 and RC2', Maximilian Gelbrecht, 25 Nov 2022
We'd like to thank both reviewers for their review of our manuscript and the editor for the handling of our manuscript. In the following we will address both reviews.
Response to RC1 / Review 1
We thank the reviewer for the positive assessment of our article.
With regards to the inclusion of benefits for Data Assimilation (DA) and NWP, we can fully understand the reviewer’s suggestion to optionally include it. In fact, we also discussed this within our author group before. Ultimately, in this article, we want to focus on Earth System Modelling and the benefits and challenges of differentiable programming therein. For the revised version of the article we would add a paragraph to the benefits section, in which we also mention the potential and benefits for DA and NWP, similar to the already included references in the adjoint section, but would not go into too much detail. However, we would welcome the opportunity to extend this into a possible follow-up article, for which we would need the expertise of the reviewer or other experts like Alan Geer, as DA and NWP are not within our core areas of expertise.
We are aware of the ongoing research of the team at Google that the reviewer mentioned. To our knowledge they haven’t published it yet. The involved scientists like Stephan Hoyer and Dmitrii Kochkov did however publish noteworthy papers on scientific machine learning and how to integrate knowledge into machine learning methods before. We would include those in a revised manuscript in the section on Challenges of Differentiable ESMs.
Response to RC2 / Review 2
We thank the reviewer for their detailed assessment of our article that will certainly help to improve a revised version of the manuscript.
Comments of the anonymous reviewer in the bullet points, answers and comments from the authors below in regular paragraphs and font.
Many relevant works are not cited. There are several PDE software packages that employ differentiable programming: FEniCS, Firedrake, and devito, to name a few in chronological order. While not originally built to use differentiable programming, it is also possible in deal.II using dual numbers. There are yet more software packages built on these tool kits for modeling individual components of the earth system, for example Gusto, Thetis, icepack, VarGlaS. Granted, no one has built, say, a coupled atmosphere-ocean GCM using these tools, but they're worth mentioning nonetheless The biggest omission regarding differential programming for PDE solvers is Farrell et al (2013), Automated derivation of the adjoint of high-level transient finite element programs. This paper won the SIAM Wilkinson Prize for Numerical Software in 2015.

We thank the reviewer for this comment. So far in our article we didn’t go into detail on the different discretisation techniques that ESMs use. As far as we know, all of the projects that the reviewer lists here are concerning FEM methods. FEM methods are just one possible discretisation technique: (pseudo-)spectral, finite differences, finite volume and other forms of discretisation also all play a role for ESMs. We consider it therefore outside of the scope of the article to go into a lot of detail on FEM solvers. However, in our revised article we would add a paragraph on discretisation techniques in general. In this paragraph we will mention that one can realize differentiable ESMs independent of the chosen discretisation method, as e.g. demonstrated by the Farrell paper that the reviewer suggested, which shows that differentiable programming can also be applied to FEM models. Additionally, in our revised article we will include additional references showcasing the prior research on combining ML techniques with ice-sheet models.
32, "Modern AD systems are able to differentiate most typical operations that appear in ESMs": What about flux or slope limiters? Do you believe in discretize then optimize, or the other way around?

In general, slope limiters can also be part of differentiable ESMs. The practical implementation of an ESM component that includes a slope limiter will depend on the model in question, its solvers, discretisation and the choice of slope limiter. In the revised manuscript we would add comments on slope limiters near the paragraphs about enforcing constraints in the Challenges of Differentiable ESMs section.
40, "Third, additional information from observations can be integrated into ESMs with Machine Learning (ML) models." I'd say that ML tools enable you to construct very complex statistical models and train them with the data you have, but ML as such does not somehow enable you to integrate more information from this data into process-based physical models of the earth system than you could with a more old-school statistical system identification or parameter estimation viewpoint. This is classic information theory, see Kullback's 1958 book.

We thank the reviewer for this comment, but it is not quite clear to us to which old-school statistical system identification or parameter estimation methods they refer. Machine learning methods like artificial neural networks are also not really new. They built upon statistics and optimisation theory like many other methods. ANNs however do provide extremely flexible universal function approximators that, through their very high capacity, are able to model more complex behaviour than many other methods. In our article we also cite various papers, e.g. the work from Um et al., Yuval et al., and Rasp et al., which showcase how ANNs can be used to improve a more traditional subgrid parametrisation. The point we were trying to make here is that, once a process-based ESM (component) is formulated such that it is automatically differentiable, it will also be much easier to seamlessly combine it with ML components; in addition, optimizing both the parameters of the process-based component and the parameters of the ML part will only be possible if both are automatically differentiable.
91-94, "Artificial neural networks (ANNs) can be seen as a subset of these models, but differentiable programming goes far beyond these building blocks": A lot of the wording here is conflating what problem you're trying to solve with how you're trying to solve it. Fitting the parameters of a model, whether it's an ANN or process-based physics model, is the answer to the "what" question. There are many ways you could solve this fitting problem. You could use derivative-free optimization methods -- it's not a very good idea, but you could do it. Using gradient-based optimization methods is the answer to a "how" question, and using AD to compute the gradient as opposed to deriving it on pen and paper (which you can still do for some PDE models) is a subset of that "how" question. The fact that you can differentiate through control flow or user-defined types is definitely a compelling reason to use AD. You do address this and quite well in section 4, but it's really important to make the distinction clear.

We thank the reviewer for their careful review of this section; indeed we should have made this clearer. In the revised manuscript, we will again go over this section and be more concise on the “how” and “what” from the perspective of an Earth system modeler.
170: I think it's worth making a bigger deal out of the fact that you can get the second derivative so easily with AD. It's often painful but still possible to manually derive a first-order adjoint model, but going to second order by hand is really atrocious.

We agree with the reviewer that computing second derivatives can have considerable benefits in theory, and we do mention this in a number of places in the manuscript, e.g. in the overview figure. However, computing second derivatives also comes at a cost. In particular, computing the Hessian can consume too much memory to be worth considering. Often, methods rather try to estimate a Hessian-vector-product in ways that don’t actually need second derivatives at all. That being said, if more models and tools are able to compute second derivatives easily, it is possible that more algorithms will be developed that might avoid the huge memory cost of the full Hessian. Therefore we’ll add another sentence on Hessians to the manual vs automatic adjoint section.
175: Here it's worth citing some of Noemi Petra's work, including here paper on stochastic Newton MCMC as well as her more recent work on hIPPYlib.

We thank the reviewer for pointing us to this work and will add it to the revised manuscript.

On behalf of the authors,
With best regards,
Maximilian Gelbrecht
Citation: https://doi.org/10.5194/egusphere-2022-875-AC1

Interactive discussion

Status: closed

RC1: 'Comment on egusphere-2022-875', Samuel Hatfield, 26 Oct 2022

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2022/egusphere-2022-875/egusphere-2022-875-RC1-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2022-875-RC1
RC2: 'Comment on egusphere-2022-875', Anonymous Referee #2, 04 Nov 2022

This paper makes the case that differentiable programming and automatic differentiation can greatly improve the utility and accuracy of earth system models. I agree with the premise and I think the paper is doing a service by urging researchers in the earth system sciences to use the best tools available. That said, I have a few concerns.

Many relevant works are not cited. There are several PDE software packages that employ differentiable programming: FEniCS, Firedrake, and devito, to name a few in chronological order. While not originally built to use differentiable programming, it is also possible in deal.II using dual numbers. There are yet more software packages built on these tool kits for modeling individual components of the earth system, for example Gusto, Thetis, icepack, VarGlaS. Granted, no one has built, say, a coupled atmosphere-ocean GCM using these tools, but they're worth mentioning nonetheless The biggest omission regarding differential programming for PDE solvers is Farrell et al (2013), Automated derivation of the adjoint of high-level transient finite element programs. This paper won the SIAM Wilkinson Prize for Numerical Software in 2015.

* 32, "Modern AD systems are able to differentiate most typical operations that appear in ESMs": What about flux or slope limiters? Do you believe in discretize then optimize, or the other way around?

* 40, "Third, additional information from observations can be integrated into ESMs with Machine Learning (ML) models." I'd say that ML tools enable you to construct very complex statistical models and train them with the data you have, but ML as such does not somehow enable you to integrate more information from this data into process-based physical models of the earth system than you could with a more old-school statistical system identification or parameter estimation viewpoint. This is classic information theory, see Kullback's 1958 book.

* 91-94, "Artificial neural networks (ANNs) can be seen as a subset of these models, but differentiable programming goes far beyond these building blocks": A lot of the wording here is conflating what problem you're trying to solve with how you're trying to solve it. Fitting the parameters of a model, whether it's an ANN or process-based physics model, is the answer to the "what" question. There are many ways you could solve this fitting problem. You could use derivative-free optimization methods -- it's not a very good idea, but you could do it. Using gradient-based optimization methods is the answer to a "how" question, and using AD to compute the gradient as opposed to deriving it on pen and paper (which you can still do for some PDE models) is a subset of that "how" question. The fact that you can differentiate through control flow or user-defined types is definitely a compelling reason to use AD. You do address this and quite well in section 4, but it's really important to make the distinction clear.

* 170: I think it's worth making a bigger deal out of the fact that you can get the second derivative so easily with AD. It's often painful but still possible to manually derive a first-order adjoint model, but going to second order by hand is really atrocious.

* 175: Here it's worth citing some of Noemi Petra's work, including here paper on stochastic Newton MCMC as well as her more recent work on hIPPYlib.

Citation: https://doi.org/10.5194/egusphere-2022-875-RC2
AC1:
'Response to the Comments RC1 and RC2', Maximilian Gelbrecht, 25 Nov 2022
We'd like to thank both reviewers for their review of our manuscript and the editor for the handling of our manuscript. In the following we will address both reviews.
Response to RC1 / Review 1
We thank the reviewer for the positive assessment of our article.
With regards to the inclusion of benefits for Data Assimilation (DA) and NWP, we can fully understand the reviewer’s suggestion to optionally include it. In fact, we also discussed this within our author group before. Ultimately, in this article, we want to focus on Earth System Modelling and the benefits and challenges of differentiable programming therein. For the revised version of the article we would add a paragraph to the benefits section, in which we also mention the potential and benefits for DA and NWP, similar to the already included references in the adjoint section, but would not go into too much detail. However, we would welcome the opportunity to extend this into a possible follow-up article, for which we would need the expertise of the reviewer or other experts like Alan Geer, as DA and NWP are not within our core areas of expertise.
We are aware of the ongoing research of the team at Google that the reviewer mentioned. To our knowledge they haven’t published it yet. The involved scientists like Stephan Hoyer and Dmitrii Kochkov did however publish noteworthy papers on scientific machine learning and how to integrate knowledge into machine learning methods before. We would include those in a revised manuscript in the section on Challenges of Differentiable ESMs.
Response to RC2 / Review 2
We thank the reviewer for their detailed assessment of our article that will certainly help to improve a revised version of the manuscript.
Comments of the anonymous reviewer in the bullet points, answers and comments from the authors below in regular paragraphs and font.
Many relevant works are not cited. There are several PDE software packages that employ differentiable programming: FEniCS, Firedrake, and devito, to name a few in chronological order. While not originally built to use differentiable programming, it is also possible in deal.II using dual numbers. There are yet more software packages built on these tool kits for modeling individual components of the earth system, for example Gusto, Thetis, icepack, VarGlaS. Granted, no one has built, say, a coupled atmosphere-ocean GCM using these tools, but they're worth mentioning nonetheless The biggest omission regarding differential programming for PDE solvers is Farrell et al (2013), Automated derivation of the adjoint of high-level transient finite element programs. This paper won the SIAM Wilkinson Prize for Numerical Software in 2015.

We thank the reviewer for this comment. So far in our article we didn’t go into detail on the different discretisation techniques that ESMs use. As far as we know, all of the projects that the reviewer lists here are concerning FEM methods. FEM methods are just one possible discretisation technique: (pseudo-)spectral, finite differences, finite volume and other forms of discretisation also all play a role for ESMs. We consider it therefore outside of the scope of the article to go into a lot of detail on FEM solvers. However, in our revised article we would add a paragraph on discretisation techniques in general. In this paragraph we will mention that one can realize differentiable ESMs independent of the chosen discretisation method, as e.g. demonstrated by the Farrell paper that the reviewer suggested, which shows that differentiable programming can also be applied to FEM models. Additionally, in our revised article we will include additional references showcasing the prior research on combining ML techniques with ice-sheet models.
32, "Modern AD systems are able to differentiate most typical operations that appear in ESMs": What about flux or slope limiters? Do you believe in discretize then optimize, or the other way around?

In general, slope limiters can also be part of differentiable ESMs. The practical implementation of an ESM component that includes a slope limiter will depend on the model in question, its solvers, discretisation and the choice of slope limiter. In the revised manuscript we would add comments on slope limiters near the paragraphs about enforcing constraints in the Challenges of Differentiable ESMs section.
40, "Third, additional information from observations can be integrated into ESMs with Machine Learning (ML) models." I'd say that ML tools enable you to construct very complex statistical models and train them with the data you have, but ML as such does not somehow enable you to integrate more information from this data into process-based physical models of the earth system than you could with a more old-school statistical system identification or parameter estimation viewpoint. This is classic information theory, see Kullback's 1958 book.

We thank the reviewer for this comment, but it is not quite clear to us to which old-school statistical system identification or parameter estimation methods they refer. Machine learning methods like artificial neural networks are also not really new. They built upon statistics and optimisation theory like many other methods. ANNs however do provide extremely flexible universal function approximators that, through their very high capacity, are able to model more complex behaviour than many other methods. In our article we also cite various papers, e.g. the work from Um et al., Yuval et al., and Rasp et al., which showcase how ANNs can be used to improve a more traditional subgrid parametrisation. The point we were trying to make here is that, once a process-based ESM (component) is formulated such that it is automatically differentiable, it will also be much easier to seamlessly combine it with ML components; in addition, optimizing both the parameters of the process-based component and the parameters of the ML part will only be possible if both are automatically differentiable.
91-94, "Artificial neural networks (ANNs) can be seen as a subset of these models, but differentiable programming goes far beyond these building blocks": A lot of the wording here is conflating what problem you're trying to solve with how you're trying to solve it. Fitting the parameters of a model, whether it's an ANN or process-based physics model, is the answer to the "what" question. There are many ways you could solve this fitting problem. You could use derivative-free optimization methods -- it's not a very good idea, but you could do it. Using gradient-based optimization methods is the answer to a "how" question, and using AD to compute the gradient as opposed to deriving it on pen and paper (which you can still do for some PDE models) is a subset of that "how" question. The fact that you can differentiate through control flow or user-defined types is definitely a compelling reason to use AD. You do address this and quite well in section 4, but it's really important to make the distinction clear.

We thank the reviewer for their careful review of this section; indeed we should have made this clearer. In the revised manuscript, we will again go over this section and be more concise on the “how” and “what” from the perspective of an Earth system modeler.
170: I think it's worth making a bigger deal out of the fact that you can get the second derivative so easily with AD. It's often painful but still possible to manually derive a first-order adjoint model, but going to second order by hand is really atrocious.

We agree with the reviewer that computing second derivatives can have considerable benefits in theory, and we do mention this in a number of places in the manuscript, e.g. in the overview figure. However, computing second derivatives also comes at a cost. In particular, computing the Hessian can consume too much memory to be worth considering. Often, methods rather try to estimate a Hessian-vector-product in ways that don’t actually need second derivatives at all. That being said, if more models and tools are able to compute second derivatives easily, it is possible that more algorithms will be developed that might avoid the huge memory cost of the full Hessian. Therefore we’ll add another sentence on Hessians to the manual vs automatic adjoint section.
175: Here it's worth citing some of Noemi Petra's work, including here paper on stochastic Newton MCMC as well as her more recent work on hIPPYlib.

We thank the reviewer for pointing us to this work and will add it to the revised manuscript.

On behalf of the authors,
With best regards,
Maximilian Gelbrecht
Citation: https://doi.org/10.5194/egusphere-2022-875-AC1

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Maximilian Gelbrecht on behalf of the Authors (19 Dec 2022) Author's response Author's tracked changes Manuscript

ED: Reconsider after major revisions (21 Feb 2023) by David Ham

Dear Dr Gelbrecht,

Thank you for your revised submission, please accept my apologies for the time it has taken me to come back to this. I’m afraid that at this stage this manuscript is a fair way from publishable.

The first key issue, which was somewhat identified by reviewer 2 and not really addressed in the revised manuscript, is that it’s not at all clear what it is you mean by differentiable programming, and hence what the change to ESM development that you are proposing actually constitutes. This is despite the presence of section 3, which is titled “differentiable programming”. It seems from that section that by “differentiable programming”, you mean some programming paradigm in which AD works well, and you cite machine learning frameworks as examples of this. So far so good, but we don’t get to learn what it is about a programming model that makes AD work well and what the difference is between that and the programming models which have hitherto generally been used for ESMs. Indeed, in section 4 you include the example of TAF which has been used, albeit with massive manual intervention primarily by Patrick Heimbach, to adjoin the MITGCM among others. You claim that there is something about modern AD systems which makes this manual intervention unnecessary, but we never get to hear what it is.

There are two issues with this failure to define what the programming paradigm you are arguing for actually is. Your audience in GMD are geoscientific model developers who on the whole are familiar with how ESMs are written, may have some idea about how AD works, but who aren’t on the whole ML people. I don’t currently think that an ESM developer reading this manuscript would understand what you are advocating that they do. The second, and more fundamental, issue is that I don’t think you’ve demonstrated that you know what the change required is either, and you certainly haven’t articulated it.

I think that your basic premise has merit: that a change in programming paradigm would make ESMs and ESM components more amenable to AD, and that this would be a good thing for the reasons that you give. I should declare an interest at this point: as I am sure you are aware, I am one of the authors of Farrell et al (2013) and am the leader of the Firedrake project. I therefore do have an opinion on exactly what the defining characteristics of what might be called differentiable programming are, which I will give here. In order to turn this manuscript into something publishable, you don’t have to agree with me, but you do need to articulate your argument about what differentiable programming is and to what extent it is applicable to ESMs.

There have always been two essential forms of AD: source-to-source translation, and operator overloading and taping (this is discussed in depth in Naumann, 2011). In essence the distinction is one of static analysis versus dynamic recording. To grossly over-simplify, the problem with the former is that it lacks robustness for more than rather simple operations, while the latter incurs the time and memory overhead of recording and running the tape. In fact, all real systems are some sort of hybrid, because the tape is still made up of base operations and the forward or reverse mode derivative of each base operation is a static analysis task. If a taping approach is applied at the level of individual floating point operations, for example by using a dual numbers approach in C++, the overhead is likely to swamp the original computation cost. The key insight in Farrell et al. (2013) and also in ML frameworks such as Pytorch and TensorFlow (interestingly, earlier versions of TensorFlow used fully static analysis but they gave up in favour of taping) is that taping is a much more performant approach if your base operators are not individual floating point operations, but rather much larger computational units. In ML frameworks these are operations over large tensors. In Farrell et al. (2013), the high level abstraction provided by the finite element method and embodied in FEniCS (and later Firedrake) meant that the taped items are whole PDE solves.

I think that this means that “differentiable programming” refers to programming models in which the base operations are simple enough to be differentiable using static analysis, but comprise enough floating point operations that the overhead in the tape operations is at least tolerable (it may even be negligible).

If this is the general characteristic of differentiable programming (once again, feel free to argue differently) then we can come to consider what a differentiable programming model suitable for ECMs might look like. Reviewer 2 gave you some examples of existing geoscientific models which are written in differentiable programming models. You argued that considering them was out of scope because they were all finite element models. In fact Devito is a finite difference framework and Reviewer 2 could have mentioned Dedalus (https://dedalus-project.org) which is a spectral framework with a lot of similarities to the other models there. The key point is that Firedrake and FEniCS, Devito, and Dedalus all provide programming models in which high level abstractions are provided for the PDEs being solved. Where automatic adjoints for these tools have been created, they tape at the level of the abstraction, not the primitive operations that implement the abstraction. It’s not a coincidence that these are predominantly finite element and spectral models, because these are discretisations that for which more complete high-level abstractions exist and hence for which a differentiable programming approach is readily applicable.

Finite difference and finite volume methods, on the other hand, lack a general abstraction, especially for advection operators. Indeed, Devito is specialised for tomography applications in which only wave equations are solved: it doesn’t have a solution to this problem. For example Clawpack, which in some senses is a high-level abstraction for finite volume, still requires Riemann solvers to be provided in low-level code (in that case in Fortran, but the particular language doesn’t matter). If finite difference and finite volume models require bespoke low-level code for core parts of the model, then only the classical AD approaches (e.g. TAF) will be available for them. The one example you cite, Souhar et al (2007) uses Tapanade, which is a similar tool to TAF. As you correctly point out in section 4, those systems have limitations that have prevented their widespread adoption. However, the final statement in section 4 “modern, more capable AD systems in combination with machinery originally developed for ML tasks promise substantially greater benefits” is currently unjustified and I am not actually convinced that it’s true to the extent that it would enable the efficient adjoining of finite difference and finite volume ESM components. You certainly haven’t explained what it is that has changed that makes more modern AD systems more capable.

This brings us to your dismissal of reviewer 2’s suggestion that you need to discuss existing differentiable geoscientific models. You claimed that it wouldn’t be in scope to discuss them because ESMs use a much wider range of discretisations. My response is that this misses the point: the existing differentiable programming tools for geoscience are predominantly finite element because that (potentially alongside spectral, which has very similar characteristics) is the discretisation to which the differentiable programming paradigm is applicable. I think there are two ways you could remedy the manuscript in this regard. You could discuss the limitations of differentiable programming as currently understood, in which context the geoscientific models for which it does work are highly relevant. Alternatively, you could provide a more detailed argument about a different conception of what differentiable programming is, and provide an explanation of what has changed such that the previous limitations have been overcome. In the latter case I would expect you to be able to point to a case where whatever your definition of differentiable programming is has been successfully applied to a finite difference or finite volume flow model (preferably an ESM component, but that’s not really necessary in order to prove your point).

Finally, I think you have been too blasé about the challenges imposed by flow control. You have stated that AD systems can differentiate through flow control. This is true in the sense that the tools produce a result. However, that result is only a locally consistent gradient: if the controls are changed such that the forward model would take a different branch, it’s far from automatic that the AD system’s gradient (adjoint or TLM) will transition to the derivative of the other branch. Indeed, with a taping approach it will most certainly not. This also leaves aside the question of what the right gradient is in the vicinity of the branch. This is neither an easy nor a trivial point, a fact which has been recognised for decades (see, for example, Bao and Kuo, 1995). One can work around it in a manual way by defining a function which is some approximation to the gradient, or by changing to parametrisations that are at least weakly differentiable. However, those approaches undermine your argument that the process needs to be automated. This is a limitation which needs to be mentioned.

Concretely, based on the above, what I think you need to do is the following:

1. Rewrite (and I mean comprehensively, not just add a couple of sentences) section 3 so that it has a clear definition of what you are arguing differentiable programming is. This must be written in such a way that an ESM developer will understand the concrete difference between what you are claiming as differentiable programming and the conventional AD approaches (e.g Tapanade and TAF) which have been known in the field for decades. The rest of the manuscript should be adapted to be consistent with what you write here.

2. Actually make the case for what aspects of ESMs can and can’t be accommodated by your definition of differentiable programming. This needs to be much more concrete than the current assertions, which come pretty close to just claiming that everything works but do not provide any explanation as to why things that have long been known to be problematic are apparently no longer so. This particularly applies to parametrisations containing switches, and advection schemes in finite difference and finite volume models.

3. You do need to at some level mention the state of the art in differentiable ESM components. If you agree with my analysis above then this could consist of discussing the packages that reviewer 2 raised. If you disagree with my analysis then I’d expect you to be able to point to some finite difference and/or finite volume geoscience models written in differentiable programming paradigms, or possibly to be able to point to a reason other than my reasoning as to why finite element seems to dominate this field.

Yours,

David

Bao, J.W. and Kuo, Y.H., 1995. On–off switches in the adjoint method: Step functions. Monthly weather review, 123(5), pp.1589-1594.

Naumann, U., 2011. The art of differentiating computer programs: an introduction to algorithmic differentiation. Society for Industrial and Applied Mathematics.

Hide

AR by Maximilian Gelbrecht on behalf of the Authors (12 Apr 2023) Author's response Author's tracked changes Manuscript

ED: Publish subject to technical corrections (28 Apr 2023) by David Ham

ED: Publish subject to technical corrections (28 Apr 2023) by Rolf Sander (Executive editor)

AR by Maximilian Gelbrecht on behalf of the Authors (10 May 2023) Manuscript

Journal article(s) based on this preprint

02 Jun 2023

| Review and perspective paper

Differentiable programming for Earth system modeling

Maximilian Gelbrecht, Alistair White, Sebastian Bathiany, and Niklas Boers

Geosci. Model Dev., 16, 3123–3135, https://doi.org/10.5194/gmd-16-3123-2023,https://doi.org/10.5194/gmd-16-3123-2023, 2023

Short summary Executive editor

Maximilian Gelbrecht, Alistair White, Sebastian Bathiany, and Niklas Boers

Viewed

Total article views: 856 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
629	211	16	856	7	4

HTML: 629
PDF: 211
XML: 16
Total: 856
BibTeX: 7
EndNote: 4

Views and downloads (calculated since 30 Sep 2022)

Month	HTML	PDF	XML	Total
Sep 2022	74	9	2	85
Oct 2022	183	88	8	279
Nov 2022	117	36	5	158
Dec 2022	43	12	0	55
Jan 2023	38	13	0	51
Feb 2023	55	22	0	77
Mar 2023	50	13	1	64
Apr 2023	57	12	0	69
May 2023	11	6	0	17
Jun 2023	1	0	1
Jul 2023	0
Aug 2023	0
Sep 2023	0
Oct 2023	0
Nov 2023	0
Dec 2023	0
Jan 2024	0
Feb 2024	0
Mar 2024	0
Apr 2024	0
May 2024	0
Jun 2024	0
Jul 2024	0
Aug 2024	0
Sep 2024	0

Cumulative views and downloads (calculated since 30 Sep 2022)

Month	HTML	PDF	XML	Total
Sep 2022	74	9	2	85
Oct 2022	183	88	8	279
Nov 2022	117	36	5	158
Dec 2022	43	12	0	55
Jan 2023	38	13	0	51
Feb 2023	55	22	0	77
Mar 2023	50	13	1	64
Apr 2023	57	12	0	69
May 2023	11	6	0	17
Jun 2023	1	0	1
Jul 2023	0
Aug 2023	0
Sep 2023	0
Oct 2023	0
Nov 2023	0
Dec 2023	0
Jan 2024	0
Feb 2024	0
Mar 2024	0
Apr 2024	0
May 2024	0
Jun 2024	0
Jul 2024	0
Aug 2024	0
Sep 2024	0

Viewed (geographical distribution)

Total article views: 832 (including HTML, PDF, and XML) Thereof 832 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 04 Sep 2024

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (1316 KB)
Metadata XML

Short summary

Differential programming is a technique that enables the automatic computation of derivatives of the output of models with respect to model parameters. Applying these techniques to Earth System Modeling leverages the increasing availability of high-quality data to improve the models themselves. This can be done either by calibration techniques that use gradient-based optimization or by incorporating machine learning methods that can learn previously unresolved influences directly from data.


Total:	0
HTML:	0
PDF:	0
XML:	0