the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Representation learning with unconditional denoising diffusion models for dynamical systems
Abstract. We propose denoising diffusion models for datadriven representation learning of dynamical systems. In this type of generative deep learning, a neural network is trained to denoise and reverse a diffusion process, where Gaussian noise is added to states from the attractor of a dynamical system. Iteratively applied, the neural network can then map samples from isotropic Gaussian noise to the state distribution. We showcase the potential of such neural networks in experiments with the Lorenz 63 system. Trained for state generation, the neural network can produce samples, almost indistinguishable from those on the attractor. The model has thereby learned an internal representation of the system, applicable on different tasks than state generation. As a first task, we finetune the pretrained neural network for surrogate modelling by retraining its last layer and keeping the remaining network as a fixed feature extractor. In these lowdimensional settings, such finetuned models perform similarly to deep neural networks trained from scratch. As a second task, we apply the pretrained model to generate an ensemble out of a deterministic run. Diffusing the run, and then iteratively applying the neural network, conditions the state generation, which allows us to sample from the attractor in the run's neighboring region. To control the resulting ensemble spread and Gaussianity, we tune the diffusion time and, thus, the sampled portion of the attractor. While easier to tune, this proposed ensemble sampler can outperform tuned static covariances in ensemble optimal interpolation. Therefore, these two applications show that denoising diffusion models are a promising way towards representation learning for dynamical systems.
 Preprint
(2606 KB)  Metadata XML
 BibTeX
 EndNote
Status: closed

RC1: 'Comment on egusphere20232261', Sibo Cheng, 12 Mar 2024
This research paper presents a study on using denoising diffusion models for datadriven representation learning of dynamical systems. The research demonstrates the utility of such networks with the Lorenz 63 system, showing that the trained network can produce samples almost indistinguishable from those on the attractor, indicating the network has learned an internal representation of the system. This representation is then used for surrogate modeling and generating ensembles out of a deterministic run.
Overall I found this paper very well written and the contribution of introducing diffusion model into dynamical systems in geoscience novel and of clear contribution. Here lists my comments before I can recommend acceptance of this manuscript:
Comments:
1. If I understand correctly, the objective of this study is to explore the possibility of using diffusion model for highdimension systems in geoscience. The numerical experiments are carried out using a three dimensional Lorenz model. To enhance the discussion, It would be beneficial if the authors could explain how generalizable their approach is to a highdimensional spatial temporal system (e.g. by adding CNN or transformer layers for feature extractions (encoding) and decoding etc).
2. As a consequence of the small dimension, the ‘latent space’ in your diffusion model (256) is much larger the one of the physics space (3). Therefore, you have little risk in losing any information when using the denoising network for surrogate modelling. The authors may consider adding a baseline of transfer learning from an untrained (randomly initialized denoising NN) in Fig 7. The authors have shown the results of untrained NN in Tab 3 but only with a linear finetuning. What happens if you finetune with a nonlinear NN of an untrained denoising NN?
Minor questions:
 In figure 7, it seems that the dense neural network with two layers trained from scratch outperforms your transfer learning from the diffusion model. Is that the case? In fact, results in tab 3 also show that the model trained from scratch (dense *3 and resnet) performs similarly to the finetuning from your diffusion model? The authors may want to add some comments regarding this
 Page 3, ‘generative training is rarely used for pretraining and representation learning of highdimensional systems’. There are some works tried to use diffusion model for contrastive models, e.g,
Yang, X. and Wang, X., 2023. Diffusion model as representation learner. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1893818949).
 Mittal, S., Abstreiter, K., Bauer, S., Schölkopf, B. and Mehrjou, A., 2023, July. Diffusion based representation learning. In International Conference on Machine Learning (pp. 2496324982). PMLR.
The authors may want to include some references and discuss the difference/similarity compared to the method used in this paper. This paper is probably the first one to propose diffusionbased representation learning in dynamical systems(?)
3. Page 9, ‘show that this representation is entangled’ why it is important for the learned features to be entangled?
4. Page 11, check the sentence ‘As we will see later, the bigger the Because of the statedependency, the resulting distribution is implicitly represented by the ensemble and could extend beyond a Gaussian assumption’
5. Page 13, it seems that you have used a lot of training samples (1.6*E7) for your diffusion model for the Lorenz system of dimension 3. I was wondering if a standard surrogate model will require that much. That is saying maybe a standard surrogate model can outperform the diffusionbased one with less training data. I am curious to see the authors’ thought.
6. fig 5 (a) and 1(b). if I understand correctly, the xaxis is the pseudo time instead of the real time in the dynamical system. if it is the case, it would be benificial to add an xaxis label to avoid any confusion.
Citation: https://doi.org/10.5194/egusphere20232261RC1  AC1: 'Reply on RC1', Tobias Finn, 24 May 2024

RC2: 'Comment on egusphere20232261', Anonymous Referee #2, 03 Apr 2024
This is a very interesting and novel study on the use of denoising diffusion model for representation learning. The manuscript is well written and describes very nicely the context, how these approaches (rooted in image applications) can be adapted to geosciences, and illustrates two distinct relevant applications, surrogate modelling and ensemble generations, that are both extremely important in high dimensional settings.
I think the manuscript can be accepted almost as it is, but I have a few minor comments I would encourage the Authors to look at.
1) While there are little spaces for doubts, I would strongly suggest the Authors to specify that their approach applies to ergodic chaotic dynamics for which an invariant distribution exists that describe the state distribution on the system's attractor. An obvious counterexample would be a stable system having an equilibrium point (or a limit cycle) as attractor.
2) When mentioning the Schrodinger Bridge (page 2), you may want to refer to Reich S. 2019 (doi:10.1017/S0962492919000011) as an exemplar study of the same analogy but in the area of data assimilation.
3) Line 27. "..dynamical systemS ..."
4) In the caption of Fig1b, use (left/right) to point the reader.
5) Line 44. I think you should always order references chronologically.
6) Line 5359. While I understand and I like the Authors narrative and choice of references. Nevertheless, and particularly for the readers of NPG, it would be appropriate to also mention the large bulk of work on the generation of ensemble members based on dynamical systems's theory and data assimilation. A good recent reference is 10.1029/2021MS002828
7) I am a bit of an inconvenience with the use of the term "latent". On the one side I agree with a comment from the other Reviewer. On the other I do also see in line 100 that you state z=x which makes one deduce the latent and actual state have the same dimension. Finally, while it is true that latent variables are defined in relation to their indirect (often hidden) relation with the observables quantities, with no reference to their number (or space dimension), in many practical applications the latent space is assumed/defined/used as being of smaller dimension.
8) Line 115. I would add ".... prior distribution FOR THE DENOISING PROCESS."
9) Equations (8). Wouldn't be better to (re)state clearly that we do not have access to x in practice?
10) Line 145. Is that because they do not depend on x?
11) Line 153. I think "Equation" must be written at the beginning of the sentence.
12) Line 176. Instead of "normally" I would suggest "most of the times".
Citation: https://doi.org/10.5194/egusphere20232261RC2  AC2: 'Reply on RC2', Tobias Finn, 24 May 2024
Status: closed

RC1: 'Comment on egusphere20232261', Sibo Cheng, 12 Mar 2024
This research paper presents a study on using denoising diffusion models for datadriven representation learning of dynamical systems. The research demonstrates the utility of such networks with the Lorenz 63 system, showing that the trained network can produce samples almost indistinguishable from those on the attractor, indicating the network has learned an internal representation of the system. This representation is then used for surrogate modeling and generating ensembles out of a deterministic run.
Overall I found this paper very well written and the contribution of introducing diffusion model into dynamical systems in geoscience novel and of clear contribution. Here lists my comments before I can recommend acceptance of this manuscript:
Comments:
1. If I understand correctly, the objective of this study is to explore the possibility of using diffusion model for highdimension systems in geoscience. The numerical experiments are carried out using a three dimensional Lorenz model. To enhance the discussion, It would be beneficial if the authors could explain how generalizable their approach is to a highdimensional spatial temporal system (e.g. by adding CNN or transformer layers for feature extractions (encoding) and decoding etc).
2. As a consequence of the small dimension, the ‘latent space’ in your diffusion model (256) is much larger the one of the physics space (3). Therefore, you have little risk in losing any information when using the denoising network for surrogate modelling. The authors may consider adding a baseline of transfer learning from an untrained (randomly initialized denoising NN) in Fig 7. The authors have shown the results of untrained NN in Tab 3 but only with a linear finetuning. What happens if you finetune with a nonlinear NN of an untrained denoising NN?
Minor questions:
 In figure 7, it seems that the dense neural network with two layers trained from scratch outperforms your transfer learning from the diffusion model. Is that the case? In fact, results in tab 3 also show that the model trained from scratch (dense *3 and resnet) performs similarly to the finetuning from your diffusion model? The authors may want to add some comments regarding this
 Page 3, ‘generative training is rarely used for pretraining and representation learning of highdimensional systems’. There are some works tried to use diffusion model for contrastive models, e.g,
Yang, X. and Wang, X., 2023. Diffusion model as representation learner. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1893818949).
 Mittal, S., Abstreiter, K., Bauer, S., Schölkopf, B. and Mehrjou, A., 2023, July. Diffusion based representation learning. In International Conference on Machine Learning (pp. 2496324982). PMLR.
The authors may want to include some references and discuss the difference/similarity compared to the method used in this paper. This paper is probably the first one to propose diffusionbased representation learning in dynamical systems(?)
3. Page 9, ‘show that this representation is entangled’ why it is important for the learned features to be entangled?
4. Page 11, check the sentence ‘As we will see later, the bigger the Because of the statedependency, the resulting distribution is implicitly represented by the ensemble and could extend beyond a Gaussian assumption’
5. Page 13, it seems that you have used a lot of training samples (1.6*E7) for your diffusion model for the Lorenz system of dimension 3. I was wondering if a standard surrogate model will require that much. That is saying maybe a standard surrogate model can outperform the diffusionbased one with less training data. I am curious to see the authors’ thought.
6. fig 5 (a) and 1(b). if I understand correctly, the xaxis is the pseudo time instead of the real time in the dynamical system. if it is the case, it would be benificial to add an xaxis label to avoid any confusion.
Citation: https://doi.org/10.5194/egusphere20232261RC1  AC1: 'Reply on RC1', Tobias Finn, 24 May 2024

RC2: 'Comment on egusphere20232261', Anonymous Referee #2, 03 Apr 2024
This is a very interesting and novel study on the use of denoising diffusion model for representation learning. The manuscript is well written and describes very nicely the context, how these approaches (rooted in image applications) can be adapted to geosciences, and illustrates two distinct relevant applications, surrogate modelling and ensemble generations, that are both extremely important in high dimensional settings.
I think the manuscript can be accepted almost as it is, but I have a few minor comments I would encourage the Authors to look at.
1) While there are little spaces for doubts, I would strongly suggest the Authors to specify that their approach applies to ergodic chaotic dynamics for which an invariant distribution exists that describe the state distribution on the system's attractor. An obvious counterexample would be a stable system having an equilibrium point (or a limit cycle) as attractor.
2) When mentioning the Schrodinger Bridge (page 2), you may want to refer to Reich S. 2019 (doi:10.1017/S0962492919000011) as an exemplar study of the same analogy but in the area of data assimilation.
3) Line 27. "..dynamical systemS ..."
4) In the caption of Fig1b, use (left/right) to point the reader.
5) Line 44. I think you should always order references chronologically.
6) Line 5359. While I understand and I like the Authors narrative and choice of references. Nevertheless, and particularly for the readers of NPG, it would be appropriate to also mention the large bulk of work on the generation of ensemble members based on dynamical systems's theory and data assimilation. A good recent reference is 10.1029/2021MS002828
7) I am a bit of an inconvenience with the use of the term "latent". On the one side I agree with a comment from the other Reviewer. On the other I do also see in line 100 that you state z=x which makes one deduce the latent and actual state have the same dimension. Finally, while it is true that latent variables are defined in relation to their indirect (often hidden) relation with the observables quantities, with no reference to their number (or space dimension), in many practical applications the latent space is assumed/defined/used as being of smaller dimension.
8) Line 115. I would add ".... prior distribution FOR THE DENOISING PROCESS."
9) Equations (8). Wouldn't be better to (re)state clearly that we do not have access to x in practice?
10) Line 145. Is that because they do not depend on x?
11) Line 153. I think "Equation" must be written at the beginning of the sentence.
12) Line 176. Instead of "normally" I would suggest "most of the times".
Citation: https://doi.org/10.5194/egusphere20232261RC2  AC2: 'Reply on RC2', Tobias Finn, 24 May 2024
Model code and software
cereadaml/ddmattractor Tobias Sebastian Finn https://doi.org/10.5281/zenodo.8406184
Viewed
HTML  XML  Total  BibTeX  EndNote  

558  294  36  888  34  27 
 HTML: 558
 PDF: 294
 XML: 36
 Total: 888
 BibTeX: 34
 EndNote: 27
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1