the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Datadriven Reconstruction of Partially Observed Dynamical Systems
Pierre Tandeo
Pierre Ailliot
Florian Sévellec
Abstract. The state of the atmosphere, or of the ocean, cannot be exhaustively observed. Crucial parts might remain out of reach of proper monitoring. Also, defining the exact set of equations driving the atmosphere and ocean is virtually impossible because of their complexity. Hence, the goal of this paper is to obtain predictions of a partially observed dynamical system, without knowing the model equations. In this datadriven context, the article focuses on the Lorenz63 system, where only the second and third components are observed, and access to the equations is not allowed. To account to those strong constraints, a combination of machine learning and data assimilation techniques is proposed. The key aspects are the following: the introduction of latent variables, a linear approximation of the dynamics, and a database that is updated iteratively, maximising the innovation likelihood. We find that the latent variables inferred by the procedure are related to the successive derivatives of the observed components of the dynamical system. The method is also able to reconstruct accurately the local dynamics of the partially observed system. Overall, the proposed methodology is simple, easy to code, and gives promising results, even in the case of small amounts of observations.

Notice on discussion status
The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint
(4992 KB)

The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.
 Preprint
(4992 KB)  BibTeX
 EndNote
 Final revised paper
Journal article(s) based on this preprint
Pierre Tandeo et al.
Interactive discussion
Status: closed

RC1: 'Comment on egusphere20221316', Anonymous Referee #1, 16 Jan 2023

AC1: 'Reply on RC1', Pierre Tandeo, 14 Apr 2023
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2022/egusphere20221316/egusphere20221316AC1supplement.pdf

AC1: 'Reply on RC1', Pierre Tandeo, 14 Apr 2023

RC2: 'Comment on egusphere20221316', Anonymous Referee #2, 31 Jan 2023
This work presents a datadriven method to infer a linear stochastic
model from a partially observed system.
This work is wellwritten and contains interesting parts, especially the notsocommon effort to explain the dynamics of the latent (embedding) space. Nevertheless
simplifications made in the work do reduce a lot the impact of this paper.
Also, there is very little novelty in
the approach. The principle of alternating between DA and a datadriven model has been already applied, in more challenging settings
(noisy/sparse observations, model with more dimensions). The fact to
have a variable that is neverobserved has also already been tested. The
originality of the approach to have a stochastic model and to explain
the latent space is not very developed.Other general comments:
 the justification of the setting and the approach is not convincing
to me (see my comments about the abstract and the introduction), and
I fail to foresee the real application of the approach. Maybe
rephrasing the last part of the conclusion and putting it in the
the introduction instead could help regarding that matter.
 The datadriven model used is linear. It is acknowledged by the
authors in the conclusion, but it is one limit of the
approach. Maybe the linear approach works because the setting is
simple enough (low dimension, weakly nonlinear). But also, I wonder
if the interpretability of the latent space is precisely related to
the choice of the linear model (maybe with a nonlinear model, there is
no need for a latent space to emulate observed variables...)
 The experiment is done on the Lorenz 63 model, which is very
lowdimensional (3) and weakly nonlinear. See for example:
https://raspstephan.github.io/blog/lorenz96istooeasy/#
There are toy models (L96, QG) that could display more interesting
behaviors for this methodology.
 The forecast is evaluated only in the next time step, which is again
a very easy case. How would behave the forecast over several time
steps?
 The method is interestingly stochastic, but no ensemble metrics are
used to evaluate the work which would have been interesting.
Specific comments:Abstract: The 2 first sentences of the abstract is a justification of
the approach. Due to the limited size of the abstract, this
justification cannot be extended making it too simplistic: 1) It is true
that defining a set of equations is difficult, but I would say that a
bigger issue is the resolution of the existing set of equations given
that coefficients are unknowns and that a discretization is needed
for the numerical resolution, which introduces some errors. 2) If we
follow the narrative, it is well justified that we should cope with "imperfect
equations". But here, the choice is to assume that no equation is known.
The fact that those two points are overlooked makes the narrative a
bit too simple to be convincing for me. I would suggest starting right
away with what you want to achieve in the abstract and having an extended justification in the introduction.L13: "governing differential equations are not necessarily known": I
would like to see examples of that. I think that, even if some equations
are known, a fully datadriven system can be justified, but here, this core question is eluded: What is the range of applications of a purely
datadriven model from partial observations?L21: "All the approaches cited above are assuming that the full state
of the system is observed, which is a strong assumption."
This is misleading. The papers above (at least Fablet, Bocquet and
Brajard) assume that observations are
noisy and sparse, but indeed each variable has a nonnull probability
to be observed. Is it what you mean by "the full state is observed?" There are also many works done in the case a variable
is never observed, e.g.:
https://arxiv.org/pdf/2102.07819.pdfFigure 1. My understanding is that the paper aims at going one step
forward into learning a datadriven model from a realistic setting (by assuming that the
state is not fully observed), but it assumes later on that a part of
the state is always observed with a very small error. To me, this is a
very strong assumption, even stronger than assumptions made by the existing
cited papers. So again, I don't see what application is targeted by
this work.L110: the "sequential methodology": Is there a theoretical reason to
add sequentially the hidden components or is this mainly practical? How
do you see that applied with highdimensional systems in which, e.g.,
10^5 variables are nonobserved?L140: This part is, in my opinion, the most interesting part. But I
miss some details to fully understand what is done (see below)Eq 7: How do you derive those equations? Is it by trials/error or is
does it correspond to theoretical reasons?L150 "correspond to a3 ≈ 0 and to a2 ≈ 0, respectively": sorry I don't
get the "respectively" here, in which case a3 is 0 and in which case
a2 is 0?L150151: "This suggests that xdot3 is more important than xdot2":
Why is that? you still have b2 coefficient associated with xdot2...L153155: sorry I have read this part several times, and I still don't
understand. What does it mean that "the algorithm focuses on the
estimation of a_2" I don't see where is the estimation of a_2 in the
algorithm and I don't understand what is meant by "focus".L158: The term "modeldriven" is misleading. The datadriven model is
also a model.L175: "the dynamical evolution of the system is retrieved with our
methodology. "This is a strong assertion since by construction the
evolution of x2 and x3 are observed and you test the forecast
skill over only one timestep.End of the introduction: I think it would be nice to have part of
these comments in the introduction, justify the approach.Citation: https://doi.org/10.5194/egusphere20221316RC2 
AC2: 'Reply on RC2', Pierre Tandeo, 14 Apr 2023
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2022/egusphere20221316/egusphere20221316AC2supplement.pdf

AC2: 'Reply on RC2', Pierre Tandeo, 14 Apr 2023
Interactive discussion
Status: closed

RC1: 'Comment on egusphere20221316', Anonymous Referee #1, 16 Jan 2023

AC1: 'Reply on RC1', Pierre Tandeo, 14 Apr 2023
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2022/egusphere20221316/egusphere20221316AC1supplement.pdf

AC1: 'Reply on RC1', Pierre Tandeo, 14 Apr 2023

RC2: 'Comment on egusphere20221316', Anonymous Referee #2, 31 Jan 2023
This work presents a datadriven method to infer a linear stochastic
model from a partially observed system.
This work is wellwritten and contains interesting parts, especially the notsocommon effort to explain the dynamics of the latent (embedding) space. Nevertheless
simplifications made in the work do reduce a lot the impact of this paper.
Also, there is very little novelty in
the approach. The principle of alternating between DA and a datadriven model has been already applied, in more challenging settings
(noisy/sparse observations, model with more dimensions). The fact to
have a variable that is neverobserved has also already been tested. The
originality of the approach to have a stochastic model and to explain
the latent space is not very developed.Other general comments:
 the justification of the setting and the approach is not convincing
to me (see my comments about the abstract and the introduction), and
I fail to foresee the real application of the approach. Maybe
rephrasing the last part of the conclusion and putting it in the
the introduction instead could help regarding that matter.
 The datadriven model used is linear. It is acknowledged by the
authors in the conclusion, but it is one limit of the
approach. Maybe the linear approach works because the setting is
simple enough (low dimension, weakly nonlinear). But also, I wonder
if the interpretability of the latent space is precisely related to
the choice of the linear model (maybe with a nonlinear model, there is
no need for a latent space to emulate observed variables...)
 The experiment is done on the Lorenz 63 model, which is very
lowdimensional (3) and weakly nonlinear. See for example:
https://raspstephan.github.io/blog/lorenz96istooeasy/#
There are toy models (L96, QG) that could display more interesting
behaviors for this methodology.
 The forecast is evaluated only in the next time step, which is again
a very easy case. How would behave the forecast over several time
steps?
 The method is interestingly stochastic, but no ensemble metrics are
used to evaluate the work which would have been interesting.
Specific comments:Abstract: The 2 first sentences of the abstract is a justification of
the approach. Due to the limited size of the abstract, this
justification cannot be extended making it too simplistic: 1) It is true
that defining a set of equations is difficult, but I would say that a
bigger issue is the resolution of the existing set of equations given
that coefficients are unknowns and that a discretization is needed
for the numerical resolution, which introduces some errors. 2) If we
follow the narrative, it is well justified that we should cope with "imperfect
equations". But here, the choice is to assume that no equation is known.
The fact that those two points are overlooked makes the narrative a
bit too simple to be convincing for me. I would suggest starting right
away with what you want to achieve in the abstract and having an extended justification in the introduction.L13: "governing differential equations are not necessarily known": I
would like to see examples of that. I think that, even if some equations
are known, a fully datadriven system can be justified, but here, this core question is eluded: What is the range of applications of a purely
datadriven model from partial observations?L21: "All the approaches cited above are assuming that the full state
of the system is observed, which is a strong assumption."
This is misleading. The papers above (at least Fablet, Bocquet and
Brajard) assume that observations are
noisy and sparse, but indeed each variable has a nonnull probability
to be observed. Is it what you mean by "the full state is observed?" There are also many works done in the case a variable
is never observed, e.g.:
https://arxiv.org/pdf/2102.07819.pdfFigure 1. My understanding is that the paper aims at going one step
forward into learning a datadriven model from a realistic setting (by assuming that the
state is not fully observed), but it assumes later on that a part of
the state is always observed with a very small error. To me, this is a
very strong assumption, even stronger than assumptions made by the existing
cited papers. So again, I don't see what application is targeted by
this work.L110: the "sequential methodology": Is there a theoretical reason to
add sequentially the hidden components or is this mainly practical? How
do you see that applied with highdimensional systems in which, e.g.,
10^5 variables are nonobserved?L140: This part is, in my opinion, the most interesting part. But I
miss some details to fully understand what is done (see below)Eq 7: How do you derive those equations? Is it by trials/error or is
does it correspond to theoretical reasons?L150 "correspond to a3 ≈ 0 and to a2 ≈ 0, respectively": sorry I don't
get the "respectively" here, in which case a3 is 0 and in which case
a2 is 0?L150151: "This suggests that xdot3 is more important than xdot2":
Why is that? you still have b2 coefficient associated with xdot2...L153155: sorry I have read this part several times, and I still don't
understand. What does it mean that "the algorithm focuses on the
estimation of a_2" I don't see where is the estimation of a_2 in the
algorithm and I don't understand what is meant by "focus".L158: The term "modeldriven" is misleading. The datadriven model is
also a model.L175: "the dynamical evolution of the system is retrieved with our
methodology. "This is a strong assertion since by construction the
evolution of x2 and x3 are observed and you test the forecast
skill over only one timestep.End of the introduction: I think it would be nice to have part of
these comments in the introduction, justify the approach.Citation: https://doi.org/10.5194/egusphere20221316RC2 
AC2: 'Reply on RC2', Pierre Tandeo, 14 Apr 2023
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2022/egusphere20221316/egusphere20221316AC2supplement.pdf

AC2: 'Reply on RC2', Pierre Tandeo, 14 Apr 2023
Peer review completion
Journal article(s) based on this preprint
Pierre Tandeo et al.
Pierre Tandeo et al.
Viewed
HTML  XML  Total  BibTeX  EndNote  

281  132  22  435  11  9 
 HTML: 281
 PDF: 132
 XML: 22
 Total: 435
 BibTeX: 11
 EndNote: 9
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1
The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.
 Preprint
(4992 KB)  Metadata XML