the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Ensemble analysis and forecast of ecosystem indicators in the North Atlantic using ocean colour observations and prior statistics from a stochastic NEMO/PISCES simulator
Abstract. This study is anchored in the H2020 SEAMLESS project (www.seamlessproject.org), which aims to develop ensemble assimilation methods to be implemented in Copernicus Marine Service monitoring and forecasting systems, in order to operationally estimate a set of targeted ecosystem indicators in various regions, including uncertainty estimates. In this paper, a simplified approach is introduced to perform a 4D (spacetime) ensemble analysis describing the evolution of the ocean ecosystem. An example application is provided, which covers a limited time period in a limited subregion of the North Atlantic (between 31° W and 21° W, between 44° N and 50.5° N, between March 15 and June 15, 2019, at a 1/4° and a 1 day resolution). The ensemble analysis is based on prior ensemble statistics from a stochastic NEMO/PISCES simulator. Ocean colour observations are used as constraints to condition the 4D prior probability distribution.
As compared to classic data assimilation, the simplification comes from the decoupling between the forward simulation using the complex modelling system and the update of the 4D ensemble to account for the observation constraint. The shortcomings and possible advantages of this approach for biogeochemical applications are discussed in the paper. The results show that it is possible to produce a multivariate ensemble analysis continuous in time and consistent with the observations. Furthermore, we study how the method can be used to extrapolate analyses calculated from past observations into the future. The resulting 4D ensemble statistical forecast is shown to contain valuable information about the evolution of the ecosystem for a few days after the last observation. However, as a result of the short decorrelation time scale in the prior ensemble, the spread of the ensemble forecast is increasing quickly with time. Throughout the paper, a special emphasis is given to discussing the statistical reliability of the solution.
Two different methods have been applied to perform this 4D statistical analysis and forecast: the analysis step of the Ensemble Transform Kalman Filter (with domain localization) and a Monte Carlo Markov Chain (MCMC) sampler (with covariance localization), both enhanced by the application of anamorphosis to the original variables. Despite being very different, the two algorithms produce very similar results, thus somehow validating each other. As shown in the paper, the decoupling of the statistical analysis from the dynamical model allows us to restrict the analysis to a few selected variables and, at the same time, to produce estimates of additional ecological indicators (in our example: phenology, trophic efficiency, downward flux of particulate organic matter). This approach can easily be appended to existing operational systems to focus on dedicated users' requirements, at small additional cost, as long as a reliable prior ensemble simulation is available. It can also serve as a baseline to compare with the dynamical ensemble forecast, and as a possible substitute whenever useful.

Notice on discussion status
The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint
(15939 KB)

The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.
 Preprint
(15939 KB)  Metadata XML
 BibTeX
 EndNote
 Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed

RC1: 'Review of egusphere20232026', Anonymous Referee #1, 17 Oct 2023
The authors present ensemblebased statistical experiments, akin to data assimilation, in which a static ensemble provides the prior statistics, and in which the posterior statistics are computed for only a part of the full model state. The study is interesting and presents some carefully worded, compelling results. The manuscript is easy to follow and mostly well written, and I have only a few general comments.
# general commentsThe use of an offline or static ensemble to generate various statistics is an interesting one, but maybe not as novel as the manuscript suggests. Line 66 states that "For these reasons, there are certainly practical situations in which it would be interesting to append such a 4D statistical analysis and forecast to existing ensemble data assimilation systems. They may serve as a baseline to compare with the dynamical ensemble forecast, and as a possible substitute whenever useful." Such an approach has been implemented before, and is often referred to as ensemble optimal interpolation (EnOI). Evensen presents the idea as a computationally cheaper alternative to the EnKF (Evensen, 2003). More recent implementations are, for example, Oke et al. (2010) and Mattern and Edwards (2023) which is using it for data assimilation with ocean color observations. They all rely on performing data assimilation with a static ensemble based on existing model output.
The manuscript is mostly well written, but sometimes assumes too much knowledge about the methodology from the readers. A few more sentences in the right places would be helpful to readers who are not familiar with Brankart (2019) and the few other papers that this manuscript's methodology is based on. I have highlighted specific instances in my comments below.
The figures in this manuscript are very useful, but they are missing labels, units and information. Generally, axis and color bar labels are missing. Further adding legends (e.g., "prior", "analysis") and labels (e.g., "MCMC" or "LETKF") would let many readers get most of the information without having to study the caption. Again, I have highlighted specific instances in my comments below, but all figures require labels.
Evensen G., 2003: The Ensemble Kalman Filter: theoretical formulation and practical implementation. https://doi.org/10.1007/s1023600300369
Oke P.R., Brassington G.B., Griffin D.A., Schiller A., 2010: Ocean data assimilation: A case for ensemble optimal interpolation. https://doi.org/10.22499/2.5901.008
Mattern J.P., Edwards C.A., 2023: Ensemble optimal interpolation for adjointfree biogeochemical data assimilation. https://doi.org/10.1371/journal.pone.0291039# specific comments
L 21: "thus somehow validating each other.": The "somehow" makes it sound too casual, I would suggest something like: "thus providing evidence for each other's estimates."
L 47: "to make an additional use of these expensive data": While I know what is meant here, "data" was previously used to refer to observational data. I would suggest adding "model" to make it clear that this is referring to model output. Furthermore, "obtained from a prior ensemble model simulation" could become "obtained from prior (ensemble) model simulations".
L 148: "multiplicative noise in the metrics of the model grid": What exactly does this mean, please explain in a bit more detail.
L 151: "twodimensional maps of autoregressive processes": Are these horizontal maps that are smoothed in some way? A bit more information could be useful to the reader.
Figure 1, 2 and others: It would be helpful to add the property and units to the color bars of each figure (especially in Fig. 2, where they are not mentioned in the caption). Also, it would be useful to mention the date.
L 279: "..., thus ensemble members.": This statement is difficult to understand, please rephrase.
L 288: Why not mention right away (in the first or second sentence of this paragraph) that this implementation is based on Brankart (2019). This comment may also apply to the previous paragraph if the LETKF implementation is very similar to that in Brankart et al. (2003) which is mentioned only at the end of that paragraph.
L 295: "zero correlations are approximated by nonzero correlation": Perhaps generalize this statement a bit by saying that low correlations are typically overestimated with a small ensemble?
L 315: "... (with a normalized Gaussian random factor).": What is a normalized Gaussian random factor? Is it multiplied with the vector? This is already a long sentence, I suggest dividing it into two sentences and providing a bit more context.
Figure 4: It would be useful for the reader to add labels to the x and yaxes. Furthermore, turning the x labels into dates, adding "LETKF" and "MCMC" into the topleft corner of the respective panel, and a legend for "prior", "analysis" and "forecast" would also make the plots much more accessible.
L 385: "remote observation is missed, ..." → "remote observations are missed" (or ignored)
L 410: I would suggest moving this paragraph to the next section, following the introduction of the CRPS score.
L 418: Mention that CRPS stands for Continuous Ranked Probability Score when it is first used, and provide a reference for it.
Figure 7 and others: additional panels with the interquantile range (e.g. 80%ile  20%ile) could be useful to better visualize differences in the ensemble spread.
L 542: Out of curiosity, how bad would be the use of a logtransformation (plus perhaps a normalization) to perform the anamorphosis?
L 560: "the first date at which the chlorophyll concentration reaches half of its maximum value over the whole time period": half of the maximum value at that particular location or half of the maximum value in the domain?
L 570: What is the phenology in the observations, do the LETKF and MCMC solutions get closer to the observed values?
L 645: "This happens to be a location at which the 4day LETKF forecast (...) is biased ...": Mention that this is the chlorophyll forecast (I presume) explicitly.
L 657: "But the same difficulty can be expected for any complex system, in which the confidence in the assumptions is bound to be low at the beginning, as long as little information is available, and then progressively enhanced." I don't quite understand this sentence, what is being enhanced here? I entirely agree with the point that in complex systems, sources of uncertainty are often ignored or not modeled adequately, and that can lead to artificially low uncertainty estimates in certain indicators.
L 698: "The main theoretical shortcoming of this approach is that the complex dynamical model is no more directly used to constrain the solution." But doesn't the dynamical model provide the prior ensemble which does affect/constrain the posterior estimates?
Citation: https://doi.org/10.5194/egusphere20232026RC1  AC1: 'Reply on RC1', JeanMichel Brankart, 07 Dec 2023

RC2: 'Comment on egusphere20232026', Michael Dowd, 22 Oct 2023
This study proposes a methodology for combining numerical model output with observations. The framework is that of Bayesian analysis. In contrast to most data assimilation, however, the emphasis is on using offline pregenerated numerical model ensembles as the prior information, rather than embedding a numerical model in the analysis scheme. The application is for multiple biogeochemical variables (while observing only one). I played around with a similar approach years ago (using an enKF, but with explicit time evolution), but never did a real application nor proper assessment (we decided to take another approach for our analysis of ocean carbonate variable and so it ended there). Hence, I like the approach, and am pleased that someone has taken it forward to the community, as I always thought it would be useful. My opinion is that there is a real need for these types of spacetime analysis procedures that don’t require explicit running of numerical models within the estimation procedure (but still use numerical model information). I note that lots of statistics people are doing problems like this with sophisticated spatiotemporal approaches (e.g. integrated nested Laplace approximations, INLA), but these tend not to be very accessible the ocean community. The approach taken here is straightforward, builds on basic data assimilation principles, gets decent results, and should be accessible to most ocean data analysts. Hence I recommend publication with some minor revisions.
COMMENTS
There is a strong link with the foundational approach of optimal interpolation (which parallels the Kalman filter observation step). Since most people know about OI, it might be useful to make a quick note of it in order to make the approach more clear to the nonexpert reader.
My main confusion in understanding the methodological development was how time was incorporated into the analysis. After a couple rereads, I see this is made clear early on when you define the state vector (its dimension includes time). But this could be brought out more explicitly. When most people see that you have used the Kalman filter update machinery, they will wonder about evolution through time. Part of my confusion may also have arisen since when I did my version of this problem, I actually ran it sequentially in time with a daily time step, and did the (spatial) observation update whenever measurements were available. My time correlation model was an autoregressive one, and I used an enKF/enKS methdology. Your time correlation is implicitly embedded in the spacetime covariance matrix that defines the multivariate state.
Lines 140150. I found this discussion of uncertainties 2 and 3 confusing. I get that you are trying to account for unresolved scales. There is likely a better plain language way of saying what you are doing.
Do you think it is proper to equate a 4D inverse problem with a Bayesian estimation? I know there are links, but you have to handwave a lot to explain them. Why not just say you used a Bayesian method?
Nice job on highlighting the difficulties of using small ensembles, partial observation of the state, and the need to estimate a big multivariate state. The tricks to make this work (like localization) are appreciated by the reader. Similarly, nice job on trying out some “ecological indicators” which emphasize the multivariate state and how measurement on one variable can tell you about other variables (and project to depth). However, the ecological indicators are to me not so central, and if shortening the paper was required I would omit these.
The central quantity in such an estimation problem is the ratio of observation variance to the model (ensemble) variance. This will dictate how far the prior is moved by the observations in creating the posterior. This is captured in your Probabilistic Scores, I think. But with simple messaging, the point could be made clearer.
Figures need more details. You don’t label the axes in some. You don’t define what variables is being plotted in others. Etc.
An alternative approach is to use a parametric spacetime covariance matrix. For example, a common approach is to use a Matern covariances for space, and autoregression in time (and generally assume spacetime separability for simplicity). This contrasts to your sample covariance matrix with postprocessing (localization). Thoughts? Pros and cons?
The way you do MCMC would be likely be called approximate Bayesian computation. That is, you make use of a cost function, rather than a more exact likelihood ratio.
Stationarity is, in general, likely the assumption that limits the forecast horizon. Training on one time period, and applying it to another time period is predicated on stationarity. However, your short term forecasts of a few days means this is not an issue since it is driven by decorrelation timescales.
Citation: https://doi.org/10.5194/egusphere20232026RC2  AC2: 'Reply on RC2', JeanMichel Brankart, 07 Dec 2023

EC1: 'Comment on egusphere20232026', Bernadette Sloyan, 01 Nov 2023
I strongly encourage the authors to provide details on how they intend to address the reviewers comments. I look forward to receiving the revised manuscript and pointbypoint reply to reviewers.
Citation: https://doi.org/10.5194/egusphere20232026EC1
Interactive discussion
Status: closed

RC1: 'Review of egusphere20232026', Anonymous Referee #1, 17 Oct 2023
The authors present ensemblebased statistical experiments, akin to data assimilation, in which a static ensemble provides the prior statistics, and in which the posterior statistics are computed for only a part of the full model state. The study is interesting and presents some carefully worded, compelling results. The manuscript is easy to follow and mostly well written, and I have only a few general comments.
# general commentsThe use of an offline or static ensemble to generate various statistics is an interesting one, but maybe not as novel as the manuscript suggests. Line 66 states that "For these reasons, there are certainly practical situations in which it would be interesting to append such a 4D statistical analysis and forecast to existing ensemble data assimilation systems. They may serve as a baseline to compare with the dynamical ensemble forecast, and as a possible substitute whenever useful." Such an approach has been implemented before, and is often referred to as ensemble optimal interpolation (EnOI). Evensen presents the idea as a computationally cheaper alternative to the EnKF (Evensen, 2003). More recent implementations are, for example, Oke et al. (2010) and Mattern and Edwards (2023) which is using it for data assimilation with ocean color observations. They all rely on performing data assimilation with a static ensemble based on existing model output.
The manuscript is mostly well written, but sometimes assumes too much knowledge about the methodology from the readers. A few more sentences in the right places would be helpful to readers who are not familiar with Brankart (2019) and the few other papers that this manuscript's methodology is based on. I have highlighted specific instances in my comments below.
The figures in this manuscript are very useful, but they are missing labels, units and information. Generally, axis and color bar labels are missing. Further adding legends (e.g., "prior", "analysis") and labels (e.g., "MCMC" or "LETKF") would let many readers get most of the information without having to study the caption. Again, I have highlighted specific instances in my comments below, but all figures require labels.
Evensen G., 2003: The Ensemble Kalman Filter: theoretical formulation and practical implementation. https://doi.org/10.1007/s1023600300369
Oke P.R., Brassington G.B., Griffin D.A., Schiller A., 2010: Ocean data assimilation: A case for ensemble optimal interpolation. https://doi.org/10.22499/2.5901.008
Mattern J.P., Edwards C.A., 2023: Ensemble optimal interpolation for adjointfree biogeochemical data assimilation. https://doi.org/10.1371/journal.pone.0291039# specific comments
L 21: "thus somehow validating each other.": The "somehow" makes it sound too casual, I would suggest something like: "thus providing evidence for each other's estimates."
L 47: "to make an additional use of these expensive data": While I know what is meant here, "data" was previously used to refer to observational data. I would suggest adding "model" to make it clear that this is referring to model output. Furthermore, "obtained from a prior ensemble model simulation" could become "obtained from prior (ensemble) model simulations".
L 148: "multiplicative noise in the metrics of the model grid": What exactly does this mean, please explain in a bit more detail.
L 151: "twodimensional maps of autoregressive processes": Are these horizontal maps that are smoothed in some way? A bit more information could be useful to the reader.
Figure 1, 2 and others: It would be helpful to add the property and units to the color bars of each figure (especially in Fig. 2, where they are not mentioned in the caption). Also, it would be useful to mention the date.
L 279: "..., thus ensemble members.": This statement is difficult to understand, please rephrase.
L 288: Why not mention right away (in the first or second sentence of this paragraph) that this implementation is based on Brankart (2019). This comment may also apply to the previous paragraph if the LETKF implementation is very similar to that in Brankart et al. (2003) which is mentioned only at the end of that paragraph.
L 295: "zero correlations are approximated by nonzero correlation": Perhaps generalize this statement a bit by saying that low correlations are typically overestimated with a small ensemble?
L 315: "... (with a normalized Gaussian random factor).": What is a normalized Gaussian random factor? Is it multiplied with the vector? This is already a long sentence, I suggest dividing it into two sentences and providing a bit more context.
Figure 4: It would be useful for the reader to add labels to the x and yaxes. Furthermore, turning the x labels into dates, adding "LETKF" and "MCMC" into the topleft corner of the respective panel, and a legend for "prior", "analysis" and "forecast" would also make the plots much more accessible.
L 385: "remote observation is missed, ..." → "remote observations are missed" (or ignored)
L 410: I would suggest moving this paragraph to the next section, following the introduction of the CRPS score.
L 418: Mention that CRPS stands for Continuous Ranked Probability Score when it is first used, and provide a reference for it.
Figure 7 and others: additional panels with the interquantile range (e.g. 80%ile  20%ile) could be useful to better visualize differences in the ensemble spread.
L 542: Out of curiosity, how bad would be the use of a logtransformation (plus perhaps a normalization) to perform the anamorphosis?
L 560: "the first date at which the chlorophyll concentration reaches half of its maximum value over the whole time period": half of the maximum value at that particular location or half of the maximum value in the domain?
L 570: What is the phenology in the observations, do the LETKF and MCMC solutions get closer to the observed values?
L 645: "This happens to be a location at which the 4day LETKF forecast (...) is biased ...": Mention that this is the chlorophyll forecast (I presume) explicitly.
L 657: "But the same difficulty can be expected for any complex system, in which the confidence in the assumptions is bound to be low at the beginning, as long as little information is available, and then progressively enhanced." I don't quite understand this sentence, what is being enhanced here? I entirely agree with the point that in complex systems, sources of uncertainty are often ignored or not modeled adequately, and that can lead to artificially low uncertainty estimates in certain indicators.
L 698: "The main theoretical shortcoming of this approach is that the complex dynamical model is no more directly used to constrain the solution." But doesn't the dynamical model provide the prior ensemble which does affect/constrain the posterior estimates?
Citation: https://doi.org/10.5194/egusphere20232026RC1  AC1: 'Reply on RC1', JeanMichel Brankart, 07 Dec 2023

RC2: 'Comment on egusphere20232026', Michael Dowd, 22 Oct 2023
This study proposes a methodology for combining numerical model output with observations. The framework is that of Bayesian analysis. In contrast to most data assimilation, however, the emphasis is on using offline pregenerated numerical model ensembles as the prior information, rather than embedding a numerical model in the analysis scheme. The application is for multiple biogeochemical variables (while observing only one). I played around with a similar approach years ago (using an enKF, but with explicit time evolution), but never did a real application nor proper assessment (we decided to take another approach for our analysis of ocean carbonate variable and so it ended there). Hence, I like the approach, and am pleased that someone has taken it forward to the community, as I always thought it would be useful. My opinion is that there is a real need for these types of spacetime analysis procedures that don’t require explicit running of numerical models within the estimation procedure (but still use numerical model information). I note that lots of statistics people are doing problems like this with sophisticated spatiotemporal approaches (e.g. integrated nested Laplace approximations, INLA), but these tend not to be very accessible the ocean community. The approach taken here is straightforward, builds on basic data assimilation principles, gets decent results, and should be accessible to most ocean data analysts. Hence I recommend publication with some minor revisions.
COMMENTS
There is a strong link with the foundational approach of optimal interpolation (which parallels the Kalman filter observation step). Since most people know about OI, it might be useful to make a quick note of it in order to make the approach more clear to the nonexpert reader.
My main confusion in understanding the methodological development was how time was incorporated into the analysis. After a couple rereads, I see this is made clear early on when you define the state vector (its dimension includes time). But this could be brought out more explicitly. When most people see that you have used the Kalman filter update machinery, they will wonder about evolution through time. Part of my confusion may also have arisen since when I did my version of this problem, I actually ran it sequentially in time with a daily time step, and did the (spatial) observation update whenever measurements were available. My time correlation model was an autoregressive one, and I used an enKF/enKS methdology. Your time correlation is implicitly embedded in the spacetime covariance matrix that defines the multivariate state.
Lines 140150. I found this discussion of uncertainties 2 and 3 confusing. I get that you are trying to account for unresolved scales. There is likely a better plain language way of saying what you are doing.
Do you think it is proper to equate a 4D inverse problem with a Bayesian estimation? I know there are links, but you have to handwave a lot to explain them. Why not just say you used a Bayesian method?
Nice job on highlighting the difficulties of using small ensembles, partial observation of the state, and the need to estimate a big multivariate state. The tricks to make this work (like localization) are appreciated by the reader. Similarly, nice job on trying out some “ecological indicators” which emphasize the multivariate state and how measurement on one variable can tell you about other variables (and project to depth). However, the ecological indicators are to me not so central, and if shortening the paper was required I would omit these.
The central quantity in such an estimation problem is the ratio of observation variance to the model (ensemble) variance. This will dictate how far the prior is moved by the observations in creating the posterior. This is captured in your Probabilistic Scores, I think. But with simple messaging, the point could be made clearer.
Figures need more details. You don’t label the axes in some. You don’t define what variables is being plotted in others. Etc.
An alternative approach is to use a parametric spacetime covariance matrix. For example, a common approach is to use a Matern covariances for space, and autoregression in time (and generally assume spacetime separability for simplicity). This contrasts to your sample covariance matrix with postprocessing (localization). Thoughts? Pros and cons?
The way you do MCMC would be likely be called approximate Bayesian computation. That is, you make use of a cost function, rather than a more exact likelihood ratio.
Stationarity is, in general, likely the assumption that limits the forecast horizon. Training on one time period, and applying it to another time period is predicated on stationarity. However, your short term forecasts of a few days means this is not an issue since it is driven by decorrelation timescales.
Citation: https://doi.org/10.5194/egusphere20232026RC2  AC2: 'Reply on RC2', JeanMichel Brankart, 07 Dec 2023

EC1: 'Comment on egusphere20232026', Bernadette Sloyan, 01 Nov 2023
I strongly encourage the authors to provide details on how they intend to address the reviewers comments. I look forward to receiving the revised manuscript and pointbypoint reply to reviewers.
Citation: https://doi.org/10.5194/egusphere20232026EC1
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML  XML  Total  BibTeX  EndNote  

232  74  22  328  14  14 
 HTML: 232
 PDF: 74
 XML: 22
 Total: 328
 BibTeX: 14
 EndNote: 14
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1
Mikhail Popov
JeanMichel Brankart
Arthur Capet
Emmanuel Cosme
Pierre Brasseur
The requested preprint has a corresponding peerreviewed final revised paper. You are encouraged to refer to the final revised version.
 Preprint
(15939 KB)  Metadata XML