Hybrid machine learning data assimilation for marine biogeochemistry

Higgs, Ieuan; Bannister, Ross; Skákala, Jozef; Carrassi, Alberto; Ciavatta, Stefano

doi:10.48550/arXiv.2504.05218

Preprints

https://doi.org/10.48550/arXiv.2504.05218

Preprints

25 Apr 2025

| 25 Apr 2025

Hybrid machine learning data assimilation for marine biogeochemistry

Ieuan Higgs, Ross Bannister, Jozef Skákala, Alberto Carrassi, and Stefano Ciavatta

Abstract. Marine biogeochemistry models are critical for forecasting, as well as estimating ecosystem responses to climate change and human activities. Data assimilation (DA) improves these models by aligning them with real-world observations, but marine biogeochemistry DA faces challenges due to model complexity, strong nonlinearity, and sparse, uncertain observations. Existing DA methods applied to marine biogeochemistry struggle to update unobserved variables effectively, while ensemble-based methods are computationally too expensive for high-complexity marine biogeochemistry models. This study demonstrates how machine learning (ML) can improve marine biogeochemistry DA by learning statistical relationships between observed and unobserved variables. We integrate ML-driven balancing schemes into a 1D prototype of a system used to forecast marine biogeochemistry in the North-West European Shelf seas. ML is applied to predict (i) state-dependent correlations from free-run ensembles and (ii), in an ``end-to-end'' fashion, analysis increments from an Ensemble Kalman Filter. Our results show that ML significantly enhances updates for previously not-updated variables when compared to univariate schemes akin to those used operationally. Furthermore, ML models exhibit moderate transferability to new locations, a crucial step toward scaling these methods to 3D operational systems. We conclude that ML offers a clear pathway to overcome current computational bottlenecks in marine biogeochemistry DA and that refining transferability, optimizing training data sampling, and evaluating scalability for large-scale marine forecasting, should be future research priorities.

Received: 08 Apr 2025 – Discussion started: 25 Apr 2025

Ieuan Higgs, Ross Bannister, Jozef Skákala, Alberto Carrassi, and Stefano Ciavatta

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-1676', Anonymous Referee #1, 28 May 2025

The manuscript presents machine learning-based techniques aimed to improve biogeochemical data assimilation, by speeding it up or increasing its performance through increments to unobserved variables. The results of this study are compelling but the data assimilation setup is highly idealized, limiting its significance.
General comments
Machine learning (ML) techniques will likely gain more and more traction in marine biogeochemical data assimilation applications. The authors present an interesting study with compelling results, but I worry that the choice of using synthetic data limits strongly how much we can learn about how to implement ML-assisted DA in less idealized situations. The allure of synthetic data is apparent: there is a true model state, we know all about this truth and can thus examine DA improvement even regarding unobserved variables. However, in this study, the synthetic observations were generated from a nature run that uses the same biogeochemical and physical model as the DA systems. Thus, the setup is highly idealized, and the ML techniques can learn correlations between variables that directly relate to the dynamics that generated the observations. Consequently, the ML-based DA strongly benefits from accurately improving (some) unobserved variables. But is this still an advantage if the true unobserved variables do not follow model dynamics? I would suggest a follow-up experiment in which the DA techniques are confronted with "real" data. Various data are collected for the L4 station, satellite data could be used. Do the improvements in chlorophyll forecast skill seen for the synthetic data translate to improvements for real data compared to the more standard DA approaches?
The authors single out zooplankton as unobserved variables that "do not update well" (Sec 5, par 3). However, it could be that this issue with zooplankton updates is due to the way the biogeochemical model is parameterized. Depending on the way zooplankton grazing, phytoplankton mortality and other processes are parameterized in the model, phytoplankton/PFTs and zooplankton can be more strongly or weakly linked. This effect could mean that the result that zooplankton, specifically is difficult to estimate, does not generalize to other model configurations. Yet, I agree with the authors that there can always be some variables that are difficult to estimate, perhaps even more so in a DA setup without synthetic observations.
When I first read the title, I was expecting an approach to better inform 3D biogeochemical models with observations. The abstract then mentions that 1D models are used -- which seems like a good choice given the complexity of 3D models. However, when I read that the DA was basically performed in 0D, I felt that an opportunity was missed to examine if ML approaches can yield improved spatial increments. Variational DA requires the specification of length scales, ensemble-based approaches typically rely on localization to limit the DA update spatially, but ML approaches could potentially learn how to best spread the increment spatially. But by eliminating all spatial dimensions from the DA and simply applying any DA update throughout and only in the mixed layer, this important DA aspect is not examined in this study at all. Why are only 0D increments used in this study, how much additional effort would be required to create a full 1D update?
Depending on the setup (and more obviously in a DA configuration with real observations), the surface-only DA approach could lead to worse performance of the DA systems. The study does not provide many specifics about the ensemble generation or the nature run, but if the model state below the mixed layer is sufficiently different in the data-generating nature run and the simulation used in the DA, then there is an error source that the DA system (with or without ML) cannot adequately correct. An example: Higher nitrate (or higher DOM, perhaps simply more C or N) below the mixed layer in the DA compared to the nature run could result in continuous overestimation and subsequent downward corrections of nitrate, chlorophyll and zooplankton in the mixed layer without addressing the underlying issue of too much nutrient input. This type of scenario could also manifest itself in zooplankton drifting away from the truth despite a series of beneficial corrective DA updates.
The authors trained their ML models in one location and show that this approach yields comparatively bad results when transferring the models to a new location. Here it would have been interesting to train the models on data from 2 or more (sufficiently different) locations to examine if these more generalized models would (1) perform similarly well on one of their training locations as the specialized models and (2) if the generalized training would make the models more transferable to new locations. This is likely beyond the scope of this study, but perhaps the authors could discuss this point.
As mentioned in the manuscript, the DA update in Eq. 1 represents the best linear unbiased estimator (BLUE) and the authors kept the ML-based DA approaches in the linear framework by estimating elements of Eq. 1. But one could imagine going beyond that and estimate nonlinear relationships between model state, observations, and increment, for example using neural networks. In how far do the authors think the DA framework presented here could be improved further using other ML techniques?

Specific comments (based on the preprint document, which does not have numbered lines)
P 1, par 1: I would suggest changing "and lead" to "which can lead to".
P 2, par 1: "size-class chlorophyll [...] and other types of in situ data": This makes it sound like size-class chlorophyll is an in situ data product, when it typically is based on satellite estimates.
P 2, par 2: "The multivariate updates can happen in the DA step, through ensemble-informed background covariances (as in the EnKF), or, through balancing schemes, such as the scheme of Hemmings et al. (2008) based on nitrogen mass conservation ...": In variational DA, multivariate updates also happen through the tangent-linear and adjoint models. However, this type of update has issues as well and can lead to unrealistic updates because there are typically no prescribed covariance terms between different variables. For example, in Mattern et al. (2017; DOI: 10.1016/j.ocemod.2016.12.002), the authors had to reduce the updates to unobserved nitrate in order to avoid unrealistic nitrate accumulation at the ocean surface.
P 3, Sec 2.1: "It provides a sufficient balance between realism and computational cost": How "sufficient" a 1D model is, may be very dependent on the application. I would suggest motivating it a bit better based on the goals of this study.
Eq. 1: There is no error here, but it looks like the Δx was meant to be included in Eq 1.
P 5, Sec 2.4: "x^f is the background state": This notation is used in many studies but maybe for some who do not know it, either motivate the superscript "f" by mentioning "forecast" or change it to "b" to match "background".
P 6, par 1: "In this work, the state of the system, both x^f and x^a, comprises the surface values of most pelagic variables in ERSEM.": Based on the abstract and previous text, I was expecting a 1D DA system, and not 0D. Reading a bit further, I see that this is not simply a 0D update, but that the full vector is updated but only within the mixed layer
Eq. 4: Why is the summing of the 4 chlorophyll variables to total chlorophyll not included in H? That is, why doesn't H have the form [1, 1, 1, 1, 0, 0, ...] or similar, where the four non-zero entries are associated with the chlorophyll variables?
Eq. 7: Mention what "R" is here.
Table 2, line 3: I would consider a phrase like "ensemble-based" more useful than the "itself".
Eq. 9: This may be a Latex issue, but the "COR" looks very much like "CO*R" with the R from the previous equation. Why not use a lower-case rho, which is often used to denote correlations?
Eq. 9: Are these properties obtained from the long simulation or prescribed some other way? Reading on, I see that COR_i,c is being estimated using the ML techniques. This could be made explicit here.
Sec 3.2, par 1: "to predict the state-dependent correlations between observed and unobserved quantities in Eq. (9)." I would suggest adding the symbol (currently COR_i,c) so the reader knows immediately what is being estimated. Furthermore, I would suggest using "estimate" rather than "predict", as this is not done in a forecast sense.
Sec 3.2, par 1: "is scaled according its climatological maximum": Does this mean that COR_i,c is estimated with data from all locations, and it is then rescaled for each location? A better description and some more information would be useful here. Apparently COR_i,c is location-specific, is some time-dependence assumed, if so, what kind? How many parameters need to be estimated here? Stating these basic facts early on helps the reader understand the main idea before going into implementation details.
Sec 3.2, par 1: What motivates the use of the OI name here?
P 9, par 2 "In ML-EtE, we assume that the essential properties of each statistical object that creates an analysis increment, such as covariances and observation uncertainty, can be more effectively captured by directly predicting the analysis increment rather than predicting every component individually and allowing errors to compound across multiple independent predictions that are then combined into a single value.": This sentence is too long and difficult to understand. Also, "each statistical object that creates an analysis increment, such as covariances and observation uncertainty" sounds like the covariances create an analysis increment and also the observation uncertainty creates an analysis increment. Please rephrase. But furthermore, if I understand this sentence correctly, it seems to argue for directly estimating the Kalman gain matrix rather than its components?
Sec 3.4, par 1: "(1) a set-up where we choose to update only nitrate": I presume chlorophyll is updated as well? Maybe rephrase to "a set-up where the only unobserved variable that is updated is nitrate".
Sec 3.6.1: What are "cycles" here, and is the trajectory more than just a snapshot? Based on Eq. 10, one could assume a cycle is a time step.
Sec 3.6.1: "The expected RMSE": Why not call it the "ensemble average RMSE"?
Fig. 2: I think it is counter-intuitive to have nitrate shown in green and chlorophyll in black. I would suggest switching the colors. Also counter-intuitive, but maybe to a lesser extent: the light-limited time-period is shown in the brightest color. Finally, why show an unspecified "arbitrary" year instead of the climatology in the top panel, when the correlation is based on climatological values?
P 12, par 1: Make it explicit that these RMSE values are for the correlation estimates.
P 12, par 3: Explain better what the terms offline and online are referring to here. I would suggest introducing the offline term at the very start of the section when the experiment is introduced.
P 12, par 3: "so that any update to the system can have dynamical impact on later DA cycles as the model": This sentence ends abruptly without a period.
Fig. 4: The "relative ratio" label is not helpful, "relative to observational error" would be better. The title says "Nitrates"; there is a syntax error in the unit label of the second panel -- which also appears in Fig. 5.
Fig. 4: Why is the performance of the EnKF seemingly much worse than the RUS scheme for observed chlorophyll? Mention this in the text.
P 13, par 1: "as a percentage of the observation error": Are these truly percentage values?
P 13, par 1: "error. exceeds": There is an issue with the sentence.
Fig. 5: I find it difficult to see improvement in the plots, but perhaps I haven't stared at them long enough. I would think it would be more informative to show "forecast-truth" and "analysis-truth": it would clearly show when improvement occurs ("analysis-truth" closer to zero) and analysis-forecast is simply the space between the curves (which could be shaded in the figure).
Fig. 5 and others: Why aren't EnKF results shown for comparison?
P 17, par 1: "so values shown in Fig. 7 are RMSEs for 7-day forecasts relative to the RMSE of the RUS method": But Fig. 7 only shows a one value for each RMSE value, are these averages across several 7-day forecasts? Please explain better what is shown.
P 19: "(Sect. ??)"

Citation: https://doi.org/10.5194/egusphere-2025-1676-RC1
- AC1: 'Reply on RC1', Ieuan Higgs, 01 Aug 2025
  
  Thank you for taking the time and care to provide valuable feedback and contributions to this manuscript. Please see our responses to the comments in the attached PDF, which we are ready to implement for a future revision.
  
  Best wishes,
  
  Ieuan Higgs and the co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-1676-AC1
RC2:
'Comment on egusphere-2025-1676', Anonymous Referee #2, 01 Jul 2025

Please find the report in supplement.

Citation: https://doi.org/10.5194/egusphere-2025-1676-RC2
- AC2: 'Reply on RC2', Ieuan Higgs, 01 Aug 2025
  
  Thank you for taking the time and care to provide valuable feedback and contributions to this manuscript. Please see our responses to the comments in the attached PDF, which we are ready to implement for a future revision.
  Best wishes,
  Ieuan Higgs and the co-authors
  
  Citation: https://doi.org/10.5194/egusphere-2025-1676-AC2

Ieuan Higgs, Ross Bannister, Jozef Skákala, Alberto Carrassi, and Stefano Ciavatta

Viewed

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 678 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
674	0	4	678	0	0

HTML: 674
PDF: 0
XML: 4
Total: 678
BibTeX: 0
EndNote: 0

Views and downloads (calculated since 25 Apr 2025)

Month	HTML	PDF	XML
Apr 2025	41	0	41
May 2025	44	0	44
Jun 2025	30	1	31
Jul 2025	34	1	35
Aug 2025	103	2	105
Sep 2025	380	0	380
Oct 2025	33	0	33
Nov 2025	9	0	9

Cumulative views and downloads (calculated since 25 Apr 2025)

Month	HTML	PDF	XML
Apr 2025	41	0	41
May 2025	44	0	44
Jun 2025	30	1	31
Jul 2025	34	1	35
Aug 2025	103	2	105
Sep 2025	380	0	380
Oct 2025	33	0	33
Nov 2025	9	0	9

Viewed (geographical distribution)

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 667 (including HTML, PDF, and XML) Thereof 667 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 16 Nov 2025

Short summary

We explored how machine learning can improve computer models that simulate ocean ecosystems. These models help us understand how the ocean works, but they often struggle due to limited observations and complex processes. Our approach uses machine learning to better connect the parts of the system we can observe with those we cannot. This leads to more accurate and efficient predictions, offering a promising way to improve future ocean monitoring and forecasting tools.


Total:	0
HTML:	0
PDF:	0
XML:	0