the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Linking satellites to genes with machine learning to estimate major phytoplankton groups from space
Roy El Hourany
Juan Pierella Karlusich
Lucie Zinger
Hubert Loisel
Chris Bowler
Abstract. Ocean color remote sensing offers two decades-long time series of information on phytoplankton abundance. However, determining the structure of the phytoplankton community from this signal is not straightforward, and many uncertainties remain to be evaluated, despite multiple intercomparison efforts of the different available algorithms. Here, we use remote sensing and machine learning to infer the abundance of seven phytoplankton groups at a global scale based on a new molecular method from Tara Oceans. Our dataset is to our knowledge the most comprehensive and complete, available to describe phytoplankton community structure at a global scale using a molecular marker that defines relative abundances of all phytoplankton groups simultaneously. The methodology shows satisfying performances to provide robust estimates of phytoplankton groups using satellite data, with few limitations regarding the global generalization of the method. Furthermore, this new satellite-based methodology allows a valuable global intercomparison with the pigment-based approach used in in-situ and satellite data to identify phytoplankton groups. Nevertheless, these datasets show different, yet coherent information on the phytoplankton, valuable for the understanding of community structure. This makes remote sensing observations excellent tools to collect Essential Biodiversity Variables and provide a foundation for developing marine biodiversity forecasts.
Roy El Hourany et al.
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2022-1421', Anonymous Referee #1, 11 Feb 2023
The work by El Hourany et al. describes machine learning techniques for application to (blue) ocean color data to determine the global distribution of phytoplankton functional types. A special focus is on the description of ML techniques with the identification of crucial features based on parameters of the merged GlobColour dataset. The details of the methods used are often cryptically written and difficult to follow, and reproduction of the methods and results is not possible. The methods section should be revised accordingly. Besides the application of ML methods in the context, the advantage of the method remains unclear and is not further specified; it could well be higher accuracy or computing speed. I recommend a thorough revision of the paper to describe the methods in a more understandable way and to prove the added value (also of future ML methods).
Specific comments:
- The title is a bit catchy and inaccurate. It is rather about pigments, which are typical for color groups, but which can be very different in type of phytoplankton and corresponding genes.
- The figures should all be revised, e.g. Fig. 5. Axis labels with units are often missing. Partly chlorophyll concentrations are given in log10, this is better in Fig. 2.
- Line 87: Only as a comment that size fractioning often damages the cells and such data should therefore be treated with caution.
- For the discrimination of absorption features, rather the central visible region is necessary (e.g. Xi et al. 2015). In this respect, the use of the GlobColour data set with Rrs only up to 555 nm is unfavorable, as the correlation plots show. The OC-CCI dataset has more (MERIS) bands here and corresponding differences could be underlined. References to GlobColour and matchup procedure are missing.
- It is a Case-1 approach for a medium range of chlorophyll concentrations, which should be communicated in a better way. Maybe flagging and an uncertainty product would be useful. However, in such open ocean conditions, HPLC methods are often at the limit (if low volumes of water are filtered) – extreme uncertainties may exist in the fundamental training data.
- Besides SST is salinity actually a strong indicator for some PFTs.
- Method part is unclear, especially lines 163-212. A part of the problem could be that less common naming convention is used, e.g. do you refer to neural network architecture if you optimize the size map? How does the final map or architecture look like?
- Line 269: The more parameter we utilize, the more we must trust the data quality. Nevertheless, seen over the global ocean, there are many uncertainties in all mentioned parameters and regions. Especially Rrs in blue bands and the retrieved chlorophyll concentration must be considered as critical, even more because reflectances are derived from multi-mission merged data with sensor-specific atmospheric correction.
- 6: The marine model of ocean color algorithms is for atmospheric correction and chlorophyll retrieval is mostly based on a diatom-like chlorophyll-specific absorption and scattering behavior (e.g. Bricaud et al., 1995). Thus, good that there is relatively high correlation of diatoms and chlorophyll concentration. But what is actually with features that are not captured, e.g. specific optical properties of Coccolithophores (e.g. Balch, 2018)? There is a high abundance, e.g. in The Great Calcite Belt, where Fig. 7 indicates high reliability of the model with a C2 distribution in Fig. 10, that seems to be different. I see some question marks and would ask for more careful discussion about the model uncertainty.
- It is unclear how the new method behaves compared to the mentioned operational model by Xi et al. (2020). What are the advantages of the presented method?
Citation: https://doi.org/10.5194/egusphere-2022-1421-RC1 -
RC2: 'Comment on egusphere-2022-1421', Anonymous Referee #2, 19 Mar 2023
The authors develop a machine learning approach to link ocean colour data and in situ omics to improve detection of phytoplankton functional types and groups from space. The topic they are dealing with is innovative. However, the methodology and algorithm development steps are hard to follow and need to be revised to make the workflow clearer to the reader. In this scope, a flowchart is essential.
I am not fully convinced by the validation approach of the method. The training is done using the whole omics database and cross-validation statistics show the good prediction capabilities of the model. Then, the validation is made with an external database built on HPLC-based information. From my point of view, this cannot be considered a proper validation because one quantity is based on HPLC data, the estimated one on omics data. Such a comparison thus implies that the two approaches bring the same level of information on phytoplankton taxonomy. In this case, there would be no need to develop a new approach based on omics. However, as discussed at the end of the paper, HPLC- and omics-based phytoplankton information have some degree of correlation, which is good because this means that OMICS information can be found in optical properties to some extent and OMICS based approaches are welcome because they will bring new and complementary information on phytoplankton from space.
I realize that the OMICS database used to develop the new ML approach is small, but probably the authors might think to train the model over 70% of the database and validate it with the remaining 30%.
Results needs to be discussed more and the text about retrieved global distribution of phytoplankton and biomes needs to be profoundly checked and revised.
The work thus needs to be deeply revised to improve the methodology and make the validation stronger as well as the text more readable.
Specific comments:
Figure 1 is misleading as the same color palette has been used for both columns though the % axis are different from left to right. A quick reader could interpret the yellow dots of (e.g.) Cryptophytes as abundant as Green Algae or Diatoms.
Line 91: this statement means that we have phytoplankton also in the 180-2000 um size class, which is possible in case of diatoms chains. Could you provide a distribution of frequency of phytoplankton groups within each size class? This would help the reader to have a wider image of the type of phytoplankton in the database (and especially for those chain-forming species and classes spanning a wide size range).
Line 115: why normalizing omics data on Chl? Because Chl varies according to the physiological status of phytoplankton, a photoacclimation component is re-introduced (which is a major problem in the DPA analysis). Why not using OMICS-ased % of the whole population?
Lines 121-123: it is not clear which data are interpolated. In situ or satellites?
Table 2 contains mistakes on the coefficients. From Uitz et al. (2006), the coefficient for Chl-b is 1.01 while 0.35 is for 19-BF. In addition, 19-BF is here only attributed to the pelagophytes while is also a pigment within haptophytes (except coccolithophores). So, from the current coefficients all haptophytes only contains 19-Hex.
Line 155: which cross-validation procedure? Do these statistics refer to all pigments or is it a global indicator for the technique?
Line 163: please indicate and explain better which are the “several machine learning algorithms” you tested and why a SOM has been chosen. This will be very helpful for scientists approaching the same problem.
Section 3.1 need to be rewritten and a flowchart added. That’s strange to see 3.1.1 and 3.1.2 as two different sections when (if I had well understood) the work is done simultaneously. Figure 5: y- and x- axes should be the same and indicate the name of the solid and dashed lines in the caption.
Line 191: which several experiments? How many? Please explain better.
Section 3.1.3 needs to be clearer
Line 269: what is the impact of interpolation on bbp and Kd? (i.e., Interpolation declared in the methods)
Line 275: from Table 3, pelagophytes instead of cryptophytes
Line 314: generally speaking, are you referring to the surface-to-volume ratio?
Line 329 and Line 331: please check and discuss: C4, C5 and C6 are dominated by Prokaryotes, but these areas are generally known to be dominated by large phytoplankton. Same for C1, dominated by diatoms but in the subtropics. In addition, it would be nice to see these clusters plotted on map in Figure 10.
Figure 10: How the spectra have been normalized? By the minimum? The spectral shape should be discussed.
Citation: https://doi.org/10.5194/egusphere-2022-1421-RC2
Roy El Hourany et al.
Roy El Hourany et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
381 | 189 | 12 | 582 | 3 | 7 |
- HTML: 381
- PDF: 189
- XML: 12
- Total: 582
- BibTeX: 3
- EndNote: 7
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1