A method for quantifying correlation in the shape of oceanographic profile data
Abstract. Vertical profiles are a common type of oceanographic observation, involving measurements of a variable across a range of depths, and are widely used to identify physical and biogeochemical features of the water column. Recent studies have shown that oceanographic profiles can be represented as functional data objects, where each profile is treated as a single datum and expressed as a function of pressure. This study applies a recently developed technique, which defines a scalar correlation coefficient for functional data, to the analysis of oceanographic profiles. The method represents each profile using basis functions, whose associated weightings are termed basis coefficients, and quantifies dependence through the variability of these coefficients. An important advantage of this method is that the resulting correlation coefficient reflects similarities in overall profile shape, not just correlations between values at specific depths. Two applications of this method are explored: calculating the correlation coefficient between two different oceanographic variables, and estimating the temporal autocorrelation function of a single variable. Each application is demonstrated using two case study datasets: (1) the Coastal Endurance Washington Offshore Profiler Mooring and (2) biogeochemical-Argo floats. The first case study demonstrates how the method can be used to identify physical drivers of variability in biogeochemical profile structure. The second case study reveals regional differences in relationships between profiled variables and their temporal autocorrelation characteristics. This technique has broad potential for application to data from moorings, autonomous platforms, and ocean models, with possible use in observing system optimisation, data assimilation, and the analysis of vertically structured ocean processes.
In general, this manuscript is well-written and comprehensive. It is effectively organized, with straightforward notation and two case studies that illustrate a wide range of use cases. The authors present the application of functional data analysis (FDA) to oceanographic profiles to calculate a scalar correlation coefficient for vertically-varying data. FDA significantly addresses a shortcoming of current oceanographic analyses, in which a specific depth level must be selected or some other dimension must be fixed in order to calculate correlation. Figure 1 was especially helpful for understanding how the method works! This technique has the potential to be invaluable in many different applications, and I look forward to using it myself in the future.
My main suggestion is to better emphasize the utility of FDA, both in the introduction and when drawing conclusions from the case studies. Presenting more concrete findings that cannot be easily drawn through the use of other correlation analyses will help to underscore the importance of this method. Below, I point to specific instances in the text where I think some further explanation or analysis may bolster your argument. In addition, I have a handful of other minor comments. At the end of this review, there are a couple of remaining questions I have, which could be taken into consideration when revising but may fall outside the immediate scope of the paper.
Thank you for the opportunity to review this paper-- it was a pleasure to read.
Specific comments:
Lines 67-69: It could be meaningful to also mention conclusions drawn about the seasonal cycle, since the “seasonal strength was not quantified” in past studies (Line 61).
Lines 69-70: The interpretation that “relationships between temperature and chlorophyll profiles, as well as their temporal ACFs, vary spatially” seems quite generic. Would it be possible to say something more concrete about the takeaways from the BGC-Argo profiles, like the conclusions from Lines 265-266, or to discuss more about how the correlations between the profiles bring new insight to “relationships between environmental conditions and vertical chlorophyll structure” (Line 64).
Lines 113-114 / Figure 2: In Example 4, it is stated that a strong negative correlation is because of the near surface deviations in opposite directions between the two sets. Yet, in Example 2, it seems like the near-surface deviations are even more pronounced, such as the Chl (dark purple) decreasing near the surface and the Temp (dark purple) increasing near the surface. Including the “typical profile shape” in each of the Fig 2 examples can help justify these assertions since readers can then visualize the deviations more easily.
Lines 148-149: It might be useful to explicitly mention that the maximum number of Fourier basis coefficients is the number of measurements within each profile in the Methods section.
Lines 157: In the Case 2 analysis, the correlation between physical variables and oxygen is compared even though there are substantially less oxygen profiles. Could this lead to biases in the correlation between oxygen and physical variables since the correlations between physical variables to other physical variables have been sampled over different times? The conclusion in Line 157 states that the oxygen distribution is driven by salinity, but the correlations between (salinity and density) and (density and oxygen) are not over the same times, although maybe the difference is small since there is an overall large number of profiles.
Line 175: Why were the BGC-Argo profiles smoothed with a “15 m moving-median window” while the CE09OSPM was not smoothed? Are there specific criteria that should be used to determine whether or not to smooth the data before performing FDA?
Lines 186-194: Other analyses might, for example, correlate the deep chlorophyll maximum with a specific isotherm (as mentioned in Line 48) when analyzing these profiles. It would be interesting to see whether the same conclusions are drawn when comparing correlations between this kind of conventional method and the vertically-resolved FDA method. The possible differences and additional insights that are drawn using FDA would strengthen the argument for using this technique.
Lines 233-234: It could be helpful to speculate on how well FDA works when applied to profiles with different vertical resolutions that are interpolated/preprocessed. I assume this would be highly dependent on the interpolation/preprocessing method, but as a potential user of FDA, it would be nice to know whether the authors recommend that this technique be mostly applied to profiles with the same vertical resolution, or if there is enough flexibility that users could perform cross-platform analyses with differing resolutions (which is alluded to in Line 249).
Lines 251-254: An additional application that could be worth mentioning is for validating GCMs and other models. In particular, FDA could help to assess model performance on depth-dependent characteristics such as the vertical extent of OMZs, which show discrepancies with the climatology (Cabré et al. 2015). Although I imagine some difficulty might arise from differences in vertical resolution.
Line 227: Could you comment more generally on the data quality needed for this technique? Is there a possibility of spurious signals from FDA, e.g., spectral leakage, Gibbs phenomenon in a profile with sharper gradients than another, or issues with noisy data?
Typos & clarification:
Lines 23-24: “comprehensive profiling datasets, however they are” Semicolon or period might be better than the comma
Line 31: “continuous, shape based” As someone unfamiliar with FDA, I was unsure of what shape based meant when describing profiles. Maybe another term could be used, or a slight reorganization with Lines 32-33, which clarified the meaning of “shape based” for me.
Line 78: “described briefly here. and a” Comma instead of period
Line 91: “linear combination of the mean basis coefficients represent a mean function” I think it would be more accurate to say the combination of coefficients multiplied by their respective basis functions are the mean function.
Line 105: “functional shape is maintained perfectly” Maintained is unclear to me, maybe “perfectly consistent” or “change in unison”
Line 106: “out of phase” suggests a periodic signal to me, could this be rephrased? Alternatively the phrase could be deleted so the sentence reads “suggests that any deviation in …”
Line 109: “profiles, and their correlation” No comma here
Line 111: Instead of a “clear relationship”, could you describe what that relationship is, as is done in Lines 112-114?
Line 174: “20m and 230 m” Space between 20 and m
Line 182: Include comma before “and” in “(5906204, 5904021 and 6901585)”
Line 207: “expected given the variables are” Should this be “given that”?
Line 229: Some clarification on why this method does not “characterise relationships between functional variables” might be worthwhile. What is the distinction between functional data and functional variables? Is a functional variable the same as the indexing variable, like depth?
Line 262: “coefficient that accounts describes the dependence” Either “accounts for” or “describes”
Line 264: “floats respectively,” Comma after “floats”
Figure 3: Could the x-axis of the variables be aligned so that it is easier to compare the times when there are no profiles of oxygen?
Figure 6: Not necessary, but since the float trajectories are associated with a specific color in Fig 5, maybe there could be a colored box behind the float numbers in Fig 6 for each float. That way, at a quick glance, readers can see which float corresponds to which trajectory without having to compare the numbers.
Questions:
How appropriate would it be to compare correlations of variables that are truncated differently? For instance, in the CE09OSPM case, if the physical variables had relatively low rates of missing data outside of 40-400 m but the oxygen had high rates of missing data, could one perform FDA with a larger depth range for the physical variables and then truncate later for correlation with the oxygen? Or would that lead to conclusions that are not physically consistent since the depth ranges have changed?
Would it be helpful to use less basis functions on noisier data to avoid overfitting?