the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Elucidating the interplay between catchment and in-stream processes using high frequency multivariate and multisite data
Abstract. Stream water quality data provide essential insights into catchment biogeochemical processes. Continuous high-resolution measurements offer significant potential to differentiate between catchment-scale inputs and in-stream biogeochemical processing. Nevertheless, extracting clear, meaningful information from complex, multivariate time series spanning multiple variables and locations poses significant challenges. Using Principal Component Analysis (PCA) on high-resolution multivariate water quality data, this study aims to (1) separate catchment-scale contributions from in-stream biogeochemical processes, and (2) evaluate the dominant environmental drivers of spatial and temporal variability. The data were collected at five monitoring stations located in the Bode River, Germany. At each station, six variables were measured at 15-minute intervals over a period of seven years (2013–2020). The first principal component (PC1) accounted for 46 % of the variance, capturing the typical seasonal impacts of stormflow dynamics. The second principal component (PC2) revealed the influence of saline groundwater upwelling, particularly during lowflow periods at specific sites. Diurnal fluctuations in pH, driven primarily by algal photosynthetic activity, were identified by the third component (PC3). Additional components highlighted localized processes: PC4, PC5, and PC6 were linked to turbidity variability during discharge peaks, while PC7 reflected anthropogenic influences, notably treated acid mine drainage entering the river. Lastly, PC8 described distinct nitrate dynamics observed at downstream monitoring sites. The application of PCA to high-resolution multivariate data proved to be very helpful in disentangling various catchment and within-stream effects on stream water quality. These findings emphasize the importance of advanced analytical techniques in unravelling complex hydrobiogeochemical dynamics.
- Preprint
(2033 KB) - Metadata XML
-
Supplement
(793 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2026-1203', Anonymous Referee #1, 19 May 2026
-
AC1: 'Reply on RC1', Kenneth Gutiérrez, 18 Jun 2026
General comments
The manuscript by Gutiérrez‑García et al. presents an interesting application of Principal Component Analysis (PCA) to a high-frequency multi-variable dataset collected on the Bode River catchment in Germany. Results from the PCA analysis were evaluated together with additional independent data, which helped to identify the major hydrological and biogeochemical processes occurring in the catchment or within the stream. Such approach appears very valuable for extracting key trends from a complex multi-variate dataset, and could surely contribute to future research.
Overall, the manuscript is generally well written and the interpretations proposed by the authors are generally consistent with the observed spatial and temporal patterns. However, the authors were not sufficiently clear about the study’s limitation, particularly concerning the methodological and interpretation weaknesses.
My main concern involves the interpretation of PCA components, which are sometimes based on weak correlations, which are themselves based on principal components (PC) that account for only a small portion of the total variance. Indeed, the authors chose to keep 8 principal components to reach a cumulative variance of 84%. However, only the first four PCs were associated with a variance greater than 5%. Such low-variance PCs are often considered weakly informative and potentially sensitive to noise. Most of the studies cited in the manuscript only used the fourth first PCs for interpreting their datasets. Although some independent validation is provided, the authors should present more clearly the robustness and limitations associated with the interpretation of low correlations and less informative PCs.
We thank the reviewer for this insightful comment and for highlighting the overall value of our approach. We completely agree that interpreting principal components with low total variance requires caution, as these components are typically more susceptible to noise.
However, the decision to retain eight principal components is driven by the physical relevance of these independent signals, rather than relying solely on the percentage of total variance they explain. In a catchment-scale study, there are highly localized processes that occur infrequently, such as the acid mine drainage treatment plant discharges at the SH station (PC7) or the wastewater treatment plant effluents at the HAU station (PC5). Because these events are so specific and rare, they naturally account for a very small fraction of the global variance. The key aspect of our analysis is that it successfully detected these infrequent processes, and this is exactly why they appear isolated within the higher components. We included them as valuable hints for localized anthropogenic point sources as independent signals.
To explicitly address the distinction between "global noise" and "independent localized signals" a dedicated paragraph will be added to the Discussion section detailing the methodological limitations and the necessity of independent validation when interpreting lower-variance components.
Several suggestions are presented below, that the authors should feel free to take into account or not.
Detailed comments
L14: it might be helpful to specify in the abstract which variables were used in this study.
The abstract will be updated to explicitly specify the variables used in this study.
L25: the authors mention broader implications for environmental monitoring and risk reduction that are not sufficiently developed or taken into account later in the paper. This approach could be applied in a variety of context, which it would be useful to mention here or in the discussion part.
We agree that the broader implications for environmental monitoring and risk reduction, mentioned early in the manuscript, requires further context. A brief paragraph will be added to the Discussion to outline how this PCA-based approach might be applied in broader contexts. This will note its potential use in evaluating monitoring network design and supporting the detection of anomalous pollution events, while emphasizing that any application to different catchments requires careful consideration of local environmental conditions.
L28: high frequency datasets are common in an operational context. That means that your approach could be applied to a large number of catchments in the world, using datasets that are often undervalued. A few words on these implications might add value to your paper.
We agree that operational high-frequency datasets are frequently underutilized globally. A brief statement will be added to the Discussion section emphasizing how this approach can extract diagnostic value from these existing, undervalued datasets.
L48: I cannot see which of these references use PCA to differentiate catchment scale processes from in-stream dynamics. Maybe you should separate references according to the main goals of each study. Furthermore, if, as you noted, previous studies has already applied PCA in this purpose, wat is the added scientific value of your study? More generally, the introduction would benefit from providing a clearer explanation of the scientific challenges associated with PCA and high-frequency datasets.
We agree that the Introduction must better frame the literature and our study's novelty. The revised manuscript will reorganize references by their main goals, adding citations to differentiate studies using PCA for low-frequency spatial variability from those focusing on high-frequency temporal dynamics. Also, we will clarify that, unlike previous studies, we focused neither on time-averaged solute concentrations nor on detecting long-term monotonic trends. Instead, we aimed explicitly at extracting information from the short-term temporal patterns of various water quality parameters. The basic idea was that the observed short-term dynamics are not purely random patterns but are due to the interplay of various effects. We assume that a small number of effects prevails affecting all or most of the observables, although varying in severity between parameters and sampling sites. Applying PCA to this high-frequency dataset allows us to disentangle these complex interactions and identify the relevant causes. Finally, we will explicitly detail the methodological and scientific challenges associated with applying PCA to high-frequency datasets.
Table 1: I would have appreciated to see the proportion of each land use for each sub catchment summarized in this table.
We agree with the reviewer's suggestion. Table 1 will be revised to include the quantitative proportion of each land use for each sub-catchment, utilizing data from recent regional studies to provide a more precise spatial context for our analysis.
Figure 1: This figure should be improved to include:
Altitude information;
Some geographical terminology, such as for the Magdeburger Börde or the Harz Mountains limits;
Location of some features cited in the result and discussion paper such as WWTPs and acide mine drainage treatment plants, the Rappbode Reservoir and the Nachterstedt open pit;
A geological map (maybe in appendix) may be also helpful.
We thank the reviewer for these suggestions. We will update Figure 1 to include the Harz Mountains and the specific sites explicitly mentioned in the text. Regarding the geological context, detailed and publicly accessible maps showing localized geological features of the study area are not readily available. However, we will attempt to incorporate an appropriate geological map into the supplementary materials to provide this context, supported by the relevant literature.
L60: It is not clearly stated that the Mageburger Börde corresponds to the lowland plain area, if it’s the case.
The text in Section 2 will be modified to explicitly state that the Magdeburger Börde corresponds to the lowland plain area of the catchment.
L105: first mention of the Central German Lowland area, is this area different from the previous one cited?
Yes, the Central German Lowland area refers to the same lowland region previously mentioned. To avoid any ambiguity, that will be explicitly clarified in the text.
L129: I would have appreciated to know more about the hydrological regime of the Bode and Selk rivers (e.g. mean annual discharge) before knowing about the discharge observed in each station. It may also be helpful to add the mean discharge when writing about the main differences between two stations.
A new table summarizing the mean, minimum, and maximum discharge for each of the five monitoring stations during the study period will be added to Section 3.1 in the revised manuscript.
L140: Could you justify the use of Spearman correlation in your study, I guess for the non-normally distribution of the dataset, or the presence of outliers? Maybe you could also precise that the very large dataset you are using automatically reduce the risk of false correlations
Upon re-evaluation, the preliminary exploratory analysis based on Spearman correlations (including the categorization by low, medium, and high flow regimes) does not significantly contribute to the core PCA methodology or the primary conclusions of the study. To streamline the manuscript and maintain a strict focus on the multivariate PCA approach, this paragraph and the associated supplementary figures (Fig. S1, S2, and S3) will be removed from the revised manuscript.
L141: how did you define the low, medium and high regime? Any thresholds to provide?
As noted in the response to L140, the preliminary exploratory analysis using Spearman correlations, including the categorization into flow regimes, will be removed from the revised manuscript to maintain focus on the core PCA methodology. Therefore, these thresholds are no longer applicable.
L142-L146: I’m not sure if this part is relevant. Why mention the term “EOF” if it is not used in your context. Besides, you already introduced PCA in the introduction. If you have more information to provide on its general application, you should have mentioned it earlier.
The detailed description in this section is intentional. In disciplines such as climatology, the application of PCA to sets of multi-sites time series is traditionally referred to as Empirical Orthogonal Functions (EOF). On the contrary, standard PCA is most commonly associated with static, non-time-series datasets, typically visualized using PC1 vs. PC2 biplots. Because our study specifically applies PCA to high-frequency, continuous time series, explicitly distinguishing our approach from both climatological EOFs and classic static PCA is necessary to prevent methodological confusion.
L151: I might be useful to precise that the sum of all eigenvalues give how much the total variance is captured by PCA.
Add that the sum of all eigenvalues give how much the total variance is captured by PCA. Furthermore, could you please provide some clarification regarding the sentence on line 165: “first four principal component exhibited eigenvalues exceeding 1”? What does this mean?
A sentence will be added to the manuscript to explicitly state that the sum of all eigenvalues corresponds to the total variance of the dataset. Regarding line 165, an eigenvalue exceeding 1 indicates that the corresponding principal component accounts for more variance than a single original standardized variable, following the Kaiser criterion. A brief explanation of this criterion will be included in the revised text to clarify the statistical significance of this threshold.
L153: Do you have some references?
A reference supporting the 80% to 90% cumulative explained variance threshold will be added to the revised manuscript to formally justify this criterion
L166: See comment 151. So, you have 84% of total variance explained by the eight principal components, so you reach your objective of 80-90%, but you didn’t mention for how many PCs it may be acceptable to have a minimum 80%. PCs that explain between 2 to 5% or generally considered poorly informative and the interpretation that you made from these might be overinterpreted, particularly if you chose to associate clear hydrological processes to each PC. I would have appreciated some words about this risk of overinterpretation in the Methods and in the Discussion sections.
We included these PCs as valuable hints for localized anthropogenic point sources as independent signals but being fully aware that, as addressed in the response to the General Comments, retaining PCs with lower variance (2-5%) carries a recognized risk of overinterpretation. To address this limitation, sentences explicitly acknowledging the risk of overinterpreting these poorly informative components and their potential sensitivity to noise will be added to both the Methods and the Discussion sections in the revised manuscript.
L172: your site description was already done in section 2. You should have mentioned these elements earlier.
The brief mention of the site locations at this line is not intended as a redundant site description, but specifically to explain the top-to-bottom ordering of the sampling sites presented in the corresponding figure.
L193: it might be useful to define a threshold for which a correlation coefficient is considered weakly or strongly informative for interpreting PCA. For example, you mentioned (L219) “strong” relationship with discharge and turbidity, for absolute correlation coefficient < 0.5. I understand that with such a large volume of data, correlation coefficients are often significant even if they are low, but you should be careful about the term used.
To ensure consistency and statistical rigor in describing the relationships between the original variables and the principal components, explicit thresholds for correlation coefficients (loadings) will be defined in the Methods section. Specifically, absolute values > 0.70 will be classified as "strong," 0.50–0.70 as "moderate," and 0.30–0.50 as "weak." The terminology throughout the Results and Discussion sections, including the instance at line 219, will be thoroughly revised to strictly adhere to these newly defined thresholds.
L243: maybe you should mention again in each subsection how much variance is explained by each PC. It may help to appreciate the robustness of each interpretation.
The explained variance for each principal component will be explicitly stated at the beginning of its corresponding subsection in the Results and Discussion.
L244: you might precise which stations are correlated with pH.
The text will be revised to explicitly specify which monitoring stations exhibit a correlation with pH in this context.
L261: I’m not sure about the purpose of this sentence? What does it imply?
The purpose of this sentence is to indicate that the biogeochemical and hydrological signals detected at the MEI station are primarily driven by local processes occurring within a 3 km radius, rather than by the long-distance transport of constituents from further upstream. The text will be revised to explicitly state this implication, clarifying that the corresponding principal component at this station reflects highly localized dynamics.
L262: “this effect”. You should precise which effect you are referring to.
The phrase "this effect" is ambiguous and refers to the strong variability associated with primary production and photosynthetic activity captured by PC3. The text will be revised to replace it with a precise description, explicitly stating "this strong primary production signal" to ensure clarity regarding the specific biogeochemical process occurring at the GGL and STF stations.
L279: again you mentioned “strong” correlation across all sites, but correlation coefficients are greater than >0.5 for SH and MEI only (and maybe GGL?).
The text will be corrected to accurately reflect the specific correlation coefficients at each site. In accordance with the newly defined thresholds (detailed in the response to L193), the description will be revised to specify that the relationship is moderate to strong only for the SH and MEI stations, while it is weak for the remaining sites. The inaccurate generalization "across all sites" will be removed.
L300-331: Components 7 and 8 are poorly informative and showed weak correlation with the variables, but you didn’t mention the potential weakness of your interpretation in the results or discussion.
Consistent with the responses to the General Comments and L166, explicit statements acknowledging the potential weakness and risk of overinterpreting these low-variance components (PC7 and PC8) will be added directly to this specific section in the Results and Discussion. The revised text will explicitly note that due to their weak correlations and low explained variance, the interpretations associated with PC7 and PC8 must be approached with caution.
L336: the discussion section might benefit from being organized by processes and not by PC. Indeed, you highlighted some major hydrological and biogeochemical processes which could constitute a sub-section on their own. These processes could therefore give their name instead of mentioning the PC. For example:
- Section 5.1: you named this section “hydrological processes” but mentioned mainly about specific discharge contributions. You should precise this title;
- Section 5.2: PC1 is based on discharge peaks but you mentioned manly nitrate exportation. Maybe a unique sub-section referring to nitrates might be clearer, even if I understand that it is not the same processes involved.
Furthermore, I’m not sure to see a clear separation between the results and discussion. Some discussions parts often provide another description of the results on PCs (e.g.: L440, L420, L425, L383).
Thanks for the feedback. We agree that organizing the discussion by processes would be more intuitive and easy to understand rather than by PC. On the other hand we would like to emphasize the link between PC and processes To address the reviewer's concerns regarding clarity without altering the core structure, the following modifications will be made:
- The subsection titles will be revised to include both the component and the primary process it identifies.
- The title of Section 5.1 will be refined to be more precise.
- The Discussion section will be thoroughly reviewed to ensure a clear separation from the Results.
L343: but didn’t you mentioned that HAU was not in the mountainous region?
The reviewer correctly notes that HAU is located in the lowlands. The intent of the paragraph is to explain that the discharge at HAU hardly exceeds that of the upstream stations because the vast majority of the stream's water originates in the high-precipitation mountainous region. To avoid the misinterpretation that HAU is geographically classified within the mountain region, the text will be rephrased to explicitly clarify this hydrological dynamic.
L372: it might be helpful to provide the correlation coefficients. Otherwise, the figures S1, S2 and S3 do not appear to be fully exploited.
Consistent with the response to comment L140, the preliminary exploratory analysis based on Spearman correlations and the associated supplementary figures (Fig. S1, S2, and S3) will be removed from the revised manuscript to maintain a strict focus on the core PCA methodology. Consequently, the text at line 372 referencing these correlations will be removed accordingly.
L386-388: do you have a reference? This sentence appears more as results/interpretation than a discussion if you don’t cite any references.
A reference supporting the observation that electrical conductivity typically increases during low-flow conditions due to dominant groundwater contributions will be added to the revised manuscript.
L481: this paragraph should be included in another subsection where you could discuss the validation methods used and actually propose other, more robust validation methods. Furthermore, as mentioned earlier, this subsection and this study would benefit from being supplemented with some broader implications: what could these key environmental factors extracted from this dataset contribute in other contexts, particularly in operational contexts where such datasets are common?
A new subsection titled "Methodological limitations and future perspectives" will be added to the Discussion section. This dedicated subsection will explicitly address the methodological challenges of interpreting lower-variance components (PC5 to PC8) and emphasize the necessity of independent validation to distinguish physical signals from statistical noise. Additionally, it will propose more robust validation methods for future research, such as event-based analyses and multi-temporal scale evaluations. Finally, the paragraph will be expanded to detail the broader implications of this approach, specifically how applying PCA to undervalued, high-frequency operational datasets can optimize monitoring network design and facilitate the detection of anomalous pollution events in other geographical contexts.
L498: this reference should have been in the Discussion section.
The reference to Pasari (2022) and the associated statement will be moved from the Conclusions to the Discussion section.
Technical comments
L53: add a comma after “Then”.
L117: I don’t understand this sentence, any mistyping?
L134: missing space between point and “Correspondingly”.
L155: add a coma after “eigenvectors”.
L157: add a coma after “components”.
L186: lowercase after “:”
L249: wrong Figure.
L465: add coma after “STF”.
We thank the reviewer for their careful reading. All the suggested technical corrections, missing punctuation, and the figure reference will be corrected in the revised manuscript.
Citation: https://doi.org/10.5194/egusphere-2026-1203-AC1
-
AC1: 'Reply on RC1', Kenneth Gutiérrez, 18 Jun 2026
-
RC2: 'Comment on egusphere-2026-1203', Anonymous Referee #2, 19 May 2026
Summary
Elucidating the interplay between catchment and in-stream processes using high frequency multivariate and multisite data investigates and discriminates catchment-scale from in-stream processes that collectively affect water quality. The authors use principal component analysis (PCA) to separate signals by location and to better understand overlapping temporal signals. The authors found PCA to be a useful technique for extracting meaningful, process-based information from temporally dense, high-frequency sensor data.
Significance
The significance of this work is the use of a common multi-variate statistical technique (PCA) to pull data from high-frequency data to better understand spatial and temporal patterns. Specifically, the paper provides (1) a clear model on how to take advantage of often underutilized high-frequency sensor data, (2) interprets overlapping temporal signals, and (3) explores spatial patterns resulting from different land uses and environmental templates.
General Suggestions
Overall, this paper was clearly written and easy to read, although the organization of the paper is not easily digestible and in some cases makes parts feel repetitive or unjustified. I provide more general suggestions below, followed by more minor editorial items.
- There are inconsistencies in the hyphenation of words.
- Principal component analysis is one of the most common multi-variate statistical analyses. In some places too much emphasis is placed on PCA being an advanced analysis. For instance, I do not think that differentiating between PCA and EOF on L142-145 is unnecessary because of how common PCA is.
- A major strength of this work is the interpretation of spatial patterns (see Significance). The authors mostly attribute this strength to the use of PCA, although, this strength is more related to the distributed site/experimental design. The authors could strengthen this linkage to show how multi-variate analyses like PCA complement and enhance the benefits of high-quality study design.
- Several key aspects of the analyses are described in too little detail. For instance, the structure of the observation matrix is not completely clear. It would be beneficial to clearly state the matrix dimensionality and if all sites and time steps are analyzed together.
- The results section includes interpretations of the results that should appear in the discussion. This is confusing since the authors attribute specific PCs to processes prior to those connections being justified, which first appear in the discussion. This occurs in every paragraph of results and is particularly noticeable in Figure 3, where the PCs are attributed to specific processes well before any justification is provided. Other examples (not exhaustive) of this are in Section 4.2 (L199-206) and L249-255; this information precedes adequate explanation. Similarly, L288-234 is hard to interpret because concepts are presented prior to any evidence or justification are provided.
- There are too many figures in this paper. I found Figures 4, 5, 7, 10, 11, 12, 14, 16 to be very repetitive and difficult to digest. I highly recommend that the authors pursue a more consolidated presentation of the information to streamline the results section and facilitate more sensible ordering of results followed by interpretations in the discussion section. On another note, the position of the legend within the above figures is not legible and needs to be moved.
- In the results section the authors describe each component in order. This becomes quite repetitive, and it is unclear how important each of the PCs are. One clear solution would be to introduce or reference the variance explained as each PC is discussed. It is hard to jump back and forth to Table 2 to assess how important each PC is.
- A significant amount of work has been done in this region outside of this manuscript. It is frequently referred to in surficial or anecdotal terms. In some places being more specific and providing details would be beneficial. For example, on L237-239, are there any groundwater quality data that can be drawn upon to bolster this assertion?
- The authors mention that small gaps are filled by interpolation. The severity of missing data is blurred because the authors present an average of missing data across multiple sites and parameters. In Figure 15, for example, relatively long periods appear interpolated (see pink line) and in Figure 17, similarly sized gaps in nitrate (see orange) are not interpolated. It is unclear what criteria for filling were used. It is also unclear if the observation matrix for PCA only included time steps that were complete across all sites and parameters.
- The interpretation in section 5.1 about changes in discharge moving downstream is much more detailed, and contradicts, an earlier statement that attributes the observations solely to groundwater contributions concentrated in the headwaters alone.
- The suggestions made on L508-513 are overly vague.
- Figure 1 does not provide a broad enough context of the study. For instance, it is unclear where the Harz Mountains are relative to the watershed and how much of the watershed is intersected.
Specific Suggestions
L15: Consider replacing “over a period of seven years” with “seven-year period”.
L25: It is unclear if “Continuous monitoring” refers to uninterrupted data or a time series of data, which might be better described as high-frequency discrete time-interval data, rather than continuous.
L30: In-text citation is missing a closing bracket.
L36: I suggest replacing “This complexity can hinder watershed modelling, given the necessity to describe water and mass fluxes, as well as the transformation processes occurring in various compartments, such as soil, vadose, and saturated zones, including surface and groundwater interactions” with “This complexity can hinder watershed modelling, given the necessity to describe water and mass fluxes, as well as the transformation processes occurring within and between various compartments, such as soil, vadose, and saturated zones.”
L42: “spatial signatures” has not been used or described before. This is jargon, consider defining or using a more direct description.
L43: Consider replacing “has emerged as” with “is”.
L50: It is unclear if “in heterogenous datasets” refers to temporal heterogeneity or the combining of datasets into one.
L52-53: I was unable to interpret this sentence and think it should be revised.
L55: It is unclear what is meant by “watershed-scale contributions”. It is unclear how contributions could be discriminated from processes. Could “watershed-scale processes” be used instead while retaining the author’s intent?
Table 1: The column title “Dominant Land Use” is inaccurate; the authors provide additional details including secondary and tertiary land uses.
L94: Consider replacing “accounts” with “accounting”.
L116: “starting 2018” – missing word?
L118: “developed only after the end of the monitoring periods” is ambiguous.
L130-134: Is this observation exacerbated by land use and land cover? An earlier description described a land use gradient that may reasonably contribute.
L134: Missing space at start of sentence.
L134-135: This statement implies that source is more important that in-stream transformations without justification.
L196: “For pH, there is a gradual increase….” Are the authors referring to the correlation coefficients or pH values?
L224: Consider replacing “with the pH variable” with “with pH”.
L223-225: The authors claim that this statement is interesting, but no rationale is provided. I suggest more directly presenting this result and then exploring its interpretation in the discussion section.
L236: “groundwater contributes a great portion of the river discharge” is vague without contextualization or data.
L266-267: This statement requires a reference.
L275-278: I found it hard to reconcile the “highly significant” 4-6 components, considering they account for a relatively small amount of the variance described in Table 2.
L279: This statement about “strong positive loadings across all sites” seams to contrast with an earlier description about sediment trapping in the regulated part of the network.
L356-357: A reference or specific data should support this statement.
L362-363: The evidence for a pre-event water signal is ambiguous. I do not doubt that it is present, but it is not clearly explained with the data.
L367-370: Is this system Karst? If not, is this an appropriate example?
L384-388: References should support these statements.
Citation: https://doi.org/10.5194/egusphere-2026-1203-RC2 -
AC2: 'Reply on RC2', Kenneth Gutiérrez, 18 Jun 2026
Summary
Elucidating the interplay between catchment and in-stream processes using high frequency multivariate and multisite data investigates and discriminates catchment-scale from in-stream processes that collectively affect water quality. The authors use principal component analysis (PCA) to separate signals by location and to better understand overlapping temporal signals. The authors found PCA to be a useful technique for extracting meaningful, process-based information from temporally dense, high-frequency sensor data.
Significance
The significance of this work is the use of a common multi-variate statistical technique (PCA) to pull data from high-frequency data to better understand spatial and temporal patterns. Specifically, the paper provides (1) a clear model on how to take advantage of often underutilized high-frequency sensor data, (2) interprets overlapping temporal signals, and (3) explores spatial patterns resulting from different land uses and environmental templates.
We thank the reviewer for the positive feedback and the accurate summary of our manuscript. We appreciate the recognition of the significance of this work, particularly regarding the application of PCA to extract spatial and temporal patterns from high-frequency sensor data. We also thank the reviewer for the thorough evaluation and constructive suggestions. These comments have been helpful in identifying areas to improve the organization, clarity, and methodological descriptions of the paper. Below, we provide detailed, point-by-point responses to each comment and outline the specific modifications that will be incorporated into the revised manuscript.
General Suggestions
Overall, this paper was clearly written and easy to read, although the organization of the paper is not easily digestible and in some cases makes parts feel repetitive or unjustified. I provide more general suggestions below, followed by more minor editorial items.
There are inconsistencies in the hyphenation of words.
The manuscript will be thoroughly proofread to ensure consistent hyphenation throughout the text.
Principal component analysis is one of the most common multi-variate statistical analyses. In some places too much emphasis is placed on PCA being an advanced analysis. For instance, I do not think that differentiating between PCA and EOF on L142-145 is unnecessary because of how common PCA is.
The text will be revised to remove any language implying that PCA is an inherently "advanced" or "novel" technique, acknowledging its status as a widely established multivariate statistical method. However, the brief distinction between PCA and Empirical Orthogonal Functions (EOF) at lines 142-145 will be retained. This description is intentional and necessary to prevent methodological confusion among an interdisciplinary readership. In fields such as climatology, the application of PCA to sets of multi-sites time series is traditionally referred to as EOF, whereas standard PCA is frequently associated with static datasets. Explicitly distinguishing our approach clarifies the specific methodology applied to this high-frequency, continuous time series for diverse scientific communities.
A major strength of this work is the interpretation of spatial patterns (see Significance). The authors mostly attribute this strength to the use of PCA, although, this strength is more related to the distributed site/experimental design. The authors could strengthen this linkage to show how multi-variate analyses like PCA complement and enhance the benefits of high-quality study design.
We agree with the reviewer's insight. The text will be revised to explicitly acknowledge that the ability to interpret complex spatial patterns is fundamentally rooted in the distributed, multi-site experimental design. The Discussion section will be expanded to emphasize how multivariate analyses like PCA complement and maximize the value of this high-quality spatial monitoring network, clarifying that the study's strengths emerge from the interaction between the robust experimental design and the statistical technique, rather than attributing the success solely to the PCA.
Several key aspects of the analyses are described in too little detail. For instance, the structure of the observation matrix is not completely clear. It would be beneficial to clearly state the matrix dimensionality and if all sites and time steps are analyzed together.
The Methods section will be revised to clarify the structure of the observation matrix. A sentence will be added to explicitly state the exact matrix dimensionality and to confirm that all sites, variables, and time steps were analyzed together simultaneously within a single, unified matrix.
The results section includes interpretations of the results that should appear in the discussion. This is confusing since the authors attribute specific PCs to processes prior to those connections being justified, which first appear in the discussion. This occurs in every paragraph of results and is particularly noticeable in Figure 3, where the PCs are attributed to specific processes well before any justification is provided. Other examples (not exhaustive) of this are in Section 4.2 (L199-206) and L249-255; this information precedes adequate explanation. Similarly, L288-234 is hard to interpret because concepts are presented prior to any evidence or justification are provided.
We understand the reviewer’s concern regarding the separation of results and discussion. However, based on our experience and in agreement with a comment from Reviewer 1, relying solely on abstract identifiers (such as PC1, PC2) throughout the Results section makes the manuscript difficult for readers to follow. Instead, we adopted an approach where we assign short, concise, process-oriented names to the components early on to make the text more intuitive. The Results section provides the statistical arguments for these preliminary names, while the comprehensive justification, refinement, and detailed interpretation of these processes are strictly reserved for the Discussion section. We will, however, carefully review the Results section to ensure that these process assignments are explicitly framed as preliminary and that any overly deep interpretation is shifted to the Discussion.
There are too many figures in this paper. I found Figures 4, 5, 7, 10, 11, 12, 14, 16 to be very repetitive and difficult to digest. I highly recommend that the authors pursue a more consolidated presentation of the information to streamline the results section and facilitate more sensible ordering of results followed by interpretations in the discussion section. On another note, the position of the legend within the above figures is not legible and needs to be moved.
We agree with the reviewer that the current presentation of Figures 4, 5, 7, 10, 11, 12, 14, and 16 is repetitive and occupies too much space. To streamline the results section, we will consolidate all these loading plots into a single multi-panel figure (e.g., one figure with 8 panels). This consolidation will not only reduce the overall number of figures but will also facilitate a more direct visual comparison between the different Principal Components. Furthermore, combining these plots allows us to use a single, clearly positioned legend outside the data area, fully resolving the legibility and placement issues raised.
In the results section the authors describe each component in order. This becomes quite repetitive, and it is unclear how important each of the PCs are. One clear solution would be to introduce or reference the variance explained as each PC is discussed. It is hard to jump back and forth to Table 2 to assess how important each PC is.
The text will be revised so that the percentage of variance explained by each principal component is explicitly stated at the beginning of its corresponding subsection.
A significant amount of work has been done in this region outside of this manuscript. It is frequently referred to in surficial or anecdotal terms. In some places being more specific and providing details would be beneficial. For example, on L237-239, are there any groundwater quality data that can be drawn upon to bolster this assertion?
We agree with the reviewer that this physical interpretation must be formally substantiated rather than relying on anecdotal regional knowledge. To address this, we will add references from the established literature on the regional geology and hydrochemistry of the basin. Unfortunately, though, information about saline groundwater (L237-239), is, to the best of our knowledge, only available in non-scientific technical reports and in German language.
The authors mention that small gaps are filled by interpolation. The severity of missing data is blurred because the authors present an average of missing data across multiple sites and parameters. In Figure 15, for example, relatively long periods appear interpolated (see pink line) and in Figure 17, similarly sized gaps in nitrate (see orange) are not interpolated. It is unclear what criteria for filling were used. It is also unclear if the observation matrix for PCA only included time steps that were complete across all sites and parameters.
We thank the reviewer for their careful observation. We apologize for the confusion regarding Figure 17, which mistakenly displayed the original, raw time series instead of the interpolated dataset. Figure 17 will be updated to display the correct interpolated data used in the analysis. As noted in the manuscript (L126-127), the overall proportion of missing data was very low (~2%). To provide further clarity and address your concern, we will add specific information regarding the median and maximum length of these gaps to the Methods section. We confirm that linear interpolation (via the na.interp() function in R) was used to fill these gaps, as this step was strictly necessary to provide the PCA with a complete observation matrix containing zero missing values.
The interpretation in section 5.1 about changes in discharge moving downstream is much more detailed, and contradicts, an earlier statement that attributes the observations solely to groundwater contributions concentrated in the headwaters alone.
We thank the reviewer for pointing out this apparent contradiction. The data presented in Section 5.1 refer to net changes of discharge. Recent work by Li et al. (2025), though, clearly shows a complex network of gaining and losing stream reaches. Thus river water quality changes in down-stream direction in spite of rather constant discharge. This information will be added to the text.
The suggestions made on L508-513 are overly vague.
This paragraph will be substantially revised to replace these general statements with specific, actionable recommendations directly derived from our findings.
Figure 1 does not provide a broad enough context of the study. For instance, it is unclear where the Harz Mountains are relative to the watershed and how much of the watershed is intersected.
Figure 1 will be revised to provide a broader regional context.
Specific Suggestions
L15: Consider replacing “over a period of seven years” with “seven-year period”.
L25: It is unclear if “Continuous monitoring” refers to uninterrupted data or a time series of data, which might be better described as high-frequency discrete time-interval data, rather than continuous.
L30: In-text citation is missing a closing bracket.
L36: I suggest replacing “This complexity can hinder watershed modelling, given the necessity to describe water and mass fluxes, as well as the transformation processes occurring in various compartments, such as soil, vadose, and saturated zones, including surface and groundwater interactions” with “This complexity can hinder watershed modelling, given the necessity to describe water and mass fluxes, as well as the transformation processes occurring within and between various compartments, such as soil, vadose, and saturated zones.”
L42: “spatial signatures” has not been used or described before. This is jargon, consider defining or using a more direct description.
L43: Consider replacing “has emerged as” with “is”.
L50: It is unclear if “in heterogenous datasets” refers to temporal heterogeneity or the combining of datasets into one.
L52-53: I was unable to interpret this sentence and think it should be revised.
L55: It is unclear what is meant by “watershed-scale contributions”. It is unclear how contributions could be discriminated from processes. Could “watershed-scale processes” be used instead while retaining the author’s intent?
Table 1: The column title “Dominant Land Use” is inaccurate; the authors provide additional details including secondary and tertiary land uses.
L94: Consider replacing “accounts” with “accounting”.
L116: “starting 2018” – missing word?
L118: “developed only after the end of the monitoring periods” is ambiguous.
L130-134: Is this observation exacerbated by land use and land cover? An earlier description described a land use gradient that may reasonably contribute.
L134: Missing space at start of sentence.
L134-135: This statement implies that source is more important that in-stream transformations without justification.
L196: “For pH, there is a gradual increase….” Are the authors referring to the correlation coefficients or pH values?
L224: Consider replacing “with the pH variable” with “with pH”.
L223-225: The authors claim that this statement is interesting, but no rationale is provided. I suggest more directly presenting this result and then exploring its interpretation in the discussion section.
L236: “groundwater contributes a great portion of the river discharge” is vague without contextualization or data.
L266-267: This statement requires a reference.
L275-278: I found it hard to reconcile the “highly significant” 4-6 components, considering they account for a relatively small amount of the variance described in Table 2.
L279: This statement about “strong positive loadings across all sites” seams to contrast with an earlier description about sediment trapping in the regulated part of the network.
L356-357: A reference or specific data should support this statement.
L362-363: The evidence for a pre-event water signal is ambiguous. I do not doubt that it is present, but it is not clearly explained with the data.
L367-370: Is this system Karst? If not, is this an appropriate example?
L384-388: References should support these statements.
We sincerely thank the reviewer for this detailed and constructive line-by-line review. In the revised manuscript, we will implement all the suggested typographical, grammatical, and phrasing corrections. Additionally, we will clarify the terminology, add the requested references to properly contextualize our statements, and ensure that any interpretative text is systematically relocated to the discussion section as advised.
Citation: https://doi.org/10.5194/egusphere-2026-1203-AC2
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 207 | 76 | 14 | 297 | 50 | 14 | 21 |
- HTML: 207
- PDF: 76
- XML: 14
- Total: 297
- Supplement: 50
- BibTeX: 14
- EndNote: 21
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General comments
The manuscript by Gutiérrez‑García et al. presents an interesting application of Principal Component Analysis (PCA) to a high-frequency multi-variable dataset collected on the Bode River catchment in Germany. Results from the PCA analysis were evaluated together with additional independent data, which helped to identify the major hydrological and biogeochemical processes occurring in the catchment or within the stream. Such approach appears very valuable for extracting key trends from a complex multi-variate dataset, and could surely contribute to future research.
Overall, the manuscript is generally well written and the interpretations proposed by the authors are generally consistent with the observed spatial and temporal patterns. However, the authors were not sufficiently clear about the study’s limitation, particularly concerning the methodological and interpretation weaknesses.
My main concern involves the interpretation of PCA components, which are sometimes based on weak correlations, which are themselves based on principal components (PC) that account for only a small portion of the total variance. Indeed, the authors chose to keep 8 principal components to reach a cumulative variance of 84%. However, only the first four PCs were associated with a variance greater than 5%. Such low-variance PCs are often considered weakly informative and potentially sensitive to noise. Most of the studies cited in the manuscript only used the fourth first PCs for interpreting their datasets. Although some independent validation is provided, the authors should present more clearly the robustness and limitations associated with the interpretation of low correlations and less informative PCs.
Several suggestions are presented below, that the authors should feel free to take into account or not.
Detailed comments
L14: it might be helpful to specify in the abstract which variables were used in this study.
L25: the authors mention broader implications for environmental monitoring and risk reduction that are not sufficiently developed or taken into account later in the paper. This approach could be applied in a variety of context, which it would be useful to mention here or in the discussion part.
L28: high frequency datasets are common in an operational context. That means that your approach could be applied to a large number of catchments in the world, using datasets that are often undervalued. A few words on these implications might add value to your paper.
L48: I cannot see which of these references use PCA to differentiate catchment scale processes from in-stream dynamics. Maybe you should separate references according to the main goals of each study. Furthermore, if, as you noted, previous studies has already applied PCA in this purpose, wat is the added scientific value of your study ? More generally, the introduction would benefit from providing a clearer explanation of the scientific challenges associated with PCA and high-frequency datasets.
Table 1: I would have appreciated to see the proportion of each land use for each sub catchment summarized in this table.
Figure 1: This figure should be improved to include:
L60: It is not clearly stated that the Mageburger Börde corresponds to the lowland plain area, if it’s the case.
L105: first mention of the Central German Lowland area, is this area different from the previous one cited ?
L129: I would have appreciated to know more about the hydrological regime of the Bode and Selk rivers (e.g. mean annual discharge) before knowing about the discharge observed in each station. It may also be helpful to add the mean discharge when writing about the main differences between two stations.
L140: Could you justify the use of Spearman correlation in your study, I guess for the non-normally distribution of the dataset, or the presence of outliers ? Maybe you could also precise that the very large dataset you are using automatically reduce the risk of false correlations
L141: how did you define the low, medium and high regime ? Any thresholds to provide ?
L142-L146: I’m not sure if this part is relevant. Why mention the term “EOF” if it is not used in your context. Besides, you already introduced PCA in the introduction. If you have more information to provide on its general application, you should have mentioned it earlier.
L151: I might be useful to precise that the sum of all eigenvalues give how much the total variance is captured by PCA.
add that the sum of all eigenvalues give how much the total variance is captured by PCA. Furthermore, could you please provide some clarification regarding the sentence on line 165: “first four pirincipal comoennt exhibited eigenvalues exceeding 1”. What does this mean ?
L153: Do you have some references ?
L166: See comment 151. So, you have 84% of total variance explained by the eight principal components, so you reach your objective of 80-90%, but you didn’t mention for how many PCs it may be acceptable to have a minimum 80%. PCs that explain between 2 to 5% or generally considered poorly informative and the interpretation that you made from these might be overinterpreted, particularly if you chose to associate clear hydrological processes to each PC. I would have appreciated some words about this risk of overinterpretation in the Methods and in the Discussion sections.
L172: your site description was already done in section 2. You should have mentioned these elements earlier.
L193: it might be useful to define a threshold for which a correlation coefficient is considered weakly or strongly informative for interpreting PCA. For example, you mentioned (L219) “strong” relationship with discharge and turbidity, for absolute correlation coefficient < 0.5. I understand that with such a large volume of data, correlation coefficients are often significant even if they are low, but you should be careful about the term used.
L243: maybe you should mention again in each subsection how much variance is explained by each PC. It may help to appreciate the robustness of each interpretation.
L244: you might precise which stations are correlated with pH.
L261: I’m not sure about the purpose of this sentence ? What does it imply ?
L262: “this effect”. You should precise which effect you are referring to.
L279: again you mentioned “strong” correlation across all sites, but correlation coefficients are greater than >0.5 for SH and MEI only (and maybe GGL?).
L300-331: Components 7 and 8 are poorly informative and showed weak correlation with the variables, but you didn’t mention the potential weakness of your interpretation in the results or discussion.
L336: the discussion section might benefit from being organized by processes and not by PC. Indeed, you highlighted some major hydrological and biogeochemical processes which could constitute a sub-section on their own. These processes could therefore give their name instead of mentioning the PC. For example:
- Section 5.1 : you named this section “hydrological processes” but mentioned mainly about specific discharge contributions. You should precise this title;
- Section 5.2 : PC1 is based on discharge peaks but you mentioned manly nitrate exportation. Maybe a unique sub-section referring to nitrates might be clearer, even if I understand that it is not the same processes involved.
Furthermore, I’m not sure to see a clear separation between the results and discussion. Some discussions parts often provide another description of the results on PCs (e.g.: L440, L420, L425, L383).
L343: but didn’t you mentioned that HAU was not in the mountainous region ?
L372: it might be helpful to provide the correlation coefficients. Otherwise, the figures S1, S2 and S3 do not appear to be fully exploited.
L386-388: do you have a reference ? This sentence appears more as results/interpretation than a discussion if you don’t cite any references.
L481: this paragraph should be included in another subsection where you could discuss the validation methods used and actually propose other, more robust validation methods. Furthermore, as mentioned earlier, this subsection and this study would benefit from being supplemented with some broader implications : what could these key environmental factors extracted from this dataset contribute in other contexts, particularly in operational contexts where such datasets are common ?
L498: this reference should have been in the Discussion section.
Technical comments
L53: add a comma after “Then”.
L117: I don’t understand this sentence, any mistyping ?
L134: missing space between point and “Correspondingly”.
L155: add a coma after “eigenvectors”.
L157: add a coma after “components”.
L186: lowercase after “:”
L249: wrong Figure.
L465: add coma after “STF”.