Utilizing Probability Estimates from Machine Learning and Pollen to Understand the Depositional Influences on Branched GDGT in Wetlands, Peatlands, and Lakes

Cromartie, Amy; De Jonge, Cindy; Ménot, Guillemette; Robles, Mary; Dugerdil, Lucas; Peyron, Odile; Rodrigo-Gámiz, Marta; Camuera, Jon; Ramos-Roman, Maria Jose; Jiménez-Moreno, Gonzalo; Colombié, Claude; Sahakyan, Lilit; Joannin, Sébastien

doi:10.5194/egusphere-2025-526

Preprints

https://doi.org/10.5194/egusphere-2025-526

Preprints

28 Feb 2025

| 28 Feb 2025

Utilizing Probability Estimates from Machine Learning and Pollen to Understand the Depositional Influences on Branched GDGT in Wetlands, Peatlands, and Lakes

Amy Cromartie, Cindy De Jonge, Guillemette Ménot, Mary Robles, Lucas Dugerdil, Odile Peyron, Marta Rodrigo-Gámiz, Jon Camuera, Maria Jose Ramos-Roman, Gonzalo Jiménez-Moreno, Claude Colombié, Lilit Sahakyan, and Sébastien Joannin

Abstract. Branched glycerol dialkyl glycerol tetraethers (brGDGTs) serve as critical molecular biomarkers for the quantitative reconstruction of past environments, ambient temperature and pH across various archives. Despite their success, numerous issues persist that limit their application. The distribution of brGDGTs varies significantly based on provenance, resulting in biases in environmental reconstructions that rely on fractional abundances and derived indices, such as the MBT’_5ME. This issue is especially significant in shallow lakes, wetlands, and peatlands within semi-arid and arid regions, where ecosystems are sensitive to diverse environmental and climatic factors. Recent advancements, such as machine learning techniques, have been developed to identify changes in sources; however, these techniques are insufficient for detecting mixed source environments. The probability estimates derived from five machine learning algorithms are employed here to detect provenance changes in brGDGT downcore records and to identify periods of mixed provenance. A new global modern database (n=2301) was compiled to train, validate, test, and apply these algorithms to two sedimentary records. Our findings are corroborated by pollen and non-pollen palynomorphs obtained from the identical records. These microfossil proxies are utilized to discuss changes in provenance, hydrology, and ecology that influence the distribution of brGDGTs. Probability estimates derived from Random Forest with a sigmoid calibration are most effective in detecting changes in brGDGT distribution. Minor changes in the relative contributions of brGDGTs provenance can significantly influence the distribution of brGDGTs, especially regarding the MBT'_5ME index. This study introduces a novel brGDGT wetland index aimed at monitoring potential biases arising from wetland development.

Received: 11 Feb 2025 – Discussion started: 28 Feb 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 2587 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (2587 KB)

Supplement (1329 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

08 Dec 2025

Utilizing probability estimates from machine learning and pollen to understand the depositional influences on branched GDGT in wetlands, peatlands, and lakes

Biogeosciences, 22, 7687–7708, https://doi.org/10.5194/bg-22-7687-2025,https://doi.org/10.5194/bg-22-7687-2025, 2025

Short summary

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-526', Joseph B. Novak, 07 Mar 2025

Please see the attached file. My apologies that I cannot participate in any additional rounds of peer review due to planned travel.

Citation: https://doi.org/10.5194/egusphere-2025-526-RC1
- AC1:
  'Reply on RC1', Amy Cromartie, 03 May 2025
  Dear Reviewer Novak,
  Thank you for your thoughtful response on our manuscript. We have gone ahead and read through your reviews and answered them as best as we can. You will find our responses below. We hope that we sufficiently answered your comments and concerns below.
  
  Review of Cromartie et al. for Biogeosciences
  Joseph B. Novak
  Recommendation
  Major revision.
  Summary
  Cromartie et al. present a new machine learning approach to probabilistically assess the provenance of brGDGTs in terrestrial sedimentary archives. This work improves upon previous work by Martinez-Sosa et al. (2023), the BIGMAC algorithm, by generating probability estimates that permit analysis of the likely relative contributions of brGDGTs from various sources to a sediment sample rather than discrete sample classifications. The improvement upon the BIGMAC algorithm is a contribution towards ongoing efforts to utilize brGDGTs as proxies of past climate change in the geologic record.
  The writing is mostly clear, although there are some places where I was confused by the word choice or sentence structure. Wherever possible, I provided suggestions to revise the wording for clarity. I urge the editors to find a machine learning expert to evaluate the methodology of this work, as this technique does not fall within my expertise. My recommendation for a major revision is based upon my concerns regarding section 4.2.4 where a new brGDGT wetlands index is proposed (see major comments).
  I look forward to the publication of this work after my comments are addressed.
  
  Major Comments
  Introduction
  The introduction would benefit from some clarification as to why it was necessary to use five machine learning techniques to generate the model described here. Did you try five machine learning methods and then settle on one as the best? Are you somehow combining the output of all five models? Machine learning is generally a confusing (and intimidating!) methodology for many people, including some who would want to use your algorithm. Clarity on why you took this approach will make people more likely to understand what you did and therefore more likely to use your algorithm (and cite your work! )
  
  I think an additional 1-3 sentences in the paragraph at lines 81-92 would be very helpful for clarifying this point.
  
  Response: Thank you for this response we have added a bit more information in the introduction on why these models were chosen.
  Introduction: “We test five popular parametric and non-parametric machine learning models based on their ability to handle small tabular datasets and produce reliable probability estimates when calibrated (Malley et al., 2012; Wang et al., 2019). Models utilizing different structures were chosen, including simple tree-based algorithms (CART), ensemble trees (RF), linear models (LR), margin-based classifiers (SVM), and instance-based lazy learners (K-NN) to evaluate performance. The best-performing model was then chosen to apply to two down-core sedimentary sequences. ”
  
  Materials and Methods
  L220–221: Do you mean that you are using the probability estimates as a means of understanding changes in brGDGT provenance through time? Because that is a different thing than using them as an environmental proxy. Please clarify.
  Response: Yes, we are primarily looking at them to explain provenance change rather than as an environmental proxy in itself. We have gone ahead and removed environmental and added provenance. This sentence is now: “proxy for provenance change”
  
  Discussion
  Figure 6a and 7a: why is the pollen water depth reconstruction plotted on a log scale? This seems a bit odd, should this not be plotted on a linear scale? Please explain.
  
  Response: We decided to plot the water depth reconstruction on a logarithmic scale following the precedent of the author’s (i.e., Robles et al., 2022) original peer-reviewed articles rather than deviate. It is common to plot these pollen-based reconstructions in this way in order to visualize changes more clearly the results.
  
  Section 4.2.4: I question whether including this section distracts from the larger point of the paper. Is this index not redundant since you are tracking basically the same thing with the % peat probability? Perhaps more importantly, there should be some sort of validation regarding whether this index is useful for identifying wetlands in a modern dataset. For example, is this index value higher in modern samples from wetlands than in those from dry soils or lakes? I think this section generally should be expanded upon significantly if the authors want to claim that this index can be used this way.
  
  Response: After consideration, and since both reviewers mentioned the index as problematic, we have gone ahead and removed this index at this time since it will require major additions to the manuscript which distracts from the primary goal of utilizing machine learning. We will try to publish this index in the future with more information. To accommodate this change we have removed the discussion of this index from the abstract (lines 35-36). We removed figure 8 and section 4.2.4. This section has been changed to include a more through discussion of utilizing the brGDGTs in different environmental settings and confidence intervals for reviewer 2.
  
  Minor Comments
  Abstract
  L21: I think you mean “Branched glycerol dialkyl glycerol tetraethers (brGDGTs) are critical molecular biomarkers” rather than “…serve as critical molecular biomarkers.”
  Response: We removed the “serve as” and changed to “are critical biomarkers”
  L22–23: Is the sentence starting with “Despite their success…” necessary?
  Response: We have removed the phrase “Despite their success, and instead start the sentence with “A key challenge”
  L25: here and throughout, make sure to use the proper prime symbol ′
  Response: This has been updated throughout the manuscript. Thank you for catching this.
  L25–26: “…where ecosystems are sensitive to diverse environmental factors.” Do you mean that depositional environments in arid and semi-arid regions are prone to change in response to water stress?
  Response: Yes, this is indeed what we mean, we have added additional text that says “where ecosystems are sensitive to diverse environmental factors including water stress from increased aridity”
  L31: “…obtained from the identical records.” Do you mean from the same samples? Or from the same cores?
  Response: These samples are from the same cores. The texts has been updated to reflect that: “taken from the same sedimentary core sequence”
  L34: Typo. “brGDGT provenance” not “brGDGTs provenance.”
  Response: Thank you this has been updated to correct this typo
  L36: I think a word is missing here. Do you mean potential biases in brGDGT paleotemperature reconstructions?
  Response: This sentence was removed due to remove the wetland index.
  
  Introduction
  
  Introduction
  L39–43: These two sentences are largely redundant. Could they be combined?
  Response: Thank you for this comment, although we can see where there are some redundancies, we have kept the sentences in the current form in order to make sure that the reader has the relevant literature.
  L47: “their potential” not “its potential." I would consider removing this initial clause in the sentence and starting with “A key challenge…” as this is a more focused start to the paragraph.
  Response: we have removed this phrase and start with “A key challenge” as you suggested.
  L47–48: I think you need a citation here since this thought is informed by previous work.
  Response : We have added the following citations to this line: (De Jonge et al., 2014 ; Naafs et al., 2017 ; Dearing Crampton-Flood, 2020; Martínez-Sosa et al., 2020; Raberg et al., 2022)
  L56–58: I think you mean “The MBT′5Me index is correlated to temperature in lake sediments, peats, and soils [CITATIONS].”
  Response: We have gone and updated this sentence to the following:
  Text added: The MBT′_5ME index has been successfully utilized as grounds for various global temperature calibrations, because of its strong correlation to temperature in modern samples, concerning lakes, peats, and soils (e.g., De Jonge et al., 2014a; Hopmans et al., 2016; Naafs et al., 2017a; Dearing Crampton-Flood et al., 2020; Martínez-Sosa et al., 2021; Véquaud et al., 2022).
  L59–65: Somewhere in here it would be good to mention that MBT′5Me is systematically higher in soils than in lakes. This is usually the major source of concern when it comes to dealing with brGDGTs from potentially mixed sources, at least in lake sediments.
  Response: We agree we have updated the text with the following texts
  Text added: “Provenance changes may introduce bias to temperature reconstructions based on the MBT′_5ME index values generally being higher in soils than in lakes (Pablo Martínez-Sosa et al., 2021). ”
  
  L78: what do you mean by ecological changes? As in bacterial ecology? This may be a word choice issue, I was really surprised to see the word “ecology” here.
  Response: This references the ecological changes within the wetlands, peatlands, and lake environments, not the bacterial ecology, however, we do understand this confusion so we have removed the word ecology from this sentence. The sentence now reads: “This paper presents a strategy for identifying provenance changes across”
  
  L99 vs L100: do you mean depositional or provenance? Because those are different things. The words cannot be used interchangeable.
  
  Response: We changed the wording to reflect that these changes in provenance by adding the following to the sentence “we demonstrate how these complementary proxies can aid in identifying potential hydrological, ecological, and depositional changes that may cause provenance shifts, introducing bias in brGDGT reconstruction”
  
  Materials and Methods
  L123: I think you mean limited data, not limited publication.
  Response: removed publication and changed to “limited data”
  L135: “A C46 internal standard.”
  Response: This has been updated
  L139: “…the C46 internal standard.”
  Response: This has been updated.
  L176: Check the journal’s referencing policies. I am not sure that “ibid” is permitted within this citation style.
  Response: We have changed to include the reference removing ibid.
  L189–191: This sentence is missing a verb. Please clarify.
  Response: I was unable to find a missing verb in these sentences, in this case “are” and “demonstrated” are the verbs and “functions” is the verb in the next sentence.
  L192: same comment regarding ibid.
  Response: updated
  L196: same comment regarding ibid.
  Response: updated
  Figure 2: consider making the background color for the soil dataset symbol a lighter shade of brown. I found it hard to read the dark font against the dark brown background.
  Response: We have gone ahead and updated this figure to a lighter shade of brown.
  L222: do you mean “periods of mixed brGDGT provenance?”
  Response: Yes, this refers to mixed provenance. We have gone ahead and removed the word sourcing and change it to provenance “thus facilitating the identification of periods of mixed provenance.”
  L244–245: why did you retain the original sample to sample curve?
  Response: We retained the original sample to sample curve, because the original publication of the pollen record and water depth reconstructions was 200,000 years old making it difficult to see variations within the shorter (36,000 years) brGDGT record. We added additional text to clarify:
  “Instead of applying a smoothing technique to the water-depth reconstruction, as done by Camuera et al., (2019) on the original 200,000-year-old sequence, we retained the original sample-to-sample curve for clarity to compare to the shorter brGDGT sequence.”
  
  L252–254: I am not sure this is necessary to explain.
  Response: Although we agree we do not necessarily need to explain the plots, we have decided to keep the reference and description in the manuscript for clarity and to allow for the proper citations to the programs used for our analysis.
  
  Results
  L279–280: these sentences should be combined.
  Response: We have gone ahead and combined these sentences as the following “The RF model with the SMOTE dataset had the highest accuracy and the lowest Log loss score for sigmoid calibrated probabilities and therefore was chosen for our analysis as the best performing model.”
  
  Discussion
  L362: I think you mean that the two timeseries are qualitatively similar.
  Response: We have added the following to the test reflect this change
  “We evaluated the accuracy of our ML model for detecting provenance change by comparing the probability estimates it produced with published data on pollen, NPPs, XRF and water depth estimates derived from these proxies and to check the that the results for quantitative similarities (Fig. 5 and 6). ”
  L402: italicize in situ
  Response: this has been updated
  L429–431: I am not sure what you mean. The De Jonge et al. 2024 study found that MBT′
  5Me are generally reproducible between laboratories.
  Response: Thank you for this catch, indeed the sentence is indeed incorrect and phrased strangely. We have removed this sentence from the manuscript.
  450: the ′ symbol should not be subscripted
  Response: This has been updated
  L468: you can simply report this p-value as “p < 0.001”
  Response: This has been changed to “p < 0.001”
  L475–482: why are some of the brGDGTs written in square brackets sometimes but not other times?
  Response: We have gone ahead and updated the manuscript to included the square brackets around all the brGDGTs when appropriate
  
  Citation: https://doi.org/10.5194/egusphere-2025-526-AC1
RC2:
'Comment on egusphere-2025-526', Anonymous Referee #2, 08 Apr 2025
The manuscript presents an interesting strategy on using machine learning probability estimates and pollen data to understand how source changes affect the distribution of brGDGTs and particularly the MBT’5ME index in sedimentary records. The study applies probability estimates from machine learning models to an extended modern global brGDGT database, to detect brGDGT contributions from different sources (soil, peat, lake) over time in two sediment archives and validate these findings through comparisons with independent pollen ad NPP data.
Although the study builds on previous research (Martinez-Sosa et al, 2023), it introduces several novel additions, including: addition of new modern br-GDGT samples to the training dataset, exploration of different probability calibration techniques and proposing a new “brGDGT wetland index”. This index, if validated further, may help distinguish wetland-influenced brGDGTs from other depositional sources, with potential applications beyond brGDGT provenance analysis.
However, I find some aspects unclear or insufficiently explained:
1) The study uses five different machine learning models, but the criteria for selecting these specific models are not well justified. Why were these models chosen over others, such as deep learning approaches or ensemble methods beyond Random Forest? Also, the manuscript could benefit from a discussion of why certain models performed better and why others underperformed.
2) The study primarily focuses on semi-arid and arid regions, but are the results generalizable to other climate settings? Note that one of the tested sediment records extends back to 35k years BP. Also, there is little discussion of how different environmental conditions might affect the model performance.
3) The results rely heavily on probability estimates to detect mixed-source environments, but their uncertainty and confidence intervals are not entirely clear. How robust are these estimates? How reliable would they remain in cases of overlapping depositional influences (ex. lake vs peat transitions)?
4) The study validates machine learning results using pollen and NPP data, but this comparison has uncertainties among which are variations in pollen/spore productivity and dispersal over time. These uncertainties could also affect the reliability of pollen-based reconstructions. A more extended comparison with established proxies of local relevance (e.g., geochemical elemental data, stable isotopes) would strengthen the argument that machine learning provides superior brGDGT provenance detection.
Minor points
Introduction:
Could the authors elaborate more on the specific limitations of past methods in br-GDGT provenance detection?

The discussion on machine learning techniques is somewhat broad; more details on how these approaches differ from traditional statistical methods would be very helpful.

Material and Methods:
The manuscript states that models were trained and tested using a new modern database, but it does not provide sufficient details on data preprocessing and splitting strategies. E.g. What were the rationale behind dataset division (train-validation-test)?

The use of SMOTE seems suitable for handling class imbalance, but was there any testing to ensure it did not introduce artificial biases?

Results:
The results section provides a detailed evaluation of model performance, but it lacks clarity on what is a meaningful improvement in classification accuracy. For example, what is the practical significance of a 0.72 vs. 0.90 F1 score in this context?

The comparison of sigmoid and isotonic calibration functions is interesting, but it is unclear why certain calibrations improved some models but worsened others. More discussion on the underlying reasons for these differences is needed (at least these could be included in the supplementary material).

Discussion:
The proposed "brGDGT wetland index" is an interesting addition, but more validation is required. The assumptions behind the brGDGT wetland index appear to be valid in modern datasets, but their applicability to fossil records is problematic mainly due to the fact that pollen productivity and dispersal (incl. source area) vary over time due to climatic, ecological, and taphonomic factors (incl. differential preservation). Because of these reasons, a water level reconstruction based solely on pollen/spores may also be problematic.
Citation: https://doi.org/10.5194/egusphere-2025-526-RC2
- AC2: 'Reply on RC2', Amy Cromartie, 03 May 2025
  
  Please see the attached pdf for our response
  
  Citation: https://doi.org/10.5194/egusphere-2025-526-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-526', Joseph B. Novak, 07 Mar 2025

Please see the attached file. My apologies that I cannot participate in any additional rounds of peer review due to planned travel.

Citation: https://doi.org/10.5194/egusphere-2025-526-RC1
- AC1:
  'Reply on RC1', Amy Cromartie, 03 May 2025
  Dear Reviewer Novak,
  Thank you for your thoughtful response on our manuscript. We have gone ahead and read through your reviews and answered them as best as we can. You will find our responses below. We hope that we sufficiently answered your comments and concerns below.
  
  Review of Cromartie et al. for Biogeosciences
  Joseph B. Novak
  Recommendation
  Major revision.
  Summary
  Cromartie et al. present a new machine learning approach to probabilistically assess the provenance of brGDGTs in terrestrial sedimentary archives. This work improves upon previous work by Martinez-Sosa et al. (2023), the BIGMAC algorithm, by generating probability estimates that permit analysis of the likely relative contributions of brGDGTs from various sources to a sediment sample rather than discrete sample classifications. The improvement upon the BIGMAC algorithm is a contribution towards ongoing efforts to utilize brGDGTs as proxies of past climate change in the geologic record.
  The writing is mostly clear, although there are some places where I was confused by the word choice or sentence structure. Wherever possible, I provided suggestions to revise the wording for clarity. I urge the editors to find a machine learning expert to evaluate the methodology of this work, as this technique does not fall within my expertise. My recommendation for a major revision is based upon my concerns regarding section 4.2.4 where a new brGDGT wetlands index is proposed (see major comments).
  I look forward to the publication of this work after my comments are addressed.
  
  Major Comments
  Introduction
  The introduction would benefit from some clarification as to why it was necessary to use five machine learning techniques to generate the model described here. Did you try five machine learning methods and then settle on one as the best? Are you somehow combining the output of all five models? Machine learning is generally a confusing (and intimidating!) methodology for many people, including some who would want to use your algorithm. Clarity on why you took this approach will make people more likely to understand what you did and therefore more likely to use your algorithm (and cite your work! )
  
  I think an additional 1-3 sentences in the paragraph at lines 81-92 would be very helpful for clarifying this point.
  
  Response: Thank you for this response we have added a bit more information in the introduction on why these models were chosen.
  Introduction: “We test five popular parametric and non-parametric machine learning models based on their ability to handle small tabular datasets and produce reliable probability estimates when calibrated (Malley et al., 2012; Wang et al., 2019). Models utilizing different structures were chosen, including simple tree-based algorithms (CART), ensemble trees (RF), linear models (LR), margin-based classifiers (SVM), and instance-based lazy learners (K-NN) to evaluate performance. The best-performing model was then chosen to apply to two down-core sedimentary sequences. ”
  
  Materials and Methods
  L220–221: Do you mean that you are using the probability estimates as a means of understanding changes in brGDGT provenance through time? Because that is a different thing than using them as an environmental proxy. Please clarify.
  Response: Yes, we are primarily looking at them to explain provenance change rather than as an environmental proxy in itself. We have gone ahead and removed environmental and added provenance. This sentence is now: “proxy for provenance change”
  
  Discussion
  Figure 6a and 7a: why is the pollen water depth reconstruction plotted on a log scale? This seems a bit odd, should this not be plotted on a linear scale? Please explain.
  
  Response: We decided to plot the water depth reconstruction on a logarithmic scale following the precedent of the author’s (i.e., Robles et al., 2022) original peer-reviewed articles rather than deviate. It is common to plot these pollen-based reconstructions in this way in order to visualize changes more clearly the results.
  
  Section 4.2.4: I question whether including this section distracts from the larger point of the paper. Is this index not redundant since you are tracking basically the same thing with the % peat probability? Perhaps more importantly, there should be some sort of validation regarding whether this index is useful for identifying wetlands in a modern dataset. For example, is this index value higher in modern samples from wetlands than in those from dry soils or lakes? I think this section generally should be expanded upon significantly if the authors want to claim that this index can be used this way.
  
  Response: After consideration, and since both reviewers mentioned the index as problematic, we have gone ahead and removed this index at this time since it will require major additions to the manuscript which distracts from the primary goal of utilizing machine learning. We will try to publish this index in the future with more information. To accommodate this change we have removed the discussion of this index from the abstract (lines 35-36). We removed figure 8 and section 4.2.4. This section has been changed to include a more through discussion of utilizing the brGDGTs in different environmental settings and confidence intervals for reviewer 2.
  
  Minor Comments
  Abstract
  L21: I think you mean “Branched glycerol dialkyl glycerol tetraethers (brGDGTs) are critical molecular biomarkers” rather than “…serve as critical molecular biomarkers.”
  Response: We removed the “serve as” and changed to “are critical biomarkers”
  L22–23: Is the sentence starting with “Despite their success…” necessary?
  Response: We have removed the phrase “Despite their success, and instead start the sentence with “A key challenge”
  L25: here and throughout, make sure to use the proper prime symbol ′
  Response: This has been updated throughout the manuscript. Thank you for catching this.
  L25–26: “…where ecosystems are sensitive to diverse environmental factors.” Do you mean that depositional environments in arid and semi-arid regions are prone to change in response to water stress?
  Response: Yes, this is indeed what we mean, we have added additional text that says “where ecosystems are sensitive to diverse environmental factors including water stress from increased aridity”
  L31: “…obtained from the identical records.” Do you mean from the same samples? Or from the same cores?
  Response: These samples are from the same cores. The texts has been updated to reflect that: “taken from the same sedimentary core sequence”
  L34: Typo. “brGDGT provenance” not “brGDGTs provenance.”
  Response: Thank you this has been updated to correct this typo
  L36: I think a word is missing here. Do you mean potential biases in brGDGT paleotemperature reconstructions?
  Response: This sentence was removed due to remove the wetland index.
  
  Introduction
  
  Introduction
  L39–43: These two sentences are largely redundant. Could they be combined?
  Response: Thank you for this comment, although we can see where there are some redundancies, we have kept the sentences in the current form in order to make sure that the reader has the relevant literature.
  L47: “their potential” not “its potential." I would consider removing this initial clause in the sentence and starting with “A key challenge…” as this is a more focused start to the paragraph.
  Response: we have removed this phrase and start with “A key challenge” as you suggested.
  L47–48: I think you need a citation here since this thought is informed by previous work.
  Response : We have added the following citations to this line: (De Jonge et al., 2014 ; Naafs et al., 2017 ; Dearing Crampton-Flood, 2020; Martínez-Sosa et al., 2020; Raberg et al., 2022)
  L56–58: I think you mean “The MBT′5Me index is correlated to temperature in lake sediments, peats, and soils [CITATIONS].”
  Response: We have gone and updated this sentence to the following:
  Text added: The MBT′_5ME index has been successfully utilized as grounds for various global temperature calibrations, because of its strong correlation to temperature in modern samples, concerning lakes, peats, and soils (e.g., De Jonge et al., 2014a; Hopmans et al., 2016; Naafs et al., 2017a; Dearing Crampton-Flood et al., 2020; Martínez-Sosa et al., 2021; Véquaud et al., 2022).
  L59–65: Somewhere in here it would be good to mention that MBT′5Me is systematically higher in soils than in lakes. This is usually the major source of concern when it comes to dealing with brGDGTs from potentially mixed sources, at least in lake sediments.
  Response: We agree we have updated the text with the following texts
  Text added: “Provenance changes may introduce bias to temperature reconstructions based on the MBT′_5ME index values generally being higher in soils than in lakes (Pablo Martínez-Sosa et al., 2021). ”
  
  L78: what do you mean by ecological changes? As in bacterial ecology? This may be a word choice issue, I was really surprised to see the word “ecology” here.
  Response: This references the ecological changes within the wetlands, peatlands, and lake environments, not the bacterial ecology, however, we do understand this confusion so we have removed the word ecology from this sentence. The sentence now reads: “This paper presents a strategy for identifying provenance changes across”
  
  L99 vs L100: do you mean depositional or provenance? Because those are different things. The words cannot be used interchangeable.
  
  Response: We changed the wording to reflect that these changes in provenance by adding the following to the sentence “we demonstrate how these complementary proxies can aid in identifying potential hydrological, ecological, and depositional changes that may cause provenance shifts, introducing bias in brGDGT reconstruction”
  
  Materials and Methods
  L123: I think you mean limited data, not limited publication.
  Response: removed publication and changed to “limited data”
  L135: “A C46 internal standard.”
  Response: This has been updated
  L139: “…the C46 internal standard.”
  Response: This has been updated.
  L176: Check the journal’s referencing policies. I am not sure that “ibid” is permitted within this citation style.
  Response: We have changed to include the reference removing ibid.
  L189–191: This sentence is missing a verb. Please clarify.
  Response: I was unable to find a missing verb in these sentences, in this case “are” and “demonstrated” are the verbs and “functions” is the verb in the next sentence.
  L192: same comment regarding ibid.
  Response: updated
  L196: same comment regarding ibid.
  Response: updated
  Figure 2: consider making the background color for the soil dataset symbol a lighter shade of brown. I found it hard to read the dark font against the dark brown background.
  Response: We have gone ahead and updated this figure to a lighter shade of brown.
  L222: do you mean “periods of mixed brGDGT provenance?”
  Response: Yes, this refers to mixed provenance. We have gone ahead and removed the word sourcing and change it to provenance “thus facilitating the identification of periods of mixed provenance.”
  L244–245: why did you retain the original sample to sample curve?
  Response: We retained the original sample to sample curve, because the original publication of the pollen record and water depth reconstructions was 200,000 years old making it difficult to see variations within the shorter (36,000 years) brGDGT record. We added additional text to clarify:
  “Instead of applying a smoothing technique to the water-depth reconstruction, as done by Camuera et al., (2019) on the original 200,000-year-old sequence, we retained the original sample-to-sample curve for clarity to compare to the shorter brGDGT sequence.”
  
  L252–254: I am not sure this is necessary to explain.
  Response: Although we agree we do not necessarily need to explain the plots, we have decided to keep the reference and description in the manuscript for clarity and to allow for the proper citations to the programs used for our analysis.
  
  Results
  L279–280: these sentences should be combined.
  Response: We have gone ahead and combined these sentences as the following “The RF model with the SMOTE dataset had the highest accuracy and the lowest Log loss score for sigmoid calibrated probabilities and therefore was chosen for our analysis as the best performing model.”
  
  Discussion
  L362: I think you mean that the two timeseries are qualitatively similar.
  Response: We have added the following to the test reflect this change
  “We evaluated the accuracy of our ML model for detecting provenance change by comparing the probability estimates it produced with published data on pollen, NPPs, XRF and water depth estimates derived from these proxies and to check the that the results for quantitative similarities (Fig. 5 and 6). ”
  L402: italicize in situ
  Response: this has been updated
  L429–431: I am not sure what you mean. The De Jonge et al. 2024 study found that MBT′
  5Me are generally reproducible between laboratories.
  Response: Thank you for this catch, indeed the sentence is indeed incorrect and phrased strangely. We have removed this sentence from the manuscript.
  450: the ′ symbol should not be subscripted
  Response: This has been updated
  L468: you can simply report this p-value as “p < 0.001”
  Response: This has been changed to “p < 0.001”
  L475–482: why are some of the brGDGTs written in square brackets sometimes but not other times?
  Response: We have gone ahead and updated the manuscript to included the square brackets around all the brGDGTs when appropriate
  
  Citation: https://doi.org/10.5194/egusphere-2025-526-AC1
RC2:
'Comment on egusphere-2025-526', Anonymous Referee #2, 08 Apr 2025
The manuscript presents an interesting strategy on using machine learning probability estimates and pollen data to understand how source changes affect the distribution of brGDGTs and particularly the MBT’5ME index in sedimentary records. The study applies probability estimates from machine learning models to an extended modern global brGDGT database, to detect brGDGT contributions from different sources (soil, peat, lake) over time in two sediment archives and validate these findings through comparisons with independent pollen ad NPP data.
Although the study builds on previous research (Martinez-Sosa et al, 2023), it introduces several novel additions, including: addition of new modern br-GDGT samples to the training dataset, exploration of different probability calibration techniques and proposing a new “brGDGT wetland index”. This index, if validated further, may help distinguish wetland-influenced brGDGTs from other depositional sources, with potential applications beyond brGDGT provenance analysis.
However, I find some aspects unclear or insufficiently explained:
1) The study uses five different machine learning models, but the criteria for selecting these specific models are not well justified. Why were these models chosen over others, such as deep learning approaches or ensemble methods beyond Random Forest? Also, the manuscript could benefit from a discussion of why certain models performed better and why others underperformed.
2) The study primarily focuses on semi-arid and arid regions, but are the results generalizable to other climate settings? Note that one of the tested sediment records extends back to 35k years BP. Also, there is little discussion of how different environmental conditions might affect the model performance.
3) The results rely heavily on probability estimates to detect mixed-source environments, but their uncertainty and confidence intervals are not entirely clear. How robust are these estimates? How reliable would they remain in cases of overlapping depositional influences (ex. lake vs peat transitions)?
4) The study validates machine learning results using pollen and NPP data, but this comparison has uncertainties among which are variations in pollen/spore productivity and dispersal over time. These uncertainties could also affect the reliability of pollen-based reconstructions. A more extended comparison with established proxies of local relevance (e.g., geochemical elemental data, stable isotopes) would strengthen the argument that machine learning provides superior brGDGT provenance detection.
Minor points
Introduction:
Could the authors elaborate more on the specific limitations of past methods in br-GDGT provenance detection?

The discussion on machine learning techniques is somewhat broad; more details on how these approaches differ from traditional statistical methods would be very helpful.

Material and Methods:
The manuscript states that models were trained and tested using a new modern database, but it does not provide sufficient details on data preprocessing and splitting strategies. E.g. What were the rationale behind dataset division (train-validation-test)?

The use of SMOTE seems suitable for handling class imbalance, but was there any testing to ensure it did not introduce artificial biases?

Results:
The results section provides a detailed evaluation of model performance, but it lacks clarity on what is a meaningful improvement in classification accuracy. For example, what is the practical significance of a 0.72 vs. 0.90 F1 score in this context?

The comparison of sigmoid and isotonic calibration functions is interesting, but it is unclear why certain calibrations improved some models but worsened others. More discussion on the underlying reasons for these differences is needed (at least these could be included in the supplementary material).

Discussion:
The proposed "brGDGT wetland index" is an interesting addition, but more validation is required. The assumptions behind the brGDGT wetland index appear to be valid in modern datasets, but their applicability to fossil records is problematic mainly due to the fact that pollen productivity and dispersal (incl. source area) vary over time due to climatic, ecological, and taphonomic factors (incl. differential preservation). Because of these reasons, a water level reconstruction based solely on pollen/spores may also be problematic.
Citation: https://doi.org/10.5194/egusphere-2025-526-RC2
- AC2: 'Reply on RC2', Amy Cromartie, 03 May 2025
  
  Please see the attached pdf for our response
  
  Citation: https://doi.org/10.5194/egusphere-2025-526-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (16 May 2025) by Petr Kuneš

AR by Amy Cromartie on behalf of the Authors (19 Jun 2025) Author's response Author's tracked changes Manuscript

ED: Publish subject to technical corrections (23 Jun 2025) by Petr Kuneš

AR by Amy Cromartie on behalf of the Authors (01 Jul 2025) Manuscript

Post-review adjustments

AA – Author's adjustment | EA – Editor approval

AA by Amy Cromartie on behalf of the Authors (28 Nov 2025) Author's adjustment Manuscript

EA: Adjustments approved (28 Nov 2025) by Petr Kuneš

Journal article(s) based on this preprint

08 Dec 2025

Utilizing probability estimates from machine learning and pollen to understand the depositional influences on branched GDGT in wetlands, peatlands, and lakes

Biogeosciences, 22, 7687–7708, https://doi.org/10.5194/bg-22-7687-2025,https://doi.org/10.5194/bg-22-7687-2025, 2025

Short summary

Supplement

https://doi.org/10.5194/egusphere-2025-526-supplement

Viewed

Total article views: 3,359 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
2,750	510	99	3,359	206	114	133

HTML: 2,750
PDF: 510
XML: 99
Total: 3,359
Supplement: 206
BibTeX: 114
EndNote: 133

Views and downloads (calculated since 28 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	68	6	6	80
Mar 2025	346	58	6	410
Apr 2025	146	30	4	180
May 2025	106	26	8	140
Jun 2025	86	6	10	102
Jul 2025	96	14	2	112
Aug 2025	234	18	0	252
Sep 2025	1,050	22	18	1,090
Oct 2025	116	16	2	134
Nov 2025	100	28	8	136
Dec 2025	122	60	14	196
Jan 2026	92	74	4	170
Feb 2026	48	36	8	92
Mar 2026	66	46	4	116
Apr 2026	24	23	3	50
May 2026	32	36	0	68
Jun 2026	7	4	0	11
Jul 2026	11	7	2	20

Cumulative views and downloads (calculated since 28 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	68	6	6	80
Mar 2025	346	58	6	410
Apr 2025	146	30	4	180
May 2025	106	26	8	140
Jun 2025	86	6	10	102
Jul 2025	96	14	2	112
Aug 2025	234	18	0	252
Sep 2025	1,050	22	18	1,090
Oct 2025	116	16	2	134
Nov 2025	100	28	8	136
Dec 2025	122	60	14	196
Jan 2026	92	74	4	170
Feb 2026	48	36	8	92
Mar 2026	66	46	4	116
Apr 2026	24	23	3	50
May 2026	32	36	0	68
Jun 2026	7	4	0	11
Jul 2026	11	7	2	20

Viewed (geographical distribution)

Total article views: 3,356 (including HTML, PDF, and XML) Thereof 3,356 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 24 Jul 2026

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (2587 KB)
Metadata XML

Short summary

BrGDGT are a molecular biomarker utilized for paleotemperature reconstructions. One issue, however, with utilizing brGDGTs is that the distribution differs in relation to sediment environments (i.e., peat, lake, soil) which change overtime. We utilize the probability estimate outputs from five machine learning algorithms, and a new modern brGDGTs database to track change and apply these models’ to two downcore records utilizing pollen and non-pollen polymorphs to confirm the model’s accuracy.


Total:	0
HTML:	0
PDF:	0
XML:	0