the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Refining data-data and data-model biome comparisons using the Earth Movers' Distance (EMD)
Abstract. Biome reconstructions are commonly used in data-data and data-model comparison studies to understand past vegetation dynamics. However, most of these assessments are based on the direct comparison of dominant biomes inferred from pollen samples or vegetation simulations. Dominant biomes are deduced from pollen samples using biome affinity scores, which aggregate pollen percentages of taxa assigned to the different biomes. While this approach generates good results over a large range of temporal and spatial scales, reducing pollen assemblages to a single dominant biome can substantially simplify the vegetation signal preserved in pollen samples and even bias conclusions when, for instance, minimal changes in pollen percentages can change the inferred dominant biome. To resolve these issues, we propose to use the Earth Movers’ distance (EMD) as a new metric to compare distributions of biome scores. The EMD has two main advantages: 1) the distributions of biome scores do not need to be reduced to their dominant biome, and the full breadth of the data is taken into account, and 2) different weights can be given to different types of disagreements to account for the ecological distance (e.g. reconstructing a temperate forest instead of a boreal forest is ecologically less wrong than reconstructing the temperate forest instead of a desert). We also introduce EMD-based statistical tests that determine if the similarity of two samples is significantly better than a random association. This paper illustrates the use of the EMD across a series of palaeoecological data-data and data-model case studies based on published data and simulations. These applications highlight the diverse types of analysis where the EMD adds value compared to analyses of the dominant biomes only. The EMD and the statistical tests are included in the paleotools R package (https://github.com/mchevalier2/paleotools).
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(4434 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(4434 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-489', Anonymous Referee #1, 26 Aug 2022
Publisher’s note: this comment and its supplement were edited on 29 August 2022. The following text is not identical to the original comment, but the adjustments do not effect the scientific meaning.
Please find attached my anonymous comments for the authors.
- AC1: 'Reply on RC1', Manuel Chevalier, 15 Feb 2023
-
RC2: 'Comment on egusphere-2022-489', Louis François, 18 Nov 2022
Review of “Refining data-data and data-model biome comparisons using the Earth Movers’ Distance (EMD)” by Chevalier et al.
- Does the paper address relevant scientific questions within the scope of CP?
Yes, this contribution looks promising as a new method to analyse pollen data for the past and compare them with vegetation model reconstructions. It is thus directly connected with key aspects of paleoclimate reconstruction and, hence, falls fully within the scope of CP.
- Does the paper present novel concepts, ideas, tools, or data?
Yes. To my knowledge, the application of the EMD to the analysis of pollen data is completely new. This concept is quite interesting.
- Are substantial conclusions reached?
Yes. The authors prove the applicability of the EMD for analysing/comparing pollen data/vegetation model reconstructions. In particular, they show that the EMD has the advantage of conserving more information from the original pollen data. Reconstructions look to be more stable through time compared to classical methods based on the biome with highest affinity score, because these latter methods do not offer a continuous measure of vegetation state (i.e., they are discrete classes).
- Are the scientific methods and assumptions valid and clearly outlined?
Yes, the main method used, the EMD, is well explained and references are provided. Other methods are statistical analyses, which relatively well explained.
- Are the results sufficient to support the interpretations and conclusions?
Yes.
- Is the description of experiments and calculations sufficiently complete and precise to allow their reproduction by fellow scientists (traceability of results)?
Yes, probably. The authors have even developed a R package that certainly helps for such a reproduction of their results, as well for analysis of other pollen data.
- Do the authors give proper credit to related work and clearly indicate their own new/original contribution?
Yes.
- Does the title clearly reflect the contents of the paper?
Yes.
- Does the abstract provide a concise and complete summary?
Yes.
- Is the overall presentation well structured and clear?
Yes, it is very well-organised and generally clear.
- Is the language fluent and precise?
I am not native English-speaker, but language looks fine to me.
- Are mathematical formulae, symbols, abbreviations, and units correctly defined and used?
Yes.
- Should any parts of the paper (text, formulae, figures, tables) be clarified, reduced, combined, or eliminated?
The paper has a reasonable length. All parts look necessary. Description of Test 2 (lines 201-216) could be improved. It is difficult to read, especially the method used to establish the second EMD distribution.
- Are the number and quality of references appropriate?
Yes.
- Is the amount and quality of supplementary material appropriate?
Yes.
Comments
In their paper, Chevalier et al. establish a new method based on the EMD to analyse pollen data (change in space and time) or compare them to model vegetation reconstructions. This method is novel, quite interesting and promise to be of broad applicability. The paper is well structured and very well written. It can be published after very minor revision. I have only a few remarks and suggestions that the authors could consider in their revision:
- The concept of biome is integrative, i.e., it is used to represent (within classes) the overall vegetation present at a given location. So, only one biome should exist at a given location. Thus, the words “dominant biome” should be avoided, and replaced by something like “biome with the highest affinity score” (with the pollen data).
- The authors claim that the biomes are discrete quantities, and for that reason, the methods based on the biome with highest affinity score is presented as less robust than the use of the EMD, which is more continuous. However, with the EMD, the authors use the biome concept and their affinity scores. So, some of the “discontinuities” associated to the discrete definition of biomes still remain, especially when mega-biomes are used as done here. Actually, the EMD method developed here could equivalently be applied using plant functional types (PFTs) and PFT scores rather than biomes/biome scores. For instance, Henrot et al. (2017) (Palaeogeogr . Palaeoclim. Palaeoecol. 467, 95-119, 2017) compared model reconstructions with vegetation data at the level of PFT, i.e., using PFT scores. This allows to keep more information from the original pollen data, that are provided at the genus level. In this case, biome maps are just created to illustrate/capture vegetation distribution in a single map.
- The EMD method allows to measure a (continuous) distance in the multidimensional space phase with the scores of the different biomes. You can thus show for instance (as in Figure 6) how the distance to a present-day biome has varied in the past. However, the biome phase space is multidimensional and the distance does not tell you in which direction you move. Are you moving towards more forests (and if yes towards which type of forests) or towards more grasslands or deserts? This information is quite important to characterize past vegetation. So, the EMD alone is not sufficient to reconstruct precisely past vegetations. It must be combined with a measure of directions in the phase space. In the method presented here, this role is played by the change in the biome with highest affinity score (so the classical method). But could it be improved to achieve a continuous method to evaluate such directions?
- I have not found many typos. Only on line 436, “toa” does not make sense. I guess the authors want to say: “… as a metric to support …”
The authors may want to discuss topics (2) and (3) in the discussion section.
Citation: https://doi.org/10.5194/egusphere-2022-489-RC2 - AC2: 'Reply on RC2', Manuel Chevalier, 15 Feb 2023
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-489', Anonymous Referee #1, 26 Aug 2022
Publisher’s note: this comment and its supplement were edited on 29 August 2022. The following text is not identical to the original comment, but the adjustments do not effect the scientific meaning.
Please find attached my anonymous comments for the authors.
- AC1: 'Reply on RC1', Manuel Chevalier, 15 Feb 2023
-
RC2: 'Comment on egusphere-2022-489', Louis François, 18 Nov 2022
Review of “Refining data-data and data-model biome comparisons using the Earth Movers’ Distance (EMD)” by Chevalier et al.
- Does the paper address relevant scientific questions within the scope of CP?
Yes, this contribution looks promising as a new method to analyse pollen data for the past and compare them with vegetation model reconstructions. It is thus directly connected with key aspects of paleoclimate reconstruction and, hence, falls fully within the scope of CP.
- Does the paper present novel concepts, ideas, tools, or data?
Yes. To my knowledge, the application of the EMD to the analysis of pollen data is completely new. This concept is quite interesting.
- Are substantial conclusions reached?
Yes. The authors prove the applicability of the EMD for analysing/comparing pollen data/vegetation model reconstructions. In particular, they show that the EMD has the advantage of conserving more information from the original pollen data. Reconstructions look to be more stable through time compared to classical methods based on the biome with highest affinity score, because these latter methods do not offer a continuous measure of vegetation state (i.e., they are discrete classes).
- Are the scientific methods and assumptions valid and clearly outlined?
Yes, the main method used, the EMD, is well explained and references are provided. Other methods are statistical analyses, which relatively well explained.
- Are the results sufficient to support the interpretations and conclusions?
Yes.
- Is the description of experiments and calculations sufficiently complete and precise to allow their reproduction by fellow scientists (traceability of results)?
Yes, probably. The authors have even developed a R package that certainly helps for such a reproduction of their results, as well for analysis of other pollen data.
- Do the authors give proper credit to related work and clearly indicate their own new/original contribution?
Yes.
- Does the title clearly reflect the contents of the paper?
Yes.
- Does the abstract provide a concise and complete summary?
Yes.
- Is the overall presentation well structured and clear?
Yes, it is very well-organised and generally clear.
- Is the language fluent and precise?
I am not native English-speaker, but language looks fine to me.
- Are mathematical formulae, symbols, abbreviations, and units correctly defined and used?
Yes.
- Should any parts of the paper (text, formulae, figures, tables) be clarified, reduced, combined, or eliminated?
The paper has a reasonable length. All parts look necessary. Description of Test 2 (lines 201-216) could be improved. It is difficult to read, especially the method used to establish the second EMD distribution.
- Are the number and quality of references appropriate?
Yes.
- Is the amount and quality of supplementary material appropriate?
Yes.
Comments
In their paper, Chevalier et al. establish a new method based on the EMD to analyse pollen data (change in space and time) or compare them to model vegetation reconstructions. This method is novel, quite interesting and promise to be of broad applicability. The paper is well structured and very well written. It can be published after very minor revision. I have only a few remarks and suggestions that the authors could consider in their revision:
- The concept of biome is integrative, i.e., it is used to represent (within classes) the overall vegetation present at a given location. So, only one biome should exist at a given location. Thus, the words “dominant biome” should be avoided, and replaced by something like “biome with the highest affinity score” (with the pollen data).
- The authors claim that the biomes are discrete quantities, and for that reason, the methods based on the biome with highest affinity score is presented as less robust than the use of the EMD, which is more continuous. However, with the EMD, the authors use the biome concept and their affinity scores. So, some of the “discontinuities” associated to the discrete definition of biomes still remain, especially when mega-biomes are used as done here. Actually, the EMD method developed here could equivalently be applied using plant functional types (PFTs) and PFT scores rather than biomes/biome scores. For instance, Henrot et al. (2017) (Palaeogeogr . Palaeoclim. Palaeoecol. 467, 95-119, 2017) compared model reconstructions with vegetation data at the level of PFT, i.e., using PFT scores. This allows to keep more information from the original pollen data, that are provided at the genus level. In this case, biome maps are just created to illustrate/capture vegetation distribution in a single map.
- The EMD method allows to measure a (continuous) distance in the multidimensional space phase with the scores of the different biomes. You can thus show for instance (as in Figure 6) how the distance to a present-day biome has varied in the past. However, the biome phase space is multidimensional and the distance does not tell you in which direction you move. Are you moving towards more forests (and if yes towards which type of forests) or towards more grasslands or deserts? This information is quite important to characterize past vegetation. So, the EMD alone is not sufficient to reconstruct precisely past vegetations. It must be combined with a measure of directions in the phase space. In the method presented here, this role is played by the change in the biome with highest affinity score (so the classical method). But could it be improved to achieve a continuous method to evaluate such directions?
- I have not found many typos. Only on line 436, “toa” does not make sense. I guess the authors want to say: “… as a metric to support …”
The authors may want to discuss topics (2) and (3) in the discussion section.
Citation: https://doi.org/10.5194/egusphere-2022-489-RC2 - AC2: 'Reply on RC2', Manuel Chevalier, 15 Feb 2023
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
413 | 172 | 15 | 600 | 7 | 8 |
- HTML: 413
- PDF: 172
- XML: 15
- Total: 600
- BibTeX: 7
- EndNote: 8
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Anne Dallmeyer
Nils Weitzel
Chenzhi Li
Jean-Philippe Baudouin
Ulrike Herzschuh
Xianyong Cao
Andreas Hense
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(4434 KB) - Metadata XML