the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Potential of natural language processing for metadata extraction from environmental scientific publications
Abstract. Climate change will most likely lead to an increase of extreme weather events, including heavy rainfall with soil surface runoff and erosion. Adapting agricultural management practices that lead to increased infiltration capacities of soil has potential to mitigate these risks. However, effects of agricultural management practices (tillage, cover crops, amendment, …) on soil variables (hydraulic conductivity, aggregate stability, …) often depend on the pedo-climatic context. Hence, the only possibility to gather information needed to advise stakeholders on suitable management practices is to quantify such dependencies using meta-analyses of studies investigating this topic. As a first step, structured information from scientific publications needs to be extracted to build a meta-database, which then can be analyzed and recommendations can be given in dependence to the pedo-climatic context.
Manually building such a database by going through all publications is very time-consuming. Given the increasing amount of literature, this task is likely to require more and more effort in the future. Natural language processing (NLP) facilitates this task, but it is not clear yet to which extent the extraction process is reliable or complete. In this work, two corpora of documents were used, which we refer to as the OTIM and the Meta corpus in the following. The OTIM corpus contains the source publications of the entries of the OTIM database of near-saturated hydraulic conductivity from tension-disk infiltrometer measurements (https://github.com/climasoma/otim-db). The Meta corpus is constituted of all primary studies from 36 selected meta-analyses on the impact of agricultural practices on sustainable water management in Europe. We focused on three NLP techniques: topic modeling, tailored regular expressions and dictionaries and the shortest dependency path. We used topic modeling to sort the individual source-publications of the Meta corpus into 6 topics (e.g. related to cover crops, biochar, …) with a coherence metric Cv ranging from 0.7 to 0.9; Then, we used tailored regular expressions and dictionaries to extract coordinates, soil texture, soil type, rainfall, disk diameter and tensions on the OTIM corpus. We found that the respective information could be retrieved with 56 % up to 100 % of all relevant information (recall) and with a precision between 83 % and 100 %. Finally, we extracted relationships between a set of practices keywords (e.g. ‘biochar’, ‘zero tillage’, …) and soil variables (e.g. ‘soil aggregate’, ‘hydraulic conductivity’, ‘crop yield’,…) from the source-publications’ abstracts of the Meta corpus using the shortest dependency path between them. These relationships were further classified according to positive, negative or absent correlations between the driver and soil property. This quickly provided an overview of the different driver-variable relationships and their abundance for an entire body of literature. Overall, we found that all three tested NLP techniques were able to support evidence synthesis tasks such as selecting relevant publications on a topic, extracting specific information to build databases for meta-analysis and providing an overview of relationships found in the corpus. While human supervision remains essential, NLP methods have the potential to support fully automated evidence synthesis that can be continuously updated as new publications become available.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(2066 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(2066 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-535', Anonymous Referee #1, 10 Aug 2022
Interesting study regarding the use use natural language processing methods to extract information from the growing volume of scientific literature. The authors not only illustrate the use of different algorithms but also try to evaluate them numerically. In general, a well written manuscript. However, I think there is a lack of discussion and some of their objectives/aims are weakly met. The "relationship extraction" section is interesting and well written and the authors might want to put the same effort in the rest of the sections.
Comments
- Abstract: The beginning abstract seems a bit disconnected with the rest of the manuscript. Climate change is a hot topic but the paper itself is not related to that. I would suggest re-framing the abstract to match the content of the manuscript.
- Assessing the ability of an algorithm such as regex: I find this evaluation a bit estrange. The algorithms itself is infallible in the sense that it always finds what you tell it to find if it is present in the text. The algorithm is only restricted by the capacity of the user to generate valid regular expressions.
- Topic modelling: There is no discussion.
- How did you achieve your second aim (to illustrate the ability of topic classification to classify a new paper as relevant to a given topic)?
- You mention that topic modelling "can help identify knowledge gaps". How? Did you find any? If your aim is to present a practical workflow, perhaps you should guide the user to achieve that.
- Why did you select 6 topics instead of 9. You only mention that you are trying to maximise the coherence, which is higher for 9 topics.
- How does the number of topics might affect your workflow? Is selecting the highest coherence score infallible?
- Could you elaborate on how excluding monograms increased the coherence? From the term frequencies (Fig 7) I do not see many soil related terms, which seems strange. Perhaps they were ignored since their appeared as monograms? I do agree that bi and even trigrams are important but I have usually seen them added to a selection of monograms.
Citation: https://doi.org/10.5194/egusphere-2022-535-RC1 -
AC1: 'Reply on RC1', Guillaume Blanchy, 12 Jan 2023
General:
Interesting study regarding the use of natural language processing methods to extract information from the growing volume of scientific literature. The authors not only illustrate the use of different algorithms but also try to evaluate them numerically. In general, a well written manuscript. However, I think there is a lack of discussion and some of their objectives/aims are weakly met. The "relationship extraction" section is interesting and well written and the authors might want to put the same effort in the rest of the sections.
We appreciate that you find the study interesting and we thank you for your useful comments on the content that will help to improve the manuscript. We would like to state that the primary aim of the study was to demonstrate a practical workflow of several NLP techniques for summarising a large body of scientific literature. This was not properly reflected in the aims of our study. We will modify the aims accordingly in the revised version of the manuscript.
We acknowledge that the “topic analysis” part is less developed and weakly matched the objective 2 of addressing if a paper was relevant or not to a topic. In this regard, we plan to restructure the content around topic classification in the manuscript. Instead of classifying “new papers” in different topics, we will now demonstrate how to identify groups of manuscripts (in our case, groups around different types of “agricultural practices”) and observe which groups are less represented (or absent). In this way, we can show practices less studied and identify possible knowledge gaps. This also serves as a first classification to identify on which topic would a meta-analysis be well suited for instance.
Specific comments:
- Abstract: The beginning abstract seems a bit disconnected with the rest of the manuscript. Climate change is a hot topic but the paper itself is not related to that. I would suggest re-framing the abstract to match the content of the manuscript.
We will rephrase the abstract such that the main focus will be NLP techniques to summarise a large body of scientific environmental literature and then present the OTIM en Meta corpus as a case study on which we applied these techniques.
- Assessing the ability of an algorithm such as regex: I find this evaluation a bit estrange. The algorithms itself is infallible in the sense that it always finds what you tell it to find if it is present in the text. The algorithm is only restricted by the capacity of the user to generate valid regular expressions.
We agree that the regex algorithm is infallible but indeed, in this case, we want to estimate how well user -defined regexes are able to recover specific information. We will make clear in the manuscript that we do not assess the ability of the regex algorithm but rather the ability of the user generated regular expressions to match relevant content considering the trade-off between generality and their specificity.
- Topic modelling: There is no discussion.
Further discussion will be added, especially on how topic classification can be used as one of the first steps of the presented semi-automated NLP workflow for information summary and identifying groups of abundant literature where a meta-analysis can be useful.
- How did you achieve your second aim (to illustrate the ability of topic classification to classify a new paper as relevant to a given topic)?
(see general comment)
- You mention that topic modelling "can help identify knowledge gaps". How? Did you find any? If your aim is to present a practical workflow, perhaps you should guide the user to achieve that.
We agree that a practical interpretation will be a useful addition to the manuscript. We will give a few examples in the manuscript and develop how we identify them.
- Why did you select 6 topics instead of 9. You only mention that you are trying to maximise the coherence, which is higher for 9 topics.
That is a fair point and will be corrected in the next version of the manuscript.
- How does the number of topics might affect your workflow? Is selecting the highest coherence score infallible?
It is not infallible and we found that choosing a number of topics between 6 and 9 topics tends to lead to the same groups. The variability in coherence for each number of topics can be great, especially for a relatively small number of corpus as we have. This will be discussed in the revised version of the manuscript.
- Could you elaborate on how excluding monograms increased the coherence? From the term frequencies (Fig 7) I do not see many soil related terms, which seems strange. Perhaps they were ignored since their appeared as monograms? I do agree that bi and even trigrams are important but I have usually seen them added to a selection of monograms.
In our case, the inclusion of monograms led to words like ‘soil’, ‘treatment’, ‘water’, ‘crop’ or ‘tillage’ to appear prominently in the different topics. This did not allow us to differentiate the topic so well and the average topic coherence in this case was Cv = 0.4. With only bi-grams, some of these words carried more meaning: “conventional tillage”, “soil water”, “cover crop” and hence enabled better to see what the topic is about. This is the reason why, in this case, we preferred to only use bi-grams. This remark is a good point and we recognize that the addition of monograms as seen in other work can sometimes help. This will be discussed in the revised manuscript.
Citation: https://doi.org/10.5194/egusphere-2022-535-AC1
-
AC1: 'Reply on RC1', Guillaume Blanchy, 12 Jan 2023
-
RC2: 'Comment on egusphere-2022-535', Anonymous Referee #2, 28 Nov 2022
General:
Overall this manuscript fits well with SOIL, and the methodology as well as the results will be of interest to readers. The nature of the study, involving "natural language processing for metadata extraction from environmental {soil} scientific publications" is inherently multidisciplinary, and complex! The necessary methods are well discussed and well referenced, and the appendix of the NLP software will be a big help to researchers in this field. The results relating agricultural practices and soil and site properties are novel and important.
Specific:
Most SOIL readers are probably substantially unfamiliar with NLP and would benefit from more focused guidance by the authors, which can be accomplished perhaps mostly easily by a trimmed revision. For example the Abstract is overly complex; the Introduction states the objectives of the study on just four lines 96-100, and a trimmed Abstract could focus simply on the achieving of the objectives.
The Material and Methods section is appropriately long, given the emphasis on methods, but could be edited to be more uniformly coherent. Perhaps part of that could be fixed by reformatting the variety of figures, and relegating some of them to just the appendix.
Most of the figures in the Results section are important, but much of the other discussions in Results are really recommendations and can be eliminated or partly moved to Conclusions.
Technical:
I see Reviewer #1 listed some technical issues, most of which I believe can be handled by trimming as suggested.
Citation: https://doi.org/10.5194/egusphere-2022-535-RC2 -
AC2: 'Reply on RC2', Guillaume Blanchy, 12 Jan 2023
General:
Overall this manuscript fits well with SOIL, and the methodology as well as the results will be of interest to readers. The nature of the study, involving "natural language processing for metadata extraction from environmental {soil} scientific publications" is inherently multidisciplinary, and complex! The necessary methods are well discussed and well referenced, and the appendix of the NLP software will be a big help to researchers in this field. The results relating agricultural practices and soil and site properties are novel and important.
We appreciate that you find this manuscript well suited for the journal SOIL and more specifically to a multi-disciplinary topic related to agricultural practices. We are also glad to hear that our effort towards a reproducible workflow (by the means of notebooks, github repository) is acknowledged.
Specific:
Most SOIL readers are probably substantially unfamiliar with NLP and would benefit from more focused guidance by the authors, which can be accomplished perhaps mostly easily by a trimmed revision. For example the Abstract is overly complex; the Introduction states the objectives of the study on just four lines 96-100, and a trimmed Abstract could focus simply on the achieving of the objectives.
Agree. As mentioned in reply to RC1, we will refocus the abstract around “NLP techniques” and the objectives we want to address in this work. Additionally, we will make sure that the NLP specific language is explained and simplified to make the abstract accessible to most.
The Material and Methods section is appropriately long, given the emphasis on methods, but could be edited to be more uniformly coherent. Perhaps part of that could be fixed by reformatting the variety of figures, and relegating some of them to just the appendix.
Figure 3 and Table 2 will be put in appendix to ease the flow through the Material and Methods section.
Most of the figures in the Results section are important, but much of the other discussions in Results are really recommendations and can be eliminated or partly moved to Conclusions.
Thank you for the feedback. We will edit the results in discussion this way and move recommendations to the conclusions section.
Technical:
I see Reviewer #1 listed some technical issues, most of which I believe can be handled by trimming as suggested.
See reply to RC1.
Citation: https://doi.org/10.5194/egusphere-2022-535-AC2
-
AC2: 'Reply on RC2', Guillaume Blanchy, 12 Jan 2023
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-535', Anonymous Referee #1, 10 Aug 2022
Interesting study regarding the use use natural language processing methods to extract information from the growing volume of scientific literature. The authors not only illustrate the use of different algorithms but also try to evaluate them numerically. In general, a well written manuscript. However, I think there is a lack of discussion and some of their objectives/aims are weakly met. The "relationship extraction" section is interesting and well written and the authors might want to put the same effort in the rest of the sections.
Comments
- Abstract: The beginning abstract seems a bit disconnected with the rest of the manuscript. Climate change is a hot topic but the paper itself is not related to that. I would suggest re-framing the abstract to match the content of the manuscript.
- Assessing the ability of an algorithm such as regex: I find this evaluation a bit estrange. The algorithms itself is infallible in the sense that it always finds what you tell it to find if it is present in the text. The algorithm is only restricted by the capacity of the user to generate valid regular expressions.
- Topic modelling: There is no discussion.
- How did you achieve your second aim (to illustrate the ability of topic classification to classify a new paper as relevant to a given topic)?
- You mention that topic modelling "can help identify knowledge gaps". How? Did you find any? If your aim is to present a practical workflow, perhaps you should guide the user to achieve that.
- Why did you select 6 topics instead of 9. You only mention that you are trying to maximise the coherence, which is higher for 9 topics.
- How does the number of topics might affect your workflow? Is selecting the highest coherence score infallible?
- Could you elaborate on how excluding monograms increased the coherence? From the term frequencies (Fig 7) I do not see many soil related terms, which seems strange. Perhaps they were ignored since their appeared as monograms? I do agree that bi and even trigrams are important but I have usually seen them added to a selection of monograms.
Citation: https://doi.org/10.5194/egusphere-2022-535-RC1 -
AC1: 'Reply on RC1', Guillaume Blanchy, 12 Jan 2023
General:
Interesting study regarding the use of natural language processing methods to extract information from the growing volume of scientific literature. The authors not only illustrate the use of different algorithms but also try to evaluate them numerically. In general, a well written manuscript. However, I think there is a lack of discussion and some of their objectives/aims are weakly met. The "relationship extraction" section is interesting and well written and the authors might want to put the same effort in the rest of the sections.
We appreciate that you find the study interesting and we thank you for your useful comments on the content that will help to improve the manuscript. We would like to state that the primary aim of the study was to demonstrate a practical workflow of several NLP techniques for summarising a large body of scientific literature. This was not properly reflected in the aims of our study. We will modify the aims accordingly in the revised version of the manuscript.
We acknowledge that the “topic analysis” part is less developed and weakly matched the objective 2 of addressing if a paper was relevant or not to a topic. In this regard, we plan to restructure the content around topic classification in the manuscript. Instead of classifying “new papers” in different topics, we will now demonstrate how to identify groups of manuscripts (in our case, groups around different types of “agricultural practices”) and observe which groups are less represented (or absent). In this way, we can show practices less studied and identify possible knowledge gaps. This also serves as a first classification to identify on which topic would a meta-analysis be well suited for instance.
Specific comments:
- Abstract: The beginning abstract seems a bit disconnected with the rest of the manuscript. Climate change is a hot topic but the paper itself is not related to that. I would suggest re-framing the abstract to match the content of the manuscript.
We will rephrase the abstract such that the main focus will be NLP techniques to summarise a large body of scientific environmental literature and then present the OTIM en Meta corpus as a case study on which we applied these techniques.
- Assessing the ability of an algorithm such as regex: I find this evaluation a bit estrange. The algorithms itself is infallible in the sense that it always finds what you tell it to find if it is present in the text. The algorithm is only restricted by the capacity of the user to generate valid regular expressions.
We agree that the regex algorithm is infallible but indeed, in this case, we want to estimate how well user -defined regexes are able to recover specific information. We will make clear in the manuscript that we do not assess the ability of the regex algorithm but rather the ability of the user generated regular expressions to match relevant content considering the trade-off between generality and their specificity.
- Topic modelling: There is no discussion.
Further discussion will be added, especially on how topic classification can be used as one of the first steps of the presented semi-automated NLP workflow for information summary and identifying groups of abundant literature where a meta-analysis can be useful.
- How did you achieve your second aim (to illustrate the ability of topic classification to classify a new paper as relevant to a given topic)?
(see general comment)
- You mention that topic modelling "can help identify knowledge gaps". How? Did you find any? If your aim is to present a practical workflow, perhaps you should guide the user to achieve that.
We agree that a practical interpretation will be a useful addition to the manuscript. We will give a few examples in the manuscript and develop how we identify them.
- Why did you select 6 topics instead of 9. You only mention that you are trying to maximise the coherence, which is higher for 9 topics.
That is a fair point and will be corrected in the next version of the manuscript.
- How does the number of topics might affect your workflow? Is selecting the highest coherence score infallible?
It is not infallible and we found that choosing a number of topics between 6 and 9 topics tends to lead to the same groups. The variability in coherence for each number of topics can be great, especially for a relatively small number of corpus as we have. This will be discussed in the revised version of the manuscript.
- Could you elaborate on how excluding monograms increased the coherence? From the term frequencies (Fig 7) I do not see many soil related terms, which seems strange. Perhaps they were ignored since their appeared as monograms? I do agree that bi and even trigrams are important but I have usually seen them added to a selection of monograms.
In our case, the inclusion of monograms led to words like ‘soil’, ‘treatment’, ‘water’, ‘crop’ or ‘tillage’ to appear prominently in the different topics. This did not allow us to differentiate the topic so well and the average topic coherence in this case was Cv = 0.4. With only bi-grams, some of these words carried more meaning: “conventional tillage”, “soil water”, “cover crop” and hence enabled better to see what the topic is about. This is the reason why, in this case, we preferred to only use bi-grams. This remark is a good point and we recognize that the addition of monograms as seen in other work can sometimes help. This will be discussed in the revised manuscript.
Citation: https://doi.org/10.5194/egusphere-2022-535-AC1
-
AC1: 'Reply on RC1', Guillaume Blanchy, 12 Jan 2023
-
RC2: 'Comment on egusphere-2022-535', Anonymous Referee #2, 28 Nov 2022
General:
Overall this manuscript fits well with SOIL, and the methodology as well as the results will be of interest to readers. The nature of the study, involving "natural language processing for metadata extraction from environmental {soil} scientific publications" is inherently multidisciplinary, and complex! The necessary methods are well discussed and well referenced, and the appendix of the NLP software will be a big help to researchers in this field. The results relating agricultural practices and soil and site properties are novel and important.
Specific:
Most SOIL readers are probably substantially unfamiliar with NLP and would benefit from more focused guidance by the authors, which can be accomplished perhaps mostly easily by a trimmed revision. For example the Abstract is overly complex; the Introduction states the objectives of the study on just four lines 96-100, and a trimmed Abstract could focus simply on the achieving of the objectives.
The Material and Methods section is appropriately long, given the emphasis on methods, but could be edited to be more uniformly coherent. Perhaps part of that could be fixed by reformatting the variety of figures, and relegating some of them to just the appendix.
Most of the figures in the Results section are important, but much of the other discussions in Results are really recommendations and can be eliminated or partly moved to Conclusions.
Technical:
I see Reviewer #1 listed some technical issues, most of which I believe can be handled by trimming as suggested.
Citation: https://doi.org/10.5194/egusphere-2022-535-RC2 -
AC2: 'Reply on RC2', Guillaume Blanchy, 12 Jan 2023
General:
Overall this manuscript fits well with SOIL, and the methodology as well as the results will be of interest to readers. The nature of the study, involving "natural language processing for metadata extraction from environmental {soil} scientific publications" is inherently multidisciplinary, and complex! The necessary methods are well discussed and well referenced, and the appendix of the NLP software will be a big help to researchers in this field. The results relating agricultural practices and soil and site properties are novel and important.
We appreciate that you find this manuscript well suited for the journal SOIL and more specifically to a multi-disciplinary topic related to agricultural practices. We are also glad to hear that our effort towards a reproducible workflow (by the means of notebooks, github repository) is acknowledged.
Specific:
Most SOIL readers are probably substantially unfamiliar with NLP and would benefit from more focused guidance by the authors, which can be accomplished perhaps mostly easily by a trimmed revision. For example the Abstract is overly complex; the Introduction states the objectives of the study on just four lines 96-100, and a trimmed Abstract could focus simply on the achieving of the objectives.
Agree. As mentioned in reply to RC1, we will refocus the abstract around “NLP techniques” and the objectives we want to address in this work. Additionally, we will make sure that the NLP specific language is explained and simplified to make the abstract accessible to most.
The Material and Methods section is appropriately long, given the emphasis on methods, but could be edited to be more uniformly coherent. Perhaps part of that could be fixed by reformatting the variety of figures, and relegating some of them to just the appendix.
Figure 3 and Table 2 will be put in appendix to ease the flow through the Material and Methods section.
Most of the figures in the Results section are important, but much of the other discussions in Results are really recommendations and can be eliminated or partly moved to Conclusions.
Thank you for the feedback. We will edit the results in discussion this way and move recommendations to the conclusions section.
Technical:
I see Reviewer #1 listed some technical issues, most of which I believe can be handled by trimming as suggested.
See reply to RC1.
Citation: https://doi.org/10.5194/egusphere-2022-535-AC2
-
AC2: 'Reply on RC2', Guillaume Blanchy, 12 Jan 2023
Peer review completion
Journal article(s) based on this preprint
Model code and software
NLP jupyter notebooks Guillaume Blanchy https://github.com/climasoma/nlp
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
343 | 190 | 17 | 550 | 5 | 5 |
- HTML: 343
- PDF: 190
- XML: 17
- Total: 550
- BibTeX: 5
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Lukas Albrecht
John Koestel
Sarah Garré
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(2066 KB) - Metadata XML