Preprints
https://doi.org/10.5194/egusphere-2022-535
https://doi.org/10.5194/egusphere-2022-535
05 Jul 2022
 | 05 Jul 2022

Potential of natural language processing for metadata extraction from environmental scientific publications

Guillaume Blanchy, Lukas Albrecht, John Koestel, and Sarah Garré

Abstract. Climate change will most likely lead to an increase of extreme weather events, including heavy rainfall with soil surface runoff and erosion. Adapting agricultural management practices that lead to increased infiltration capacities of soil has potential to mitigate these risks. However, effects of agricultural management practices (tillage, cover crops, amendment, …) on soil variables (hydraulic conductivity, aggregate stability, …) often depend on the pedo-climatic context. Hence, the only possibility to gather information needed to advise stakeholders on suitable management practices is to quantify such dependencies using meta-analyses of studies investigating this topic. As a first step, structured information from scientific publications needs to be extracted to build a meta-database, which then can be analyzed and recommendations can be given in dependence to the pedo-climatic context.

Manually building such a database by going through all publications is very time-consuming. Given the increasing amount of literature, this task is likely to require more and more effort in the future. Natural language processing (NLP) facilitates this task, but it is not clear yet to which extent the extraction process is reliable or complete. In this work, two corpora of documents were used, which we refer to as the OTIM and the Meta corpus in the following. The OTIM corpus contains the source publications of the entries of the OTIM database of near-saturated hydraulic conductivity from tension-disk infiltrometer measurements (https://github.com/climasoma/otim-db). The Meta corpus is constituted of all primary studies from 36 selected meta-analyses on the impact of agricultural practices on sustainable water management in Europe. We focused on three NLP techniques: topic modeling, tailored regular expressions and dictionaries and the shortest dependency path. We used topic modeling to sort the individual source-publications of the Meta corpus into 6 topics (e.g. related to cover crops, biochar, …) with a coherence metric Cv ranging from 0.7 to 0.9; Then, we used tailored regular expressions and dictionaries to extract coordinates, soil texture, soil type, rainfall, disk diameter and tensions on the OTIM corpus. We found that the respective information could be retrieved with 56 % up to 100 % of all relevant information (recall) and with a precision between 83 % and 100 %. Finally, we extracted relationships between a set of practices keywords (e.g. ‘biochar’, ‘zero tillage’, …) and soil variables (e.g. ‘soil aggregate’, ‘hydraulic conductivity’, ‘crop yield’,…) from the source-publications’ abstracts of the Meta corpus using the shortest dependency path between them. These relationships were further classified according to positive, negative or absent correlations between the driver and soil property. This quickly provided an overview of the different driver-variable relationships and their abundance for an entire body of literature. Overall, we found that all three tested NLP techniques were able to support evidence synthesis tasks such as selecting relevant publications on a topic, extracting specific information to build databases for meta-analysis and providing an overview of relationships found in the corpus. While human supervision remains essential, NLP methods have the potential to support fully automated evidence synthesis that can be continuously updated as new publications become available.

Journal article(s) based on this preprint

14 Mar 2023
Potential of natural language processing for metadata extraction from environmental scientific publications
Guillaume Blanchy, Lukas Albrecht, John Koestel, and Sarah Garré
SOIL, 9, 155–168, https://doi.org/10.5194/soil-9-155-2023,https://doi.org/10.5194/soil-9-155-2023, 2023
Short summary

Guillaume Blanchy et al.

Interactive discussion

Status: closed

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2022-535', Anonymous Referee #1, 10 Aug 2022
    • AC1: 'Reply on RC1', Guillaume Blanchy, 12 Jan 2023
  • RC2: 'Comment on egusphere-2022-535', Anonymous Referee #2, 28 Nov 2022
    • AC2: 'Reply on RC2', Guillaume Blanchy, 12 Jan 2023

Interactive discussion

Status: closed

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2022-535', Anonymous Referee #1, 10 Aug 2022
    • AC1: 'Reply on RC1', Guillaume Blanchy, 12 Jan 2023
  • RC2: 'Comment on egusphere-2022-535', Anonymous Referee #2, 28 Nov 2022
    • AC2: 'Reply on RC2', Guillaume Blanchy, 12 Jan 2023

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload
ED: Publish subject to minor revisions (review by editor) (13 Jan 2023) by Olivier Evrard
AR by Guillaume Blanchy on behalf of the Authors (27 Jan 2023)  Author's response   Author's tracked changes   Manuscript 
ED: Publish as is (27 Jan 2023) by Olivier Evrard
ED: Publish as is (03 Feb 2023) by Kristof Van Oost (Executive editor)
AR by Guillaume Blanchy on behalf of the Authors (13 Feb 2023)  Manuscript 

Journal article(s) based on this preprint

14 Mar 2023
Potential of natural language processing for metadata extraction from environmental scientific publications
Guillaume Blanchy, Lukas Albrecht, John Koestel, and Sarah Garré
SOIL, 9, 155–168, https://doi.org/10.5194/soil-9-155-2023,https://doi.org/10.5194/soil-9-155-2023, 2023
Short summary

Guillaume Blanchy et al.

Model code and software

NLP jupyter notebooks Guillaume Blanchy https://github.com/climasoma/nlp

Guillaume Blanchy et al.

Viewed

Total article views: 550 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
343 190 17 550 5 5
  • HTML: 343
  • PDF: 190
  • XML: 17
  • Total: 550
  • BibTeX: 5
  • EndNote: 5
Views and downloads (calculated since 05 Jul 2022)
Cumulative views and downloads (calculated since 05 Jul 2022)

Viewed (geographical distribution)

Total article views: 492 (including HTML, PDF, and XML) Thereof 492 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 07 Oct 2023
Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Short summary
Adapting agricultural practices to future climatic conditions requires to synthesize the effects of management practices on soil properties with respect to local soil and climate. This study showcases different automated text processing methods to identify topics, extract metadata for building database and summarize findings from publication abstracts. While human intervention remains essential, these methods show great potential to support evidence synthesis from large number of publications.