05 Jul 2022
05 Jul 2022
Status: this preprint is open for discussion.

Potential of natural language processing for metadata extraction from environmental scientific publications

Guillaume Blanchy1, Lukas Albrecht2, John Koestel2,3, and Sarah Garré1 Guillaume Blanchy et al.
  • 1Flanders Research Institute for Agriculture, Fisheries and Food (ILVO), Melle, Belgium
  • 2Agroscope, Reckenholzstrasse 191, 8046 Zürich, Switzerland
  • 3Institute for Soil and Environment, Swedish University of Agricultural Sciences, Box 7014, 75007 Uppsala, Sweden

Abstract. Climate change will most likely lead to an increase of extreme weather events, including heavy rainfall with soil surface runoff and erosion. Adapting agricultural management practices that lead to increased infiltration capacities of soil has potential to mitigate these risks. However, effects of agricultural management practices (tillage, cover crops, amendment, …) on soil variables (hydraulic conductivity, aggregate stability, …) often depend on the pedo-climatic context. Hence, the only possibility to gather information needed to advise stakeholders on suitable management practices is to quantify such dependencies using meta-analyses of studies investigating this topic. As a first step, structured information from scientific publications needs to be extracted to build a meta-database, which then can be analyzed and recommendations can be given in dependence to the pedo-climatic context.

Manually building such a database by going through all publications is very time-consuming. Given the increasing amount of literature, this task is likely to require more and more effort in the future. Natural language processing (NLP) facilitates this task, but it is not clear yet to which extent the extraction process is reliable or complete. In this work, two corpora of documents were used, which we refer to as the OTIM and the Meta corpus in the following. The OTIM corpus contains the source publications of the entries of the OTIM database of near-saturated hydraulic conductivity from tension-disk infiltrometer measurements ( The Meta corpus is constituted of all primary studies from 36 selected meta-analyses on the impact of agricultural practices on sustainable water management in Europe. We focused on three NLP techniques: topic modeling, tailored regular expressions and dictionaries and the shortest dependency path. We used topic modeling to sort the individual source-publications of the Meta corpus into 6 topics (e.g. related to cover crops, biochar, …) with a coherence metric Cv ranging from 0.7 to 0.9; Then, we used tailored regular expressions and dictionaries to extract coordinates, soil texture, soil type, rainfall, disk diameter and tensions on the OTIM corpus. We found that the respective information could be retrieved with 56 % up to 100 % of all relevant information (recall) and with a precision between 83 % and 100 %. Finally, we extracted relationships between a set of practices keywords (e.g. ‘biochar’, ‘zero tillage’, …) and soil variables (e.g. ‘soil aggregate’, ‘hydraulic conductivity’, ‘crop yield’,…) from the source-publications’ abstracts of the Meta corpus using the shortest dependency path between them. These relationships were further classified according to positive, negative or absent correlations between the driver and soil property. This quickly provided an overview of the different driver-variable relationships and their abundance for an entire body of literature. Overall, we found that all three tested NLP techniques were able to support evidence synthesis tasks such as selecting relevant publications on a topic, extracting specific information to build databases for meta-analysis and providing an overview of relationships found in the corpus. While human supervision remains essential, NLP methods have the potential to support fully automated evidence synthesis that can be continuously updated as new publications become available.

Guillaume Blanchy et al.

Status: open (until 06 Dec 2022)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on egusphere-2022-535', Anonymous Referee #1, 10 Aug 2022 reply
  • RC2: 'Comment on egusphere-2022-535', Anonymous Referee #2, 28 Nov 2022 reply

Guillaume Blanchy et al.

Model code and software

NLP jupyter notebooks Guillaume Blanchy

Guillaume Blanchy et al.


Total article views: 401 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
267 124 10 401 4 5
  • HTML: 267
  • PDF: 124
  • XML: 10
  • Total: 401
  • BibTeX: 4
  • EndNote: 5
Views and downloads (calculated since 05 Jul 2022)
Cumulative views and downloads (calculated since 05 Jul 2022)

Viewed (geographical distribution)

Total article views: 345 (including HTML, PDF, and XML) Thereof 345 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
Latest update: 29 Nov 2022
Short summary
Adapting agricultural practices to future climatic conditions requires to synthesize the effects of management practices on soil properties with respect to local soil and climate. This study showcases different automated text processing methods to identify topics, extract metadata for building database and summarize findings from publication abstracts. While human intervention remains essential, these methods show great potential to support evidence synthesis from large number of publications.