From text to geoinformation – A modular approach for extraction of disaster information from web text data
Abstract. The implementation of effective disaster management measures requires comprehensive information about a given flooding situation. Text data from web news offer potentially large volumes of information for this purpose. However, the extraction and spatiotemporal analysis of flood event-related information is inherently demanding due to the immense volume of unstructured text. Addressing this challenge, we present a modular and scalable method that allows the extraction of disaster-relevant information from a large text corpus. This is accomplished by combining domain specific entity extraction with dictionaries, a machine learning model for toponym identification, and hand-crafted rules for entity linking in a modular workflow. The extracted information is augmented with geolocations in order to support spatial analysis. Using the West Germany flooding event 2021 as a case study, we evaluate the capacity of our approach to extract relevant geospatial information at a variety of spatial granularity levels and in the form of various thematic descriptors. By doing so, we outline the capabilities and limitations of this approach for text extraction and analysis. Furthermore, we demonstrate the potential for systematic utilization of text data for improved situational awareness and for disaster management support.
This paper presents a workflow for extracting disaster related information from Web text. The authors demonstrated this workflow in a case study on the 2021 flooding event in West Germany. Overall, this paper is more like a project report and lacks a clear research contribution.
- Web text is known to have various biases. Depending on the disaster, some aspects of the disaster may be reported while some other aspects may be ignored. This study specifically used Web news from the GDELT project, and the extracted information will inherit all these biases. This is not to mention the NLP and other methods used to process the text, which have their own algorithmic biases.
- How long is the time delay in the Web news in GDELT? If the news in GDELT is largely delayed, then the extracted information is unlikely going to be useful for disaster response.
- Related to the previous points, how would the information extracted from news be useful for disaster managers? Can the authors provide some concrete examples in which the information extracted from the news is something unknown to disaster managers?
- This paper proposes a general workflow for extracting multiple types of information from Web text. However, there is no comparison with previous methods. A concrete research paper would focus on extracting one or two types of information (e.g., topics or locations) and compare the proposed new method with existing methods.
- The methods used in the workflow seem to be outdated, such as TF-IDF which is an old information retrieval technique. It is unclear what the methodological innovation of this paper is.
- The performance of the workflow as reported in Figure 5 is low, with F1-score being about 0.4. It is unclear whether this workflow can extract information accurately.
- This paper is based on a single case study on a flooding event. A stronger paper would have another case study, ideally on another type of disaster, to demonstrate the generalizability of this workflow.