Unravelling information on impactful geo-hydrological hazard events with HazMiner, a multilingual text mining method developed through a global scale coverage application
Abstract. The incidence and impacts from geo-hydrological hazards (GH) such as floods, flash floods and landslides are changing globally due to anthropogenic environmental changes and increased exposure driven by population growth. Reliable datasets on GH are essential to deepen our understanding of these hazards and their impacts. However, existing GH datasets contain data gaps leading to biased interpretations, especially in the Global South where populations are commonly the most impacted. Text mining offers new opportunities in documenting GH by automatically extracting information from large text corpora. Despite its potential, current methodologies are not adapted to improve documentation in data-scared contexts. We present HazMiner, a paragraph-based text mining method designed to document the location, timing and impact of GH through large language models across multiple languages and at various scales. Applied here globally on 6,366,905 news articles published from 2017 through 2024 in 58 languages, HazMiner extracted 21,411 flood, 7,659 landslide and 3,606 flash flood events with known location and time information and, in some cases, impact data. Compared to existing hazard datasets, HazMiner significantly improved hazard documentation, reducing the data gaps in many regions, especially in the Global South. The new versatile multilingual method and its dataset advances both text mining and natural hazard research.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Natural Hazards and Earth System Sciences.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
General Comments
This paper addresses a pressing research gap: the creation of reliable and validated (global) text-based datasets for natural hazard impact and adaptation research. The authors propose and implement an automated pipeline with 14 steps to cluster news paragraphs into temporally and geographically thematic clusters for three types of hazards: floods, flash floods and landslides. The work demonstrates a remarkable effort in coordinating all processing steps for large scale multilingual text data and analyzing the resulting structures. Importantly, known limitations and biases are transparently and extensively discussed in Section 4. In that regard, it is a valuable contribution with good potential to pave the way for relevant further research in various fields. Still, a few drawbacks must be pointed out and require improvement or at least a more detailed discussion of consequences and possible solutions.
Specific Comments
Clarifications / corrections needed with some suggestions
Abstract
- l1: Since the number of pages is not critically limited, the abbreviation GH could be avoided throughout the whole text. The full version "geo-hydrological hazards" does not take that much extra space and really facilitates reading.
- l15: What is data-scared?
- l18: The 58 languages should be listed in the appendix.
- l20: It does not really advance the text mining field itself as no new techniques or findings related to the text mining methods themselves are presented.
Introduction
- Almost 2.5 pages is quite long for an introduction. It mentions many related works that deserve a dedicated related work section. I suggest splitting this part into two sections, one with a compact overview of the paper and the other for the list of related work, ideally extended to account for the various tasks in the pipeline.
- l63: Text mining methods are not perfect and cannot always extract the right information. It is important to phrase it in a way that makes it clear that these techniques do not always work.
- l65: And disadvantages.
- l74: LLMs do not "understand" texts. I recommend avoiding verbs related to human cognition to refer to how LLMs process language.
- l75-78: LLMs are mentioned before GPT and BERT, which makes it sound as if these two appeared after the popularization of LLMs. I suggest reversing the order of these sentences. First, refer to the advent of Transformer-based architectures like GPT and BERT, then to the popularization of LLMs as a tool for many NLP tasks.
- l82: The "While only a few..." needs to be connected to another sentence.
- l94: It is ambiguous whether "the dataset" refers to GDELT or your dataset.
- l96: The "Followed by..." needs to be connected to another sentence.
Methodology
- It is a matter of personal taste, but since Figure 1 already shows an overview of the pipeline, it would be better if the description of each subtask was already accompanied by its own evaluation. It is hard to read and move on without knowing how well each model performs at each step. Then Section 3 could focus only on describing the resulting dataset.
- l109: What does it mean that HazMiner was "specifically" configured to extract GH events, isn't it its main purpose?
- Figure 2: "hazardous text" sounds like the text itself is hazardous.
- Figure 2: An arrow connecting the output of a step to the input of the other would be useful to avoid the reliance on color, which may not be visible for color-blind readers.
- 2.1 should be named "text selection" or "document selection", as it involves steps that are not about extraction.
- l115: Subtasks rather than suprocesses.
- l116: Scrapping news is usually not allowed by media outlet websites. Was it performed in accordance with the terms of use?
- l119: The simple heuristic to use backspace to identify paragraphs can be problematic, as there is no guarantee that all websites follow this rule and this information may be lost if the raw texts were already altered by news aggregators. This step should thus also be evaluated.
- l121: It is not really more efficient because it makes the number of instances to process much larger and removes important context needed for disambiguation.
- l122: It's not true that current LLMs are optimized for shorter inputs (unless it's not actually LLMs but text encoders, see comment above). Not fewer parameters but fewer computations.
- l150: I don't know how these specific embeddings were optimized. But generally, embeddings are trained to capture contextual relations, it is expected that the representations of all these concepts will be close in the embedding space.
- l161: Publication date is not very reliable as the timing of the hazard because many important events resurface in the news much later, and some are delayed.
- l170: It's unlikely that a rule-based approach really solves this problem. There are many news about studies that refer to events as a general category, or about events that occur in movies or books, which are not always in future tense.
- l178: This assigns coordinates to all locations mentioned in the text, but how is the event location detected among them? Normally, there is a third step.
- l180: Hallucinations is a common (although not very suitable) term for wrong content generated by text completers / chatbots. It seems that, here, it refers to normal translation errors instead.
- l196: Weighted average may likely end up in a location that is not even in the text, maybe even in a different country.
- l220: The sentence "As hazard events..." is not finished.
- l262: It is unclear what "until the number of false positives remained constant" means here.
- l290: The fact that the timing algorithm does not rely on neural network-based models does not mean it works well. It's a rule-based approach that will not always work.
Results
- For classification, traditional metrics like binary precision, recall and F1 scores should be reported in a Table. Only recall (true positive rate) was reported but, in such tasks, high precision is as (or even more) important, as "inventing" events that never happened can be a more serious mistake than missing real events depending on the use case.
- l345: Precision is quite low, though (only around 0.6 based on the confusion matrix). So the performance is mostly coming from identifying the vast majority of unrelated instances.
- l364: How exactly are the errors computed? Are the bounding boxes considered or only the centroid?
- l380: Very often, news report on preliminary number that keep being updated as the event unfolds. How was this handled? If multiple paragraphs from different articles are reporting about the same event, how are their estimates harmonized?
- l393: The mean duration is not of the GH itself but of its news coverage?
- l468: "one death but fewer than 10 fatalities" sounds like a death is not a fatality. Maybe rephrase.
Discussion
- l509: with the goal
- l520: The sentence "Mostly because..." should be connected to another sentence.
- l524: The term "only" is not correct here as it can include events with less deaths if they affect enough people.
- l545: The sentence "As studies..." should be connected to another sentence.
- l574: country's
- l643: It's not the language that is non-alphabetical, it's the writing system (see https://en.wikipedia.org/wiki/List_of_writing_systems).
- l671: The sentence "Such as news..." should be connected to another sentence.
- l674: into the (or rather "as input to the"?)
- l716: Switching models is likely not that easy, and each may use different labels, output format, programming libraries or require a different input structure.
- l725: Subconsciously is not a suitable term since model/study are the agents in this sentence.