the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Unravelling information on impactful geo-hydrological hazard events with HazMiner, a multilingual text mining method developed through a global scale coverage application
Abstract. The incidence and impacts from geo-hydrological hazards (GH) such as floods, flash floods and landslides are changing globally due to anthropogenic environmental changes and increased exposure driven by population growth. Reliable datasets on GH are essential to deepen our understanding of these hazards and their impacts. However, existing GH datasets contain data gaps leading to biased interpretations, especially in the Global South where populations are commonly the most impacted. Text mining offers new opportunities in documenting GH by automatically extracting information from large text corpora. Despite its potential, current methodologies are not adapted to improve documentation in data-scared contexts. We present HazMiner, a paragraph-based text mining method designed to document the location, timing and impact of GH through large language models across multiple languages and at various scales. Applied here globally on 6,366,905 news articles published from 2017 through 2024 in 58 languages, HazMiner extracted 21,411 flood, 7,659 landslide and 3,606 flash flood events with known location and time information and, in some cases, impact data. Compared to existing hazard datasets, HazMiner significantly improved hazard documentation, reducing the data gaps in many regions, especially in the Global South. The new versatile multilingual method and its dataset advances both text mining and natural hazard research.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Natural Hazards and Earth System Sciences.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(6382 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2026-722', Anonymous Referee #1, 13 Mar 2026
-
AC1: 'Reply on RC1', Bram Valkenborg, 06 May 2026
We would like to sincerely thank the reviewer for the detailed, constructive, and insightful review of our manuscript. The comments are helpful to clarify terminology and improve the storyline of the manuscript, including its methodological transparency, the formulation of limitations, assumptions, and applicability of the HazMiner. In particular, the reviewer’s insights of natural language processing, and evaluation strategies will significantly improve the clarity, robustness, and scientific value of the manuscript. Below we would shortly address the main comments of the reviewer.
Terminology
We agree with the reviewer’s clarification of the correct terminology. We would like to point out that HazMiner does not rely on prompt-based chatbots for text processing. Instead, HazMiner uses deep learning models with a transformer-based architecture, such as BERT encoder models, that were fine-tuned for a specific task. The manuscript will be revised accordingly to ensure consistent and accurate terminology throughout.
Propagation of text processing errors
One of the reviewers’ reoccurring concern is the propagation of text processing errors throughout HazMiner. To better quantify and communicate these uncertainties, we propose introducing an event level quality flag. This flag will represent the reliability of the information extracted from its paragraph and support a more informed downstream use of the dataset. The quality flag would reflect the known limitations of the dataset, such as text translation, the inclusion of non-time corrected paragraphs, and the location uncertainty. Where possible, the quality flag will also be evaluated against the validated paragraphs.
Time validation
Another concern raised by the reviewer was the lack of time validation. Similarly to the location validation, we propose using the Global Landslide Catalog (GLC). Specifically, the news articles used to construct the GLC will be processed through HazMiner. The time extracted by HazMiner will then be compared to the time documented by GLC.
Non-event news articles
The reviewer also noted that news articles relating to non-events, such as warnings, may be included. To address this, we propose filtering such paragraphs using zero-shot classification using two labels, for example: ‘warning’ and ‘event’. The model can be restricted to output one label, i.e., the most relevant label for the paragraph.
Manuscript narrative
The reviewer’s comments have highlighted areas where the manuscript’s storyline lacks clarity, which we will address through careful revision and refinement. For example, the application of GDELT on the HazMiner method was not clearly stated in the text except for in the title of Section 2.6, which may have complicated the interpretation of the URL sources underpinning the dataset. The revised manuscript will therefore clarify the role of GDELT in the creation of the HazMiner dataset and explicitly state that a full list of the URLs is available on Zenodo (https://zenodo.org/records/18483419).
Biases and duplicates
Finally, we wanted to clarify the goal of HazMiner. HazMiner was designed to provide a richer and more informative text-based observation system for hazard events rather than a bias-free dataset. HazMiner paragraphs represent snippets of information describing hazard-related information. We agree with the reviewer that the paragraphs dataset is prone to duplicates as explained in section 4.3.
Nevertheless, duplicates in the event dataset should be reduced due to the event through the spatiotemporal clustering of paragraphs occurring within the same location and time window. Duplicate paragraphs have a high probability of ending up in the same event, as they share the same location and timing. For example, the Central/Western Europe floods of July 14th-15th 2021 represent in HazMiner a single event with 31 992 paragraphs (Figure 9 of the manuscript), whereas EM-DAT represents it by 7 events, one per affected country. To better document our event clustering technique, we propose including a sensitivity analysis to assess how clustering parameters influence the resulting number of detected events.
We clarify that HazMiner should be presented as another observation system, which is complementary to existing observation techniques and databases. Its objective is not to eliminate all biases, but to help collect more information and details on hazard events, while transparently acknowledging the associated limitations.
Citation: https://doi.org/10.5194/egusphere-2026-722-AC1
-
AC1: 'Reply on RC1', Bram Valkenborg, 06 May 2026
-
RC2: 'Comment on egusphere-2026-722', Anonymous Referee #2, 20 Mar 2026
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2026/egusphere-2026-722/egusphere-2026-722-RC2-supplement.pdf
-
AC2: 'Reply on RC2', Bram Valkenborg, 06 May 2026
We thank the reviewer for the careful and critical evaluation of our manuscript. We appreciate the reviewer’s detailed assessment of the HazMiner methodology, and the emphasis placed on validation and robustness. Below, we shortly address the main comments of the reviewer.
Validation and accuracy reporting
We acknowledge the reviewer’s concern regarding the absence of a robust presentation of accuracy and validation in the current manuscript. To better quantify and communicate these uncertainties, we propose adding an event level quality flag. This flag will represent the reliability of the information extracted from its paragraph and support a more informed downstream use of the dataset. It will reflect the known limitations of the dataset, such as text translation, the inclusion of non-time corrected paragraphs, and the location uncertainty. Where possible, the quality flag will also be evaluated against the validated paragraphs. In the following sections, we specifically address the different sources of error noted by the reviewer.
Classification errors
Classification errors in HazMiner mainly reflect an inherent limitation of zero‑shot classification rather than a model‑specific issue, as similar behavior is expected across different embedding spaces. Zero‑shot classification was chosen to have HazMiner’s ability to detect multiple natural hazard types without supervised training. While supervised models could address this issue, no suitable models were available and developing them was beyond the scope of this study. To improve transparency, we will quantify the impact of known misclassified and corrected paragraphs in the revised manuscript.
Time validation
The reviewer specifically mentioned line 287 about time validation. We would like to clarify that this line refers to validation of the time extraction algorithm, and not to the whole HazMiner algorithm. Nevertheless, we acknowledge that a time validation is currently missing. Therefore, we propose using the Global Landslide Catalog (GLC) for the time validation. Specifically, the news articles used to construct the GLC will be processed through HazMiner. The time extracted by HazMiner will then be compared to the time documented by GLC.
Translation errors
Translation in HazMiner was performed using a relatively lightweight model (around 77 million parameters), as larger models (e.g., GPT-type) with fewer errors were not feasible given computational constraints, the translation step alone required six months of processing. To limit translation-induced errors, a multilingual model was used for location extraction, avoiding translation in the geoparsing stage. However, extending this approach to other components would require multilingual models fine-tuned for text classification and information extraction, which is beyond the scope of this study.
EM-DAT comparison
EM-DAT is a widely used global reference database on disasters, recognized for its high reliability due to manual reporting, but it contains data gaps, like any observation system. By comparing HazMiner with EM-DAT, our goal was to demonstrate the additional information on hazard events that can be captured from online news sources using automated text mining. We did not intend to suggest that such a comparison is impossible; rather, we aimed to highlight the specific rules and constraints applied when comparing impact data (L281–285). We will clarify this in the revised manuscript.
Non-event news articles
The reviewer also noted that news articles relating to non-events, such as warnings, may be included. To address this, we propose filtering such paragraphs using zero-shot classification and, for example, two labels: ‘warning’ and ‘event’. The model can be restricted to output on label, i.e., the most relevant label for the paragraph.
Sensitivity analysis of clustering parameters
We agree with the reviewer that a sensitivity analysis would improve transparency and inform the reader. We therefore propose performing clustering across a range of xi, epsilon, and “minimum paragraphs per event” values to assess their influence on the resulting number clusters. The most relevant findings will be included in the manuscript to clarify the impact of these parameters.
Data sources
We thank the reviewer for noting that the manuscript did not clearly indicate where the URLs used to generate the HazMiner dataset can be accessed. These URLs were retrieved from the GDELT dataset using a structured query (Table 1 of the manuscript), resulting in a large collection of sources. All URLs are publicly available in the article dataset hosted on Zenodo (https://zenodo.org/records/18483419), and the revised manuscript will explicitly state where readers can consult them.
Citation: https://doi.org/10.5194/egusphere-2026-722-AC2
-
AC2: 'Reply on RC2', Bram Valkenborg, 06 May 2026
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,685 | 1,055 | 92 | 2,832 | 229 | 261 |
- HTML: 1,685
- PDF: 1,055
- XML: 92
- Total: 2,832
- BibTeX: 229
- EndNote: 261
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General Comments
This paper addresses a pressing research gap: the creation of reliable and validated (global) text-based datasets for natural hazard impact and adaptation research. The authors propose and implement an automated pipeline with 14 steps to cluster news paragraphs into temporally and geographically thematic clusters for three types of hazards: floods, flash floods and landslides. The work demonstrates a remarkable effort in coordinating all processing steps for large scale multilingual text data and analyzing the resulting structures. Importantly, known limitations and biases are transparently and extensively discussed in Section 4. In that regard, it is a valuable contribution with good potential to pave the way for relevant further research in various fields. Still, a few drawbacks must be pointed out and require improvement or at least a more detailed discussion of consequences and possible solutions.
Specific Comments
Clarifications / corrections needed with some suggestions
Abstract
- l1: Since the number of pages is not critically limited, the abbreviation GH could be avoided throughout the whole text. The full version "geo-hydrological hazards" does not take that much extra space and really facilitates reading.
- l15: What is data-scared?
- l18: The 58 languages should be listed in the appendix.
- l20: It does not really advance the text mining field itself as no new techniques or findings related to the text mining methods themselves are presented.
Introduction
- Almost 2.5 pages is quite long for an introduction. It mentions many related works that deserve a dedicated related work section. I suggest splitting this part into two sections, one with a compact overview of the paper and the other for the list of related work, ideally extended to account for the various tasks in the pipeline.
- l63: Text mining methods are not perfect and cannot always extract the right information. It is important to phrase it in a way that makes it clear that these techniques do not always work.
- l65: And disadvantages.
- l74: LLMs do not "understand" texts. I recommend avoiding verbs related to human cognition to refer to how LLMs process language.
- l75-78: LLMs are mentioned before GPT and BERT, which makes it sound as if these two appeared after the popularization of LLMs. I suggest reversing the order of these sentences. First, refer to the advent of Transformer-based architectures like GPT and BERT, then to the popularization of LLMs as a tool for many NLP tasks.
- l82: The "While only a few..." needs to be connected to another sentence.
- l94: It is ambiguous whether "the dataset" refers to GDELT or your dataset.
- l96: The "Followed by..." needs to be connected to another sentence.
Methodology
- It is a matter of personal taste, but since Figure 1 already shows an overview of the pipeline, it would be better if the description of each subtask was already accompanied by its own evaluation. It is hard to read and move on without knowing how well each model performs at each step. Then Section 3 could focus only on describing the resulting dataset.
- l109: What does it mean that HazMiner was "specifically" configured to extract GH events, isn't it its main purpose?
- Figure 2: "hazardous text" sounds like the text itself is hazardous.
- Figure 2: An arrow connecting the output of a step to the input of the other would be useful to avoid the reliance on color, which may not be visible for color-blind readers.
- 2.1 should be named "text selection" or "document selection", as it involves steps that are not about extraction.
- l115: Subtasks rather than suprocesses.
- l116: Scrapping news is usually not allowed by media outlet websites. Was it performed in accordance with the terms of use?
- l119: The simple heuristic to use backspace to identify paragraphs can be problematic, as there is no guarantee that all websites follow this rule and this information may be lost if the raw texts were already altered by news aggregators. This step should thus also be evaluated.
- l121: It is not really more efficient because it makes the number of instances to process much larger and removes important context needed for disambiguation.
- l122: It's not true that current LLMs are optimized for shorter inputs (unless it's not actually LLMs but text encoders, see comment above). Not fewer parameters but fewer computations.
- l150: I don't know how these specific embeddings were optimized. But generally, embeddings are trained to capture contextual relations, it is expected that the representations of all these concepts will be close in the embedding space.
- l161: Publication date is not very reliable as the timing of the hazard because many important events resurface in the news much later, and some are delayed.
- l170: It's unlikely that a rule-based approach really solves this problem. There are many news about studies that refer to events as a general category, or about events that occur in movies or books, which are not always in future tense.
- l178: This assigns coordinates to all locations mentioned in the text, but how is the event location detected among them? Normally, there is a third step.
- l180: Hallucinations is a common (although not very suitable) term for wrong content generated by text completers / chatbots. It seems that, here, it refers to normal translation errors instead.
- l196: Weighted average may likely end up in a location that is not even in the text, maybe even in a different country.
- l220: The sentence "As hazard events..." is not finished.
- l262: It is unclear what "until the number of false positives remained constant" means here.
- l290: The fact that the timing algorithm does not rely on neural network-based models does not mean it works well. It's a rule-based approach that will not always work.
Results
- For classification, traditional metrics like binary precision, recall and F1 scores should be reported in a Table. Only recall (true positive rate) was reported but, in such tasks, high precision is as (or even more) important, as "inventing" events that never happened can be a more serious mistake than missing real events depending on the use case.
- l345: Precision is quite low, though (only around 0.6 based on the confusion matrix). So the performance is mostly coming from identifying the vast majority of unrelated instances.
- l364: How exactly are the errors computed? Are the bounding boxes considered or only the centroid?
- l380: Very often, news report on preliminary number that keep being updated as the event unfolds. How was this handled? If multiple paragraphs from different articles are reporting about the same event, how are their estimates harmonized?
- l393: The mean duration is not of the GH itself but of its news coverage?
- l468: "one death but fewer than 10 fatalities" sounds like a death is not a fatality. Maybe rephrase.
Discussion
- l509: with the goal
- l520: The sentence "Mostly because..." should be connected to another sentence.
- l524: The term "only" is not correct here as it can include events with less deaths if they affect enough people.
- l545: The sentence "As studies..." should be connected to another sentence.
- l574: country's
- l643: It's not the language that is non-alphabetical, it's the writing system (see https://en.wikipedia.org/wiki/List_of_writing_systems).
- l671: The sentence "Such as news..." should be connected to another sentence.
- l674: into the (or rather "as input to the"?)
- l716: Switching models is likely not that easy, and each may use different labels, output format, programming libraries or require a different input structure.
- l725: Subconsciously is not a suitable term since model/study are the agents in this sentence.