Unravelling information on impactful geo-hydrological hazard events with HazMiner, a multilingual text mining method developed through a global scale coverage application

Valkenborg, Bram; Dewitte, Olivier; Smets, Benoît

doi:10.5194/egusphere-2026-722

Preprints

https://doi.org/10.5194/egusphere-2026-722

Preprints

12 Feb 2026

| 12 Feb 2026

Unravelling information on impactful geo-hydrological hazard events with HazMiner, a multilingual text mining method developed through a global scale coverage application

Bram Valkenborg, Olivier Dewitte, and Benoît Smets

Abstract. The incidence and impacts from geo-hydrological hazards (GH) such as floods, flash floods and landslides are changing globally due to anthropogenic environmental changes and increased exposure driven by population growth. Reliable datasets on GH are essential to deepen our understanding of these hazards and their impacts. However, existing GH datasets contain data gaps leading to biased interpretations, especially in the Global South where populations are commonly the most impacted. Text mining offers new opportunities in documenting GH by automatically extracting information from large text corpora. Despite its potential, current methodologies are not adapted to improve documentation in data-scared contexts. We present HazMiner, a paragraph-based text mining method designed to document the location, timing and impact of GH through large language models across multiple languages and at various scales. Applied here globally on 6,366,905 news articles published from 2017 through 2024 in 58 languages, HazMiner extracted 21,411 flood, 7,659 landslide and 3,606 flash flood events with known location and time information and, in some cases, impact data. Compared to existing hazard datasets, HazMiner significantly improved hazard documentation, reducing the data gaps in many regions, especially in the Global South. The new versatile multilingual method and its dataset advances both text mining and natural hazard research.

Received: 06 Feb 2026 – Discussion started: 12 Feb 2026

Competing interests: At least one of the (co-)authors is a member of the editorial board of Natural Hazards and Earth System Sciences.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Bram Valkenborg, Olivier Dewitte, and Benoît Smets

Status: final response (author comments only)

RC1:
'Comment on egusphere-2026-722', Anonymous Referee #1, 13 Mar 2026
General Comments
This paper addresses a pressing research gap: the creation of reliable and validated (global) text-based datasets for natural hazard impact and adaptation research. The authors propose and implement an automated pipeline with 14 steps to cluster news paragraphs into temporally and geographically thematic clusters for three types of hazards: floods, flash floods and landslides. The work demonstrates a remarkable effort in coordinating all processing steps for large scale multilingual text data and analyzing the resulting structures. Importantly, known limitations and biases are transparently and extensively discussed in Section 4. In that regard, it is a valuable contribution with good potential to pave the way for relevant further research in various fields. Still, a few drawbacks must be pointed out and require improvement or at least a more detailed discussion of consequences and possible solutions.
Specific Comments
There seems to be a misuse of the term LLMs in the text. As of now, "LLM" is being mostly used to refer to generative AI models, i.e. decoders for text completion. Although Transformer-based encoders like BERT can also be called large language models when they are trained for an underlying language modelling task, it seems that the pre-trained deep learning models used in this work (those listed in Table A1) are standard token classifiers, text classifiers or machine translation models and not general purpose LLMs (in the now popular prompt-reply sense). I strongly recommend using the right technical terms (e.g. pre-trained sequence labeling models, deep learning classifiers, etc). It would make the paper even stronger if dedicated NLP models are used for each subtask instead of merely "prompting" its way into the results with a chain of generative AI outputs. If de facto LLMs (the instruction-tuned chatbots) were indeed used, then all prompts should be documented in the appendix.

One of the most important aspects of a dataset paper is data provenance. Section 2.1 discusses text collection but there is no information on where exactly the data is coming from. What are the used URLs? What types of news outlets were included? Section 4.4.1 refers to Common Crawl and GDELT, but one can not surely infer what news database was used in HazMiner.

The authors opted for working with paragraphs instead of texts and discuss some of the advantages like input length limitation of many encoders. But only in Section 4.3 the paper explains that the news text units were not preserved. If each paragraph is being counter as a separate observation, then the number of most events will be inflated, as arguably many news texts contain multiple paragraphs about the event of interest. We have no guarantees that the larger number of identified events in comparison to existing databases is not merely due to the potentially exponentially duplicated pieces of information from each text. The text argues that "sharing information across paragraphs within the same article without an in-depth knowledge of the article’s content comes with a risk", but the risk of not doing it is as critical. It is also common for the same news text from an agency press to be replicated by several news outlets, which further increase the number of duplicated information.

The pipeline comprises a few arbitrary decisions not well justified in the paper and the presented evaluation of each subtask is limited. Timing and two types of impacts, for instance, were not evaluated at all. Even if the timing extraction algorithm is rule-based, it is also prone to error as any other model and must be evaluated. The performance of the paragraph and binary classifiers is suboptimal. The majority negative class is mostly correctly identified, but precision of the positive class remains low, which causes the final sample to contain many unrelated instances. This problem is intensified in a pipeline as errors are propagated from one step to the next.

Not all groups of news refer to ongoing events. Some events are so severe that newspapers continue to report on them after months or years (for instance, when a new judicial decision comes up). It seems that these cases were not handled and are always being counted as new events.

GDELT is referenced many times in the text. What makes the HazMiner dataset different from GDELT? A section dedicated to comparing the advantages and disadvantages of existing datasets and HazMiner with an overview illustration or table would be useful. Related to this point, the paper is lacking a dedicated related work section. Currently, some related work is discussed in the first section, and some in the limitations and biases section.

For impact extraction, there is no clear information on how the impact from different paragraphs and sources was aggregated. It is common for news outlets to report preliminary estimates that are updated that as the event unfolds. How was this handled?

The Global South bias in existing datasets is mentioned as a drawback many times in the text. Still, line 315 mentions that EM-DAT concentrates on this area. Besides, the bias was not alleviated: HazMiner still contains many more observations about the Global North, as mentioned in Section 3.2.

Clarifications / corrections needed with some suggestions
Abstract
- l1: Since the number of pages is not critically limited, the abbreviation GH could be avoided throughout the whole text. The full version "geo-hydrological hazards" does not take that much extra space and really facilitates reading.

- l15: What is data-scared?

- l18: The 58 languages should be listed in the appendix.

- l20: It does not really advance the text mining field itself as no new techniques or findings related to the text mining methods themselves are presented.
Introduction
- Almost 2.5 pages is quite long for an introduction. It mentions many related works that deserve a dedicated related work section. I suggest splitting this part into two sections, one with a compact overview of the paper and the other for the list of related work, ideally extended to account for the various tasks in the pipeline.

- l63: Text mining methods are not perfect and cannot always extract the right information. It is important to phrase it in a way that makes it clear that these techniques do not always work.

- l65: And disadvantages.

- l74: LLMs do not "understand" texts. I recommend avoiding verbs related to human cognition to refer to how LLMs process language.

- l75-78: LLMs are mentioned before GPT and BERT, which makes it sound as if these two appeared after the popularization of LLMs. I suggest reversing the order of these sentences. First, refer to the advent of Transformer-based architectures like GPT and BERT, then to the popularization of LLMs as a tool for many NLP tasks.

- l82: The "While only a few..." needs to be connected to another sentence.

- l94: It is ambiguous whether "the dataset" refers to GDELT or your dataset.

- l96: The "Followed by..." needs to be connected to another sentence.
Methodology
- It is a matter of personal taste, but since Figure 1 already shows an overview of the pipeline, it would be better if the description of each subtask was already accompanied by its own evaluation. It is hard to read and move on without knowing how well each model performs at each step. Then Section 3 could focus only on describing the resulting dataset.

- l109: What does it mean that HazMiner was "specifically" configured to extract GH events, isn't it its main purpose?

- Figure 2: "hazardous text" sounds like the text itself is hazardous.

- Figure 2: An arrow connecting the output of a step to the input of the other would be useful to avoid the reliance on color, which may not be visible for color-blind readers.

- 2.1 should be named "text selection" or "document selection", as it involves steps that are not about extraction.

- l115: Subtasks rather than suprocesses.

- l116: Scrapping news is usually not allowed by media outlet websites. Was it performed in accordance with the terms of use?

- l119: The simple heuristic to use backspace to identify paragraphs can be problematic, as there is no guarantee that all websites follow this rule and this information may be lost if the raw texts were already altered by news aggregators. This step should thus also be evaluated.

- l121: It is not really more efficient because it makes the number of instances to process much larger and removes important context needed for disambiguation.

- l122: It's not true that current LLMs are optimized for shorter inputs (unless it's not actually LLMs but text encoders, see comment above). Not fewer parameters but fewer computations.

- l150: I don't know how these specific embeddings were optimized. But generally, embeddings are trained to capture contextual relations, it is expected that the representations of all these concepts will be close in the embedding space.

- l161: Publication date is not very reliable as the timing of the hazard because many important events resurface in the news much later, and some are delayed.

- l170: It's unlikely that a rule-based approach really solves this problem. There are many news about studies that refer to events as a general category, or about events that occur in movies or books, which are not always in future tense.

- l178: This assigns coordinates to all locations mentioned in the text, but how is the event location detected among them? Normally, there is a third step.

- l180: Hallucinations is a common (although not very suitable) term for wrong content generated by text completers / chatbots. It seems that, here, it refers to normal translation errors instead.

- l196: Weighted average may likely end up in a location that is not even in the text, maybe even in a different country.

- l220: The sentence "As hazard events..." is not finished.

- l262: It is unclear what "until the number of false positives remained constant" means here.

- l290: The fact that the timing algorithm does not rely on neural network-based models does not mean it works well. It's a rule-based approach that will not always work.
Results
- For classification, traditional metrics like binary precision, recall and F1 scores should be reported in a Table. Only recall (true positive rate) was reported but, in such tasks, high precision is as (or even more) important, as "inventing" events that never happened can be a more serious mistake than missing real events depending on the use case.

- l345: Precision is quite low, though (only around 0.6 based on the confusion matrix). So the performance is mostly coming from identifying the vast majority of unrelated instances.

- l364: How exactly are the errors computed? Are the bounding boxes considered or only the centroid?

- l380: Very often, news report on preliminary number that keep being updated as the event unfolds. How was this handled? If multiple paragraphs from different articles are reporting about the same event, how are their estimates harmonized?

- l393: The mean duration is not of the GH itself but of its news coverage?

- l468: "one death but fewer than 10 fatalities" sounds like a death is not a fatality. Maybe rephrase.
Discussion
- l509: with the goal

- l520: The sentence "Mostly because..." should be connected to another sentence.

- l524: The term "only" is not correct here as it can include events with less deaths if they affect enough people.

- l545: The sentence "As studies..." should be connected to another sentence.

- l574: country's

- l643: It's not the language that is non-alphabetical, it's the writing system (see https://en.wikipedia.org/wiki/List_of_writing_systems).

- l671: The sentence "Such as news..." should be connected to another sentence.

- l674: into the (or rather "as input to the"?)

- l716: Switching models is likely not that easy, and each may use different labels, output format, programming libraries or require a different input structure.

- l725: Subconsciously is not a suitable term since model/study are the agents in this sentence.
Citation: https://doi.org/10.5194/egusphere-2026-722-RC1
- AC1: 'Reply on RC1', Bram Valkenborg, 06 May 2026
  
  We would like to sincerely thank the reviewer for the detailed, constructive, and insightful review of our manuscript. The comments are helpful to clarify terminology and improve the storyline of the manuscript, including its methodological transparency, the formulation of limitations, assumptions, and applicability of the HazMiner. In particular, the reviewer’s insights of natural language processing, and evaluation strategies will significantly improve the clarity, robustness, and scientific value of the manuscript. Below we would shortly address the main comments of the reviewer.
  Terminology
  We agree with the reviewer’s clarification of the correct terminology. We would like to point out that HazMiner does not rely on prompt-based chatbots for text processing. Instead, HazMiner uses deep learning models with a transformer-based architecture, such as BERT encoder models, that were fine-tuned for a specific task. The manuscript will be revised accordingly to ensure consistent and accurate terminology throughout.
  Propagation of text processing errors
  One of the reviewers’ reoccurring concern is the propagation of text processing errors throughout HazMiner. To better quantify and communicate these uncertainties, we propose introducing an event level quality flag. This flag will represent the reliability of the information extracted from its paragraph and support a more informed downstream use of the dataset. The quality flag would reflect the known limitations of the dataset, such as text translation, the inclusion of non-time corrected paragraphs, and the location uncertainty. Where possible, the quality flag will also be evaluated against the validated paragraphs.
  Time validation
  Another concern raised by the reviewer was the lack of time validation. Similarly to the location validation, we propose using the Global Landslide Catalog (GLC). Specifically, the news articles used to construct the GLC will be processed through HazMiner. The time extracted by HazMiner will then be compared to the time documented by GLC.
  Non-event news articles
  The reviewer also noted that news articles relating to non-events, such as warnings, may be included. To address this, we propose filtering such paragraphs using zero-shot classification using two labels, for example: ‘warning’ and ‘event’. The model can be restricted to output one label, i.e., the most relevant label for the paragraph.
  Manuscript narrative
  The reviewer’s comments have highlighted areas where the manuscript’s storyline lacks clarity, which we will address through careful revision and refinement. For example, the application of GDELT on the HazMiner method was not clearly stated in the text except for in the title of Section 2.6, which may have complicated the interpretation of the URL sources underpinning the dataset. The revised manuscript will therefore clarify the role of GDELT in the creation of the HazMiner dataset and explicitly state that a full list of the URLs is available on Zenodo (https://zenodo.org/records/18483419).
  Biases and duplicates
  Finally, we wanted to clarify the goal of HazMiner. HazMiner was designed to provide a richer and more informative text-based observation system for hazard events rather than a bias-free dataset. HazMiner paragraphs represent snippets of information describing hazard-related information. We agree with the reviewer that the paragraphs dataset is prone to duplicates as explained in section 4.3.
  Nevertheless, duplicates in the event dataset should be reduced due to the event through the spatiotemporal clustering of paragraphs occurring within the same location and time window. Duplicate paragraphs have a high probability of ending up in the same event, as they share the same location and timing. For example, the Central/Western Europe floods of July 14th-15^th 2021 represent in HazMiner a single event with 31 992 paragraphs (Figure 9 of the manuscript), whereas EM-DAT represents it by 7 events, one per affected country. To better document our event clustering technique, we propose including a sensitivity analysis to assess how clustering parameters influence the resulting number of detected events.
  We clarify that HazMiner should be presented as another observation system, which is complementary to existing observation techniques and databases. Its objective is not to eliminate all biases, but to help collect more information and details on hazard events, while transparently acknowledging the associated limitations.
  
  Citation: https://doi.org/10.5194/egusphere-2026-722-AC1
RC2:
'Comment on egusphere-2026-722', Anonymous Referee #2, 20 Mar 2026

The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2026/egusphere-2026-722/egusphere-2026-722-RC2-supplement.pdf

Citation: https://doi.org/10.5194/egusphere-2026-722-RC2
- AC2: 'Reply on RC2', Bram Valkenborg, 06 May 2026
  
  We thank the reviewer for the careful and critical evaluation of our manuscript. We appreciate the reviewer’s detailed assessment of the HazMiner methodology, and the emphasis placed on validation and robustness. Below, we shortly address the main comments of the reviewer.
  Validation and accuracy reporting
  We acknowledge the reviewer’s concern regarding the absence of a robust presentation of accuracy and validation in the current manuscript. To better quantify and communicate these uncertainties, we propose adding an event level quality flag. This flag will represent the reliability of the information extracted from its paragraph and support a more informed downstream use of the dataset. It will reflect the known limitations of the dataset, such as text translation, the inclusion of non-time corrected paragraphs, and the location uncertainty. Where possible, the quality flag will also be evaluated against the validated paragraphs. In the following sections, we specifically address the different sources of error noted by the reviewer.
  Classification errors
  Classification errors in HazMiner mainly reflect an inherent limitation of zero‑shot classification rather than a model‑specific issue, as similar behavior is expected across different embedding spaces. Zero‑shot classification was chosen to have HazMiner’s ability to detect multiple natural hazard types without supervised training. While supervised models could address this issue, no suitable models were available and developing them was beyond the scope of this study. To improve transparency, we will quantify the impact of known misclassified and corrected paragraphs in the revised manuscript.
  Time validation
  The reviewer specifically mentioned line 287 about time validation. We would like to clarify that this line refers to validation of the time extraction algorithm, and not to the whole HazMiner algorithm. Nevertheless, we acknowledge that a time validation is currently missing. Therefore, we propose using the Global Landslide Catalog (GLC) for the time validation. Specifically, the news articles used to construct the GLC will be processed through HazMiner. The time extracted by HazMiner will then be compared to the time documented by GLC.
  Translation errors
  Translation in HazMiner was performed using a relatively lightweight model (around 77 million parameters), as larger models (e.g., GPT-type) with fewer errors were not feasible given computational constraints, the translation step alone required six months of processing. To limit translation-induced errors, a multilingual model was used for location extraction, avoiding translation in the geoparsing stage. However, extending this approach to other components would require multilingual models fine-tuned for text classification and information extraction, which is beyond the scope of this study.
  EM-DAT comparison
  EM-DAT is a widely used global reference database on disasters, recognized for its high reliability due to manual reporting, but it contains data gaps, like any observation system. By comparing HazMiner with EM-DAT, our goal was to demonstrate the additional information on hazard events that can be captured from online news sources using automated text mining. We did not intend to suggest that such a comparison is impossible; rather, we aimed to highlight the specific rules and constraints applied when comparing impact data (L281–285). We will clarify this in the revised manuscript.
  Non-event news articles
  The reviewer also noted that news articles relating to non-events, such as warnings, may be included. To address this, we propose filtering such paragraphs using zero-shot classification and, for example, two labels: ‘warning’ and ‘event’. The model can be restricted to output on label, i.e., the most relevant label for the paragraph.
  Sensitivity analysis of clustering parameters
  We agree with the reviewer that a sensitivity analysis would improve transparency and inform the reader. We therefore propose performing clustering across a range of xi, epsilon, and “minimum paragraphs per event” values to assess their influence on the resulting number clusters. The most relevant findings will be included in the manuscript to clarify the impact of these parameters.
  Data sources
  We thank the reviewer for noting that the manuscript did not clearly indicate where the URLs used to generate the HazMiner dataset can be accessed. These URLs were retrieved from the GDELT dataset using a structured query (Table 1 of the manuscript), resulting in a large collection of sources. All URLs are publicly available in the article dataset hosted on Zenodo (https://zenodo.org/records/18483419), and the revised manuscript will explicitly state where readers can consult them.
  
  Citation: https://doi.org/10.5194/egusphere-2026-722-AC2

Bram Valkenborg, Olivier Dewitte, and Benoît Smets

Viewed

Total article views: 2,945 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,716	1,135	94	2,945	232	264

HTML: 1,716
PDF: 1,135
XML: 94
Total: 2,945
BibTeX: 232
EndNote: 264

Views and downloads (calculated since 12 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	1,035	443	38	1,516
Mar 2026	555	503	51	1,109
Apr 2026	69	90	2	161
May 2026	57	99	3	159
Jun 2026	0

Cumulative views and downloads (calculated since 12 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	1,035	443	38	1,516
Mar 2026	555	503	51	1,109
Apr 2026	69	90	2	161
May 2026	57	99	3	159
Jun 2026	0

Viewed (geographical distribution)

Total article views: 2,995 (including HTML, PDF, and XML) Thereof 2,995 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 02 Jun 2026

Short summary

Data gaps in datasets on floods, landslides, and flash floods influence our understanding of these hazards. We present HazMiner, a new method to extract their location, timing, and impacts from online news articles. We applied HazMiner at the global scale to 6,366,905 news articles published from 2017 through 2024. This resulted in the detection of 21,411 floods, 7,659 landslides, and 3,606 flash floods. HazMiner outperforms current hazard datasets, especially in data-scarce regions.


Total:	0
HTML:	0
PDF:	0
XML:	0