the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Wikimpacts 1.0: A new global climate impact database based on automated information extraction from Wikipedia
Abstract. Climate extremes like storms, heatwaves, wildfires, droughts and floods significantly threaten society and ecosystems. However, comprehensive data on the socio-economic impacts of climate extremes remains limited. Here we present Wikimpacts 1.0, a global climate impact database built by extracting information from Wikipedia using natural language processing. Our method identifies relevant articles, extracts the information using GPT4o, post-processes the information and consolidates the database. Impact data is stored at the event, national, and sub-national levels, covering 2,928 events from 1034 to 2024, with 20,186 national and 36,394 sub-national entries. The database shows low error scores (range from 0 to 1) for event-level information like timing (0.05), deaths (0.03), and economic damage (0.12), and slightly higher error scores for injuries (0.21), homelessness (0.25), displacement (0.29), and damaged buildings (0.28) compared to manually annotated data from 156 events. Wikimpacts 1.0 provides broader impact coverage on storms than EM-DAT at the sub-national level. In comparing impact values, 38 out of 234 matched events have identical data for deaths, and 7 of 94 for injuries. However, there are notable discrepancies in information on homelessness and damage. Our public database highlights the potential of natural language processing to complement existing impact datasets and to provide robust information on climate impacts.
- Preprint
(32354 KB) - Metadata XML
-
Supplement
(639 KB) - BibTeX
- EndNote
Status: open (until 10 Dec 2025)
- RC1: 'Comment on egusphere-2025-4891', Anonymous Referee #1, 17 Nov 2025 reply
-
RC2: 'Comment on egusphere-2025-4891', Anonymous Referee #2, 21 Nov 2025
reply
I thank the Editor for the opportunity to review the manuscript “Wikimpacts 1.0: a new global climate impact database based on automated information extraction from Wikipedia”.
General assessment
The work is of high scientific interest, given the need for transparent, harmonized and scalable datasets for climate risk assessment and loss and damage research. Despite its obvious potential, the manuscript requires substantial revisions to improve the clarity, structure, and articulation of its scientific contribution. Several aspects of the article would benefit from clearer explanation, greater integration between methods and results, better organization of figures and tables, and a more explicit discussion of both scientific limitations and advances.
If the authors adequately address the points presented below, the manuscript may be reconsidered for publication.Major comments
The abstract is clear and well structured, effectively presenting the objective, methodology, and main results. However, it does not fully convey the scientific relevance and broader implications of the work. The authors could slightly expand the final sentence to better highlight the contribution of Wikimpacts 1.0 to climate impact research and data integration.The introduction is clear, comprehensive, and well supported by recent literature. It effectively presents the relevance of climate impact data, the limitations of existing databases, and the general methodological approach. However, the section could place greater emphasis on the scientific innovations of the work — for example, by briefly explaining how Wikimpacts 1.0 improves on existing approaches or supports climate impact analysis. A brief description of the paper’s structure at the end of the section would also enhance readability. Finally, the reference “Li et al., 2025a” appears to correspond to the DOI of the Wikimpacts 1.0 dataset. While it is appropriate to cite the dataset associated with the paper, the current phrasing could confuse readers into thinking it refers to a separate publication. The authors could clarify by writing “the accompanying Wikimpacts 1.0 dataset (Li et al., 2025a)” or similar.
In section 2.1, the normalization example (“more than 200 deaths…”) introduces methodological details that belong to Section 4 and distracts from the structural description of the database.
Similarly, in Section 2.2, the forward reference to the case study later discussed in Section 6 is premature and breaks the logical flow; it could be rephrased more generally (e.g., “an example is provided later in the paper”).In section 3.1, it should be specified that the text classifier is the fine-tuned English BERT model itself, not a separate classifier. The numerical results (30085 – 4900 - 5046 articles) are intermediate outcomes and would be better presented in the results section; this part should focus on the procedure.
Section 3.2 clearly describes the GPT-4o extraction process, though it is somewhat too detailed. Some prompt information could be moved to the Supplementary Material, and a brief justification of the model choice would improve clarity.
In section 3.3.3, the term “version 3.1/3.2” appears to refer to different configurations of the extraction pipeline or prompt templates rather than model versions. This could be stated explicitly to avoid confusion.Section 4 presents a solid and well-structured description of the validation framework used to assess the Wikimpacts 1.0 pipeline. The creation of a manually annotated gold standard and the definition of customized normalized error metrics are strong methodological points that demonstrate transparency and reproducibility. However, the section mixes methodological description and quantitative outcomes. The numerical results (e.g., error scores by level and field) would be more appropriately presented in the results section, while this part should focus on explaining the validation design. Separating the methods from the results would improve clarity and logical flow. The rationale for selecting 70 events for the development set and 156 for the test set should be briefly explained (e.g., availability, annotation workload, or statistical considerations). The description of the annotation process could also specify the number and background of annotators and how disagreements were resolved.
The statement regarding the evaluated database fields (those without asterisks in Table 3) could be clarified: only fields directly extracted by GPT-4o are assessed, while post-processed or derived fields are excluded. This distinction is important and should be explicitly stated.The comparison with EM-DAT, reported in Section 5.4, is extremely useful but would benefit from a clearer explanation. The text should specify that the event-by-event comparison involves the main quantitative impact fields (deaths, injuries, damage, and homeless when available), and describe how events were matched between the two databases and how the percentage-difference classes (±10 %, ±30 %, ±50 %) were defined. Figure 10 (a–d) effectively summarizes these differences, but the caption and text should explicitly explain the calculation, the meaning of the color scheme (blue = Wikimpacts < EM-DAT; red = Wikimpacts > EM-DAT), and the interpretation of the observed discrepancies—particularly the large deviations for damage data.
The subsection 6.1.1 highlights the main sources of error at L1. To improve readability, the authors could:
• more clearly separate systematic error types from specific illustrative examples;
• explicitly link these error types to the corresponding error magnitudes in Table 6.
The subsection 6.1.2 provides a clear explanation for why error rates increase at L2 and L3.
Table 9 is central to this discussion, but its current placement after section 6.1.1 disrupts the logical flow. It should be moved to follow section 6.1.2. The distinction between “LLM” (model-generated output) and “Gold” (manually annotated reference) is introduced earlier in the methodology, but briefly restating it here would improve clarity. A brief clarification of the NULL penalty mechanism would also enhance comprehension.
The consolidation step in the subsection 6.1.3, is presented as resolving many of the issues identified in 6.1.1 and 6.1.2, but the explanation is very brief. Since consolidation plays an important role in improving the final database, it would be useful to include at least one concrete example showing how an error is corrected during this process.
The qualitative comparison with EM-DAT presented in section 6.2 is relevant and adds useful context regarding differences in structure, granularity and traceability. However, Section 6.2 does not introduce enough additional material to justify a standalone subsection titled “Comparison with existing impact databases,” especially since EM-DAT is the only database discussed. To improve coherence and avoid fragmentation, the authors might integrating this discussion directly into Section 5.4, making the EM-DAT comparison a single unified block or expanding the subsection to briefly include other relevant databases (e.g., DesInventar,), thereby matching the broader scope suggested by the title.
The limitations, described in the section 6.3, are clearly described, particularly those related to the reliance on English-language Wikipedia and the uneven availability of impact information across regions and hazard types. However, the section could be further strengthened by briefly explaining the practical implications of these limitations for potential uses of the database (e.g., risk modelling, regional comparisons, or loss-and-damage assessments). Even a short statement would help readers understand how these constraints may affect downstream analyses.The conclusions effectively summarizes the structure and overall objectives of the Wikimpacts 1.0 database, but it does not explicitly acknowledge the main limitations discussed in section 6.3, nor does it highlight potential future applications of the database (e.g., climate risk modeling, vulnerability assessments, benchmarking of other impact datasets).
Furthermore, although the conclusion states that Wikimpacts 1.0 presents several innovations compared to the state of the art, these innovations are not explicitly listed.Comments to figures and tables
Several figures and tables could be improved.
Figures
• Figure 2 illustrates the three database levels (L1, L2, L3), but its current form is overly schematic and provides limited insight into the actual relational structure of the database. The figure does not clarify the type or direction of the relationships between levels (e.g., 1:N, N:N), nor does it explain the meaning of the arrows or how information is propagated or linked across levels. Improving the figure (e.g., clearer arrows, explicit relationship types, distinct fields at each level) and adding a more detailed explanation in the text would significantly help the reader understand the architecture of the Wikimpacts 1.0.
• In Figures 7a, 7c, 7f and 8a–c, some color ranges are visually indistinguishable (e.g., 80–160 vs 160–350; 20–40 vs 40–75; 5–10 vs 10–15; 50–100 vs 100–200) and should be adjusted for better contrast.
• In Figure 9a, the legend appears to refer only to storm events, although the text suggests that both tropical and extratropical categories are included; the legend and caption should be revised accordingly.Tables
• Table 1 presents the main information categories (base, temporal, spatial, impact, etc.) and their fields, but it is lengthy and not clearly referenced in the text. It should be explicitly cited and briefly discussed when introduced.
• Table 2 is informative but visually dense. Reformatting (e.g., grouping hazards by type or using multi-column or bullet layouts) would enhance clarity.
• Table 3 documents each database field, specifying data type, value format, mandatory status, and applicable level (L1–L3). While technically useful, its role relative to Table 1 is not explained. The authors should clarify that Table 1 summarizes data categories, whereas Table 3 provides the technical specifications.Citation: https://doi.org/10.5194/egusphere-2025-4891-RC2
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 296 | 94 | 19 | 409 | 29 | 12 | 17 |
- HTML: 296
- PDF: 94
- XML: 19
- Total: 409
- Supplement: 29
- BibTeX: 12
- EndNote: 17
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General assessment
The topic is timely and relevant. A global, open, LLM-based impact database is of clear interest to NHESS readers. The manuscript reads as a data-and-methods paper describing a new dataset, its structure, extraction pipeline, evaluation, and example applications. This fits well within the journal’s scope for data-oriented contributions. I recommend major revisions to improve clarity, strengthen the discussion of limitations, and help users understand how to interpret and apply the dataset.
Major practical comment on the DB: Regarding event–article mapping and potential misclassification
While inspecting the public database, I noticed several cases where impacts appear to be drawn from broad multi-event Wikipedia pages (e.g. “Tropical cyclones in 2017”) even when a dedicated single-event article exists (e.g. “Cyclone Numa”). This can lead (it does actually) to incorrect country lists, the inclusion of non-impacted areas, or duplicate entries for the same event. The current manuscript does not fully explain how cross-references within multi-event articles are handled (for example, when one system contributes to another) nor how potential double-counting or mis-allocation of impacts is prevented. I recommend adding a subsection clarifying the filtering logic, giving examples of typical failure modes, and explaining whether any automatic or rule-based deduplication is applied during consolidation.
Specific comments on the article
Regarding the role of Wikimpacts 1.0 relative to existing datasets, it would help to clarify early in the Introduction whether the authors position this work as complementary to curated impact databases such as EM-DAT and DesInventar, or as an alternative source. It would also be useful to specify for which types of analyses the dataset is most appropriate (for example, global multi-hazard comparisons or exploratory sub-national studies) and which applications require caution (such as completeness-sensitive national loss or other variable accounting).
Coverage and Wikipedia reliance: the requirement that an English Wikipedia article must exist implies notability and language biases. Small-scale or local events, or those in regions with limited Wikipedia activity, are surely under-represented. I suggest adding a short paragraph addressing this and explaining how users should interpret the absence of events. This also helps contextualise the patterns shown in Fig. 4.
L1-L3 definitions (Sect. 2 and Sect. 3.3.3): the manuscript would benefit from a clearer and more consistent description of what “location” means at each level. At L80, levels are defined as event (L1), country (L2), and sub-national (L3). Later wording in Sect. 3.3.3 is less precise. Restating the definitions once, with consistent terminology, will help users interpret the later evaluation of location accuracy.
Hazard types (Sect. 3.4): the exclusion of events that cannot be mapped to the seven main hazard categories, such as landslides, should be made explicit as a limitation. It would help readers to understand whether landslide impacts are entirely lost or whether some are absorbed under parent storm or flood events.
Phrase “extensive spatio-temporal coverage” (L58): this wording may overstate completeness given Wikipedia's known biases. Moderating this statement or directing the reader to the evaluation and limitations sections would avoid possible misinterpretation.
Regarding the example of the 2011 European floods, the manuscript notes that the main event was categorised as an extratropical cyclone, although a flood category may be more appropriate. This is a useful illustration of hazard-type ambiguity. It would help if the authors commented briefly on how common such cases are and whether simple rule-based corrections might reduce them.
Temporal coverage: the database extends from 1034 to 2024. The sharp rise in event counts after the 19th century makes it clear that older entries are sparse. Adding one sentence advising users how to interpret pre-1900 records (highly incomplete, not suitable for quantitative trend analysis) would improve clarity.
Evaluation (Sect. 4): Its structure and intent are valuable. It would help the reader if Sect. 3 indicated that the extraction quality is assessed in Sect. 4. Furthermore, Sect. 4 would benefit from a short explanation of how the 70 + 156 gold-standard events were sampled (randomly or stratified). This may affect the generalisability of the error rates (?).
Regarding the interpretation of field-specific error rates, Table 6 reveals that location has a much higher error rate than other fields. The manuscript would be strengthened by explaining what types of errors dominate (for example, administrative-level mismatches, NULL penalties, or coordinate issues) and by offering guidance on how users should interpret L2 and L3 location fields. A short paragraph identifying which extracted fields are robust for most applications and which require caution would be particularly helpful.
The sentence “Within L1, event and timing data are highly accurate, while location data is less robust,” is not intuitive because L1 represents the aggregated event level. It would be useful to clarify what “location” means at L1 and why its accuracy is lower.
Comparison with EM-DAT: the manuscript would be improved by a more cautious framing of discrepancies. Differences may arise not only from extraction errors but also from different event definitions, thresholds, and loss components. I recommend explicitly positioning Wikimpacts as complementary to curated sources and offering guidance on how users might use them together.
Fig. 4: The distribution of events confirms that Wikimpacts 1.0 reflects high-impact, media-visible disasters rather than a full record of climate extremes. Making this explicit in the Discussion would prevent users from interpreting the dataset as complete.
Discussion: The existing section is strong, but could more directly address certain limitations:
• notability and language biases linked to Wikipedia;
• the deliberate exclusion of particular hazards (for example, landslides) and implications for multi-hazard studies;
• systematic weaknesses in LLM extraction for multi-country or compound hazards beyond aggregated error rates.
A short concluding paragraph offering explicit user guidance, indicating suitable and unsuitable use cases, would be valuable. For example, the database appears well-suited for global comparative studies, exploratory sub-national analyses where data exist, or cross-hazard synthesis, but less suitable for completeness-sensitive applications or studies focused on small, local events.
Minor comments
Regarding language and clarity, a few specific examples illustrate areas where editing would improve readability:
• The phrase “administrative units at the same level can also be highly variable” (L35) needs clarification as to what variability matters for impact analysis.
• The sentence beginning “Due to the categorization based on single hazards” (L40) would benefit from rephrasing for grammar and conceptual clarity.
• The reference to DesInventar at L42 would be clearer if the specific shared limitations were briefly stated.
• The last part of the Introduction (L55–70) blends methodological details that belong in Methods or Data Availability. Ending the Introduction with a clearer statement of objectives and contributions would strengthen the structure.
In Section 2, the opening paragraph focuses on repository and accessibility information rather than internal structure. Moving that material to a Data Availability section would allow Section 2 to begin more directly with the conceptual design. Similarly, some technical field-definition details could move to Supplementary Information.
Regarding abbreviations, SI at L88 should be defined on first use.
Regarding the example referring to 2025 Wikipedia data (L89), the manuscript could clarify how information “as of 2025” was obtained when the mining cut-off appears to be 2024. A brief explanation would prevent confusion. Consider a relevant note in Figure 1.