From manual classification to large language models: assessing the quality and consistency of historical convective event records

Schätz, Franck; Glaser, Rüdiger

doi:10.5194/egusphere-2026-199

Preprints

https://doi.org/10.5194/egusphere-2026-199

Preprints

06 Feb 2026

| 06 Feb 2026

From manual classification to large language models: assessing the quality and consistency of historical convective event records

Franck Schätz and Rüdiger Glaser

Abstract. Historical text sources represent a central, yet methodologically challenging basis for the reconstruction of convective weather events. This study examines the extent to which historical reports on thunderstorms and hailstorms contain reliable climatological information, despite heterogeneous sources, varying degrees of detail and linguistic diversity. Based on a corpus prepared using source criticism, qualitative descriptions are converted into structured evidence levels and intensity classes and analysed using statistical methods and a multilingual BERT language model.

The reconstructed time series show a distinctly stable seasonal signal with a dominant summer maximum that occurs independently of fluctuations in source density and is consistent both in the overall series and in a dense observation window. A comparison with modern observation data from the German Weather Service and with independent historical reconstructions shows a high degree of agreement in seasonal patterns despite different survey methods and time periods. Analysis of the intensity classes also shows that historical sources do not primarily document extreme events, but rather reflect a physically plausible ranking of event strengths.

The results of the automated classification prove that the language model reliably reproduces seasonal and intensity-related patterns and implicitly captures source-specific reporting patterns without levelling them. Overall, the study shows that AI-supported methods can extract robust climatological information from historical texts when processed using rigorous methods, thus opening up new perspectives for quantitative historical climate research.

Received: 14 Jan 2026 – Discussion started: 06 Feb 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Franck Schätz and Rüdiger Glaser

Status: final response (author comments only)

RC1: 'Comment on egusphere-2026-199', Anonymous Referee #1, 06 Mar 2026

This work represents a highly relevant research topic based on a sound methodology. It is very well worth publishing, but requires some improvements and clarifications. Most of them concern the pre-processing and of the data and evaluation of the results.

Title misleading: In 2026, the term "Large Language Model" (LLM) usually refers to a specific architecture of generative models. The article, however, uses an encoder-only architecture (BERT) which is significantly smaller and lacks of capabilities such as reasoning or zero-shot approaches.
The research presented here bases significantly on the HISKLID data collection and the VRE tambora.org. However, since the article should be "self-contained" to a certain degree, please add some more details on these collections answering especially the question of their representativeness with regard to your research questions.
Also, I think that the paper would gain a lot if some concrete text examples (original language, German normalization, English translation) were included (not only in the appendix).
Figure 1 requires more explanation and contextualisation in the text. As it is, it just stands there unrelated to the rest. Why, for example, is the question of occupation of the documents' authors important and what can we learn from that? Is this related to questions of potential biases? How and how do you deal with those?
Add more information about the manual normalization task. Who did this, what criteria (dictionaries?) where applied, to what extent did you normalize (spelling only or also out-of-fashion words)? How did you deal with semantic shifts of words or phrases? How did you do a quality assurance of this manual task?
I am generally a bit confused here, because you cite Ehrmanntraut 2025 who proposes an automatic normalization system "for German literary texts from c. 1700-1900". Why did you not use this and conducted the task manually instead?
With regard to figure 2: It would be useful to see also the numbers in relation to each other (i.e. number of events per source/token).
In l. 63 you report on "21 steps." What does this mean and how are these 21 steps related to the "multi-stage process" of l. 87? Consider explaining the process in greater detail, also exemplify challenges and questions of uncertainty and how you dealt with them.
"The vocabularies were developed inductively from the source material and checked iteratively" (l. 111): elaborate on this. Here again, I wonder if this has been done manually and how you deal with subjectivity and ambiguities.
If I understand correctly, then you conducted an evaluation/validation of the BERT results against a sampled gold standard (l. 314: "independent test dataset"), but I cannot find any information on its constitution and sampling. Please add this, and also provide the dataset.
Please make the data and models of this research publicly available.

Citation: https://doi.org/10.5194/egusphere-2026-199-RC1

AC1: 'Reply on RC1', Franck Schätz, 23 Mar 2026

We would like to thank the reviewer for their thorough and constructive assessment. The comments have helped us to improve the manuscript. A recurring theme across several comments concerns the preprocessing pipeline and how the structured input to the classification procedure is generated from raw historical sources. We address this systematically below and refer the reader to our response to Comment 2, where we describe our general approach to this issue.

Comment 1 — Title

While we acknowledge that the term "Large Language Model" (LLM) is increasingly associated with generative, decoder-based architectures in contemporary discourse, its use in the scientific literature has not been strictly limited to this definition. Encoder-only models such as BERT and its derivatives have been widely referred to as LLMs in peer-reviewed publications, given that they are large-scale neural language models pre-trained on extensive text corpora.

Nevertheless, we appreciate that the terminology may cause confusion for some readers. To avoid ambiguity and ensure terminological precision, we have revised the title to:

"From manual classification to transformer-based language models: assessing the quality and consistency of historical convective event records"

We believe this formulation more accurately reflects the architecture employed while preserving the intended contrast between traditional manual approaches and modern deep learning methods.

Comment 2 — HISKLID / tambora.org / Self-contained nature of the article

We note that the key characteristics of the HISKLID data collection and the tambora.org research environment are already described in Section 2. However, we agree that the article would benefit from a clearer explanation of how the structured input to the classification procedure is generated from raw historical sources.

The present article deliberately focuses on one specific, well-defined step within a source evaluation workflow described in Schatz (2023): the intensity classification of convective events. This workflow comprises the following main stages: Historical-Climatological Query, Event Search, Source Characterisation, Specification - Origin, Biographical Level, Transformation, Contextualisation and Quantification, and Plausibility Check. Each stage is further decomposed into formally defined individual processing steps, each with explicit input and output domains, as documented in full in Schatz (2023). The classification procedure described in this article constitutes the Contextualisation and Quantification stage. The sole input required for this step is the normalised quotation produced by the preceding Transformation stage, which encompasses quote extraction, linguistic normalisation, temporal standardisation, and spatial georeferencing. A compact schematic figure of the workflow will be added to the revised manuscript (see Figure A in the attachment).

We will additionally acknowledge the inherent limitations of any historical corpus: the collection constitutes a sample of an unknown total population of sources, and representativeness cannot be established in the statistical sense. What can be demonstrated, and what the seasonal analysis in this study confirms, is that the corpus exhibits physically plausible and internally consistent patterns that are robust across different subsets of the data, different time windows, and independent comparative series. This consistency is the most meaningful available proxy for representativeness in a historical climatological context.

Comment 3 — Text examples / Figure 1

We agree that concrete text examples would substantially improve the accessibility and transparency of the methodology. We will include a small selection of representative quotations in the main text, presented in three columns: original historical language, normalised German, and English translation. These examples will illustrate the linguistic variability of the corpus, spanning Middle High German to Early New High German, and demonstrate how the classification procedure is applied in practice, including the assignment of evidence classes (C1 to C3) and intensity levels. An annotated example is provided at the end of this response document (see Annotated Example).

We also agree that Figure 1 requires more contextualisation and will expand the accompanying text accordingly. The documentation of authors's occupational backgrounds serves two complementary purposes. First, it enables a source-critical assessment of potential reporting biases: different professional groups observe and describe weather phenomena through different lenses. As demonstrated in Section 4.3, the chi-square analysis confirms that source type significantly influences intensity classification (e.g., chronicles overrepresenting severe events, weather records overrepresenting mild ones). The occupational profile of authors is directly linked to the type of document they produced, and therefore to these systematic biases.

Second, the occupational metadata establishes a basis for comparison with early instrumental observation networks of the 18th and 19th centuries. Early meteorological stations were frequently staffed not by professional meteorologists but by educated laypeople in civic roles, such as rectors, engineers, and works supervisors, as documented for example in the Physikatsberichte of 1869, the Lamont station network, and the meteorological stations of the Kingdom of Bavaria. The occupational spectrum documented in Figure 1 reflects a structurally comparable observer community, suggesting a degree of methodological continuity between pre-instrumental historical records and early systematic observation networks.

Comment 4 — Manual normalisation

Normalisation focused on lexical and orthographic standardisation rather than grammatical modernisation. Idiomatic expressions were retained in their original form to preserve semantic authenticity. Out-of-fashion or dialect-specific vocabulary was resolved using the Worterbuchnetz (covering Early New High German and New High German), with additional cross-referencing of sources from comparable dialectal and temporal contexts where individual terms remained ambiguous. Semantic shifts in weather-related terminology were addressed carefully, ensuring that terms were interpreted according to their period-specific meaning rather than their modern equivalent.

For temporal references, dating was standardised using Grotefend's reference work on historical calendars, including saints' days, feast days, and semi-automated calculation of moveable feasts such as Easter. Historical place names were resolved against GeoNames to enable spatial georeferencing.

Quality assurance was ensured through iterative cross-checking of key event vocabulary, covering the core terms used for thunderstorms, thunder, hail, wind, snow and rain, against the compiled silver label lists. Problematic cases were rare and were resolved through systematic source comparison. No grammatical normalisation was applied, as the classification procedure operates at the lexical-semantic level and does not require syntactic standardisation.

These preprocessing steps correspond to the Transformation stage of the workflow described in Schatz (2023) and, as outlined in our response to Comment 2, will be made explicit in the compact workflow figure (see Figure A) to be added to the revised manuscript.

Comment 5 — Ehrmanntraut (2025)

While we indeed cite Ehrmanntraut (2025) as a methodological reference for transformer-based text normalization, the system was not employed for the automatic normalization of our corpus for the following reasons.

The system proposed by Ehrmanntraut (2025) was specifically trained on the DTA EvalCorpus, a parallel corpus of German literary texts (novels, theater plays, novellas, etc.) from c. 1780 to 1901. Our corpus, in contrast, consists of historical records of convective weather events, a highly domain-specific register that differs substantially from literary prose in both vocabulary and linguistic structure. Domain mismatch of this kind is well-known to degrade normalization performance, as Ehrmanntraut (2025) himself acknowledges when discussing the generalizability of such systems to other registers and datasets.

We therefore conducted the normalization manually, which allowed us to ensure terminological consistency for the domain-specific vocabulary present in our records. The citation of Ehrmanntraut (2025) was intended to situate our approach within the broader methodological literature on historical text normalization, and to confirm that transformer-based normalization has been validated for comparable historical German material.

Furthermore, our normalisation task extended beyond purely linguistic standardisation to encompass temporal dating and spatial georeferencing, both of which are indispensable for historical climatological analysis but lie outside the scope of systems such as Ehrmanntraut (2025). As outlined in our response to Comment 2, these dimensions will be made explicit in the compact workflow figure (see Figure A) to be added to the revised manuscript.

A further practical limitation is that a substantial portion of our corpus predates 1700, which lies outside the temporal coverage of the DTA EvalCorpus on which Ehrmanntraut (2025) was trained. Applying an automated normalisation system to part of the corpus while normalising the remainder manually would introduce systematic inconsistencies in the lexical basis, which would in turn compromise the reliability of the silver label matching and the subsequent classification. A uniform manual approach was therefore not only preferable but necessary to ensure corpus-wide consistency.

Comment 6 — Figure 2 (events per source)

We thank the reviewer for this suggestion. However, a normalised representation of events per source is not straightforward in this corpus, as many sources are compilations or critical editions that aggregate observations spanning multiple years, decades, or even centuries. A simple events-per-source ratio would therefore be misleading rather than informative, as it would conflate fundamentally different types of sources with very different temporal scopes.

Furthermore, the distribution of event density over time is already addressed in the manuscript. As Figure 9 clearly shows, the density of documented events increases progressively over the centuries, which is consistent with the growth in source availability documented in Figure 2. The calendar heat maps in Figure 9 provide a more meaningful representation of this temporal distribution than a normalised events-per-source ratio would.

As an alternative, we could provide a breakdown of event counts by language stage (Middle High German, Early New High German, New High German) or by century, which would offer additional context on the temporal distribution of the corpus without the interpretive problems associated with a per-source normalisation. We will add this information either as a supplementary panel to Figure 2 or as a brief descriptive statement in Section 2.

Comment 7 — Workflow and classification procedure

The present article focuses exclusively on the Contextualisation and Quantification stage of a source evaluation workflow described in Schatz (2023), of which the intensity classification of convective events is the central component. The figure presents a simplified overview of the main stages of the workflow as applied in this study (see Figure X). The order of individual processing steps may vary depending on the characteristics of the source. The underlying theoretical framework, including the full formalisation of individual processing steps and their ordering conditions, is documented in Schatz (2023).

The sole input to the classification step described in this article is the normalised quotation produced by the preceding Transformation stage. The compact schematic figure (see Figure X) will also clarify the relationship between the workflow described in Schatz (2023) and the multi-stage classification process described at line 87, which refers to the classification sub-process within the Contextualisation and Quantification stage.

Comment 8 — Silver Labels / inductive development

We respectfully note that this question is already addressed in considerable detail in the manuscript. The vocabularies, referred to as Silver Labels, are not unstructured word lists but formally defined, rule-based classification resources derived systematically from the corpus. Their structure, including the three evidence classes (C1: direct and specific, C2: indirect, C3: relative/qualitative), is described in Section 3.2, and the complete vocabularies including all recorded variants and regular expression patterns are documented in full in Appendices B1 to B4.

The inductive development refers to the fact that candidate terms were identified directly from the historical source material rather than imposed from external sources, ensuring domain specificity. Iterative checking means that the vocabulary was refined through repeated application to the corpus, with ambiguous or borderline cases resolved through cross-referencing with sources from comparable dialectal and temporal contexts.

Regarding subjectivity and ambiguity, we refer the reviewer to our general response on subjectivity below, which addresses this concern more fundamentally. A concrete annotated example is additionally provided at the end of this document (see Annotated Example).

Comment 9 — Test dataset / data availability

We respectfully note that both points raised in this comment are already addressed in the manuscript. The constitution and sampling of the test dataset are described in Section 3.5: the dataset was divided into training, validation and test portions in a ratio of 70:10:20, with stratified splitting to preserve class distributions.

Regarding data and model availability, we refer the reviewer to page 24, lines 418ff., where all relevant resources are explicitly referenced and linked.

On the question of subjectivity

We would like to address the recurring concern about subjectivity in a more fundamental way. Although the term "subjectivity" is often used in the context of text-based classification, we argue that it requires more precise differentiation. Any transition from natural language to quantified categories presupposes linguistic understanding and contextual interpretation. This applies to manual classification, rule-based NLP systems and probabilistic modelling approaches equally. The relevant question is therefore not whether interpretation is involved, but whether that interpretation is transparent, operationally defined and reproducible.

In text-based research, complete objectivity is neither achievable nor a meaningful standard. A more appropriate benchmark is intersubjective traceability, defined as the extent to which interpretive decisions are made explicit, documented, and open to scrutiny. The formalisation developed in Schatz (2023) meets this standard precisely by decomposing the source evaluation workflow into formally defined processing steps with explicit input and output domains, making interpretive decisions visible and traceable rather than leaving them implicit.

It is also worth noting that quantitative approaches are not exempt from this principle. Feature selection, model architecture, and parameter choices all involve interpretive decisions that are frequently less explicitly documented than the classification procedure described here. This applies in particular to approaches that treat words or phrases as context-independent units, an assumption that is difficult to sustain for historical texts, where meaning is strongly tied to linguistic register, temporal context, and idiomatic usage.

Crucially, words and phrases in historical sources can only be understood in context. This principle motivated both the design of the evidence classes (C1 to C3) and the choice of a transformer-based language model, which captures semantic relationships within a given context rather than treating words as independent of their surroundings.

Ultimately, the criterion for any classification procedure is not the absence of interpretation but agreement with a validated standard. The high classification performance, stable seasonal patterns and consistent intensity hierarchies demonstrated in this study provide empirical confirmation that the formalised, context-sensitive approach yields reproducible, scientifically valid results, which we consider the most compelling response to concerns about subjectivity.

To further illustrate the transparency and reproducibility of the classification procedure, we include an annotated text example below (see Annotated Example), showing explicitly how individual terms are assigned to evidence classes and intensity levels based on formally defined rules. Ambiguous cases are explicitly flagged rather than resolved arbitrarily, demonstrating that the procedure meets the standard of intersubjective traceability described above.

Annotated Example: Cologne, 25 August (historical newspaper)

Source text (normalised German):

Köln, vom 25. August. Vergangenen Freitag hat man in dieser Gegend ein schreckliches Donnerwetter gehabt [...] ist zugleich ein Wolkenbruch gewesen, dadurch den Dörfer Schevren und Häuser ins Wasser gesetzt worden [...] In der Gegend Woringen seien Hagelstein eines Ei groß gefallen, alle Fenstern, Hafer völlig zerschlagen. Zu Hurt hat er Wind die Bäume aus der Erden gerissen, auch der Donnerschlag im bergischen Land 30 hin- und wiedergelegene Häuser in Brand geschlagen [...] Zu Würzburg hat das Donnerwetter das Magazin- oder Munitionshaus auf dem Schlossberg aus dem Grund geschlagen, alle Bäume mit den Wurzeln ausgerissen, die umliegenden Weinberge und Felder ganz ruiniert [...]

Annotation overview:

Text passage	Category	Evidence class	Justification
Donnerwetter (x2)	Thunderstorm		thunderstorm_class
Wolkenbruch	Rain (class 3)	C3	Pattern: wolkenbruch
Häuser ins Wasser gesetzt	Rain (class 3)	C2	Pattern: unter wasser
Schaffe ersoffen	Rain (class 3)	C2	Regex: ersoffen
Hagelstein	Hail (evidence)	C3	Direct reference
eines Ei groß	Hail	C1 (direct size)	ei in direct_size_terms
alle Fenstern, Hafer völlig zerschlagen	Hail	C3 (indirect damage)	fenster eingeschlagen, erschlagen
Wind die Bäume aus der Erden gerissen	Wind (class 3)	C2	Pattern: bäume ausgerissen
Donnerschlag	Lightning strike		lightning_strike
Häuser in Brand geschlagen	Lightning strike	C2	lightning_strike
aus dem Grund geschlagen, alle Bäume mit den Wurzeln ausgerissen	Wind (class 3)	C2	Pattern: wurzel gerissen, bäume ausgerissen
Weinberge und Felder ganz ruiniert	Hail	C3 (indirect damage)	ruiniert in indirect_size_terms*

* Note: Weinberge und Felder ganz ruiniert refers to the Würzburg thunderstorm event. The cause (hail vs. wind/lightning) is not unambiguous from the text; however, the pattern ruiniert is automatically classified as Hail C3 based on the silver label rules. This case illustrates how the procedure handles ambiguity transparently: the rule is applied consistently, and the uncertainty is documented explicitly.

This example demonstrates that each classification decision is traceable to a formally defined rule. The silver label lists (Appendices B1-B4) document all patterns, regular expressions, and word lists used. Ambiguous cases are flagged and explained rather than silently resolved, ensuring full transparency of the classification process.

Citation: https://doi.org/10.5194/egusphere-2026-199-AC1

RC2:
'Comment on egusphere-2026-199', Anonymous Referee #2, 17 Mar 2026

This manuscript presents an innovative and timely approach for extracting climatological information from historical documentary sources using a combination of source-critical analysis, formal classification, and machine learning. The integration of linguistic interpretation with quantitative climatological validation is particularly noteworthy and represents a valuable contribution to historical climate research.
The study is well structured, methodologically ambitious, and demonstrates promising results, especially regarding the reconstruction of physically plausible seasonal patterns and the application of large language models for automated classification.
However, several fundamental methodological issues need to be addressed before the manuscript can be considered for publication. In particular, concerns related to the definition of ground truth, potential circular validation, treatment of uncertainty, and source-related biases limit the interpretability and robustness of the conclusions.

Citation: https://doi.org/10.5194/egusphere-2026-199-RC2
- AC2: 'Reply on RC2', Franck Schätz, 23 Mar 2026
  
  We thank the reviewer for their careful and constructive assessment and for the positive evaluation of the methodological approach.
  The concerns raised — ground truth definition, circular validation, treatment of uncertainty, and source-related biases — are addressed in detail in our responses to Reviewer 1, to which we refer for the full argumentation. In summary: the evidence classes (C1–C3) provide an explicit, rule-based treatment of uncertainty; source-related biases are quantified and discussed in Section 4.3 using chi-square analysis and Cramér's V; and the question of subjectivity and intersubjective traceability is addressed in a dedicated section of the reviewer responses.
  Regarding the concern about circular validation specifically: the manual classifications used for fine-tuning the language model and the statistical validation of seasonal patterns are based on the same corpus, but address fundamentally different aspects of the data. The seasonal analysis validates the physical plausibility of the corpus as a whole — it does not validate the model output. The model is evaluated exclusively on a held-out test set (20% of the labelled data, stratified, not used during training), ensuring that the classification performance reported reflects genuine generalisation rather than overfitting. The two validation steps are therefore methodologically independent.
  
  Citation: https://doi.org/10.5194/egusphere-2026-199-AC2

Franck Schätz and Rüdiger Glaser

Viewed

Total article views: 1,800 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,152	530	118	1,800	284	392

HTML: 1,152
PDF: 530
XML: 118
Total: 1,800
BibTeX: 284
EndNote: 392

Views and downloads (calculated since 06 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	615	199	64	878
Mar 2026	396	168	50	614
Apr 2026	84	82	0	166
May 2026	41	66	3	110
Jun 2026	16	15	1	32

Cumulative views and downloads (calculated since 06 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	615	199	64	878
Mar 2026	396	168	50	614
Apr 2026	84	82	0	166
May 2026	41	66	3	110
Jun 2026	16	15	1	32

Viewed (geographical distribution)

Total article views: 1,788 (including HTML, PDF, and XML) Thereof 1,788 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 16 Jun 2026

Short summary

Could AI help us decode historical storm reports? This study develops a method for classifying storms and hail events. We have transformed vague descriptions into reliable climate data, showing that even ancient texts exhibit clear seasonal patterns. Training a language model shows that AI can automatically extract weather information while meeting scientific standards. Free tools reveal that summer is the peak season for storms.


Total:	0
HTML:	0
PDF:	0
XML:	0