Evaluation of nine gridded daily weather reconstructions for the European heatwave summer of 1807

Stucki, Peter; Brönnimann, Stefan; Imfeld, Noemi; Pfister, Lucas; Ruth, Conall Eoin; Schmutz, Yannis Valentin; Brugnara, Yuri; Wegmann, Martin; Przybylak, Rajmund; Filipiak, Janusz

doi:10.5194/egusphere-2025-5264

Preprints

https://doi.org/10.5194/egusphere-2025-5264

Preprints

04 Nov 2025

| 04 Nov 2025

Status: this preprint is open for discussion and under review for Climate of the Past (CP).

Evaluation of nine gridded daily weather reconstructions for the European heatwave summer of 1807

Peter Stucki, Stefan Brönnimann, Noemi Imfeld, Lucas Pfister, Conall Eoin Ruth, Yannis Valentin Schmutz, Yuri Brugnara, Martin Wegmann, Rajmund Przybylak, and Janusz Filipiak

Abstract. Recent research of early instrumental measurements combined with numerical-statistical techniques has contributed to global atmospheric reanalysis as well as regional products that cover pre-1850 weather. The advent of machine learning (ML) raises the question of how well we can reconstruct weather from the distant past using both established and emerging approaches. Here, we evaluate nine such approaches to reproduce the daily weather during Europe's hot summer of 1807. The datasets examined include the Twentieth Century Reanalysis (20CR) and enhanced versions (via additional assimilation, dynamical downscaling), an analog resampling product, as well as ML reconstructions that use neural networks (along with video-inpainting methods or variational auto-encoders). Validation is based on early station measurements, documentary information, statistical diagnostics, and a semi-quantitative assessment of atmospheric flow.

We find that the summer of 1807 can be considered a prototype, pre-industrial heatwave summer, with three extremely hot episodes and maximum temperatures exceeding 30 – 35 °C in Central Europe. Most approaches achieve mean correlations (anomalies form the seasonal cycle) above 0.75 for temperature and centered Root Mean Square Error values below 3 °C, though variability tends to be underestimated. This speaks for overall robust reconstructions given the distant past and scarce underlying weather information. Skill scores for almost all reconstructions indicate that they are reliable in discriminating very hot from cooler (high-pressure from lower-pressure) conditions. Improved spatial skill with respect to 20CR for stations in Central and Northeastern Europe can be attributed to the increased influence of newly ingested weather information on the atmospheric reconstructions.

The atmospheric flow-aware approaches reproduce plausible large-scale features such as ridges of high pressure and associated belts of hot air, whereas data-driven ML approaches excel statistically in replicating station variability but often produce less realistic circulation patterns. The analog method yields balanced but less intense reconstructions, and the high-resolution dataset aligns best with heat intensities in the Alpine region.

Such trade-offs leave users choose between computational efficiency, statistical performance, and physically coherent circulation. Future developments need to address uncertainties in the early measurements. In turn, the analyses also emphasize the value of high-quality early weather records to produce and validate gridded reconstructions.

Received: 24 Oct 2025 – Discussion started: 04 Nov 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 9349 KB)

Supplement (8898 KB)

Download & links

Peter Stucki, Stefan Brönnimann, Noemi Imfeld, Lucas Pfister, Conall Eoin Ruth, Yannis Valentin Schmutz, Yuri Brugnara, Martin Wegmann, Rajmund Przybylak, and Janusz Filipiak

Status: open (until 30 Dec 2025)

Post a comment Subscribe to comment alert

RC1:
'Comment on egusphere-2025-5264', Rhidian Thomas, 21 Nov 2025 reply
This manuscript presents a detailed and useful study of an important event: the summer of 1807 in Central Europe, a “prototype heatwave summer within a pre-industrial context”. The study’s aim is to critically compare nine different gridded reconstructions for this event, comprising:
Four methods derived directly from the 20CRv3 reanalysis, either in its original form or with offline assimilation of additional observations;

A downscaled simulation using 20CR fields as boundary conditions;

An analogue resampling method trained on ERA5 fields;

Three machine learning (ML) methods of varying complexity.

The authors identify hot periods in the 1807 summer using a variety of data sources: qualitative accounts from a contemporary observer, daily station temperature series, and reconstructions using the 20CR ensemble. The bulk of the paper is given to a thorough evaluation of the nine methods, using both statistical criteria (Taylor diagrams and three extreme-specific metrics) and a semi-quantitative analysis of the spatial plausibility of the reconstructed fields. Finally, the authors critically evaluate the quality of certain stations and consider how this will affect each method differently.
Various approaches are now emerging for reconstructing historical weather, and the authors cite these in the introduction. The authors show that several of these methods can produce plausible reconstructions of the 1807 summer, a well-chosen case study that is of increasing significance in the current changing climate. The manuscript also provides a rigorous framework for comparing reconstruction methods that will be useful beyond this single case study. Overall, the authors present an impressive amount of information that will be of broad interest to readers of Climate of the Past. A highlight of the paper is the novel inclusion of reconstructions using a variational auto-encoder (VAE), and the authors find interesting differences between VAE and the other ML approaches compared to the physics-based reconstructions.
The paper is well organised and well written, and the figures are clear and helpful. The title is accurate and the abstract gives a good overview of the findings. I have two general comments and a few science comments below which I feel would improve the manuscript, however these are fairly minor and I do not feel they amount to major edits. Otherwise, I am pleased to recommend it for publication in Climate of the Past.
The comments below are organised into two general comments, several specific science comments, and technical/typing comments.

General comments
Ensemble-means vs best members
In L379, the authors state that “CRB performs slightly better than CRM”. My interpretation of the results was the opposite – I thought it was interesting that the best member datasets (CRB and CPB) generally performed no better than their corresponding ensemble means (CRM and CRP respectively). I think L379 refers to Figs 5 and S3; here, CRM has slightly lower COR for ta, but higher for p. It’s also not clear that CRB has “more balanced variability” – this may be true for p, but for ta the SDR looks closer to 1 in CRM than in CRB. I interpreted Figs 6 and S4-S6 similarly, where the TSS scores show little improvement in the best members compared to ensemble means (CPB in S5 is an exception here). This is not a huge point, but I think it detracts from an important conclusion that the authors draw (e.g. L586 in the summary): that CRM is a good “mid-performance reference point” that is not easily beaten even by concatenating the best individual ensemble members. As 20CR is such a widely used dataset, I think it is important to highlight this result for other users.

Temporal evolution
The flow fields in Fig 7 and Figs S7-S8 show the reconstructed circulation at specific snapshots in time (single day for Fig 7, a few days average for S7-S8). But if we were interested in development of a system over time, we would want the fields to change smoothly from day t to day t+1. Do the ML approaches show this property, or do you occasionally see unrealistic “jumps” between days? For example, if there are multiple local minima (circulation patterns) that the model could end up in, it could conceivably settle in different minima on successive days as the fitting is done separately for each day. This does not appear to occur in ML weather models in the present day, but I wondered if it is more of an issue for historical periods due to the sparsity of input observations – I would guess that, as the observational constraints become weaker, there are more possible circulation patterns that could fit the input at each timestep. In the VAE approach, for example, is the model constrained in any way to produce fields that are smooth in time?
I am not asking for any extra work to address this query – I am just interested to see if the authors noticed any differences between the methods here, or if I have misunderstood some aspect of the ML methods. If they noticed interesting differences, it might be a nice addition to their discussion of the strengths of each method.

Specific comments
L181: What are the two periods used to calculate the temperature offset? Is the past period a single year (1807) or an average? The difference could be sensitive to the start year, so averaging over a period (e.g. 1800-1810) may be best.
L381-384: “With a few exceptions, the plotting positions of the 20CR ensemble members (including the three members which produced the hottest temperatures for a certain heat episode, cf. Figure 4) are detached from the CRM” – I don’t think the 20CR ensemble members (x80) are shown in Fig 5 or Fig S3? Do you mean the best-members methods (CRB and CPB)?
“In fact, (relative) over-estimation of temperature and pressure in association with potentially lower correlation and higher cRMSE can be expected from the nature of 20CR members due to more distinct fields of temperature and pressure” What do you mean by “more distinct fields of temperature and pressure”? More distinct than what (the ensemble mean)?
L410: I was unsure how to interpret this sentence. Does it mean the average *across all methods* is better than 0.5 for each of the three scores? I interpret this to mean the dashed line is better than 0.5 (higher or lower depending on the score). But then what do the values of 0.25 and 0.75 refer to? Also, do the values in this sentence refer only to ta in Fig 5 (the values for p in S3 seem different)?
L431: I think this is a helpful summary of the performance of each method. It looks like the ordering follows the ordering of the TSS score in the lower right panels of Figs 6 and S4-6 – if so, it might be helpful to state that, e.g.: “Overall, the methods can be ranked by their TSS scores: TNN and VAE….” etc.

Technical corrections
L20 (Table 1): What does "Time" mean here – e.g. does 14 mean 14:00 UTC? Could clarify in the caption.
L191: I’m not sure what the end of this sentence means – missing a word?
L268: Does [0,1] here mean any value between 0 and 1? It may help to say this in words as well.
L269: This sounds like the best performance is when MBR=2 – I think it should be when MBR=0?
L282: It allows *us* to summarize
L297: I would possibly avoid using “tendency” here, due to its other meaning (d/dt) which could be confusing. You could just end the sentence after “appear more rugged”.
L604: us --> use
Fig 2: Units are missing for the y-axis – can add these either in the figure or in the caption.

Reply
Citation: https://doi.org/10.5194/egusphere-2025-5264-RC1

Peter Stucki, Stefan Brönnimann, Noemi Imfeld, Lucas Pfister, Conall Eoin Ruth, Yannis Valentin Schmutz, Yuri Brugnara, Martin Wegmann, Rajmund Przybylak, and Janusz Filipiak

Supplement

https://doi.org/10.5194/egusphere-2025-5264-supplement

Peter Stucki, Stefan Brönnimann, Noemi Imfeld, Lucas Pfister, Conall Eoin Ruth, Yannis Valentin Schmutz, Yuri Brugnara, Martin Wegmann, Rajmund Przybylak, and Janusz Filipiak

Viewed

Total article views: 221 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
122	82	17	221	16	9	11

HTML: 122
PDF: 82
XML: 17
Total: 221
Supplement: 16
BibTeX: 9
EndNote: 11

Views and downloads (calculated since 04 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	122	82	17	221

Cumulative views and downloads (calculated since 04 Nov 2025)

Month	HTML	PDF	XML	Total
Nov 2025	122	82	17	221

Viewed (geographical distribution)

Total article views: 221 (including HTML, PDF, and XML) Thereof 221 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 26 Nov 2025

Short summary

We test nine reconstructions of Europe’s hot summer of 1807, using weather records, reanalyses, machine-learning (ML), and data assimilation. Most approaches match observed temperature and pressure well. Approaches based on physics of atmospheric flow capture weather patterns well, while ML approaches better reflect station records. Ingestion of accurate records from new regions improves the reconstructions markedly. In all, the approaches provide new insights to pre-industrial extreme weather.


Total:	0
HTML:	0
PDF:	0
XML:	0