Evaluation of nine gridded daily weather reconstructions for the European heatwave summer of 1807
Abstract. Recent research of early instrumental measurements combined with numerical-statistical techniques has contributed to global atmospheric reanalysis as well as regional products that cover pre-1850 weather. The advent of machine learning (ML) raises the question of how well we can reconstruct weather from the distant past using both established and emerging approaches. Here, we evaluate nine such approaches to reproduce the daily weather during Europe's hot summer of 1807. The datasets examined include the Twentieth Century Reanalysis (20CR) and enhanced versions (via additional assimilation, dynamical downscaling), an analog resampling product, as well as ML reconstructions that use neural networks (along with video-inpainting methods or variational auto-encoders). Validation is based on early station measurements, documentary information, statistical diagnostics, and a semi-quantitative assessment of atmospheric flow.
We find that the summer of 1807 can be considered a prototype, pre-industrial heatwave summer, with three extremely hot episodes and maximum temperatures exceeding 30 – 35 °C in Central Europe. Most approaches achieve mean correlations (anomalies form the seasonal cycle) above 0.75 for temperature and centered Root Mean Square Error values below 3 °C, though variability tends to be underestimated. This speaks for overall robust reconstructions given the distant past and scarce underlying weather information. Skill scores for almost all reconstructions indicate that they are reliable in discriminating very hot from cooler (high-pressure from lower-pressure) conditions. Improved spatial skill with respect to 20CR for stations in Central and Northeastern Europe can be attributed to the increased influence of newly ingested weather information on the atmospheric reconstructions.
The atmospheric flow-aware approaches reproduce plausible large-scale features such as ridges of high pressure and associated belts of hot air, whereas data-driven ML approaches excel statistically in replicating station variability but often produce less realistic circulation patterns. The analog method yields balanced but less intense reconstructions, and the high-resolution dataset aligns best with heat intensities in the Alpine region.
Such trade-offs leave users choose between computational efficiency, statistical performance, and physically coherent circulation. Future developments need to address uncertainties in the early measurements. In turn, the analyses also emphasize the value of high-quality early weather records to produce and validate gridded reconstructions.
This manuscript presents a detailed and useful study of an important event: the summer of 1807 in Central Europe, a “prototype heatwave summer within a pre-industrial context”. The study’s aim is to critically compare nine different gridded reconstructions for this event, comprising:
The authors identify hot periods in the 1807 summer using a variety of data sources: qualitative accounts from a contemporary observer, daily station temperature series, and reconstructions using the 20CR ensemble. The bulk of the paper is given to a thorough evaluation of the nine methods, using both statistical criteria (Taylor diagrams and three extreme-specific metrics) and a semi-quantitative analysis of the spatial plausibility of the reconstructed fields. Finally, the authors critically evaluate the quality of certain stations and consider how this will affect each method differently.
Various approaches are now emerging for reconstructing historical weather, and the authors cite these in the introduction. The authors show that several of these methods can produce plausible reconstructions of the 1807 summer, a well-chosen case study that is of increasing significance in the current changing climate. The manuscript also provides a rigorous framework for comparing reconstruction methods that will be useful beyond this single case study. Overall, the authors present an impressive amount of information that will be of broad interest to readers of Climate of the Past. A highlight of the paper is the novel inclusion of reconstructions using a variational auto-encoder (VAE), and the authors find interesting differences between VAE and the other ML approaches compared to the physics-based reconstructions.
The paper is well organised and well written, and the figures are clear and helpful. The title is accurate and the abstract gives a good overview of the findings. I have two general comments and a few science comments below which I feel would improve the manuscript, however these are fairly minor and I do not feel they amount to major edits. Otherwise, I am pleased to recommend it for publication in Climate of the Past.
The comments below are organised into two general comments, several specific science comments, and technical/typing comments.
General comments
Ensemble-means vs best members
In L379, the authors state that “CRB performs slightly better than CRM”. My interpretation of the results was the opposite – I thought it was interesting that the best member datasets (CRB and CPB) generally performed no better than their corresponding ensemble means (CRM and CRP respectively). I think L379 refers to Figs 5 and S3; here, CRM has slightly lower COR for ta, but higher for p. It’s also not clear that CRB has “more balanced variability” – this may be true for p, but for ta the SDR looks closer to 1 in CRM than in CRB. I interpreted Figs 6 and S4-S6 similarly, where the TSS scores show little improvement in the best members compared to ensemble means (CPB in S5 is an exception here). This is not a huge point, but I think it detracts from an important conclusion that the authors draw (e.g. L586 in the summary): that CRM is a good “mid-performance reference point” that is not easily beaten even by concatenating the best individual ensemble members. As 20CR is such a widely used dataset, I think it is important to highlight this result for other users.
Temporal evolution
The flow fields in Fig 7 and Figs S7-S8 show the reconstructed circulation at specific snapshots in time (single day for Fig 7, a few days average for S7-S8). But if we were interested in development of a system over time, we would want the fields to change smoothly from day t to day t+1. Do the ML approaches show this property, or do you occasionally see unrealistic “jumps” between days? For example, if there are multiple local minima (circulation patterns) that the model could end up in, it could conceivably settle in different minima on successive days as the fitting is done separately for each day. This does not appear to occur in ML weather models in the present day, but I wondered if it is more of an issue for historical periods due to the sparsity of input observations – I would guess that, as the observational constraints become weaker, there are more possible circulation patterns that could fit the input at each timestep. In the VAE approach, for example, is the model constrained in any way to produce fields that are smooth in time?
I am not asking for any extra work to address this query – I am just interested to see if the authors noticed any differences between the methods here, or if I have misunderstood some aspect of the ML methods. If they noticed interesting differences, it might be a nice addition to their discussion of the strengths of each method.
Specific comments
L181: What are the two periods used to calculate the temperature offset? Is the past period a single year (1807) or an average? The difference could be sensitive to the start year, so averaging over a period (e.g. 1800-1810) may be best.
L381-384: “With a few exceptions, the plotting positions of the 20CR ensemble members (including the three members which produced the hottest temperatures for a certain heat episode, cf. Figure 4) are detached from the CRM” – I don’t think the 20CR ensemble members (x80) are shown in Fig 5 or Fig S3? Do you mean the best-members methods (CRB and CPB)?
“In fact, (relative) over-estimation of temperature and pressure in association with potentially lower correlation and higher cRMSE can be expected from the nature of 20CR members due to more distinct fields of temperature and pressure” What do you mean by “more distinct fields of temperature and pressure”? More distinct than what (the ensemble mean)?
L410: I was unsure how to interpret this sentence. Does it mean the average *across all methods* is better than 0.5 for each of the three scores? I interpret this to mean the dashed line is better than 0.5 (higher or lower depending on the score). But then what do the values of 0.25 and 0.75 refer to? Also, do the values in this sentence refer only to ta in Fig 5 (the values for p in S3 seem different)?
L431: I think this is a helpful summary of the performance of each method. It looks like the ordering follows the ordering of the TSS score in the lower right panels of Figs 6 and S4-6 – if so, it might be helpful to state that, e.g.: “Overall, the methods can be ranked by their TSS scores: TNN and VAE….” etc.
Technical corrections
L20 (Table 1): What does "Time" mean here – e.g. does 14 mean 14:00 UTC? Could clarify in the caption.
L191: I’m not sure what the end of this sentence means – missing a word?
L268: Does [0,1] here mean any value between 0 and 1? It may help to say this in words as well.
L269: This sounds like the best performance is when MBR=2 – I think it should be when MBR=0?
L282: It allows *us* to summarize
L297: I would possibly avoid using “tendency” here, due to its other meaning (d/dt) which could be confusing. You could just end the sentence after “appear more rugged”.
L604: us --> use
Fig 2: Units are missing for the y-axis – can add these either in the figure or in the caption.