the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Understanding European Heatwaves with Variational Autoencoders
Abstract. Understanding the dynamics of heatwaves is critical for accurate climate risk assessment. Traditional definitions, based solely on surface temperature thresholds, often overlook the complex, multivariate nature of heatwaves. This study uses a spatiotemporal Variational Autoencoder (VAE), an unsupervised machine learning method, to identify compact representations of multivariate, year-round heatwave patterns. Focusing on key atmospheric variables (e.g., circulation, humidity, temperature, geopotential height, cloud cover, stream function, and radiation), we extract eleven-day heatwave samples from ERA5 reanalysis data over the North Atlantic, centered on near-surface temperature extremes in Western Europe. The VAE was trained on data from 1941–1990 and evaluated using 2001–2022 samples, and effectively clustered heatwave events by season, revealing known dynamical regimes such as summer blocking highs and winter omega blocks. The VAE model captures the interplay and temporal evolution between different atmospheric variables in their contributions to heatwaves over Western Europe. Notably, recent summer heatwaves form a distinct cluster within the latent space, pointing to a shift in atmospheric dynamics consistent with climate change. Composite anomaly maps further show coherent pre-onset patterns across variables. These results demonstrate the potential of VAEs to uncover meaningful structure in complex heatwave dynamics from data, and promise advances in understanding heatwaves.
- Preprint
(13282 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-2460', Anonymous Referee #1, 11 Jul 2025
Summary:
This work uses a non-linear dimensionality reduction method to study heat wave characteristics in western Europe. More specifically, they train a 3D variational autoencoder to reconstruct 11-day windows of multiple atmospheric variables around historical heat wave onset dates. Afterwards, the trained VAE is used to embed heat waves from a test period temporally after the training period. Then, the embeddings are clustered, and a shift in frequency in these clusters between training and testing is observed.
Strengths:
- Non-linear method to analyze heat wave characteristics and their spatio-temporal trends.
- Relevant study area: western europe has recently experienced a series of devastating heat waves, which are assumed to often have been preceded by atmospheric blocking events.
- Detailed description of the methodology, which enables reproducibility
- If valid, the results are interesting: an observed shift of summer time heat wave characteristics in the past 20 years.
Major comments:
- The VAE may potentially be applied in an OOD scenario: Data from 2001-2022 may be outside of the training distribution (heatwaves during 1941-1990). There is no guarantee that for out of domain samples, neural networks produce meaningful latent representations. This may lead to flawed interpretations on any such latent representations.
- The previous comment is particularly relevant in the context of one of the main claims of this work: that there is a change in heat wave patterns after 2000. While this may very well be true, it can not be ruled out that this is just an artifact of the way this study was set up.
- As a way forward I’d suggest to actually not do a temporal splitting of the train/val/test sets, but instead a purely random one. Ultimately you want to extract good representations for the entire time series 1941-2022.
- Why do you produce composite maps, instead of directly reconstructing the center of each cluster?
- Could your composite maps not lead to unphysical patterns? For instance, phenomena could cancel out if they are just shifted in space slightly?
- In this study you introduce atmospheric patterns that have been observed prior to heat waves. One open question remains, if these patterns exclusively arise prior to heatwaves - or if they are not always succeeded by extreme temperature anomalies. This would be very valuable to understand the physical mechanisms but also assess potential predictive skill.
- Why first t-SNE and then cluster? Why not directly cluster on the embeddings?
- Since your reconstruction accuracy is far from perfect (table 2), I wonder if you have looked into the model residuals. What are the errors the model makes? Does it miss any significant patterns related to heat waves? Is it unbiased? How well does it capture the extremes (i.e. the grid cells with heat waves)? This may be especially tricky since you seem to train on MSE.
- Figure 5 would greatly benefit from actually including the maps of z500 from Lhotka & Kysely (2022) and plotting the differences.
- L. 342ff can you elaborate a bit more how this work improves over the previous work from Happé et al 2024? Also, as is I assume the “in their” should be replaced by “to their”?
- L 386ff - the current limitations read somewhat superficial. This approach builds on top of many assumptions. These should be clearly stated, please extend the section on limitations.
- Overall the storyline of the discussion needs to be streamlined. It jumps between results, limitations and outlooks. For instance, why is L 395ff after the limitations section?
- The choice of splitting the dataset temporally also potentially influences attribution of the breakpoint you claim to find in heat wave characteristics. How robust is this finding if you change the periods over which you aggregate the latent space samples (in figure 4)?
Minor comments:
- L 1-2 & L. 42f not sure I can follow the logic here. To me these are two separate aspects: one is defining heatwaves, which should be done using temperatures (or if you like, including also humidity) - and the other one is understanding their dynamics, for which other variables are also important.
- L. 64 reads misleading, i would say the predominant use of VAEs is generative modeling - and anomaly detection is just one application.
- L. 87, just to be sure, you compute the climatology, after having applied the temporal operators from table 1, correct?
- Have you tried a more standard 2D VAE (e.g. using an Imagenet pretrained ConvNext backbone)? In many applications the additional inductive bias on the temporal structure of the data may not translate in actual better performance, so it could be interesting to see if this is one case where it does.
- L. 191 not sure “temporal change” is the best term here, i read it first as if the temporal characteristics of heatwaves changed
- L. 200 not sure it proves that the patterns are necessarily multivariate - probably you could learn this from a single variable like air temperature also.
- L. 210 - cluster 1 & 2 are clearly dominated by one season, but 3&4 not sure i agree with your assignment, e.g. cluster 4 also has 29% DJF
- L 356-363 this reads like methods and not discussion & conclusion
- Could it be that your latent space is just clustered by how extreme a heat wave was? Instead of different atmospheric patterns? In other words: how important is the non-linear multi-modal feature of your method?
- Maybe more a question of style, but to my taste, the introduction reads a bit alarmist
Citation: https://doi.org/10.5194/egusphere-2025-2460-RC1 - AC1: 'Reply on RC1', Aytaç Paçal, 05 Sep 2025
-
RC2: 'Comment on egusphere-2025-2460', Anonymous Referee #2, 14 Jul 2025
Review of “Understanding European Heatwaves with Variational Autoencoders”
This research analyses heatwaves in western Europe, during the entire year, from a spatio-temporal multi-variate perspective. To this end the authors use ML and DL techniques. They find four clusters of heatwave patterns throughout the entire year, with dynamics consistent with previous literature. Notably, they use ERA5 and extended the variables to characterize heatwaves.
While this avenue of work (Deep Learning for heatwave understanding) is very interesting, both from the methodological and climate-scientific perspective, I have my concerns regarding the novelty of the presented research. From the current work it seems that most of the methods are one-to-one copied from Happé et al. (2024), including the heatwave selection method, VAE, and the GMM clustering, including their respective (hyper)parameters. It needs to be clear throughout the entire manuscript what the novelty is of the current work and what has been reproduced or based on previous studies. Currently, the authors cite Happé et al. (2024) in some places but they do not contextualize their work as an application of the framework developed by Happé et al. (2024). If the authors see their work not as an application but rather an extension of the framework, additional developments need to be made to the current AI framework. Generally, the Abstract, Introduction, and Discussion & Conclusion need to properly reflect which part of this research is novel and which follow the framework from Happé et al. (2024). Please find below more detailed comments.
Major points of discussion
- Introduction, L60-70 Here it reads as if this is the first study that uses the framework of VAE+Clustering to characterize climate extremes (especially line 68-70). Since this is not the case, it needs to be framed clearly what the novelty is of this work with respect to previous works, and how this study is either an application or extension of previous works. Please also have a look at:
- Spuler FR, Kretschmer M, Kovalchuk Y, Balmaseda MA, Shepherd TG. Identifying probabilistic weather regimes targeted to a local-scale impact variable. Environmental Data Science. 2024;3:e25. doi:10.1017/eds.2024.29
- Methods 2.1; Why do the authors take this exact grid area? Or the 15d moving window? Crucially, why do the authors take a grid of 0.7 degrees spatial resolution if ERA5 has 0.25? If these parameters are chosen because those were used in Happé et al. (2024), that needs to be stated as such. Happé et al worked with 0.7 degrees because it is the native resolution of EC-Earth, and hence appropriate for that study. It is unclear why one would work with that resolution for ERA5, instate of the native 0.25 degrees.
- Heatwave identification – the authors take the “1941-1980 daily” percentile, which will inherently cause more heatwaves in the last 4 decades, as thermodynamics lead to an increase in temperature everywhere. This is important to consider when studying dynamics of heat extremes – how meaningful are the dynamical types that are then found? Furthermore, the test-set also consists of heatwaves from the last two decades – how do the authors deal with this non-stationarity?
- Methods 2.2; Indeed, here the authors mention following the methods proposed by Happé et al. (2024). It would benefit the entire methods section if it would be very clear which parts of the methodology deviate from Happé et al. (2024).
- Methods 2.3; As these methods as well follow Happé et al. (2024), it would be transparent to mention something like ‘following Happé et al. (2024) we use a 3d VAE …”. Then continue explaining where your methods deviate and why the authors made those choices (e.g. improvement of training/framework/…). For example, the use of t-SNE is also done in Happé et al. (2024), yet this is not mentioned in your section 153-160). Additionally, the choice of 100 closest heatwaves to each centroid is also not cited as following Happé et al. (2024) – L161.
- Methods 2.3 the r2 scores; As this section talks about reconstruction errors, I would suggest this section fits better in the result. Apart from that – are these r2 scores based on a latent dimension size 128? Is this chosen because of Happé et al. (2024)? Why didn’t the authors take a higher latent space size, since the dimensions went from 2 to 9 variables and from 5 to 11 days? The latent dimension size should be properly justified and tested. Furthermore, I have my concerns with these low r2 scores and would be curious to see the reconstructed maps for these variables. What happens if one goes to higher latent dimension sizes? Lastly, table 2 only shows the r2 scores for the test-subset – my suggestion would be to also include the scores of the train set; to show how well the authors’ model is able to generalize. I’m especially curious to this last point, as Happé et al. (2024) showed that data augmentation was needed to avoid overfitting.
- Results; I’m curious as to why the authors apply PCA to go down to 50 components in the latent space – why not use PCA directly on the heatwave data? Or why not go down to 50 dimensions in the VAE latent space? What happens to the r2 scores after doing this step?
- Results; I find it interesting that the authors find 4 clusters that correspond with each season. What does this mean for interpretation – did the latent dimensions clusters actually find dynamically different heatwaves or rather the dynamics of the different seasons? Would it be possible to plot composite maps within a cluster of summer-only and winter-only heatwaves? Perhaps that could show us whether these patterns are indeed found year-round or whether you find the seasonal dynamics. This would also underpin your speculative (“hints”) conclusion in L383-385 better. Answering this is not trivial, as dynamics leading to heatwaves in summer (e.g. blocking) do not necessarily lead to warm anomalies in winter. Rather blocking like systems cause cold anomalies in winter. I find it therefore interesting that cluster #1 is a blocking pattern in winter, while the authors compare this cluster to UK High pattern in Happé et al. (2024) and the omega block in Rouges et al. (2023) which occur in summer [L328-241]. This as the authors show in Figure 4 that there are 0 summer heatwaves part of their cluster #1. Could it be that the fact that the authors find this pattern in winter is merely a result of the non-stationarity of the dataset? Could the authors explain this more?
- In the Discussion the authors state that the VAE/GMM is sensitive to hyperparameters; it would be good to see some of these experiments in this research. Especially the latent dimension size is essential for this research as ensuring that the latent representations are representative of your heatwave samples is not trivial – otherwise the clusters might be meaningless.
- Discussion & Conclusion; Again, it needs to be contextualized which parts of the framework is based on previous work and which parts are novel. Using phrases such as “We confirm the results from Happé et al. (2024), by showing XYZ.” Or “As opposed to Happé et al. (2024), we do/find XYZ.” This helps guide the reader and highlights the novelty of the authors’ work. E.g. in sentences 356-363, 392-394, and 369-400. It needs to be clear in the conclusion what the main scientific output is of your contribution.
Minor points
- “This trend is projected to continue even at the lowest projected global warming scenario, and the intensity of extremes will increase proportionally with the amount of warming.” L19-20 Is there a reference for this? My understanding was that this is not necessarily a proportional increase.
- The motivation in the introduction seems to cover all types of extreme events and all over the globe, yet the focus of the manuscript is heatwaves over western Europe only.
- In the introduction the authors motivate that heat extremes cause mortality and increased costs, in summer mostly. Then why does the study focus on year-round heatwaves? I think this is important to motivate, as heatwaves in western Europe don’t cause impacts in winter.
- Methods 2.4; is the model trained using r2? Or MSE? Lines 144-152; is this not better fitted in the result section? It is also mentioned here that it is difficult to capture the local surface conditions because of the course spatial resolution, but then why did the authors decide to re-grid from 0.25 to 0.7 degrees in spatial resolution?
- Why do the authors choose MSLP, Z500, and STREAM250? Rather than different levels of Z or stream?
- Figure 4. If I understand correctly these samples are from all year round? Is the t-SNE trained only on train-data or on all samples?
- Section 3.4 can use some more literature comparison, especially when discussing the dynamics leading to heatwaves (the causal pathways).
- Some sentences need rephrasing, for example:
- L195-297 “they show a negative tendency” --> positive? Tendency towards what?
- L342-344 “the key difference in their study …” --> our study? Now it reads as if they (Happé et al.) used 11d multivariate data instead of you.
Citation: https://doi.org/10.5194/egusphere-2025-2460-RC2 -
AC2: 'Reply on RC2', Aytaç Paçal, 05 Sep 2025
We are grateful for the reviewer’s constructive comments and suggestions. We carefully addressed each point in our response letter, with particular attention to clarifying the novelty of our work. Please find our detailed responses in the attached PDF.
- Introduction, L60-70 Here it reads as if this is the first study that uses the framework of VAE+Clustering to characterize climate extremes (especially line 68-70). Since this is not the case, it needs to be framed clearly what the novelty is of this work with respect to previous works, and how this study is either an application or extension of previous works. Please also have a look at:
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
305 | 69 | 13 | 387 | 10 | 27 |
- HTML: 305
- PDF: 69
- XML: 13
- Total: 387
- BibTeX: 10
- EndNote: 27
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1