Combining different views on internal climate variability of temperature over Europe
Abstract. Internal climate variability (ICV) estimates provide a useful benchmark for assessing climate model performance and the emergence of anthropogenically forced climate change. This study aims to quantify the magnitude of ICV using different types of data, representing both Earth System Model simulations and observation-based datasets. We focus on seasonal mean near surface air temperature over Europe utilizing different methodological approaches: assessment of variability inferred from pre-industrial control simulations, spread of a single-model initial-condition large ensemble, separation of uncertainty sources in CMIP6 transient simulations, and forcing attribution in observed time series. Across all methods and datasets, we found that ICV estimates decrease during the seasonal course from winter to autumn and spatially from north-eastern to south-western Europe. By comparing the results of the historical and scenario simulations of the large ensemble and selected CMIP6 models, we conclude that European ICV generally decreases under anthropogenically forced climate change. Moreover, our study suggests that applying ICV estimates as a benchmark for assessing regional climate simulations over Europe should be approached with caution. The estimate based on the pre-industrial control simulations offers an advantage since the simulations are not influenced by external forcings and their ensemble mean estimate encompasses the range of the other methods. When the focus is on future climate simulations, estimates from scenario simulations should be used, as they already account for the influence of anthropogenic forcings on ICV. Regarding ICV estimates from observational data, their advantage lies in accounting for true climate history, free of modelling uncertainty. Historical simulations also account for historical climate change and yield ICV estimates comparable to those from observational data.
General Comments
The study analyses Internal Climate Variability (ICV) over the European domain, both on a gridded basis and aggregated over four European regions, for seasonal and annual surface air temperature. It uses climate model projections for different time periods in the historical and SSP5-8.5 scenario, as well as piControl simulations from 16 climate models, a SMILE large ensemble from a single model, and three reanalysis/observation-based datasets. The results from these different approaches to estimate ICV over Europe are then described and compared, with a focus on inter-method, inter-season spatial and inter-model differences. The analysis has potential value for the scientific community, for example by providing quantitative information on current and future ICV over Europe and by evaluating models in terms of their ability to represent it. It may also contribute by comparing different methods for estimating ICV. However, before the manuscript can be published I recommend several major revisions:
In my view, the paper would benefit from a clearer structure based on explicitly defined research questions. Based on my understanding, these could include, for example:
At present, the manuscript to me seems to lack a clearly defined set of objectives, which makes it difficult to understand the overarching goal. Clearer research questions would in my view also improve the focus of the results, conclusions, abstract, and introduction parts. In my view, the paper could also benefit from clearly defined metrics for model benchmarking and method comparison, right now the results and conclusions from the analysis appear very imprecise to me.
In my opinion, the text would benefit from more precise formulations and fewer filler expressions (examples below).
The abstract and conclusion appear to include statements that are either obvious or not clearly supported by the analysis (see comments below).
In my view, the methodology is not clearly presented, and as I interpret it, the setup hardens comparability between the ICV estimates from the different methods, as different ICV estimation approaches are applied to different datasets and time periods. The study also seems to lack a discussion of the limitations of the different ICV estimation methods and does not sufficiently address how much of the differences may arise from methodological choices.
The results section appears to largely restates what can already be seen in the figures without much interpretation.
Abstract
In my view, the abstract would benefit from beeing clearer on the motivation of the analysis conducted in the study, what do you want to reach with the main innovation of this study, which is the calculation of ICV over europe with the different methods, and what can the reader learn from it or how can he or she apply it? Answering those questions should furthermore focus on the innovation of the study, which is the calculation and comparison of ICV estimates over europe with the different methods and not on ICV quantification in general. Also, in my view, the abstract would benefit from including some of the most relevant quantitative findings, for example the range of quantitative estimates of ICV over Europe and selected results from a model evaluation exercise. This could include the number of models analysed, as well as which models exhibit the highest, lowest, and most comparable ICV relative to the observational estimates (If I understand correctly that this is what you mean by “model benchmarking” in this study). The same, in my opinion, holds for the conclusions part.
Some of the conclusions in the abstract could be formulated more cautiously and robustly. The statement “Across all methods and datasets, we found that ICV estimates decrease during the seasonal course from winter to autumn and spatially from north-eastern to south-western Europe. By comparing the results of the historical and scenario simulations of the large ensemble and selected CMIP6 models, we conclude that European ICV generally decreases under anthropogenically forced climate change.” is too general. There are several examples where ICV increases under forcing (e.g. the CMIP6 multimodel mean for JJA over Ukraine). Likewise, the decrease from winter to autumn is not consistently observed across all regions and datasets (e.g. Eastern Europe in the CMIP6 multimodel mean for 1971–2000 from JJA to SON), even though this tendency is present in many cases.
At the same time, I would suggest removing statements from the abstract that are not sufficiently explained or interpreted in the manuscript. For example, I don’t see where the statement “Moreover, our study suggests that applying ICV estimates as a benchmark for assessing regional climate simulations over Europe should be approached with caution.” is backed, explained or interpreted in the remainder of the study.
Similarly, some statements appear to reflect rather general methodological observations instead of conclusions emerging from the presented analysis. For example, the statement “The estimate based on the pre-industrial control simulations offers an advantage since the simulations are not influenced by external forcings and their ensemble mean estimate encompasses the range of the other methods.” mainly describes a general property of pre-industrial control simulations rather than a finding of the study itself. In addition, I could not find a clear interpretation or explanation of the statement that their ensemble mean estimate “encompasses” the range of the other methods, beyond a repetition in the conclusions.
Likewise, the statements “When the focus is on future climate simulations, estimates from scenario simulations should be used, as they already account for the influence of anthropogenic forcings on ICV. Regarding ICV estimates from observational data, their advantage lies in accounting for true climate history, free of modelling uncertainty.” appear to consist of fairly general and expected methodological observations rather than conclusions directly supported by the analyses conducted in the study, and on top the statement that the observational data is “free of modelling uncertainty” is not supported by the results in the study, where three different observational datasets yield three different estimates for ICV, as those datasets are differently curated, e.g. by using a model constrained by observations as done in the case of Twentieth Century Reanalysis dataset or interpolating between observations.
Also, you are introducing the word “forcing attribution” in the abstract but do not mention it again in the methodology and the statement “Historical simulations also account for historical climate change and yield ICV estimates comparable to those from observational data.” is, in my view, way to general and also not really backed by the results of the study, when interpreting the word “comparable” as “roughly equal” as there are many examples where the ICV of the historical simulations fall way outside the observed ICV range (e.g. for MPI-ESM1-2-LR for DJF in NEU, but there are several other examples).
Introduction
The introduction discusses uncertainties in climate projections, defines ICV and its origins, and presents several methods previously used to quantify ICV and publications which use them. However, I think the introduction would benefit from the following improvements:
(1) The current discussion places comparatively strong emphasis on uncertainties in general (including model and scenario uncertainty), while the relevance of ICV quantification itself receives less attention. I would suggest reducing the focus on the broader uncertainty discussion, as this is nothing the paper focuses on, and instead elaborating more on why quantifying ICV is important, as currently only one concrete example is provided.
(2) In my view, the manuscript would also benefit from a clearer explanation of how the comparison of different methodologies and datasets contributes to a better understanding of ICV and to model benchmarking. At present, the introduction mainly states that several methods are combined to “compare and assess ICV magnitude from diverse perspectives” and to “provide a useful benchmark” for evaluating model performance, but it remains somewhat unclear to me how exactly these comparisons contribute to these goals.
In addition, I think it would be valuable to formulate more explicit research questions in the introduction that can then be addressed throughout the paper using the different methods and datasets.
Methods
In each paragraph describing the methods, or alternatively in a short overview at the beginning of the section, the order of the calculations could be stated more clearly. In particular, it would help to clarify in which order detrending, regridding, temporal aggregation, spatial aggregation, and standard deviation calculations are applied. For the HS09 method specifically, it is currently unclear whether the different time intervals are separated first and then detrended, or whether detrending is performed before the separation into periods.
If the periods are first separated and the 4th-order polynomial detrending is then applied to individual 30-year periods, the polynomial fit may smooth out part of the low-frequency variability together with the forced signal. This should be checked and discussed.
It might be good to clarify whether area weighting is applied during the spatial aggregation, i.e. whether the differing spatial areas represented by individual grid cells are taken into account.
It would also be useful to discuss to what extent the ICV estimates derived from the four methods are directly comparable in terms of which processes and forcings are included or excluded. For example, in the observational analysis, volcanic aerosols and solar variability appear to be implicitly removed together with the forced trend. In contrast, in the SMILEs, these forcings would also be removed because they are identical across ensemble members and therefore cancelled out through subtraction of the ensemble mean. However, in the other methods, volcanic and solar forcing effects may still contribute to the estimated variability because the applied trend removal does not explicitly account for them. This may potentially also contribute to the comparatively larger variability estimates obtained from the PiControl simulations. In my opinion, the manuscript would benefit from a clearer discussion of which effects are included and excluded in the ICV quantification of each method and how this affects the comparability between methods.
The application of linear trend removal to the piControl simulations should also be discussed and checked, as this implicitly assumes that model drift is approximately linear. It is unclear whether this assumption is generally valid. If model drift contains nonlinear components, residual drift effects may remain in the ICV estimates, which could potentially contribute to the systematically larger ICV values obtained from the piControl simulations.
Similarly, the use of a linear assumption between forcing and regional temperature for the observational datasets may also be problematic. While the relationship between global forcing and global mean temperature is approximately linear, this does not necessarily hold at the regional or grid-point scale. Regional temperature responses may contain nonlinearities and path dependencies associated with circulation changes or feedback processes such as albedo changes from changing snow cover. This assumption should therefore be tested and discussed more explicitly.
It is also unclear to me whether the subset of models used for the ICV evaluation is representative of the broader CMIP6 model spread, for example with respect to TCRE or other measures of climate sensitivity and model uncertainty. Selecting models primarily based on the availability of piControl simulations over a given period appears somewhat arbitrary and could potentially introduce sampling biases into the ICV estimate from the ensemble mean (e.g. if only hot models are sampled).
Using different time periods across the datasets, could further reduce the comparability of the results. For example, the observational datasets estimate variability over the 1901–2010 period, whereas the historical simulations and SMILEs are analysed over 1971–2000. Since the external forcing conditions differ substantially between these periods, differences in the estimated ICV are expected, which makes direct comparison and potential “model benchmarking” based on the provided data more difficult.
In addition, the large differences in the lengths of the analysed periods should be discussed more explicitly, as they may introduce substantial sampling uncertainty. Calculating a standard deviation from 30 years of data is considerably more uncertain than estimating it from several hundred years, especially if the ICV is correlated through time, an effect that could explain the differences between the ICV estimates based on PiContorol and the other datasets. One could also test this by additionally calculating the ICV from selected periods within PiControl that match in length with the other data.
To further improve the comparability between methods, it may also be useful to apply the different approaches more consistently across datasets. For example, the HS09 method could additionally be applied to the piControl simulations and observational datasets, while the forcing-attribution approach could also be applied to the piControl and historical simulations. Similarly, applying both the HS09 and forcing-attribution methods to the SMILE simulations could help disentangle how much of the differences reported in the study arise from the methodological choices themselves versus differences in the underlying datasets/models. This would substantially improve the interpretability of the different ICV estimates and enable more angles of comparison.
Results (4)
This section mainly describes patterns, quantifications, and observations that can be directly inferred from the figures. However, in my view, it would substantially benefit from more interpretation and discussion of the underlying physical mechanisms behind the reported results, some of that also follows in the conclusion section, but I think it would make sense to add it here already.
The subsection titles need to be revised. This is clearly part of the Results section rather than “Methods and data”. In addition, the title of Section 4.2, “Spatial distribution of ICV estimates”, does not fully reflect the actual content of the section. A title such as “Comparison of spatilly aggregated ICV estimates across methods” or similar would better capture the focus of the analysis.
The methodological concerns raised in the Methods section would also directly affect the interpretation of the results. In particular, the different methodologies should be hard to compare, as they include and exclude different sources of variability and forcing effects, are applied over different periods, and use substantially different temporal sample sizes. In my opinion, these limitations should be discussed more explicitly when comparing the resulting ICV estimates, or better removed from the analysis by revising the methodology and e.g. matching timeperiods.
More generally, several formulations throughout the Results section appear too broad or imprecise. An example among others is the statement “The annual analysis yields smaller ICV estimates than the seasonal analyses, while DJF shows the largest ICV magnitudes across all seasons. Pidata provides the highest ICV estimates among all the approaches and timescales, except during JJA. GISTEMP shows lower ICV estimates than the other results, while Berkeley and 20CR display comparable ICV estimates in most cases.” is only true in a broad overall sense, but there are numerous exceptions to each of these claims. For example, there appear to be regions and methods where the annual analysis does not yield the lowest ICV values (e.g. northeastern Europe in the historical SMILE simulations). Likewise, DJF often exhibits the highest variability, but not consistently across all regions and datasets (e.g. Iberia in the SMILE scenario simulations for 2070–2099). Similarly, piControl simulations often show the highest ICV values, but there are exceptions (e.g. Iberia in MAM compared to the SMILE historical simulations). The statement regarding GISTEMP also appears too broad, as there are regions where GISTEMP produces comparable or even larger estimates than other datasets (e.g. DJF over northern Italy). Finally, the statement that Berkeley Earth and 20CR are “comparable” is somewhat unclear to me in its current form. Presumably, the intention is to state that they show relatively similar spatial and seasonal patterns and magnitudes across many regions.
The statement “In general, despite the differences between the methods, there is good agreement between the modeled datasets and the observational ones. However, it is noticed that most of the modeled datasets display noticeably higher ICV estimates than the observational ones during JJA over all the regions, except the NEU.” also appears difficult to justify in its current form and I don’t think it is well defined what good means here. In many cases, the differences between models and observations are quite substantial (e.g. differences exceeding 50% in MAM over WCE for EC-Earth3-CC). Therefore, the conclusion of “good agreement” should, in my view, either be supported more quantitatively or formulated more cautiously.
Conclusions:
As in other sections, in my view, the conclusion contains several imprecisely formulated statements. An example is “Our results imply that the estimate based on the ensemble mean value of preindustrial control simulations (denoted here as Pidata ENSMEAN) encompasses the range of the other methods,” where it is unclear to me what is meant by “encompasses the range” in terms of metric or interpretation.
There are also statements that to me do not clearly follow as conclusions from the analysis but rather read as general observations about the datasets, for example “So, when the focus is on future climate simulations, the estimates involving scenario simulations should be used, as they already account for the influence of anthropogenic forcings on ICV.”
Overall, the conclusion would in my opinion benefit from more precise formulations and a clearer distinction between results derived from the study and more general statements and, as the rest of the paper, from clearly defined research questions and evaluation metrics. Also it should discuss methodological limitations of the study which could influence to the observed results.
Figures and Tables:
It is not clear to me which method is used to determine whether a grid point is displayed in Figures 2 and 3. In particular, some of the grid points in coastal regions do not appear to align consistently with the land–sea mask shown in the figures.
In Figures 4–8, it would be helpful to include the shading levels in the legend to improve readability and interpretation.
In Table 1, it is not always clear to me which model name corresponds to which modelling centre description, and some entries appear incomplete (e.g. MIROC-ES2L). Adding clearer separation between rows and columns, for example by using lines, could improve readability.
Code:
The code is not provided in a public repo so I couldn’t check it.
Conclusion
Overall, the manuscript addresses a relevant topic and has the potential to contribute useful quantitative insights into internal climate variability over Europe and its estimation using different methods, datasets and models. However, in its current form, the study suffers from a lack of clearly defined objectives, limited methodological comparability across datasets and approaches, and insufficient discussion of key limitations. In addition, several conclusions appear either insufficiently supported or too general/imprecise in their current formulation, and the results section would benefit from more interpretation rather than primarily descriptive summaries. I therefore recommend major revision before the manuscript can be considered further.