the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Standardising the "Gregory method" for calculating equilibrium climate sensitivity
Abstract. The equilibrium climate sensitivity (ECS) – the equilibrium global mean temperature response to a doubling of atmospheric CO2 – is a high-profile metric for quantifying the Earth system’s response to human-induced climate change. A widely applied approach to estimating the ECS is the ‘Gregory method’ (Gregory et al., 2004), which uses an ordinary least squares (OLS) regression between the net radiative flux and surface air temperature anomalies from a 150-year experiment in which atmospheric CO2 concentrations are quadrupled. The ECS is determined at the point where net radiative flux reaches zero i.e. the system is back in equilibrium. This method has been used to compare ECS estimates across the CMIP5 and CMIP6 ensembles and will likely be a key diagnostic for CMIP7. Despite its widespread application, there is little consistency or transparency between studies in how the climate model data is processed prior to the regression, leading to potential discrepancies in ECS estimates. We identify 20 alternative data processing pathways, varying by different choices in global mean weighting, annual mean weighting, anomaly calculation method, and linear regression fit. Using 41 CMIP6 models, we systematically assess the impact of these choices on ECS estimates. While the inter-model ECS range is insensitive to the data processing pathway, individual models exhibit notable differences. Approximating a model’s native grid cell area with cosine of the latitude can decrease the ECS by 11 %, and some anomaly calculation methods can introduce spurious temporal correlations in the processed data. Beyond data processing choices, we also evaluate an alternative linear regression method – total least squares (TLS) – which appears to have a more statistically robust basis than OLS. However, for consistency with previous literature, and given physical reasoning suggests that TLS may further reduce the ECS compared to OLS, i.e. make a known bias in the Gregory method worse, we do not feel there is sufficient clarity to recommend a transition to TLS in all cases. To improve reproducibility and comparability in future studies, we recommend a standardised Gregory method: weighting the global mean by cell area, weighting the annual mean by number of days per month, and calculating anomalies by first applying a rolling average to the piControl timeseries then subtracting from the CO2 quadrupling experiment. This approach implicitly accounts for model drift while reducing noise in the data to best meet the pre-conditions of the linear regression. While CMIP6 results of the multi-model mean ECS appear robust to these processing choices, similar assumptions may not hold for CMIP7, underscoring the need for standardised data preparation in future climate sensitivity assessments.
- Preprint
(1602 KB) - Metadata XML
-
Supplement
(2289 KB) - BibTeX
- EndNote
Status: final response (author comments only)
- CC1: 'Comment on egusphere-2025-2252', Govindasamy Bala, 12 Jul 2025
-
RC1: 'Comment on egusphere-2025-2252', Anonymous Referee #1, 23 Jul 2025
The authors evaluate how various choices for processing GCM output impact estimates of ECS using the standard “Gregory method” (regression of global top-of-atmosphere radiation on global surface temperature using years 1-150 of 4xCO2 simulations). In particular, they test different choices of area weighting, annual averaging, drift correction, and regression methods. They find that (i) the choice of area weighting method can affect ECS for when the model output is not on a regular, lat-lon grid; (ii) the choice of annual averaging method does not affect ECS; (iii) the choice of drift correction method can slightly affect ECS; and (iv) the choice of regression method can substantially affect ECS for a subset of models. They make a recommendation that ECS calculations be standardized using by always weighting the global mean by cell area (rather than cosine of latitude), weighting the annual average by number of days in each month, and calculating anomalies relative to a 21-year rolling average in the pre-industrial control simulation. They also recommend continuing to use Ordinary Least Squares regression for continuity with most previous studies, even though other regression methods may have theoretical justification.
This is a well written paper that will make a nice contribution to the literature. While some of the results, including that area weighting and annual averaging methods make little to no difference to ECS estimates, are unsurprising to me, it’s nice to see this systematically demonstrated. It will be nice to have this paper as a reference that the community can point to for best practices in calculating ECS using the Gregory method. I recommend publication after some revision.
That said, there are a number of things that I think could be improved. Most notably:
- Given that the choice of annual averaging method (weighting all months equally versus weighting by number of days) makes essentially no difference to the calculation of ECS (at most 0.02 K), I found it strange that this choice made its way into your recommendation for standardization. My takeaway is that this choice doesn’t matter, so my suggestion would be to note that and remove it from the recommended standardization list (e.g,. Lines 22-24 in the abstract) so that there is one fewer thing readers will have to keep track of, making it more likely that they will actually follow your recommendations as well.
- Regarding the area weighting, my takeaway from your findings is that for models on a regular latitude-longitude grid, weighting by grid cell area (areacella) and weighting by cosine(latitude) makes no difference. This is of course expected. It’s really only those models with output on an irregular grid where weighting by cosine(latitude) produces a different global average than weighting by grid cell area. This is of course also expected, because for an irregular grid, the grid area does not scale like cosine(latitude), and to weight by cosine(latitude) is simply an error. I think it would be much more clear to frame it this way, rather than as a “choice” of area weighting method, which makes it seem like both are acceptable options. Your recommendation to always weight by grid cell area is good since that removes the need to check whether the output is on a regular grid or not, and because weighting by areacella is easier (with less to go wrong) than weighting by cos(latitude) anyway.
- Another common choice is calculating anomalies with respect to the long-term average of the piControl simulation. You note this in several places (Lines 101-104) and I thought this was what you mean when you describe taking a climatology (Lines 292, 349, 464. But your results seem to only compare using anomalies relative to the raw piControl (including interannual variability), a 21-year rolling mean, and linear trend (Fig. 3). You should add an analysis using long-term average of the piControl simulation. This choice is far more common than subtracting the raw (annually varying) piControl data, I think.
- Lines 171-176: It is unclear what you mean by performing “branch alignment” or “correction” here, so please elaborate on what exactly you did and how much it matters. Overall, the need to correctly identify the branch point in order to accurately perform the drift correction you propose (subtracting 21-year rolling averages) should be emphasized more. It’s a step that needs accurate metadata or additional effort to make sure the assumed branch point is correct, and this should be noted and perhaps even made as a concrete recommendation for standardization. Alternatively, many authors just use the long-term average over the piControl in an attempt to avoid having to pay attention to the branch point.
- Does the choice of temperature variable matter? While most studies use near-surface air temperature (“tas”), others use surface temperature (“ts”) which represents the “skin temperature” (i.e., sea-surface temperature over open ocean, surface of sea ice, and surface of land). Does this choice make a difference to ECS, and if so can you make a recommendation to use “tas” rather than “ts”?
- I understand the focus on using the Gregory method applied to years 1-150 of 4xCO2 simulations, which has traditionally been the way people have estimated ECS in GCMs. However, it’s becoming increasingly common to estimate ECS using regression over different set of years, for example (i) using years 21-150 which is thought to provide more accurate estimates of equilibrium warming by avoiding some of the initial curvature in the Gregory plot, or (ii) using years 1-300 when longer output is available (e.g., in LongRunMIP or, hopefully, in CMIP7). It would be good to comment on whether the choices you evaluate here also make a difference for ECS estimates using those different choices of years. You could use the available LongRunMIP simulations to test using years 1-300, for example. I imagine that the difference in OLS vs TLS regression methods might matter more when using years 21-150, but might matter less when using years 1-300. I’m not sure about the other choices you explore. But this analysis and associated set of recommendations will become important as alternative regression periods are chosen for evaluation of ECS in CMIP7 models.
- You could also consider testing the available 2xCO2 simulations. Do the same recommendations apply to calculating ECS in those, or do some choices become more, or less, important? This seems less pressing than my recommendations above.
Other comments:
- Line 24: You should define piControl
- Lines 40-41: I found this summary of ECS ranges confusing since they are comparing different things. The Charney estimate is an approximate range. The CMIP6 range quoted is simply the range of models. And the AR6 range quoted is the 5-95% range. All measure different things, so it is not correct to say that AR6 narrowed the range relative to the models (it is simply more narrow than the model range). AR6 did narrow the range relative to the 5-95% range reported in AR5 and previous reports. Please reword this to be more clear on these points.
- Line 49: Another paper to cite here is doi: 10.1175/2008JCLI2596.1
- Line 94-104: This felt out of place here since its not really needed here and you discuss the anomaly calculation in more detail below. Later in the paper I found myself scanning back up to this section to see these details. I suggest moving this to where you discuss the anomaly method in more detail below.
- Lines 113-117: That’s reasonable not to evaluate how ECS values change when using different regression periods, as many other papers look into that already. However, as I noted above, it would be good to check whether your recommendations still apply when using different choices for regression period. The choice of OLS vs TLS regression in particular could matter more or less.
- Lines 134, 202: Baseline, Standard, and Alternative pathways are mentioned here, but not yet described. Only later on Lines 262-264 are they described.
- Lines 157-170: this feels redundant with the text on Lines 94-104. Also, you should include a fourth choice here (which is common in the literature): calculating anomalies with respect to the long-term climatological mean of the piControl simulation (either over the full length of that simulation, or over the century or so leading up to the branch point).
- Lines 179-181: This is framed as if it needed investigation. But I think it simply has to be true that calculating anomalies before or after summing the variables makes zero difference since these are linear operations.
- Lines 213-244: This should make more clear that using cosine(latitude) when the grid is irregular is simply an error (not a valid different choice).
- Figure 2: Do you have a sense of why OLS vs TLS regression matters so much for some models, but not for others? Can you comment on under what conditions the choice matters?
- Lines 274-282: I found this discussion confusing. Non-closure of the global energy budget does not necessarily cause model drift if the model is fully spun up. It instead just means that there will be a top-of-atmosphere energy imbalance maintained in the piControl state, which balances that non-closure within the model.
- Line 307: You mention calculating anomalies relative to the climatological mean here, but as I note above I don’t think you’ve tested that case?
- Line 307-309: I think aiming for consistency in how ECS was calculated in Zelinka et al. (2020) is a pretty strong argument. Could you expand this to explain what choices were made in Zelinka et al. with respect to global area and annual averaging methods?
- Lines 348-350: I do not follow how removing a climatological average (constant values) or linear trend would change the variability or the correlation between variables.
- You should cite doi: 10.1175/JCLI-D-17-0667.1, who discuss using Deming regression instead of OLS.
- It may be worth mentioning that choices of drift correction method will likely make much more of a difference for the calculation of anomalies in historical simulations (as a percent change), even if they don't matter for ECS calculation. This of course could use further study to compare those choices.
Citation: https://doi.org/10.5194/egusphere-2025-2252-RC1 -
RC2: 'Comment on egusphere-2025-2252', Anonymous Referee #2, 07 Aug 2025
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-2252/egusphere-2025-2252-RC2-supplement.pdf
-
RC3: 'Comment on egusphere-2025-2252', Anonymous Referee #3, 27 Aug 2025
Review of egusphere-2025-2252, entitled “Standardising the “Gregory method” for calculating equilibrium climate sensitivity” by Zehrung et al.
Considering the limited consistency and transparency across studies in how climate model data are processed prior to regression (application of the Gregory method) and the potential impact of these choices on ECS estimates, the authors identify 20 alternative data-processing pathways and systematically assess their influence. Their results suggest that these pathways have little effect on the range or central values of ECS estimates for most models. Nevertheless, they propose a preferred pathway for preparing air temperature and radiative flux anomalies, supported by both objective reasoning and some degree of subjective judgment. I note, however, that their recommendation may also make the ECS calculation process somewhat more complicated.
Overall, the manuscript is well-written, well-organized, and easy to follow as a technical note. I recommend acceptance with minor revisions. While the results do not yield significant differences from current estimates, I find their recommendations for clearer documentation of anomaly calculation pathways to be valuable for the community.
Minor Suggestions & comments:
Lines 9–10: Please clarify that ECS is obtained by extrapolating to N = 0 in the N–ΔT regression, i.e., the ΔT intercept.
Lines 10–11, 121–123: Given that abrupt-4xCO2 simulations may extend to 300 years, I wonder how the authors envision ECS estimation in CMIP7, and how best to compare results across phases.
Lines 16–18, 460: Please emphasize that these sensitivities occur only for a very small set of models (outliers), while most models are not affected.
Lines 22–24, 455–459: While I appreciate the recommendations, I find they may make the ECS estimation unnecessarily complicated, especially given the finding of “no statistically significant difference” and “unlikely to see a meaningful difference in results.” Related concerns include: (1) areacella is not commonly used; (2) leap-year treatment varies across calendars (e.g., noleap, julian, gregorian); (3) checking the branching date of each simulation is time-consuming—branch time metadata are usually included in CMIP6 but rarely in CMIP5.
Lines 26–27: Did the authors mean that the “CMIP6 multi-model mean ECS appears not sensitive to these processing choices”?
Line 60: Did you mean “the global mean radiative response λΔT”?
Lines 65–67: Please confirm whether Gregory et al. (2004) actually used 150-year simulations.
Lines 154–155, 248: Were leap years accounted for, given the different calendars used by models?
Lines 157–170, 269–270 & Fig. 1: Why were anomalies relative to climatological monthly means not considered? Including a climatology-based method could show how model drift affects results.
Lines 171–176: While aligning abrupt-4xCO2 and piControl at the prescribed branch time is useful, checking branch dates for each model is time-consuming. Would it be possible for modeling centers to standardize branching dates across experiments (e.g., r1) or to mark them more clearly in their piControl runs to simplify this step? The authors might consider recommending this.
Lines 178–181: Since the variable rtmt (TOA net radiative flux) is available for most models, I suggest computing rtmt for those models without the variable directly before preprocessing.
Lines 234–235: Could the authors provide further explanation here, particularly regarding the different distribution of MPI-ESM1-2-HAM compared with the other two models?
Lines 250–251: Given the conclusions, is annual-mean weighting strictly necessary?
Lines 265–267 and elsewhere: Why not also report differences in the multi-model mean?
Lines 306–309, 463–465: I find it odd to recommend anomalies relative to a climatological mean or linear fit initially and then just apply a 21-year running mean over the piControl, citing Zelinka et al. (2020). Widespread use is not, by itself, a sufficient justification.
Lines 335–337: Is this effect due to the imposed radiative forcing in the idealized abrupt-4xCO2 experiment? The argument could be clarified.
Line 361: It might be useful to report ECS values explicitly, rather than only feedback values.
Lines 363–364: In fact, previous studies (e.g., Forster et al., 2016; He et al., 2025; Lutsko et al., 2022; Smith et al., 2020) suggest that the Gregory method (OLS) underestimates effective radiative forcing.
Forster, P. M., et al. (2016). JGR: Atmospheres, 121, 12,460–12,475. https://doi.org/10.1002/2016JD025320
He, H., et al. (2025). ESSOAr. DOI: 10.22541/essoar.175157564.42459435/v1
Lutsko, N. J., et al. (2022). JGR: Atmospheres, 127, e2022JD037486. https://doi.org/10.1029/2022JD037486
Smith, C. J., et al. (2020). Atmos. Chem. Phys., 20, 9591–9618. https://doi.org/10.5194/acp-20-9591-2020
Lines 436–438: It seems CMIP7 is prioritizing longer simulations rather than additional ensemble members—could the authors comment?Citation: https://doi.org/10.5194/egusphere-2025-2252-RC3
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
1,157 | 87 | 13 | 1,257 | 32 | 16 | 39 |
- HTML: 1,157
- PDF: 87
- XML: 13
- Total: 1,257
- Supplement: 32
- BibTeX: 16
- EndNote: 39
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
There is another method to estimate the climate sensitivity, called the two-point method that is a derivative of the Gregory method. This method developed by us takes into account the land surface warming in the fast adjustments (prescribed SST simulations) and the potential TOA radiative imbalance in the equilibrium slab simulations. This method is based on the Gregory method and provides a better (accurate) estimate of climate sensitivity. This method is first discussed in a paper (by us) that estimates the efficacy of methane forcing (https://doi.org/10.1007/s00382-018-4102-x). Please see equation 4 in that paper. This method is also used in the estimation of BC efficacy in https://doi.org/10.1088/1748-9326/ab21e7 and another paper on solar geoengineering, (https://doi.org/10.1088/1748-9326/ab21e7). Only later on, this method is called the "two-point" method in several successive papers from us: https://doi.org/10.5194/esd-10-885-2019; https://doi.org/10.1029/2019EF001326; https://doi.org/10.1029/2020JD033256; https://doi.org/10.1088/1748-9326/ad5e9d; https://academic.oup.com/oocc/article/4/1/kgae016/7737801
Interestingly, we have not popularized this 2-point method and hence it is likely that you have not discussed this method. I suggest you provide a detailed discussion of our two-point method for the estimation of climate sensitivity and cite all the papers that have used our two-point method. This will help to make this method known to the broader community of researchers working on climate sensitivity.