the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
RIME-X v1.0: Combining Simple Climate Models, Earth System Models, and Climate Impact Models into a Unified Statistical Emulator for Regional Climate Indicators
Abstract. Many tasks in climate science, including climate impact assessment, scenario analysis, and end-to-end attribution, require efficient methods to translate a wide range of emissions scenarios into regional-scale climate indicators while explicitly accounting for uncertainty. Climate and impact model emulators are statistical models that approximate selected outputs of comprehensive models and can perform this translation. The Rapid Impact Model Emulator (RIME) uses individual simulations from climate or impact models to empirically relate global mean surface air temperature (GMT) levels to regional-scale indicators, enabling the conversion of GMT trajectories, commonly derived from Simple Climate Models (SCMs), into time series of regional climate impacts. Here, we present the Rapid Impact Model Emulator Extended (RIME-X), an extension of the RIME framework that replaces deterministic emulation of individual models along single GMT trajectories with a probabilistic approach. RIME-X combines ensemble simulations of GMT derived from SCMs with warming-level-dependent regional indicator distributions estimated from weighted Model Intercomparison Project (MIP) data. This results in scenario-dependent, time-evolving probability distributions of regional indicators. By jointly quantifying global and regional sources of uncertainty from the start, RIME-X enables systematic exploration of the full space of plausible regional climate impact trajectories under different emissions scenarios. The method is applicable to regional indicators whose distributions are predominantly determined by warming level and provides a computationally efficient framework for uncertainty-aware regional indicator emulation. We provide an open-source Python implementation of RIME-X, including preprocessing workflows for data from the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP) and support for user-defined indicators.
- Preprint
(12572 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Review of Schwind et al', Benjamin Sanderson, 09 Feb 2026
-
AC2: 'Reply on RC1', Niklas Schwind, 04 May 2026
Dear Dr. Ben Sanderson,
We thank you for the thorough and thoughtful assessment of our manuscript and for recognizing the novelty and usefulness of the RIME-X framework. We particularly appreciate the positive evaluation of the method as a pragmatic and probabilistic approach that effectively leverages existing model ensembles while enabling exploration of a broader scenario space.
We are encouraged that you find the framework to be promising and elegant and agree on its potential value for climate impact emulation and downstream applications. We also appreciate the constructive suggestions provided, especially regarding further characterization of limitations, validation of impact-relevant indicators, and improvements to usability and robustness of the codebase.
We briefly summarize the main revisions corresponding to the comments addressed below:
- we clarify the methodological limitations of the RIME-X framework, particularly regarding assumptions and scope of applicability, and add additional validation in edge cases
- we expand the validation of impact-relevant indicators to better demonstrate robustness across use cases
- we pledge to enhance the usability and documentation of the codebase
- we incorporate additional discussion and examples to better illustrate practical applications and limitations
In the following, we address each comment in detail and outline how we will revise the manuscript accordingly.
Comment #1:
The key limitation of the approach (an assumed mapping between global mean temperature and impacts) - is acknowledged by the authors as limiting applibability to questions of path depedence, such as deep overshoot where warming patterns may significantly differ to a non-mitigation scenario. The extended work on overshoot impacts by many of the coauthors exemplifies this concern. Though this limitation is acknowledged - a useful extension would be to apply the technique to a scenario like SSP534over and gauge which variables are impacted by path dependency, and to what degree. Similarly, it would be useful to operationally test with SSP370 as a test of applicability to strongly differing aerosol emissions pathways. The authors could also consider what the optimal training set should be - e.g. is it an asset to keep SSP585 in the training set, if that scenario exhibits high warming rates which are unlike policy-relevant scenarios and it might actually reduce performance?
Thank you for this feedback. We agree that, while path dependence and aerosol effects are explicitly acknowledged as limitations of the method, their quantitative impact on emulator performance warrants further analysis.
To address this, we extended the leave-one-out validation framework described in Section 3 by conducting additional experiments using SSP1-2.6 and SSP3-7.0. Those results will be additionally included and discussed in the revised paper. SSP1-2.6, with its slight overshoot and extended stabilization period, provides a useful test case for assessing biases arising from path dependence. SSP3-7.0 allows us to evaluate the sensitivity of the method to differing aerosol emission pathways relative to the training scenarios. The results can be seen in Figures 1 to 4 in the supplementary material to this response.
We chose not to include SSP5-3.4-OS in this validation, as the number of available CMIP6 simulations for this scenario is substantially smaller than for the other scenarios. This would introduce additional sampling uncertainty in the ground-truth distributions and limit the comparability of the error metrics to validation experiments with other scenarios.
The results show that, for SSP1-2.6, biases increase, particularly for precipitation-related indicators. Minimum errors remain approximately unchanged for temperature and increase slightly for precipitation (+0.3 pp). Median errors increase slightly for temperature (+0.1 pp) and more substantially for precipitation (+1.3 pp). Maximum errors increase slightly for temperature (+0.8 pp) and nearly double for precipitation (increase of up to +8 pp). The Q–Q plots indicate that the increase in maximum precipitation errors is associated with a progressive shift of the simulated distribution away from the emulated distribution over time, pointing to path-dependent. This is consistent with expectations, for example due to continued ocean warming influencing regional precipitation patterns even after global temperatures stabilize.
For SSP3-7.0, we observe a moderate increase in biases, particularly in regions strongly affected by aerosol changes, such as parts of Africa and Asia. Minimum errors decrease slightly for temperature (−0.1 pp) and increase slightly for precipitation (+0.3 pp). Median errors increase moderately for temperature (+0.7 pp) and precipitation (+0.9 pp). Maximum errors increase more noticeably for temperature (+2 pp) and again nearly double for precipitation (increase of up to +7 pp).
Overall, these results indicate that both path dependence and aerosol forcing can introduce systematic deviations, but that these remain moderate in most regions and are consistent with the limitations discussed in the manuscript.
Regarding the question on the optimal training set, we agree that this is an important consideration. In general, the training dataset should be as large and diverse as possible, covering a wide range of scenario characteristics. In particular, including high-warming scenarios is beneficial, as it ensures that models with lower transient climate responses also contribute information at higher GMT levels.
To clarify this point, we have expanded the discussion of training data limitations in the manuscript (Section 3.1) and added the following:
“As a result, higher-emission scenarios generally sample a broader span of GMT levels, which counteracts—at least to some extent—the reduction in available samples for any specific high-GMT bin. The inclusion of high-warming scenarios such as \textit{SSP5-8.5} in the training data also provides additional coverage at high-GMT levels.”
We have also clarified this point in the conclusions:
“Moreover, as fewer ESMs reach high warming levels, the robustness of emulations decreases with warming level due to reduced training data, making the inclusion of high warming scenarios like \textit{SSP5-8.5} in the training data particularly useful for a wide applicability of the method.”Comment #2:
Out of sample validation of tempature/precip emulations are convincing and well presented. But while the model is demonstrated for its use in impact emulation (extremes, crop yields etc), and these results are highlighted in the abstract, validation is missing for these quantities. Some assessment of the out-of-sample performance of impact-relevant metrics would be very useful to demonstrate the model is viable for real-world applications. This is especially true given the model use in the Climate Impact Explorer - where end-of-chain results are the focus, but they are not properly validated. For these end-user applications, an assessment of reliability as a function of impact would be invaluable. For the purposes of a development-model description paper, this is all understandable - but the title/abstract should be slightly adjusted to create a clearer impression of what is validated and what is not.
Thank you for this valuable feedback. We agree that validation of impact-relevant indicators is important, particularly in the context of end-user applications such as the Climate Impact Explorer.
Our primary validation focuses on CMIP6 temperature and precipitation, as these datasets provide sufficiently large ensembles to derive robust ground-truth distributions from the simulations (each simulation only provides one sample per timestep, so to retrieve a relatively robust time-dependent distribution for a particular scenario, we require a high number of simulations for that particular scenario) and to meaningfully quantify errors between simulated and emulated distributions. For many impact indicators (e.g., from ISIMIP), such validation is more challenging due to the limited number of available simulations, which can lead to substantial sampling uncertainty in the estimated ground-truth distributions.
Nevertheless, to address this point, we have added an additional evaluation of impact-relevant and extreme indicators included in the Climate Impact Explorer, which we will also add and discuss in the revised manuscript. Specifically, we compare time series of selected quantiles (median, 5th, and 95th percentiles) emulated by RIME-X against the corresponding simulated time series. To avoid cherry-picking from a large set of possible indicators, we restrict this analysis to the indicators already presented in the showcase section of the manuscript.
For each indicator, Figures 5–9 of the supplementary material to this comment now show results across 15 regions: the regions with the lowest, median, and highest normalized mean absolute error, as well as regions around the 5th and 95th percentiles of the error distribution. This provides a representative overview of emulator performance across a range of conditions. We chose to emulate SSP3-7.0 because data availability for this scenario was given for all indicators and notably didn’t exclude this scenario from the calibration data, because of the anyway sparce amount of data to calibrate on.
We note that some level of error is expected due to sampling uncertainty in the underlying simulations, particularly for these indicators with limited ensemble sizes. Nonetheless, these results provide additional evidence that the emulator captures the main features of the simulated distributions.
We further investigated the particularly large maximum errors in the 95th percentile of the maize yield change indicator. These can be traced to the impact model DSSAT-Pythia, which does not provide outputs for SSP3-7.0 but exhibits comparatively large values under SSP5-8.5 (notably in regions such as PAK, LIE, and SVN). As a result, these values are included in the RIME-X–derived distributions but are absent from the corresponding “ground truth” SSP3-7.0 simulations, leading to inflated errors in the upper quantiles.
Finally, following your suggestion, we have clarified in the abstract that the primary validation focuses on temperature and precipitation from CMIP6, while results for impact indicators are demonstrated and qualitatively evaluated.
“By jointly quantifying global and regional sources of uncertainty from the start, RIME-X enables systematic exploration of the full space of plausible regional climate impact trajectories under different emissions scenarios. The method is conceptually applicable to all regional indicators whose distributions are predominantly determined by warming level and provides a computationally efficient framework for uncertainty-aware regional indicator emulation. We evaluate the method using out-of-sample validation on temperature and precipitation simulations from the Coupled Model Intercomparison Project 6 (CMIP-6) and demonstrate its applicability to a range of impact-relevant and extreme event indicators derived from the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP). We provide an open-source Python implementation of RIME-X, including preprocessing workflows for data from ISIMIP and support for user-defined indicators.”
Comment #3:
Code is generally well written - seperation of preprocessing and emulation, well structured config files and dependencies. However - the lack of unit testing is a significant concern for robustness of future development - and I would highly recommend the authors to implement a formal test suite. More documentation & tutorials would also improve the usability of the codebase for new users.
Thank you for this helpful feedback. We agree that a more comprehensive testing framework and improved documentation are important for the robustness and usability of the codebase.
We are currently extending the repository to include a formal unit and regression testing suite, as well as additional documentation and user-oriented tutorials. These improvements, along with further functionality, like enabling changing socioeconomic conditions throughout the emulations, applicability to more datasets, and preprocessing of user-defined region masks from shapefiles, will be incorporated into the public repository over the course of this year.
Comment #4:
The code also seems to slighly differ from the paper in the mapping of GMT distributions to impact distributions. The paper describes a Monte Carlo sampling of temperatures from the SCM ensemble distribution, for each sample drawing corresponding impact values from the conditional distribution. The code instead converts the GMT ensemble into quantiles, then uses linear interpolation between pre-computed GMT bins to obtain impact values. This is a sensible approach, but slightly conceptually different from the paper's description and relies on the validity of linear interpolation between GMT bins. This should be discussed.
Thank you for this comment. The apparent difference arises because there are two versions of the code implementing RIME-X.
The version referred to here (https://github.com/iiasa/rimeX/blob/main/rimeX/emulator.py) is the first explicit implementation of the method, which deterministically calculates the quantiles of the marginal distribution.
However, this version proved difficult to parallelize efficiently on a full grid, which is required for the Climate Impact Explorer. For this reason, we developed a second implementation in (https://github.com/iiasa/rimeX/blob/main/rimeX/preproc/quantilemaps.py). In this version, the function make_quantile_map_array is a preprocessing step that uses regional averages and warming level information from the MIP to create “quantile maps” representing the conditional distributions given warming levels, as described in the Implementation section of the paper. The emulation is then performed by make_quantilemap_prediction, which applies the sampling procedure using the SCM distribution together with the precomputed quantile map. We intend to retain both implementations of the method, as the explicit version proved valuable for potential future use cases of the emulator that require a sampling strategy from the RIME-X distribution. However, to avoid confusion about the implementation, in the repository extension outlined in response to Comment #3 we will provide more comprehensive documentation and tutorials to better explain the codebase and improve users’ ability to effectively apply it.
Apart from small sampling differences, which quickly become negligible (we tested with up to 10,000 samples and typically found <1% difference between outputs; see Figure 10 in the supplementary material to this comment for an example), both implementations are output-equivalent. The advantage of the quantile-map version is that it can be applied gridpoint-wise in parallel, enabling full spatial emulation and producing distribution quantiles for each grid point, as shown in Figure 5 in the Preprint.
Comment #5:
Compound events - RIME-X is demonstrated to produce marginal distributions for temperature and precip conditional on global temperature. A useful extension would be to think about joint conditional distributions of multiple variables - which cannot be trivially inferred from the single distributions. This should be discussed as a limitation (as in a caveat not to trust the implicit joint distributions from the current model), and a possible future extension could be to do this formally.
Thank you for this suggestion. This is indeed a very relevant point, as not capturing spatial and temporal dependencies between distributions is a limitation to the applicability of the method, although it does not affect the accuracy of the marginal emulations. We have therefore added this explicitly as a limitation in the Conclusion, including the following sentence:
“This epistemic uncertainty is, however, partly mitigated through model weighting and GMT-based constraints. Additionally, the method does not capture temporal, spatial, or inter-variable correlations between the emulated distributions of multiple variables, regions, or time steps, even when they originate from the same underlying data source, meaning that dependencies across regions, variables, and time steps are not preserved between several emulated distributions.”
At the same time, we note that there is a practical way to emulate compound indicators within the current RIME-X framework. Specifically, compound metrics can first be computed from the underlying climate model data, thereby fully capturing spatial, temporal, and cross-variable dependencies, and can then be emulated as a single derived indicator using RIME-X. For example, wet-bulb temperature can be calculated from the raw simulation output of temperature and humidity and subsequently emulated as a compound variable.
Finally, we agree that a more general treatment of joint distributions across multiple variables, regions, or time steps would be highly valuable and methodologically feasible. We have therefore added this as a direction for future work in the Conclusion:
“Future work could address some of these limitations—for instance, by conditioning indicator distributions on changing socioeconomic variables like GDP or population, improving robustness at high warming levels using training data with prescribed GMT shifts like, e.g., from the upcoming Tipping Points Modelling Intercomparison Project \citep[TIPMIP;][]{winkelmann2025tipping}, testing additional global predictor indicators such as the likelihood of being before or after peak GMT, or extending the framework to produce multivariate output that captures spatial, temporal, or cross-indicator correlations particularly relevant for the analysis of compound events.”
Comment #6:
The 21-year running mean is deeply embedded as part of the pre-processing step, but this step smooths over annual variability and infrequent features, and potentially cutting off an important part of the impact space associated with even mildly extreme years. The implication of this should be discussed more - how can the framework inform about once per decade events which might in fact be the dominant impacts on infrastructure - and this could be an avenue the authors explore in future versions.
Thank you for this comment. The 21-year running mean is only used as a definition for global mean temperature (GMT) in the framework. For individual regional indicators, the temporal aggregation is configurable via the running_mean_window parameter in the configuration files, which is set to 21-years by default as this choice was made for the Climate Impact Explorer.
However, this is not a structural limitation of the method. Users can choose different aggregation windows, and in the implementation, it is even possible to compute distributions for individual months. In principle, there is no constraint preventing even finer temporal resolutions, such as daily values, depending on the underlying input data and the intended application. This is not included in the implementation, though.
At the same time, there is a high level of structural flexibility for the indicator definition within the RIME-X framework. In principle, also 1-in-20 year extreme events can be identified and mapped, using a running window approach. However, such approaches would require large training data sets, as there is a clear trade-off here with stochastic natural variability. In practice, the available training data limits what’s possible, and a running mean approach has proven to be numerically robust. This is a structural limit of a mapping instead of a generative emulator approach.
Comment #7:
It would be useful to provide guidance for end-users on where the model is currently reliable, where it's merely informative and where it's just hypothetical. The provided validation indicates that the model is relatively robust at coarse-scale: temperature/precip projections under non-overshoot scenarios. Downstream Impacts, and high-frequency outputs are perhaps less robust - and it would be useful for end-users to flag these as areas in development, as well as scenarios where the underlying assumptions would be expected to fail (large aerosol signals, SRM, large overshoots, impacts dominated by land use assumptions). Laying these out (especially in the impact explorer) would strengthen the model framework in terms of providing areas of operational applicability.
Thank you for this helpful suggestion. We agree that providing clearer guidance on the domain of applicability is important for end users. However, we consider a strict separation into “robust”, “informative”, and “hypothetical” usage categories not to be robustly definable in a general and transferable way, as applicability depends on the specific indicator, region, scenario, aggregation scale, and used case.
Instead, we prefer to give an extensive list of possible features of indicators, regions, and scenarios that could introduce biases or corrupt the emulations. We have strengthened the manuscript by further expanding the discussion of limitations and scenarios where the underlying assumptions are expected to break down. Some of these limitations were already included, and we have now added the additional cases highlighted in your comment. Specifically, we have expanded the text as follows (bold additions):
“Though supported and applied in the literature for many indicators \citep{herger2015improved,schleussner2016differential,hausfather2022climate}, this assumption can introduce biases, particularly for overshoot or stabilization scenarios, and in regions or indicators affected by pattern effects, high regional aerosol concentrations, changing circulation patterns, land–atmosphere feedbacks, SST-driven pattern effects, monsoon dynamics, or changes in regional land use \citep{schleussner2024overconfidence,pfleiderer2024limited,wells2023understanding,giani2024origin}. These limitations also restrict the applicability of our approach to idealized model experiments deploying solar radiation modification.”
In addition, we have strengthened the conclusions by providing a consolidated list of limitations and applicability constraints:
“However, some limitations can affect RIME-X's accuracy. It assumes that the regional indicator depends primarily on GMT, excluding influences like aerosols, shifts in socioeconomic conditions, land use, or circulation patterns (including SST-driven pattern effects, monsoon dynamics, and land-atmosphere feedbacks). This limits applicability to regions or indicators where this assumption approximately holds. It also does not account for the path to reaching GMT levels, limiting applicability in scenarios or indicators where time-lag effects play a major role for the assessment of regional indicators, such as overshoot scenarios. In addition, the framework may be less robust in scenarios where regional aerosol forcing differs strongly from the training scenarios. Our approach is also not validated for idealized model experiments deploying solar radiation modification.”
-
AC2: 'Reply on RC1', Niklas Schwind, 04 May 2026
-
RC2: 'Comment on egusphere-2025-5781', Anonymous Referee #2, 12 Feb 2026
Please see the attached pdf for my review.
-
AC1: 'Reply on RC2', Niklas Schwind, 03 May 2026
Dear Reviewer,
We thank you for the thorough and constructive assessment of our manuscript and for the positive evaluation of the RIME-X framework. We appreciate your recognition of the clarity of the manuscript, the flexibility of the approach, and the overall quality of the figures and presentation.
We are particularly grateful for your thoughtful comments regarding the evaluation of the model across a broader range of indicators, as well as your detailed and helpful line-by-line suggestions, which provide valuable guidance for improving the clarity and completeness of the manuscript.
In the following, we address the comments related to the content, methodology, and validation of the paper in detail and outline how we will revise the manuscript accordingly. We then conclude with a consolidated list of the requested textual and structural changes that will be implemented in the revised version.
Comments related to the Introduction
Comment 1:L60+ the rest of the intro is really model description; I get that you want to preview the method here but this feels like overkill. Readers scanning through won’t come to the intro for this information.
Thank you for this comment. We understand the concern, but we prefer to retain this level of methodological preview in the introduction, as it helps readers quickly understand the core idea and positioning of the framework before entering the Methods section. To address the issue of clarity and scannability, we have added a short orienting sentence at the beginning of the methodological preview in the introduction of the revised manuscript to make explicit that this part serves as a brief overview rather than the full methodological presentation:
“To overcome these challenges, we present the Rapid Impact Model Emulator Extended (RIME-X) — a novel, top-down probabilistic emulator designed as a probabilistic extension to the deterministic Rapid Impact Model Emulator (RIME) \citep{byers2025fast}. Both approaches are briefly outlined here, while a detailed description of RIME-X is provided in Section~\ref{methodologysection}.”
Comment 2:
L80 I think “assumption-light” may be a bit much if you don’t include some of the assumptions this does rely on – e.g. that GWL timeslices are more appropriate than simple pattern scaling. You go into the assumptions later so maybe worth referring to that section here, or move this to that section.
Thank you for this suggestion. We will remove “assumption-light” from the sentence in the revised manuscript.
Comment 3:
I think it’d be worth noting more clearly other couplings of SCMs with regional emulators which are relevant to your work, e.g. the MESMER-MAGICC coupling https://gmd.copernicus.org/articles/15/2085/2022/ and STITCHES https://esd.copernicus.org/articles/13/1557/2022/ (which as I understand it is designed to be driven with any GMST; you cite this but not really in this context).
Thank you for this helpful suggestion. We agree that it is important to more clearly acknowledge existing approaches that couple SCMs with regional emulators, such as the MESMER–MAGICC framework, and to better place our work in this context.
In the revised manuscript, we will explicitly reference those approaches and clarify the role of SCM–emulator coupling in enabling bottom-up uncertainty analysis. In particular, we will revise the sentence in lines 53–55 as follows:
Original:
“These methods support a more holistic bottom-up exploration of scenario uncertainty, global climate response uncertainty, model uncertainty, and natural variability for a set of regional indicators by repeatedly emulating those indicators for multiple scenarios and GMT pathways using emulator instances calibrated on different ESMs.”
Revised:
“Coupling these methods with SCMs enables a more holistic, bottom-up exploration of uncertainties by deriving a probabilistic set of GMT pathways across scenarios from SCMs and repeatedly emulating regional indicators along these pathways using emulator instances calibrated on different ESMs \citep[e.g.][]{Beusch_Nicholls_2022, tebaldi2022stitches, Schwaab_al._2024}.”
Comments related to the Methodology
Comment 1:
Figure 2 – on panel b at the bottom, are dots the best way to show this? why not a distribution? If it must be dots, would suggest adding some noise to the y value so the density can be more easily seen; it doesn’t convey much info currently I think. Also, the cmap used is quite narrow; it’s hard to see much difference across the decades. Would consider using one with more contrast if possible. Panel a legend should be “ensemble” singular I think, and you don’t need to have “reached in” in each one, in fact I don’t think you need any of the decadal labels in either plot as the reader can see from the dots in a what they correspond to.
Thank you for this helpful suggestion. In the revised version of the manuscript, we will improve the figure by using a colormap with higher contrast to better distinguish the decadal differences. We will also improve the labels, as suggested.
The distributions shown in panel (b) are directly derived from the points shown in the same panel; therefore, a distribution-based visualization is already available. The dots are intentionally kept to provide a visual link between panel (a) and panel (b), allowing readers to trace how the ensemble samples map into the resulting distributions.
Comment 2:
L135 in the equation, why is it Ti ≤ GMT < Ti + ∆T and not Ti -∆T/2 ≤ GMT < Ti + ∆T/2 ie why is the GMT of the bin not in the middle of the bin? Maybe as delT is small this isn’t important (though this can be changed by the user I understand, so it could be) but seems in theory to be the less obvious choice.
Thank you for this question. This is primarily an implementation choice driven by how warming level bins are defined in the configuration. In the current setup, users specify a minimum and maximum warming level together with a fixed warming level step size.
With this definition, bins are constructed directly from the specified bounds and step (e.g. min = 0, max = 5.0, step = 0.1), which yields bin edges aligned with these values (i.e. 0.0, 0.1, …, 5.0). If instead a centred bin definition of the form were used, this would either shift the effective bin centres to values such as 0.05, …, 4.95, which is less intuitive given the user-defined configuration, or it would require including values outside the specified minimum and maximum warming levels.
For consistency with the configuration interface and interpretability of user-defined warming level grids, we therefore use the left-inclusive bin definition .
Comment 3:
L152 “Note that we only have the ability to collect several values of IM in bins BIM(GMT) and extract the distribution because we discretized the GMT variable.” Is nearly a repeat of L139 “Note that we can only define probabilities for GMT levels like we do here because we discretized the GMT variable. ”- I think this could be rephrased or just kept once, and notedagain in the discussion.
Thank you for pointing this out. In the revised manuscript, we have consolidated the two sentences into a single statement at the end of Section 2.2.3 to avoid repetition and improve clarity.Comment 4:
Fig 3 are the data points on the left below the lowest of the cbar? They seem lighter – which I guess could be the case as 1C GMT in Germany might be before 2005-15. I’m not clear on why the number of points in the distribution is decreasing as temperatures increase – what determines this? Fewer input model runs?Thank you for this observation. We checked this, and the data points on the left correspond to a GMT level of 0.9°C and should therefore match the minimum of the colorbar.
Regarding the decreasing number of samples at higher GMT levels, this is indeed due to the reduced number of available simulated data points in higher warming ranges. This effect arises from the fact that fewer model simulations reach higher GMT levels. We have already discussed this limitation in the limitations and discussion sections of the manuscript.
Comment 5:
Fig 6 don’t think you need “value” on x axes. By the time we’re down to province scales (bottom graphs), we run into issues with the data, right? Since you use ESM data, the province won’t lie cleanly over gridcells, and may be much smaller than a gridcell? Just worth noting.
Thank you for this comment. We agree that the label “value” on the x-axis is unnecessary and will remove it.
Regarding the spatial scale, the province shown in Fig. 6 is Guangdong, which is relatively large (only slightly smaller than Cambodia and slightly larger than Uruguay). At this scale, the mismatch between administrative boundaries and ESM grid cells is limited, although not entirely negligible.
That said, we fully agree that this can become an issue for smaller provinces or regions, as well as for some countries with complex geometries or sizes comparable to or smaller than individual grid cells. Thus, we will add the following sentence to the list of limitations in the Conclusions of the revised manuscript:
“Finally, the applicability of the method at finer spatial scales is limited by the resolution of the underlying ESM input data, as coarse model grids may not adequately resolve smaller regions or complex boundaries.”Comment 6:
Fig 5 unsure on the x labels – it’s the change in TXx in deg C right? So units should be something like “TXx cf 2020 (deg C)”
Thank you for the comment. Yes, the x-axis represents the change in TXx relative to the year 2020. We have revised the label accordingly to “Change in TXx relative to 2020 (°C)” to make both the variable and units explicit.Comment 7:
When you use ISIMIP data, it’s bias-corrected, right? How might this affect things compared to raw CMIP output?Thank you for this question. The climatic ISIMIP data used in this study are bias-corrected while the ISIMIP data stemming from the impact models is not bias-corrected. As systematic biases present in, for example, raw CMIP6 output are reduced through this process, we expect that using bias-corrected data also reduces systematic biases in the emulated distributions, thereby improving their accuracy and applicability.
Comment 8:
Table A1 – the Experiments column isn’t clear to me – this is the total members across the scenarios listed? It’d be good to see this broken down by scenario for clarity, which you could do in the scenario column.We thank you for pointing this out. The number reported in the “Experiments” column corresponds to the maximum number of ensemble members available for a single scenario, rather than the total across all listed scenarios. As this number can differ between scenarios, we agree that the current presentation may be unclear.
In the revised manuscript, we will therefore revise Table A1 to improve clarity by adding the number of ensemble members for each scenario directly in the “Scenario” column (in brackets). This will make the distribution of experiments across scenarios more transparent.
Comment 9:
Table A2 you also show TXx in section 2.4 – shouldn’t this be in the table too? Or clarify it’s just Fig 6.
Thank you for pointing this out. You are absolutely right, we will include TXx in Table A2 in the revised version to ensure consistency with Section 2.4.Comments related to the Discussion and Validation
Comment 1:
L255 this is a great bit of discussion; but I think it’s worth noting this increased uncertainty only covers the training scenarios – so if someone looked at a scenario outside this range – eg extreme NETs or regional aerosols (even eg SRM), this uncertainty wouldn’t cover the effects because they’re not accounted for mechanistically. This isn’t a weakness of your model (if anything it shows its robustness) but this in-sample vs out-of-sample issue is key and I think you should bring this into the discussion. Similarly, one could argue that an increased uncertainty may is a limitation, if the variations driving that uncertain could in theory be taken into account (again, just worth noting).
We thank you for this constructive addition. We agree that the distinction between in-sample and out-of-sample scenarios is important and should be made explicit in the discussion, we will thus add the following to L255 in the revised manuscript:
“However, when all scenarios and years of a training dataset are included in the calibration, any loss of predictability due to time lag or related effects will manifest as increased uncertainty (i.e., wider error bars) rather than as an overconfident estimate, as long as the emulated scenario is not a strongly out-of-sample case with respect to the training data, for example, when emulating scenarios involving high-overshoot pathways using a training dataset that does not include such processes.”
We agree with you that increased uncertainty can, in itself, be interpreted as a limitation—particularly in cases where the underlying sources of variability could in principle be explicitly represented rather than reflected as broader distributions. At the same time, the sentence at L255 is formulated in an informative manner, aiming to describe how such limitations manifest within the emulator (i.e., as increased uncertainty rather than overconfident estimates), rather than to evaluate them normatively. For this reason, we retain the original phrasing in the manuscript, while clarifying the scope of its validity as discussed above.Comment 2:
L263 “This effect is partly mitigated by” I don’t think this is true – it’s a separate issue. I’m not sure what the argument here is to mean a wider GMT range from the SCMs counteracts the loss of samples – but am open to being convinced.
Thank you for this comment. The logic is that a wider GMT range means that more GMT bins are considered when constructing the final marginal distribution. As a result, even though individual conditional distributions are likely estimated from fewer samples per bin at higher warming level, the total number of samples contributing to the marginal distribution does not decrease as strongly as the sampling density within each GMT bin, because more GMT bins are included due to the wider GMT range. We therefore think that the larger GMT range at higher emissions levels partly mitigates the loss of samples in each individual bin at higher warming levels.
Comment 3:
L281 worth being clear you don’t use 1500 runs (right?) – maybe say how many you use here. Also Table A1 mentions SSP119 which you don’t list here; but you don’t mention SSP460 in the table – so maybe these are mixed up? In any case needs clarifying.
Thank you for this comment. You are correct that this needs clarification, and also that there was a mix-up in the scenario naming in Table A1. We will correct this in the revised manuscript by replacing SSP1-1.9 with SSP4-6.0 and ensuring consistency across the text and tables.
Regarding the number of simulations: we do not use 1500 fixed runs, but instead include all available ensemble members from the models listed in the table. This results in a total of 1858 simulations for all scenarios combined and the historical period for tas, and 1850 simulations for pr. Of these, SSP2-4.5 contributes 337 simulations for tas and 336 for pr. For the calibration of the emulator, we use all simulations from models and ensemble members that also provide SSP2-4.5, excluding only those scenarios or ensemble members without corresponding SSP2-4.5 data. This yields approximately 1500 simulations (slightly fewer in practice due to these exclusions).
Comment 4:
This evaluation section is nice, but you only look at annual mean T and precip. I’d expect T to be pretty good, though precip does surprisingly well – although the importance of getting extremes right means the errors at those ends are more important than the other percentiles, and I think this should be noted. It’s nice to see this at the country level too. But it’s a shame to only look at these 2 variables as you make a point of how flexible RIME-X is in terms of temporal scale and the range of impact variables you can look at. I think it’d be very useful to have some analysis of a less standard variable – you looked at maize yield earlier so maybe some kind of ISIMIP outputlike that could be good – and/or a shorter timescale (you had some plots of TXx earlier but not here). This would greatly enhance the argument here I think.
Thank you for this valuable and constructive comment. We agree that extending the evaluation beyond mean temperature and precipitation would strengthen the argument, especially given the flexibility of RIME-X in terms of variables and timescales.
Our primary quantitative validation is based on CMIP6 temperature and precipitation, as these variables provide sufficiently large ensembles to construct robust ground truth distributions and to meaningfully assess distributional errors. For many ISIMIP impact indicators, ensemble sizes are substantially smaller, which can introduce considerable sampling uncertainty in the estimated ground truth distributions and complicate a comparable quantitative evaluation.
To address your suggestion, we have now considered how we can do an additional validation that includes all variables in the showcase section of the paper, including the maize yield change variable as an example of an impact variable and TXx and extreme daily rainfall variables as examples of extreme variables on smaller timescales. We will also add this extra validation to the revised manuscript.
Specifically, we assess emulated time series of selected quantiles (5th, 50th, and 95th percentiles) against simulated data.
For each indicator, in Figure 1 to 5 of the supplementary to this response we present results across 15 representative regions, including those with the lowest, median, and highest errors, as well as regions around the 5th and 95th percentiles of the error distribution. We use SSP3-7.0 for this evaluation due to consistent data availability across all selected indicators.
We further note that large maximum errors in the upper percentiles of the maize yield change indicator can be traced to structural differences in availability of scenario-specific simulations in individual ISIMIP models (e.g. DSSAT-Pythia), where outputs available under SSP5-8.5 but not SSP3-7.0 do influence the emulated distribution tails and thus inflate apparent errors in the comparison.
Additionally, we have clarified in the abstract that the core quantitative validation is based on temperature and precipitation from CMIP6, while additional results for impact-relevant and extreme indicators are presented as demonstrative applications of the framework.
“By jointly quantifying global and regional sources of uncertainty from the start, RIME-X enables systematic exploration of the full space of plausible regional climate impact trajectories under different emissions scenarios. The method is conceptually applicable to regional indicators whose distributions are predominantly determined by warming level and provides a computationally efficient framework for uncertainty-aware regional indicator emulation. We evaluate the method using out-of-sample validation on temperature and precipitation simulations from the Coupled Model Intercomparison Project 6 (CMIP-6) and demonstrate its applicability to a range of impact-relevant and extreme event indicators derived from the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP). We provide an open-source Python implementation of RIME-X, including preprocessing workflows for data from ISIMIP and support for user-defined indicators.”
Finally, we will add a hint at the importance of the high and low percentiles in the description of Figure 8 in the revised manuscript:
“The expected deviation is quantified in the shaded area. While agreement across all percentiles is desirable, deviations in the tails are particularly important, as accurate representation of extremes is critical for many impact assessments.”List of textual and structural revisions that will be implemented in the revised manuscript
At the end of this response, we provide a consolidated list of the requested textual and structural revisions that will be implemented in the revised manuscript, and do not require additional clarification:
L14 would say “applicable to [those] regional indicators” to ensure clarity that this wouldn’t apply to all indicators.
L33 this sentence on internal var feels a bit out of place but I can see you’re relating it to the model uncertainty in MIPs; maybe use “additionally” or something to link to the broader paragraph.
L41 this intro is all one paragraph – it needs breaking up really. This could be a good spot for a paragraph break.
L42 would say “enables [the] running [of]”
L59 another paragraph break maybe?
L137 should note it’s index i in the equation as you use j just before (and the equation should be numbered I think which would help here). Also I can’t see N defined (number of bins?).
L139 “levels like we” -> “levels as we” I think.
L146 “more on them” a bit casual I think – maybe just “see…”?
Fig 4 “which distribution” doesn’t read right. “the Climate Action” don’t need “the”? And “point hints at” implies ambiguity – just say “denotes” or similar. Don’t need to repeat “year” in the legend.
L195 “equally spaced 101 quantile levels” should be “101 equally spaced quantile levels” right? And worth clarifying this gives the integer percentiles 0-100?Figure 6 is introduced before Figure 5 currently.
L208 “formats: In” “formats. In” ?
L212 I think the sentence order needs switching. “We highlight the median and the 90% confidence interval. ” seems to refer to the left hand side currently but this is on the right. Would intro the left hand side first to make it easier in any case.
L228 “section 3.1[,] which”
L242 this Leach 2021 paper doesn’t seem to look at calibration specifically – probably best to refer to the FaIR calibration paper here https://gmd.copernicus.org/articles/17/8569/2024/L248 new paragraph please!
L251 “depend mainly on the level of GMT”– entirely depends, right?
L277 not sure why CMIP6 is italicised now; I don’t think it needs to be but just be consistent.
L285 ISIMIP italicised too.
-
AC1: 'Reply on RC2', Niklas Schwind, 03 May 2026
-
RC3: 'Comment on egusphere-2025-5781', Anonymous Referee #3, 07 Apr 2026
The manuscript represents an extension of RIME, RIME-X, combining multiple models which provides scenario-dependent, time-evolving probability distributions of regional indicators, e.g., global mean surface air temperature (GMT). This work replaces deterministic emulation with a probabilistic approach designed to translate emissions scenarios into probabilistic distributions of regional climate indicators. The method aims to provide a computationally efficient framework for uncertainty-aware emulation for regional climate risk assessment.
The manuscript is well written, methodologically clear, and addresses an important gap in climate science applications. But the methodological assumption of dependency of the regional indicators on global warming levels require further clarification, justification, and additional analysis prior to acceptance for publication.
Major Comments:
- Model’s validation is discussed only for temperature and precipitation (figure 7 and 8). How’s model performance about other indicators? How about providing a table summarizing the model performance for other indicators?
- The entire framework relies on the assumption that regional climate impact indicator distributions are predominantly determined by global mean temperatures. While this assumption is reasonable for some thermodynamic variables and some regions, it is problematic for tropical regions and for many regional indicators which are strongly influenced by SST gradients, circulation patterns, and land-atmospheric feedback. Mostly precipitation, extreme rainfall, wind, compound and multivariate extremes. So, the domain where the RIME-X framework can be applied needs to be clearly defined. Also, for the different indicators what is the fraction of variance that is explained by the global mean temperature can be mentioned. Limitations for hydroclimate and extremes, especially in monsoon regions, need to be discussed.
- Secondly, the method assumes that the regional responses depend on warming levels but not on the pathway taken to reach the warming level. The manuscript needs to explore the conditions of overshoot, heterogenous anthropogenic aerosol forcing, rapid reductions of aerosols.
- The validation is done based on the CMIP6 SSP 245 scenario. The pathways for the scenarios in CMIP7 are different (https://doi.org/10.5194/egusphere-2024-3765, 2025). What is the expected deviation in results, if any, for this?
- The manuscript claims applicability to extremes. However, the methodology may not adequately capture the heavy tails, and compound extremes. It can smooth or underestimate tail risks, which are often the most relevant for impact assessments. Authors need to clarify whether the method is suitable for compound or cascading extremes.
- The use of 21-year running average removes important components of variability and limits applicability to interannual variability and event-scale extremes. Authors need to discuss implications of this for risk assessments.
Minor comments:
Abstract: Model’s performance (may be for temperature and precipitation) should be added quantitatively.
L100-105: Mention what “P_s,y” and “p” denote to avoid any confusion for the reader.
L150: “… tas …” --> “… tas (temperature at surface) …”
L175: “… IM, we …”
L183: n=10000? Or 10.000 (fractional)?
L193: “… files, we call…”
Citation: https://doi.org/10.5194/egusphere-2025-5781-RC3
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 358 | 228 | 25 | 611 | 41 | 77 |
- HTML: 358
- PDF: 228
- XML: 25
- Total: 611
- BibTeX: 41
- EndNote: 77
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
This paper describes the RIME-X framework, a novel approach to simulating distributional climate impacts as a function of emissions pathways, exploiting probabilistic simple climate model ensembles and a conditional methodology which expresses regional impacts as a function of global mean temperture, using existing databases of climate simulations to build a lookup table which can be weighted according to model performance and interdepedence. The result is a model framework which can be used to produce quasi-probabilistic impact projections for novel scenarios with minimal computational effort.
The method is a smart approach to climate impact emulation, exploiting available data to make a reasonable approximation of probabilistic outcomes at the point level. The approach is pragamatic and natively probabilistic, the results are meaningful and easily interpretable. As such, the model is enormously useful as a way of utilizing ESM ensemble output while addressing ensemble biases and allowing exploration of a wider scenario space. The method also naturally pairs with existing work on model interdependency (further development on this front would be valuable, and I'd encourage the authors to do so).
In all - this is a comprehensive model description paper for a promising and elegant technique. I suggest some minor issues with the current submission which could be addressed in order to better frame the model's limitations for end users:
1. The key limitation of the approach (an assumed mapping between global mean temperature and impacts) - is acknowledged by the authors as limiting applibability to questions of path depedence, such as deep overshoot where warming patterns may significantly differ to a non-mitigation scenario. The extended work on overshoot impacts by many of the coauthors exemplifies this concern. Though this limitation is acknowledged - a useful extension would be to apply the technique to a scenario like SSP534over and gauge which variables are impacted by path dependency, and to what degree. Similarly, it would be useful to operationally test with SSP370 as a test of applicability to strongly differing aerosol emissions pathways. The authors could also consider what the optimal training set should be - e.g. is it an asset to keep SSP585 in the training set, if that scenario exhibits high warming rates which are unlike policy-relevant scenarios and it might actually reduce performance?
2. Out of sample validation of tempature/precip emulations are convincing and well presented. But while the model is demonstrated for its use in impact emulation (extremes, crop yields etc), and these results are highlighted in the abstract, validation is missing for these quantities. Some assessment of the out-of-sample performance of impact-relevant metrics would be very useful to demonstrate the model is viable for real-world applications. This is especially true given the model use in the Climate Impact Explorer - where end-of-chain results are the focus, but they are not properly validated. For these end-user applications, an assessment of reliability as a function of impact would be invaluable. For the purposes of a development-model description paper, this is all understandable - but the title/abstract should be slightly adjusted to create a clearer impression of what is validated and what is not.
3. Code is generally well written - seperation of preprocessing and emulation, well structured config files and dependencies. However - the lack of unit testing is a significant concern for robustness of future development - and I would highly recommend the authors to implement a formal test suite. More documentation & tutorials would also improve the usability of the codebase for new users.
The code also seems to slighly differ from the paper in the mapping of GMT distributions to impact distributions. The paper describes a Monte Carlo sampling of temperatures from the SCM ensemble distribution, for each sample drawing corresponding impact values from the conditional distribution. The code instead converts the GMT ensemble into quantiles, then uses linear interpolation between pre-computed GMT bins to obtain impact values. This is a sensible approach, but slightly conceptually different from the paper's description and relies on the validity of linear interpolation between GMT bins. This should be discussed.
4. Compound events - RIME-X is demonstrated to produce marginal distributions for temperature and precip conditional on global temperature. A useful extension would be to think about joint conditional distributions of multiple variables - which cannot be trivially inferred from the single distributions. This should be discussed as a limitation (as in a caveat not to trust the implicit joint distributions from the current model), and a possible future extension could be to do this formally.
5. The 21-year running mean is deeply embedded as part of the pre-processing step, but this step smooths over annual variability and infrequent features, and potentially cutting off an important part of the impact space associated with even mildly extreme years. The implication of this should be discussed more - how can the framework inform about once per decade events which might in fact be the dominant impacts on infrastructure - and this could be an avenue the authors explore in future versions.
6. It would be useful to provide guidance for end-users on where the model is currently reliable, where it's merely informative and where it's just hypothetical. The provided validation indicates that the model is relatively robust at coarse-scale: temperature/precip projections under non-overshoot scenarios. Downstream Impacts, and high-frequency outputs are perhaps less robust - and it would be useful for end-users to flag these as areas in development, as well as scenarios where the underlying assumptions would be expected to fail (large aerosol signals, SRM, large overshoots, impacts dominated by land use assumptions). Laying these out (especially in the impact explorer) would strengthen the model framework in terms of providing areas of operational applicability.