the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
CMIP6 data usage: Lessons learned from more than 200 million downloads
Abstract. Earth system simulations from the Coupled Model Intercomparison Project (CMIP) are considered the gold standard in terms of representation of the Earth’s climate system, its past and present states, and future evolution. As CMIP moves into its seventh phase, the increasing complexity of Earth system models (ESMs) means that there is a greater need for infrastructure resources to store, distribute and utilize CMIP simulations. Statistics on the usage of data during CMIP6 has the potential of offering guidance to prepare for CMIP7. Here, we analyse the usage of CMIP6 data and propose recommendations for optimizing the production and accessibility of future CMIP data. Our analysis focuses on CMIP6 data usage statistics from the Earth System Grid Federation (ESGF), the main database of CMIP and other ESMs simulation data. We perform an analysis of CMIP6 ESGF data usage statistics, with a focus on the usage of variables, experiments, individual thematic Model Intercomparison Projects (MIPs), sources and institutions, and related geographical usage trends. We further include statistics on usage from other sources hosting CMIP6 data, including some curated by community portals (Pangeo) through commercial clouds (Google Cloud and Amazon Web Services) and by climate services (Copernicus Climate Change Service). We conclude with recommendations for centres involved in the production and distribution of data to optimise resources based on usage statistics, and to implement improved approaches to track usage.
- Preprint
(2590 KB) - Metadata XML
-
Supplement
(574 KB) - BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on egusphere-2026-1246', Stephan Kindermann, 27 Apr 2026
-
RC2: 'Comment on egusphere-2026-1246', Mario Acosta, 29 Apr 2026
This manuscript analyses CMIP6 data usage across the Earth System Grid Federation (ESGF) and selected additional providers, with the aim of extracting lessons for future CMIP data production, curation and access. The topic is timely, relevant and of clear community interest, especially in the context of CMIP7 and the growing challenges associated with climate data volumes, storage costs, accessibility, and sustainability.
The manuscript’s main strength is that it brings together a uniquely valuable empirical basis, large-scale usage statistics, and turns them into practical reflections on archive design, metadata quality, storage strategies, and data tracking. This is useful for the GMD readership because it speaks directly to the usability and scientific reach of model output, rather than only to model development itself. In particular, the identification of substantial metadata inconsistencies in downloaded files, the analysis of differences between downloaded data volume and number of downloaded files, and the discussion of archive structure effects are all important contributions.
Overall, I find the manuscript valuable and suitable for publication after minor revision. My main concern is not the relevance of the study, which is strong, but rather the need to sharpen the interpretation of download statistics and to qualify some recommendations more carefully.
Main comments
1. Interpretation of downloads versus actual usage should be handled more carefully
The manuscript is appropriately transparent in stating that downloads provide only a partial view of usage and that many factors influence the statistics, including dataset size, time resolution, spatial resolution, ensemble size, archive strategy, and publication timing. This is one of the strengths of the manuscript. However, in the interpretation of the results, the language occasionally becomes stronger than the evidence fully supports.
For example, the paper draws conclusions about which variables, experiments, or sources are most “used”, but in many cases the statistics likely reflect a mixture of true user demand and archive-side effects. The variable-level ratio RRR, defined as downloaded data over available data, is a helpful step toward normalisation, but similar corrections are not developed with the same strength for experiments, models, or institutions.
I suggest that the authors tighten the language throughout the manuscript to distinguish more explicitly between:
- observed download activity,
- actual scientific use,
- archive availability and visibility,
- and access patterns conditioned by file structure or provider capabilities.
2. The effect of archive structure is one of the most important results and deserves stronger treatment
The comparison between EC-EARTH3 and CESM2 for monthly historical tas is one of the most insightful parts of the manuscript. It clearly shows that archive choices, for example yearly files versus multi-decadal files, can dramatically alter the number of downloadable files and therefore the apparent usage statistics.
This point is central because it affects the interpretation of several later results, especially rankings by number of downloads and possibly also some by total downloaded volume. At present, this issue is discussed well, but mostly through one illustrative example. Since one of the paper’s practical recommendations is that more uniform file formatting could improve accessibility and interpretation, I think this argument would benefit from a slightly more systematic treatment.
For instance, the authors could:
- better discuss whether download volume is also substantially biased by storage conventions, not only download counts,
- clarify how representative the EC-EARTH3/CESM2 example is of broader CMIP6 publication practices,
- and emphasize more clearly that some source-level rankings cannot be interpreted independently of file chunking and storage decisions.
3. Recommendations regarding low-use variables should be framed more cautiously
One of the main conclusions is that, since for roughly a quarter of the variables more data was available than downloaded, future resource allocation could potentially be optimised by revisiting whether low-usage variables should be published at reduced frequency or with higher compression. This is a reasonable and useful discussion, especially in view of storage constraints and carbon footprint considerations.
However, I recommend that the authors frame this point more cautiously. Low download rates do not necessarily imply low scientific value. Some variables may serve specialised but essential user communities, model evaluation tasks, process studies, or reproducibility requirements. A variable can be scientifically important even if it is not widely downloaded. The paper does acknowledge some caveats, but the policy implication here remains stronger than the evidence warrants.
I therefore suggest explicitly stating that download statistics should inform prioritisation, but not be used as the sole criterion for decisions on what should or should not be archived in future CMIP phases.
4. The ESGF metadata inconsistency result is very important and could be highlighted even more
The finding that erroneous metadata combinations accounted for about 782,830 files and around 1000 TB of downloaded data is, in my opinion, one of the strongest and most actionable results of the manuscript. It directly illustrates the scientific and operational cost of imperfect metadata quality control, both for users and for the broader infrastructure ecosystem.
This result strongly supports the manuscript’s discussion of QA/QC improvements for CMIP7. I encourage the authors to highlight this even more clearly in the abstract and conclusions, because it is not merely a technical inconvenience but a real inefficiency affecting usability, reproducibility, and duplicated user effort.
5. Comparison across providers is useful, but should be presented as clearly asymmetric
The inclusion of AWS, Google Cloud and C3S broadens the relevance of the paper and is welcome. However, the resulting comparison is necessarily very uneven:
- Google does not provide usage statistics,
- AWS statistics are limited and difficult to interpret,
- C3S provides richer statistics but only for a constrained subset of variables, models and experiments,
- and ESGF remains by far the dominant quantitative basis for the analysis.
The manuscript already says this, but I think the framing should go slightly further. The paper is fundamentally an ESGF-based usage analysis, complemented by exploratory perspectives from other providers. Presenting it in this way more explicitly would help calibrate reader expectations and avoid overinterpreting cross-provider differences.
6. Citation analysis is interesting but should be even more clearly labelled exploratory
The attempt to compare download activity with citation counts is worthwhile, especially because it touches on the broader impact of CMIP data in science. However, the manuscript itself explains why this proxy is incomplete: dataset DOIs are inconsistently cited, some users cite model description papers instead, and the publication hub is sparse or unavailable.
Because of this, I recommend framing the citation analysis even more explicitly as exploratory and incomplete. The current discussion is mostly careful, but a reader could still come away with too strong a sense of relationship between downloads and scientific uptake. This section would benefit from a sentence stating directly that citation counts, in the present form, cannot support robust comparative conclusions across models or experiments.
Minor comments
- The manuscript would benefit from consistently distinguishing between “downloaded files”, “downloaded volume”, “requests”, and “users”, especially when comparing ESGF and C3S, since these metrics are not equivalent.
- The wording around “most used variables” or “most popular models” could be softened in some places to “most downloaded” unless normalisation or context is explicitly applied.
- The discussion of geographical patterns is interesting and potentially important, especially regarding differences between ESGF and C3S user locations. Still, I would encourage the authors to remain cautious in attributing these patterns to resource inequality alone, since untracked institutional mirrors and local infrastructures may play an important role.
- It would be useful to clarify slightly more how incomplete ESGF node coverage in the CMCC monitoring system may affect the interpretation of the results.
- The discussion of downscaled products is valuable and broadens the perspective of the paper beyond raw CMIP downloads. If possible, the authors could make even clearer that downstream impact through regional or statistically downscaled datasets is likely systematically underestimated by the statistics analysed here.
- The manuscript is generally well written and structured. I only suggest a final pass to ensure that the strongest conclusions always remain aligned with what the available statistics can actually demonstrate.
I recommend publication after revision. The manuscript is timely, relevant, and useful to the CMIP and wider climate data infrastructure community. Its novelty lies less in methodological sophistication than in the large-scale empirical analysis of actual CMIP6 data access patterns and in the translation of those patterns into practical lessons for future data publication and tracking. I consider that following the minor recommendations, this paper will make a valuable contribution to GMD.
Citation: https://doi.org/10.5194/egusphere-2026-1246-RC2
Viewed
| HTML | XML | Total | Supplement | BibTeX | EndNote | |
|---|---|---|---|---|---|---|
| 1,180 | 578 | 59 | 1,817 | 142 | 52 | 68 |
- HTML: 1,180
- PDF: 578
- XML: 59
- Total: 1,817
- Supplement: 142
- BibTeX: 52
- EndNote: 68
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review: