the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
ClimateBenchPress (v1.0): A Benchmark for Lossy Compression of Climate Data
Abstract. The rapidly growing volume of weather and climate data, both from models and observations, is increasing the pressure on data centers, restricting scientific analysis, and data distribution. For example, kilometre-scale climate models can generate petabytes of data per simulated month, making it generally infeasible to store all output. To address this challenge, numerous novel compression techniques have been proposed to ease data storage requirements. However, there exist no well-defined benchmarks for rigorously evaluating and comparing the performance of these compressors, including their impact on the data's properties. The lack of benchmarks makes it difficult to design and standardize compressors for weather and climate data, and for scientists to trust that compression errors have no significant impact on their analysis. Here, we address this gap by presenting ClimateBenchPress, a benchmark suite for lossy compression of climate data, which defines both data sets and evaluation techniques. The benchmark covers climate variables following various statistical distributions at medium to very high resolution in time and space, from both numerical models and satellite observations. To ensure a fair comparison between different compressors, each variable comes with a set of maximum error bound checks that the lossy compressors need to pass. By evaluating an initial set of baseline compressors on the benchmark, we gather practical insights for effective application of lossy compression. Our benchmark is open source and extensible: users can easily add new compressors, data sources, and evaluation metrics depending on their own specific use cases.
- Preprint
(14802 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 06 Apr 2026)
- RC1: 'Comment on egusphere-2026-60', Anonymous Referee #1, 03 Mar 2026 reply
Model code and software
ClimateBenchPress data-loader Tim Reichelt and Juniper Tyree https://doi.org/10.5281/zenodo.18015682
ClimateBenchPress Compressors Tim Reichelt and Juniper Tyree https://doi.org/10.5281/zenodo.18152639
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 254 | 145 | 10 | 409 | 19 | 33 |
- HTML: 254
- PDF: 145
- XML: 10
- Total: 409
- BibTeX: 19
- EndNote: 33
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of “ClimateBenchPress (v1.0): A Benchmark for Lossy Compression of Climate Data” by Tim Reichelt et al.
General comments
This new paper presents ClimateBenchPress, an open-source benchmark designed to standardize the evaluation of lossy compression algorithms for climate and weather data, addressing the current lack of consistent comparison frameworks in the field. As climate datasets grow to petabyte scale, lossy compression becomes essential, yet existing studies differ widely in datasets, error tolerances, and evaluation metrics. ClimateBenchPress includes diverse model and observational datasets, defines error bounds derived from uncertainty estimates, and provides distortion and compression metrics to enable fair comparison across compressors.
Testing several state-of-the-art methods (e.g., SZ3, ZFP, SPERR, JPEG2000, and rounding-based approaches), the authors show that while some achieve very high compression ratios, they may violate error bounds or mishandle edge cases such as NaNs. Simpler methods, such as bit rounding combined with optimized lossless compression, offer competitive and often more robust performance. Overall, the results demonstrate that no single compressor dominates across all variables and metrics, highlighting key trade-offs and the need for a standardized benchmark.
In my view, this is a scientifically sound study with robust and well-substantiated results. The benchmark is thoughtfully designed, the methodology transparent, and the evaluation framework rigorous and reproducible. The manuscript is well written, clearly structured, and accessible to both compression specialists and climate scientists. I recommend publication in GMD, subject to the minor comments below.
Specific comments
Lines 26–28: The example given can be justified conceptually but would benefit from clarification. Global mean temperature is statistically robust to small, spatially uncorrelated compression errors that may cancel upon averaging. In contrast, local wind power estimates depend on nonlinear relationships and fine-scale variability, making them potentially more sensitive to small local errors. Clarifying this reasoning would avoid confusion.
Lines 43–48: The term “variable characteristics” is vague. Please clarify which properties are meant (e.g., statistical distribution, intermittency/sparsity, smoothness vs. sharp gradients, spatial/temporal correlation scales, dynamic range, NaNs, extremes, etc.). A few examples would improve clarity.
Line 70: “Actionable insights” sounds generic. Consider specifying the practical guidance provided (e.g., recommendations for particular variable types or error tolerances).
Table 1: Cloud-related variables (e.g., liquid water content) are not included, although they represent a challenging case due to their 3-D structure, sharp gradients, and large near-zero regions. While sparsity and NaNs are partly represented by precipitation and SST, a brief comment on the exclusion of 3-D cloud condensates would be useful, possibly as a future extension.
Table 1: Since V is mostly 1 here, variables are effectively treated independently. While reasonable, it would help to state explicitly that multivariate compression is beyond the current scope. In practice, many variables are physically correlated (e.g., atmospheric chemistry tracers), and advanced methods may exploit such structure. A brief acknowledgment would strengthen the discussion.
Lines 79–88: The regridding discussion focuses on model output, but similar issues apply to satellite swath data provided in along-track/across-track coordinates. Regridding can alter statistical properties relevant for compression. A brief acknowledgment would clarify that restricting to regular grids simplifies comparability but does not reflect all real-world cases.
Line 101: “Linear packing” is mentioned in the context of the ERA5 data but not explained. A short definition or reference would be helpful.
Lines 128–129: The manuscript states that the absence of standard error bounds is “partly” due to application dependence. “Largely” may be more appropriate, as tolerable error levels are typically driven by downstream applications.
Lines 175–178 and Table 2: For several variables, the gap between the low (100th percentile) and mid (99th percentile) bounds exceeds that between mid (99th) and high (95th) significantly. This suggests sensitivity of the low bound to extreme outliers or heavy-tailed spread distributions. Please comment on this sensitivity and whether slightly lower percentiles (e.g., 99.9%) were considered.
Lines 226–234: Instruction count is a useful reproducible metric, but wall-clock runtime remains highly relevant in practice. Parallelization (multi-threading, GPU support) and peak memory footprint can significantly affect scalability for large datasets. A brief discussion of these practical aspects would be valuable.
Figure 2: The scorecards are helpful. For example, SZ3 often achieves higher compression ratios but also larger error metrics and occasional bound violations. Although discussed later, more explicit guidance in the figure interpretation would help readers assess comparability across methods.
Figure 7: This figure shows that compressors can produce markedly different error distributions, even under identical nominal absolute bounds. While discussed, this reinforces that methods are not strictly comparable under a single bound alone. Future work could consider complementing the current protocol with additional tail metrics (e.g., p99/p99.9) or joint criteria on maximum and distributional error properties.
Lines 468–471: Benchmarking full high-resolution model outputs (terabyte scale) would be highly valuable. Such tests would better reflect modern data volumes and assess compressors under realistic storage, I/O, and scalability constraints, complementing the current laptop-scale setup for HPC environments.
Technical corrections
Line 99: Remove extra “.”
Line 136: Index v runs from 0 to V, implying V+1 elements; is this intended?
Line 318: Rephrase as “This ensures …”
Figures 3 and 6: Adding distinct marker symbols alongside colors and linestyles would improve clarity.