ClimateBenchPress (v1.0): A Benchmark for Lossy Compression of Climate Data

Reichelt, Tim; Tyree, Juniper; Klöwer, Milan; Dueben, Peter; Lawrence, Bryan N.; Baker, Allison H.; Faghih-Naini, Sara; Hoefler, Torsten; Stier, Philip

doi:10.5194/egusphere-2026-60

Preprints

https://doi.org/10.5194/egusphere-2026-60

Preprints

09 Feb 2026

| 09 Feb 2026

ClimateBenchPress (v1.0): A Benchmark for Lossy Compression of Climate Data

Tim Reichelt, Juniper Tyree, Milan Klöwer, Peter Dueben, Bryan N. Lawrence, Allison H. Baker, Sara Faghih-Naini, Torsten Hoefler, and Philip Stier

Abstract. The rapidly growing volume of weather and climate data, both from models and observations, is increasing the pressure on data centers, restricting scientific analysis, and data distribution. For example, kilometre-scale climate models can generate petabytes of data per simulated month, making it generally infeasible to store all output. To address this challenge, numerous novel compression techniques have been proposed to ease data storage requirements. However, there exist no well-defined benchmarks for rigorously evaluating and comparing the performance of these compressors, including their impact on the data's properties. The lack of benchmarks makes it difficult to design and standardize compressors for weather and climate data, and for scientists to trust that compression errors have no significant impact on their analysis. Here, we address this gap by presenting ClimateBenchPress, a benchmark suite for lossy compression of climate data, which defines both data sets and evaluation techniques. The benchmark covers climate variables following various statistical distributions at medium to very high resolution in time and space, from both numerical models and satellite observations. To ensure a fair comparison between different compressors, each variable comes with a set of maximum error bound checks that the lossy compressors need to pass. By evaluating an initial set of baseline compressors on the benchmark, we gather practical insights for effective application of lossy compression. Our benchmark is open source and extensible: users can easily add new compressors, data sources, and evaluation metrics depending on their own specific use cases.

Received: 05 Jan 2026 – Discussion started: 09 Feb 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Tim Reichelt, Juniper Tyree, Milan Klöwer, Peter Dueben, Bryan N. Lawrence, Allison H. Baker, Sara Faghih-Naini, Torsten Hoefler, and Philip Stier

Status: final response (author comments only)

RC1: 'Comment on egusphere-2026-60', Anonymous Referee #1, 03 Mar 2026

Review of “ClimateBenchPress (v1.0): A Benchmark for Lossy Compression of Climate Data” by Tim Reichelt et al.
General comments
This new paper presents ClimateBenchPress, an open-source benchmark designed to standardize the evaluation of lossy compression algorithms for climate and weather data, addressing the current lack of consistent comparison frameworks in the field. As climate datasets grow to petabyte scale, lossy compression becomes essential, yet existing studies differ widely in datasets, error tolerances, and evaluation metrics. ClimateBenchPress includes diverse model and observational datasets, defines error bounds derived from uncertainty estimates, and provides distortion and compression metrics to enable fair comparison across compressors.
Testing several state-of-the-art methods (e.g., SZ3, ZFP, SPERR, JPEG2000, and rounding-based approaches), the authors show that while some achieve very high compression ratios, they may violate error bounds or mishandle edge cases such as NaNs. Simpler methods, such as bit rounding combined with optimized lossless compression, offer competitive and often more robust performance. Overall, the results demonstrate that no single compressor dominates across all variables and metrics, highlighting key trade-offs and the need for a standardized benchmark.
In my view, this is a scientifically sound study with robust and well-substantiated results. The benchmark is thoughtfully designed, the methodology transparent, and the evaluation framework rigorous and reproducible. The manuscript is well written, clearly structured, and accessible to both compression specialists and climate scientists. I recommend publication in GMD, subject to the minor comments below.
Specific comments
Lines 26–28: The example given can be justified conceptually but would benefit from clarification. Global mean temperature is statistically robust to small, spatially uncorrelated compression errors that may cancel upon averaging. In contrast, local wind power estimates depend on nonlinear relationships and fine-scale variability, making them potentially more sensitive to small local errors. Clarifying this reasoning would avoid confusion.
Lines 43–48: The term “variable characteristics” is vague. Please clarify which properties are meant (e.g., statistical distribution, intermittency/sparsity, smoothness vs. sharp gradients, spatial/temporal correlation scales, dynamic range, NaNs, extremes, etc.). A few examples would improve clarity.
Line 70: “Actionable insights” sounds generic. Consider specifying the practical guidance provided (e.g., recommendations for particular variable types or error tolerances).
Table 1: Cloud-related variables (e.g., liquid water content) are not included, although they represent a challenging case due to their 3-D structure, sharp gradients, and large near-zero regions. While sparsity and NaNs are partly represented by precipitation and SST, a brief comment on the exclusion of 3-D cloud condensates would be useful, possibly as a future extension.
Table 1: Since V is mostly 1 here, variables are effectively treated independently. While reasonable, it would help to state explicitly that multivariate compression is beyond the current scope. In practice, many variables are physically correlated (e.g., atmospheric chemistry tracers), and advanced methods may exploit such structure. A brief acknowledgment would strengthen the discussion.
Lines 79–88: The regridding discussion focuses on model output, but similar issues apply to satellite swath data provided in along-track/across-track coordinates. Regridding can alter statistical properties relevant for compression. A brief acknowledgment would clarify that restricting to regular grids simplifies comparability but does not reflect all real-world cases.
Line 101: “Linear packing” is mentioned in the context of the ERA5 data but not explained. A short definition or reference would be helpful.
Lines 128–129: The manuscript states that the absence of standard error bounds is “partly” due to application dependence. “Largely” may be more appropriate, as tolerable error levels are typically driven by downstream applications.
Lines 175–178 and Table 2: For several variables, the gap between the low (100th percentile) and mid (99th percentile) bounds exceeds that between mid (99th) and high (95th) significantly. This suggests sensitivity of the low bound to extreme outliers or heavy-tailed spread distributions. Please comment on this sensitivity and whether slightly lower percentiles (e.g., 99.9%) were considered.
Lines 226–234: Instruction count is a useful reproducible metric, but wall-clock runtime remains highly relevant in practice. Parallelization (multi-threading, GPU support) and peak memory footprint can significantly affect scalability for large datasets. A brief discussion of these practical aspects would be valuable.
Figure 2: The scorecards are helpful. For example, SZ3 often achieves higher compression ratios but also larger error metrics and occasional bound violations. Although discussed later, more explicit guidance in the figure interpretation would help readers assess comparability across methods.
Figure 7: This figure shows that compressors can produce markedly different error distributions, even under identical nominal absolute bounds. While discussed, this reinforces that methods are not strictly comparable under a single bound alone. Future work could consider complementing the current protocol with additional tail metrics (e.g., p99/p99.9) or joint criteria on maximum and distributional error properties.
Lines 468–471: Benchmarking full high-resolution model outputs (terabyte scale) would be highly valuable. Such tests would better reflect modern data volumes and assess compressors under realistic storage, I/O, and scalability constraints, complementing the current laptop-scale setup for HPC environments.
Technical corrections
Line 99: Remove extra “.”
Line 136: Index v runs from 0 to V, implying V+1 elements; is this intended?
Line 318: Rephrase as “This ensures …”
Figures 3 and 6: Adding distinct marker symbols alongside colors and linestyles would improve clarity.

Citation: https://doi.org/10.5194/egusphere-2026-60-RC1
RC2:
'Comment on egusphere-2026-60', Anonymous Referee #2, 19 Apr 2026
General comments
I found this to be a useful and well motivated paper. ClimateBenchPress addresses a real problem in the climate-data compression literature: reported compression ratios are often hard to compare because different studies use different datasets, error tolerances, metrics, and sometimes different definitions of compression ratio. The benchmark proposed here is a practical step toward making these comparisons more reproducible.
The paper is also clear about what the benchmark is and is not meant to do. I appreciated the authors' effort to define compression ratio in terms of raw array size rather than final file size, the use of numcodecs-rs and WebAssembly to improve reproducibility across platforms, and the inclusion of qualitative analyses of error distributions, NaN handling, and spatial error patterns in addition to the summary scorecards. The baseline results are useful precisely because they show trade-offs and failure cases rather than identifying a single universally best compressor. I support publication after minor revisions.
Specific comments
Lines 157-164 and Eqs. (1)-(2): I may be misunderstanding the percentile convention, but the ordering in Eqs. (1)-(2) appears inconsistent with Table 2. The text describes P100%, P99%, and P95% as percentiles of the ensemble spread data; under the usual percentile convention, P100% >= P99% >= P95%. This would make b_low the largest bound and b_high the smallest, whereas Table 2 presents low, mid, and high as increasing error bounds. Please clarify the percentile convention or check whether the labels in Eqs. (1)-(2) are reversed.

Definition 1 and Table 2: The manuscript does define a pointwise relative error bound, and it notes that this differs from the normalized ratio |X_i - Xhat_i| / |X_i|. I still think this distinction deserves more emphasis for readers from the compression community, because different papers and compressors use different relative-error conventions near zero. In this benchmark, the definition |X_i - Xhat_i| <= b_rel |X_i|, together with the exact-equality condition, makes the bound zero-preserving. That matters for interpreting failures on biomass, precipitation, and other zero-heavy or near-zero fields.

Lines 179-187 and Table 2: The text says that the expert bounds mostly lie between the low and high computed bounds, with exceptions only for sea-level pressure and precipitation. However, the expert bound for sea-surface temperature is 0.01 degrees C, which is much larger than all three computed SST bounds in Table 2. Please check either the table or the statement in the text.

Table 2 and lines 179-187: The caption notes that some expert bounds are average rather than pointwise bounds. Since the benchmark bounds themselves are pointwise, I would also mention this in the text where the comparison to expert bounds is discussed. Otherwise the comparison can sound more like-for-like than it really is.

Section 2.2.1: The spectral error metric is useful, but I could not tell exactly how it is computed for global longitude-latitude data. Is the spectrum computed directly on the native lon-lat grid? Is any latitude weighting used? Is there any reprojection? How is the radial averaging defined? A few implementation details would make this easier to reproduce.

Section 2.2.1: Relatedly, please clarify how DSSIM and spectral error are aggregated for variables with multiple time steps and vertical levels. Are they computed per 2-D slice and then averaged, or on higher-dimensional fields directly?

Lines 313-324 and 347-358: The conversion from relative to absolute bounds using the smallest non-zero right-hand side is intentionally conservative, but it can be dominated by near-zero or subnormal values. The precipitation example later in the paper shows this well. I would make this point more explicit in the main text, especially for compressors that do not natively support the benchmark's relative-error definition or use a different relative-error convention.

Figure 2 and Table 3: I found Table 3 helpful, especially the bracketed entries for partial relative-error support. In the scorecards, however, it is still easy to conflate two cases: compressors that do not natively support the tested pointwise constraint, and compressors that nominally support it but fail in particular cases. Some additional cue in the Figure 2 caption or surrounding text would help readers separate these cases.

Lines 227-234, Figure 4, and Figure E1: The instruction-count metric is a nice reproducibility feature. The paper already notes that wall-clock measurements are noisy and that WebAssembly trades execution speed for reproducibility and portability. I would still bring a little more of this practical interpretation into the main text, especially how the single-threaded WebAssembly measurements should be read relative to native or parallel implementations.

Lines 392-405: The chunking sensitivity analysis is valuable, especially because SZ3 behaves differently under relative-error settings. Since chunking is often important in practice, please report the actual chunk shapes used, perhaps in an appendix or supplement.

Figure 3 and Appendix D: The normalized rate-distortion summaries are helpful, but they should not be read as a single global Pareto ranking. Because the values are normalized per metric and variable and then averaged, I would add a short caution in the main text about over-interpreting the aggregate ordering.

Figure 8 and lines 426-434: Please explain in the Figure 8 caption why only a subset of compressors is shown. The text gives the context, but a reader looking directly at the figure may wonder why SPERR and JPEG2000 are absent, or why the two lossless backends are not separated.

Section 2.1 / Figure 2 caption / lines 336-339: SST is described earlier as containing NaNs, but I only learned later that the air-temperature field also contains NaNs. Please mention this earlier in the dataset description.

Introduction and Section 3.1: The introduction motivates the broader area partly by mentioning neural compression, but the baseline suite only includes conventional codecs. A short explanation of why learned methods are not included in the current baseline would help frame the benchmark and future extensions.

Line 101: "Linear packing" may not be familiar to all readers. A short definition or reference would help.

Conclusion, lines 448-449, versus Table 1: The phrase "below a couple of gigabytes" seems inconsistent with Table 1, which gives the evaluation set size as about 2.95 GB. "Around 3 GB" would be more accurate.

Technical corrections
Line 99: There is extra punctuation around the footnote marker in "outputs.3 ."

Definition 1, line 136: The index 0 <= v <= V implies V+1 variables. I assume this should be 0 <= v < V, or an equivalent 1-based convention.

Line 318: "This ensure" should be "This ensures".

Line 413: "e.g Lindstrom (2017)" should read "e.g., Lindstrom (2017)".

Figure 3 caption: The final averaging expression appears to drop the error-bound index. I suspect the intended notation is something like n_{c,b} = mean_v({n_{v,b,c}}).

Figures 5-6: "Not chunked" sounds a little awkward; "unchunked" would be more idiomatic.

Figures 3 and 6: Figure 3 would be easier to read with distinct marker symbols in addition to colors and line styles. Figure 6 already uses crosses and squares to distinguish chunked from unchunked runs, but compressor identity still relies heavily on color.
Citation: https://doi.org/10.5194/egusphere-2026-60-RC2

Tim Reichelt, Juniper Tyree, Milan Klöwer, Peter Dueben, Bryan N. Lawrence, Allison H. Baker, Sara Faghih-Naini, Torsten Hoefler, and Philip Stier

Model code and software

ClimateBenchPress data-loader Tim Reichelt and Juniper Tyree https://doi.org/10.5281/zenodo.18015682

ClimateBenchPress Compressors Tim Reichelt and Juniper Tyree https://doi.org/10.5281/zenodo.18152639

Tim Reichelt, Juniper Tyree, Milan Klöwer, Peter Dueben, Bryan N. Lawrence, Allison H. Baker, Sara Faghih-Naini, Torsten Hoefler, and Philip Stier

Viewed

Total article views: 5,124 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
2,901	2,133	90	5,124	194	275

HTML: 2,901
PDF: 2,133
XML: 90
Total: 5,124
BibTeX: 194
EndNote: 275

Views and downloads (calculated since 09 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	1,098	572	50	1,720
Mar 2026	1,642	1,344	36	3,022
Apr 2026	140	193	4	337
May 2026	21	24	0	45

Cumulative views and downloads (calculated since 09 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	1,098	572	50	1,720
Mar 2026	1,642	1,344	36	3,022
Apr 2026	140	193	4	337
May 2026	21	24	0	45

Viewed (geographical distribution)

Total article views: 5,167 (including HTML, PDF, and XML) Thereof 5,167 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 25 May 2026

Short summary

The growing size of datasets used in climate science makes it difficult to store, analyze, and distribute dataset. Lossy compression algorithms can significantly reduce the disk space required to store datasets, but it can be difficult to understand and compare the behavior of different compression algorithms. ClimateBenchPress provides a benchmark to standardize comparisons between lossy compression algorithms and guide development of novel algorithms specifically targeted towards climate data.


Total:	0
HTML:	0
PDF:	0
XML:	0