Technical note: A Flexible Framework for Precision Truncation and Lossless Compression in WRF Simulations with Application over the United States

Wu, Shang; Wong, David C.; Wang, Jiandong; Jin, Yuzhi; Li, Junjun; Lu, Chunsong

doi:10.5194/egusphere-2025-4811

Preprints

https://doi.org/10.5194/egusphere-2025-4811

Preprints

29 Oct 2025

| 29 Oct 2025

Status: this preprint is open for discussion and under review for Atmospheric Chemistry and Physics (ACP).

Technical note: A Flexible Framework for Precision Truncation and Lossless Compression in WRF Simulations with Application over the United States

Shang Wu, David C. Wong, Jiandong Wang, Yuzhi Jin, Junjun Li, and Chunsong Lu

Abstract. As climate simulations generate increasingly large datasets, reducing storage demands without compromising scientific integrity has become a critical challenge. This study evaluates the effectiveness of precision truncation, applied prior to lossless compression, in balancing storage efficiency and fidelity within regional Weather Research and Forecasting (WRF) simulations over the United States. We examine input-only, output-only, and combined input–output truncation strategies across both routine meteorological variables and extreme precipitation indices. Results show that conventional atmospheric fields remain robust when outputs are truncated to 5 or 4 significant digits, keeping biases within acceptable limits. Wind speed is largely insensitive to truncation, temperature and humidity are more vulnerable under aggressive output truncation (3 significant digits). Precipitation shows mixed responses, with deviations dominated by input perturbations. Extreme precipitation indices display more complex sensitivities: percentile- and maximum-based indices are highly susceptible to nonlinear, regionally heterogeneous biases under input truncation, whereas frequency- and intensity-based indices respond more systematically to output truncation, with substantial distortions emerging at 3 digits. These findings demonstrate that truncation strategies cannot be applied uniformly but must be tailored to variable type and diagnostic. Within this study, output-only truncation emerges as the most reliable strategy, with 4 significant digits identified as a safe lower bound and 5 digits preferable when fidelity of extreme-event is critical. To implement this in practice, we introduce a flexible error-tolerance framework that applies a predefined threshold across all indices and adapts truncation levels by region and season, enabling substantial storage savings while safeguarding the integrity of climate diagnostics.

Received: 29 Sep 2025 – Discussion started: 29 Oct 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 6041 KB)

Supplement (2478 KB)

Download & links

Shang Wu, David C. Wong, Jiandong Wang, Yuzhi Jin, Junjun Li, and Chunsong Lu

Status: open (extended)

Post a comment Subscribe to comment alert

RC1: 'Comment on egusphere-2025-4811', Anonymous Referee #1, 25 Nov 2025 reply

This manuscripts proposes to reduced storage size of weather and climate data without compromising scientific integrity, and investigates various precision truncation strategies (combined with lossless compression) with the data from the Weather Research and Forecasting (WRF) simulations. The authors choose 2016 for the WRF simulation period, with 4-D data assimilation. Results were compared with hourly 2m air temperature and humidity and 10m wind speed, and hourly precipitation.

Metrics of relative data compression are percentage of original data when further compressed using bzip2 or gzip.

Metrics on errors due to data compression consist of RMSE of the encoded values vs. reference values, Pearson Correlation R, and Normalized Mean Bias NMB. Additional metrics for assessing impacts on extreme precipitation include number of days exceeding the 95% or 99% percentile of wet days, the maximum 1-day or 5-day precipitation total, annual count of days with daily precipitation over 10mm, count and total precipitation in wet days over a year, simple daily intensity index derived from that.
The paper is generally well written. The results are encouraging but not new (see my point below on the literature review), and the authors are not providing final compression results for the optimal strategy; it is thus unclear why the reader should actually care about doing this extra work of data compression. The paper would stronly benefit from being improved for clarity.

The fundamental limitation of the paper is that it is not properly situated in the comprehensive literature of data truncation and data compression, beyond three references: Baker et al (2016), Poppick et al (2020, lossy), Walters and Wong (2023).

The following work extensively investigated truncation strategies:

M Klower, M Razinger, JJ Dominguez, PD D ¨ uben, TN Palmer, Compressing atmospheric ¨

data into its real information content. Nat. Comput. Sci. 1, 713–724 (2021).
Moreover several works have explored neural lossy compression:

L Huang, T Hoefler, Compressing multidimensional weather and climate data into neural

networks. ICLR (2023).

T Han, S Guo, W Xu, L Bai, , et al., Cra5: Extreme compression of era5 for portable global climate and weather research via an efficient variational transformer. arXiv preprint arXiv:2405.03376 (2024).

P Mirowski, D Warde-Farley, M Rosca, et al., Neural compression of atmospheric states. arXiv preprint arXiv:2407.11666 (2024).
How does this work differ from the conclusions in all these previous works - is it by using the compressed data as inputs to WRF simulation? This should be made explicit.

Several parts of the paper were unclear:
* The article would benefit from an illustration of what are the input and output variables for the Weather Research Forecasting models, and a schematic of how the data interact. Am I right that the truncation of input data to WRF has an impact on the output results coming from the WRF, and that this is the reason why, given the same output truncation, different input truncations can reduce the relative compression size of the outputs? Are input variables forcing variables, or are they also weather data? It is only on line 325 that we can infer that output variables are not recursively fed back into the model, since output-only truncation can happen after the model is run.
* The relationship between the 1622 stations and the surface data is unclear. Do the authors have access to simulated and data-assimilated dense surface data?
* What are the inputs and outputs to the WRF? Do the authors re-run the WRF at different input data truncation strategies?
* What is the N in the error formulas: is it the number of input/output datapoints in 2016? Or is it the number of discrete observation station measurements? How is the distribution of observation points compared to that of the data used in the WRF?
* It was unclear if 69% at 5 significant digits in the input data (WRF_5) meant that:

1) original data at full precision are further compressed using lossless gzip

2) 5-digit truncated input data are further compressed using lossless gzip

3) the ratio of the storage size 2) over 1) is computed.
* When the authors write that the baseline dataset has 837GB of input data, are these full precision data or data compressed using gzip / bzip2?
* A table summarising the effective storage space after various truncation strategies, as well as compression, would be very useful.
* The color scheme of Fig. 6 is confusing and does not correspond to previous figures.
* What is also missing is a visual showing one measurement (e.g., temperature at 2m) with the FORTRAN representation and corresponding truncations.

Reply

Citation: https://doi.org/10.5194/egusphere-2025-4811-RC1

Shang Wu, David C. Wong, Jiandong Wang, Yuzhi Jin, Junjun Li, and Chunsong Lu

Supplement

https://doi.org/10.5194/egusphere-2025-4811-supplement

Data sets

Dataset for the paper 'A Flexible Framework for Precision Truncation and Lossless Compression in WRF Simulations: Method and Application over the United States' Shang Wu https://doi.org/10.5281/zenodo.17139028

Model code and software

Truncation tool for the paper 'A Flexible Framework for Precision Truncation and Lossless Compression in WRF Simulations: Method and Application over the United States' David C. Wong https://doi.org/10.5281/zenodo.17156737

Shang Wu, David C. Wong, Jiandong Wang, Yuzhi Jin, Junjun Li, and Chunsong Lu

Viewed

Total article views: 236 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
160	56	20	236	33	21	23

HTML: 160
PDF: 56
XML: 20
Total: 236
Supplement: 33
BibTeX: 21
EndNote: 23

Views and downloads (calculated since 29 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	56	11	3	70
Nov 2025	72	31	11	114
Dec 2025	32	14	6	52

Cumulative views and downloads (calculated since 29 Oct 2025)

Month	HTML	PDF	XML	Total
Oct 2025	56	11	3	70
Nov 2025	72	31	11	114
Dec 2025	32	14	6	52

Viewed (geographical distribution)

Total article views: 231 (including HTML, PDF, and XML) Thereof 231 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 31 Dec 2025

Short summary

Climate simulations create huge amounts of data that are difficult to store and share. In this study, we developed a simple method to reduce file sizes while keeping the scientific information accurate. By carefully shortening numbers before applying compression, we tested different settings on U.S. weather simulations and found ways to save space without losing key results. This approach helps scientists work more efficiently and supports better access to climate data for the wider community.


Total:	0
HTML:	0
PDF:	0
XML:	0