Technical note: A Flexible Framework for Precision Truncation and Lossless Compression in WRF Simulations with Application over the United States
Abstract. As climate simulations generate increasingly large datasets, reducing storage demands without compromising scientific integrity has become a critical challenge. This study evaluates the effectiveness of precision truncation, applied prior to lossless compression, in balancing storage efficiency and fidelity within regional Weather Research and Forecasting (WRF) simulations over the United States. We examine input-only, output-only, and combined input–output truncation strategies across both routine meteorological variables and extreme precipitation indices. Results show that conventional atmospheric fields remain robust when outputs are truncated to 5 or 4 significant digits, keeping biases within acceptable limits. Wind speed is largely insensitive to truncation, temperature and humidity are more vulnerable under aggressive output truncation (3 significant digits). Precipitation shows mixed responses, with deviations dominated by input perturbations. Extreme precipitation indices display more complex sensitivities: percentile- and maximum-based indices are highly susceptible to nonlinear, regionally heterogeneous biases under input truncation, whereas frequency- and intensity-based indices respond more systematically to output truncation, with substantial distortions emerging at 3 digits. These findings demonstrate that truncation strategies cannot be applied uniformly but must be tailored to variable type and diagnostic. Within this study, output-only truncation emerges as the most reliable strategy, with 4 significant digits identified as a safe lower bound and 5 digits preferable when fidelity of extreme-event is critical. To implement this in practice, we introduce a flexible error-tolerance framework that applies a predefined threshold across all indices and adapts truncation levels by region and season, enabling substantial storage savings while safeguarding the integrity of climate diagnostics.
This manuscripts proposes to reduced storage size of weather and climate data without compromising scientific integrity, and investigates various precision truncation strategies (combined with lossless compression) with the data from the Weather Research and Forecasting (WRF) simulations. The authors choose 2016 for the WRF simulation period, with 4-D data assimilation. Results were compared with hourly 2m air temperature and humidity and 10m wind speed, and hourly precipitation.
Metrics of relative data compression are percentage of original data when further compressed using bzip2 or gzip.
Metrics on errors due to data compression consist of RMSE of the encoded values vs. reference values, Pearson Correlation R, and Normalized Mean Bias NMB. Additional metrics for assessing impacts on extreme precipitation include number of days exceeding the 95% or 99% percentile of wet days, the maximum 1-day or 5-day precipitation total, annual count of days with daily precipitation over 10mm, count and total precipitation in wet days over a year, simple daily intensity index derived from that.
The paper is generally well written. The results are encouraging but not new (see my point below on the literature review), and the authors are not providing final compression results for the optimal strategy; it is thus unclear why the reader should actually care about doing this extra work of data compression. The paper would stronly benefit from being improved for clarity.
The fundamental limitation of the paper is that it is not properly situated in the comprehensive literature of data truncation and data compression, beyond three references: Baker et al (2016), Poppick et al (2020, lossy), Walters and Wong (2023).
The following work extensively investigated truncation strategies:
M Klower, M Razinger, JJ Dominguez, PD D ¨ uben, TN Palmer, Compressing atmospheric ¨
data into its real information content. Nat. Comput. Sci. 1, 713–724 (2021).
Moreover several works have explored neural lossy compression:
L Huang, T Hoefler, Compressing multidimensional weather and climate data into neural
networks. ICLR (2023).
T Han, S Guo, W Xu, L Bai, , et al., Cra5: Extreme compression of era5 for portable global climate and weather research via an efficient variational transformer. arXiv preprint arXiv:2405.03376 (2024).
P Mirowski, D Warde-Farley, M Rosca, et al., Neural compression of atmospheric states. arXiv preprint arXiv:2407.11666 (2024).
How does this work differ from the conclusions in all these previous works - is it by using the compressed data as inputs to WRF simulation? This should be made explicit.
Several parts of the paper were unclear:
* The article would benefit from an illustration of what are the input and output variables for the Weather Research Forecasting models, and a schematic of how the data interact. Am I right that the truncation of input data to WRF has an impact on the output results coming from the WRF, and that this is the reason why, given the same output truncation, different input truncations can reduce the relative compression size of the outputs? Are input variables forcing variables, or are they also weather data? It is only on line 325 that we can infer that output variables are not recursively fed back into the model, since output-only truncation can happen after the model is run.
* The relationship between the 1622 stations and the surface data is unclear. Do the authors have access to simulated and data-assimilated dense surface data?
* What are the inputs and outputs to the WRF? Do the authors re-run the WRF at different input data truncation strategies?
* What is the N in the error formulas: is it the number of input/output datapoints in 2016? Or is it the number of discrete observation station measurements? How is the distribution of observation points compared to that of the data used in the WRF?
* It was unclear if 69% at 5 significant digits in the input data (WRF_5) meant that:
1) original data at full precision are further compressed using lossless gzip
2) 5-digit truncated input data are further compressed using lossless gzip
3) the ratio of the storage size 2) over 1) is computed.
* When the authors write that the baseline dataset has 837GB of input data, are these full precision data or data compressed using gzip / bzip2?
* A table summarising the effective storage space after various truncation strategies, as well as compression, would be very useful.
* The color scheme of Fig. 6 is confusing and does not correspond to previous figures.
* What is also missing is a visual showing one measurement (e.g., temperature at 2m) with the FORTRAN representation and corresponding truncations.