The effect of lossy compression of numerical weather prediction data on data analysis: a case study using enstools-compression 2023.11

Tintó Prims, Oriol; Redl, Robert; Rautenhaus, Marc; Selz, Tobias; Matsunobu, Takumi; Modali, Kameswar Rao; Craig, George

doi:https://doi.org/10.5194/egusphere-2024-753

Preprints

https://doi.org/10.5194/egusphere-2024-753

Preprints

25 Apr 2024

| 25 Apr 2024

The effect of lossy compression of numerical weather prediction data on data analysis: a case study using enstools-compression 2023.11

Oriol Tintó Prims, Robert Redl, Marc Rautenhaus, Tobias Selz, Takumi Matsunobu, Kameswar Rao Modali, and George Craig

Abstract. The increasing amount of data in meteorological science requires effective data reduction methods. Our study demonstrates the use of advanced scientific lossy compression techniques to significantly reduce the size of these large datasets, achieving reductions ranging from 5x to over 150x, while ensuring data integrity is maintained. A key aspect of our work is the development of the 'enstools-compression' Python library. This user-friendly tool simplifies the application of lossy compression for Earth scientists and is integrated into the commonly used NetCDF file format workflows in atmospheric sciences. Based on the HDF5 compression filter architecture, 'enstools-compression' is easily used in Python scripts or via command line, enhancing its accessibility for the scientific community. A series of examples, drawn from current atmospheric science research, shows how lossy compression can efficiently manage large meteorological datasets while maintaining a balance between reducing data size and preserving scientific accuracy. This work addresses the challenge of making lossy compression more accessible, marking a significant step forward in efficient data handling in Earth sciences.

Received: 13 Mar 2024 – Discussion started: 25 Apr 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Oriol Tintó Prims, Robert Redl, Marc Rautenhaus, Tobias Selz, Takumi Matsunobu, Kameswar Rao Modali, and George Craig

Status: final response (author comments only)

RC1: 'Comment on egusphere-2024-753', Anonymous Referee #1, 28 May 2024

In this work, the authors developed lossy compression tools, 'enstools-encoding' and 'enstools-compression'. The authors outline that the manuscript and tools enable scientists to compress existing datasets and create compressed new datasets. The tool and manuscript also introduced a bisection method that automatically choose an optimal lossy compression method and/or its parameters based on given metrics. The authors further evaluate the lossy compression with forecasted kinetic energy spectra, fraction skill scores, and 3D visualisation of derived variables. The manuscript presents a practical tool for the lossy compression algorithms. The evaluations of lossy compression algorithms are stated clearly and useful for practical applications. Therefore, I recommend the publication of this manuscript once the following comments are addressed.
Major comments:

1. The proposed compression tool is used to reduce the size of, presumably, large datasets. One concern is the efficiency of the compression tool. As a perhaps extreme example, when I tried to use it with a 13G global ocean model output, the tool is very slow on my personal laptop. It would be instructive to indicate 1) are the compression algorithms pure Python, or are C/Fortran libraries used for compression? 2) can we use the parallel capability in xarray to speed up the processing? 3) can authors provide some comments on the efficiency of the tool?

2. In Figure 1, 2, axis labels and legend labels are not shown. In Figure 3, the colourbar does not have labels and values. It also does not have subfigure labels such as a), b) and c) which are used in the caption. These issues make some results difficult to interpret.
Minor comments:

1. L95, page 4, it says '...this work aims to ensure that scientists can seamlessly utilize the compressed data. Essentially, the intent...'. Does the first sentence have the same meaning as the second sentence starting from 'Essentially'? Otherwise, I feel that the meaning of "seamlessly utilize the compressed data" is unclear.

2. L136, page 5, would it be better to use 'ranging from x_0 to x_1 (x_1 > x_0)' than specific values?

3. In Section 2.2, authors introduces the use of CSF. It would be good to refer to the Appendix A, or briefly explain where CSF is used, i.e. in Python function call arguments and command line arguments, such that readers won't get lost on the context of these specifications as well as how to use these specifications.

4. In Section 2.4, "enstools-encoding" was introduced without explicitly distinguish it from "enstools-compression" introduced in the start of Section 2. In L218, page 7, authors claim that "Additionally, we provide a command line interface...", which gives me an impression that the "enstools-compression" can only be used as command line interface but the documentation of enstools-compression seems to suggest that it has a Python API as well.

5. In L254, page 9, it is not clear what is "see 2" in the bracket.

6. in L272, it might be useful to highlight the "coordinate directions" in Figure 3.

7. In the code availability section, although the texts are correct and I'm able to access them, links by mouse click to both the github repository (an additional bracket in the link) and the zenodo entry (only linked to https://doi) are both broken. Additionally, it is unclear if "enstools-encoding" part of the manuscript and should be included in the code availability section.

Citation: https://doi.org/10.5194/egusphere-2024-753-RC1
RC2: 'Comment on egusphere-2024-753', Anonymous Referee #2, 21 Jun 2024

-----------------------

General comments:

-----------------------

This manuscript presents a tool to facilitate using lossy compression with netcdf files (via the enstools-compression Python library). The tool can straightforwardly apply compressors that are accessible via hdf5 filters. Further. they include a method of specifying quality metrics such that the tool chooses the optimal compressor that meets those metrics. A second contribution is the several examples from atmospheric science datasets illustrating the effects of lossy compression.
1) The introduction section was quite well written and comprehensive.
2) There appears to be an issue with a number of the figures. For Figures 1-3, it appears as if the axes labels have been cut off and even missing labels. Figure 6 has a number of problems as well. I will list specifics below.
3) Please be clear on the version of SZ being used, as versions 1, 2, and 3 are quite different. In many places, the text or figure just says "sz". By the end of the paper, I was thinking that sz meant SZ2, but it should be more clear (more specifics below).
4) This paper presents a practical tool that is useful. The hdf5 filters are quite challenging to use for compressing NetCDF4 files, which motivated the development of this tool. The paper emphasizes that they are trying to make it easier for users, which is great. I do think they could emphasize even more that it is quite difficult, for instance, to use nccopy with the "right" parameters to customize the lossy compression desired, so their translation via this tool is quite nice.
5) In section 3, application examples were presented. While these were interesting, I am not sure that I learned anything new about the potential impacts of lossy compression on model data.
6) If a lossy compressor has a registered HDF5 plugin filter, does it also have to have an hdf5plugin interface to be useable by enstools-compression? If so, what is required for that? (Couldn't quite tell from Section 2.4.)
7) I experimented with the software quite a bit. I did have one issue. It seems that the *.nc files that I created with the enstools-compression could only be viewed with Python tools (e.g., enstools, netCDF4 python, xarray). Trying to view content with non-python tools like ncdump or h5dump gave me errors. Maybe I did something wrong, but if not, then this incompatibility is limiting from my perspective as nco tools (and other non-python tools), for example, are quite popular still. More generally, I would not want to force folks to use python to read a compressed file.
-----------------------

Specific comments:

-----------------------

1) Line 61: The compressor is FPZIP (not FZIP).
2) Lines 59-62: Two other well known lossy compressors that I think would be worth noting are MGARD (https://github.com/CODARcode/MGARD) and SPERR ( https://github.com/NCAR/SPERR)
3) Page 4, first paragraph: Might want to reference the libpressio software (https://github.com/robertu94/libpressio) as it has similar goals to make compressors easier to use/access.
4) Lines 122-123: The statement about ZFP precision mode is not exactly accurate. The precision specified refers to the number of bit planes encoded for the transform coefficients, which does not directly translate to the number of bits in the original data (like with bit grooming approaches). See https://zfp.readthedocs.io/en/release1.0.1/faq.html for a good explanation.
5) Line 127: In this paragraph, I'd be clear on which versions of SZ that you use. The differences between SZ1, SZ2, and SZ3 are quite notable.
6) Section 2.3,

This paper could be of interest for a data SSIM (DSSIM): "On a structural similarity index approach for floating-point data" (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10319365)
7) Line 215: The hdf5plugin library *does* include SZ/SZ3 (text says that it doesn't)
8) Figure 1: no axes labels, no legend labels, axes tic labels appear to be cut off
9) line 220: Can you write to a netdcf4 file? Please clarify.
10) line 253: Here I would probably refer to the DSSIM metric noted above as that is what Baker et al appear to be using now. Is the enstools software actually generating images to compute the ssim metrics?
11) line 254: What is "(see 2)" referring to?
12) Figure 2: Again, no axes labels, no legend labels, axes tic labels appear to be cut off
13) Figure 3: Only the middle plot has a title, there are no axes labels, and the vertical axes tics are all zero (cut-off?)
14) Figure 5 caption: The choice of 0.78% seems oddly specific. What is the reason for that?
15) Figure 6: Several things:
-For plot (b), the caption says "SSIM of 0.89" and in the title it is "0.86", also says CR of 12 in the caption and subplot title says CR:13.4
-For plot (c), I don't understand how with a fixed rate mode compression of 2.2, one ends up with a 13.4 CR ...
-For (c) and (d), Why were these modes and compressor chosen? Seems arbitrary ....
16) Line 351: Issues with compression of derived variables were also earlier discussed in detail in Baker 2016.
17) Figure 7: Is sz mean sz2 or sz1? (The Zhao, 2020 paper that introduces SZ3 explicitly states that SZ2 has a major drawback with its predictor). I am wondering why not just use SZ3 ?
18) Figure 7: I think this figure is very much like a figure in the DSSIM paper mentioned above, so it would be good to look at that paper.
19) Line 410: Blocking artefacts in zfp can be mitigated by using the -DZFP_ROUNDING_MODE=ZFP_ROUND_FIRST option when compiling (https://zfp.readthedocs.io/en/release1.0.1/installation.html#c.ZFP_ROUNDING_MODE).
20) Line 428: Here it looks like SZ 2 and 3 are the two versions being used, so it would be helpful to clarify this sooner in the paper. Though I don't know what the motivation for using SZ2 is, seems like just using SZ3 would be fine.
21) Software comments:
-I personally think it would be easier to just specify the SSIM that you want instead of having to convert it with "-log10(1-ssim)"
-Why only include the ssim and the Pearson's correlation? I see a way to add custom metrics, but I would think the software should minimally include RMSE, PSNR, a max norm, and possibly K-S test.
-I was unable to successfully "Use Xarray to store a compressed netCDF" as described here:

https://enstools-compression.readthedocs.io/en/latest/examples/examples_api/compress_dataset_without_enstools_write.html
-See the general comment above about how the use of python tools seems required for reading the data
-----------------------

Typos, minor:

-----------------------

1) Several times in the paper, the opening quote is in the wrong direction (e.g., twice in the abstract).
2) Figure 5 caption: "0,78%" => "0.78%"
3) Line 371: This sentence is quite awkwardly written

Citation: https://doi.org/10.5194/egusphere-2024-753-RC2

Oriol Tintó Prims, Robert Redl, Marc Rautenhaus, Tobias Selz, Takumi Matsunobu, Kameswar Rao Modali, and George Craig

Model code and software

enstools-compression Oriol Tintó and Robert Redl https://doi.org/10.5281/zenodo.10998676

Interactive computing environment

The effect of lossy compression of numerical weather prediction data on data analysis: software to reproduce figures using enstools-compression Oriol Tintó Prims https://doi.org/10.5281/zenodo.10998604

Oriol Tintó Prims, Robert Redl, Marc Rautenhaus, Tobias Selz, Takumi Matsunobu, Kameswar Rao Modali, and George Craig

Viewed

Total article views: 337 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
231	87	19	337	13	10

HTML: 231
PDF: 87
XML: 19
Total: 337
BibTeX: 13
EndNote: 10

Views and downloads (calculated since 25 Apr 2024)

Month	HTML	PDF	XML	Total
Apr 2024	92	31	7	130
May 2024	80	26	4	110
Jun 2024	38	19	6	63
Jul 2024	21	11	2	34

Cumulative views and downloads (calculated since 25 Apr 2024)

Month	HTML	PDF	XML	Total
Apr 2024	92	31	7	130
May 2024	80	26	4	110
Jun 2024	38	19	6	63
Jul 2024	21	11	2	34

Viewed (geographical distribution)

Total article views: 338 (including HTML, PDF, and XML) Thereof 338 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 26 Jul 2024

Short summary

Advanced compression techniques can drastically reduce the size of meteorological datasets (by 5x to 150x) without compromising the data's scientific value. We developed a user-friendly tool called 'enstools-compression' that makes this compression simple for Earth scientists. This tool works seamlessly with common weather and climate data formats. Our work shows that lossy compression can significantly improve how researchers store and analyze large meteorological datasets.


Total:	0
HTML:	0
PDF:	0
XML:	0