the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A Comparison of Lossless Compression Algorithms for Altimeter Data
Abstract. Satellite data transmission is usually limited between hundreds of kilobits-per-second (kb/s) and several megabits-per-second (Mb/s) while the space-to-ground data volume is becoming larger as the resolution of the instruments increases while the bandwidth remains limited, typically. The Surface Water and Ocean Topography (SWOT) altimetry mission is a partnership between the National Aeronautics and Space Administration (NASA) and the Centre National des Études Spatiales (CNES) which uses the innovative KaRin instrument, a Ka band (35.75 GHz) synthetic aperture radar combined with an interforemeter. Its launch is expected for 2022 for oceanographic and hydrological levels measurement and it will generate 7 TeraBytes-per-day, for a lifetime total of 20 PetaBytes. That is why data compression needs to be implemented at both ends of satellite communications. This study compares the compression results obtained with 672 algorithms, mostly based on the Huff- man coding approach which constitute the state-of-the-art for scientific data manipulation, including Computational Fluid Dynamics (CFD). We also have incorporated data preprocessing such as shuffle and bitshuffle, and a novel algorithm named SL6.
- Preprint
(2824 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
CEC1: 'Comment on egusphere-2022-1094', Juan Antonio Añel, 13 Jan 2023
Dear authors,Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy" on many levels. Indeed, it should have never been published in Discussions before solving the issues listed below.https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlFirst, the few code that you have shared is archived on GitHub. However, GitHub is not a suitable repository. GitHub itself instructs authors to use other alternatives for long-term archival and publishing, such as Zenodo. Therefore, please, publish your code in one of the appropriate repositories, and reply to this comment with the relevant information (link and DOI) as soon as possible, as it should be available for the Discussions stage. Also, please, include the relevant primary input/output data. In this way, you must include in a potentially reviewed version of your manuscript the modified 'Code and Data Availability' section, the DOI of the code (and another DOI for the dataset if necessary). Also, the GitHub repository does not contain a license. If you do not include a license, despite what you state, the code is not "open-source/libre"; it continues to be your property. Therefore, when uploading the model's code to Zenodo, you could want to choose a free software/open-source (FLOSS) license. We recommend the GPLv3. You only need to include the file 'https://www.gnu.org/licenses/gpl-3.0.txt' as LICENSE.txt with your code. Also, you can choose other options that Zenodo provides: GPLv2, Apache License, MIT License, etc.Also, we can not accept that it is necessary to contact the authors or request permission to get access to code or data. Both kinds of assets must be published in a permanent repository without the ability of the authors to remove them, and this must be done before submitting the manuscript.In this way, you must reply to this comment with the link to the repository used in your manuscript, with its DOI. The reply and the repository must be available well in advance (as they should be already available) the Discussions stage is closed, to be sure that anyone has access to it for review purposes.Please, be aware that failing to comply promptly with this request will result in desk rejection of your manuscript for publication.Juan A. AñelGeosci. Model Dev. Exec. EditorCitation: https://doi.org/
10.5194/egusphere-2022-1094-CEC1 -
AC1: 'Reply on CEC1', Mathieu Thevenin, 26 Jan 2023
Dear Juan,
Thank you for your comment.
We have carefully read the conditions about the codes used in the writing of the articles. Of course, we can provide most of the codes on a viable repository.
However the SL6 code code is under license which does not allow open source.
Unfortunately, we cannot provide all the codes allowing full reproduction of our study. Indeed, some codes and tools are not open source.We understand the importance of validating our work as well as possible. However, as you know, all the experiments are not necessarily reproducible, for questions of equipment, skills or even time; and I imagine that you do not limit your review to the basic thing.Now, our question is, is it possible to derogate from this rule for legitimate reasons ?Thank you for your reply and you interest in our work.Mathieu THEVENINCitation: https://doi.org/10.5194/egusphere-2022-1094-AC1
-
AC1: 'Reply on CEC1', Mathieu Thevenin, 26 Jan 2023
-
RC1: 'Comment on egusphere-2022-1094', Anonymous Referee #1, 10 Mar 2023
Please see my comments in the file attached.
-
CC1: 'Reply on RC1', Stephane Pigoury, 02 Apr 2023
Thank you for taking the time to review our study and for your comments.
To respond to you on data-related matters. I emphasize that the data was not selected to favor SL6. Indeed, as specified, our study aims to resume the study "Evaluation of lossless and lossy algorithms for the compression of scientific datasets in netCDF-4 or HDF5 files", published here. (https://gmd.copernicus.org/articles/12/4099/2019/). These are the same data. Note also that the data in this publication is also only available on request and that there was no problem of reproducibility for this study.Citation: https://doi.org/10.5194/egusphere-2022-1094-CC1
-
CC1: 'Reply on RC1', Stephane Pigoury, 02 Apr 2023
-
RC2: 'Comment on egusphere-2022-1094', Anonymous Referee #2, 15 Mar 2023
This article compares the patented SL6 algorithm of the author(s) with existing Huffman-based lossless compression algorithms, and stresses the excellence of homogeneous results and constant compression time of the proposed SL6, which seems demanding by the satellite missions. The structure looks pretty well. The methods and results sound promising. However, I still have some concerns regarding the contents.
1. Since the SL6 is patented and shows desirable properties, and might be suitable for space missions and satellite communications, the reason why it excels should be highlighted, which might be helpful for scientific research communities.
2. Compression algorithms are actively studied, and there are public competitions such as the CVPR. Though the onboard SL6 algorithm is purposely devised, but I still wonder how it would be compared with the CNN/GAN-based compressions.
3. The Abstract is loosely organized and should be improved.
4. The citations should be correctly and consistently formated.Citation: https://doi.org/10.5194/egusphere-2022-1094-RC2 -
CC2: 'Reply on RC2', Stephane Pigoury, 02 Apr 2023
Thank you for your interest in our study.
The reason why we have not detailed the function of the SL6 algorithm is that this is not a study dedicated to the operation of SL6 technology, but an analysis of the state of the art about loseless data compression. It seemed to us more coherent and more interesting to complete the previous study by introducing a new metric allowing better analysis.Citation: https://doi.org/10.5194/egusphere-2022-1094-CC2 -
AC3: 'Reply on RC2', Mathieu Thevenin, 11 Apr 2023
The reviewer address an interesting point.
Since the study was to compare the the previous study cited in the introduction, we only focused on the compression algorithms that were previously considered in the swot mission. However, adding more compression algorithms would be very interesting. It would be the object of another article or a conference paper.Thanks
Citation: https://doi.org/10.5194/egusphere-2022-1094-AC3
-
CC2: 'Reply on RC2', Stephane Pigoury, 02 Apr 2023
-
RC3: 'Comment on egusphere-2022-1094', H. Xu, 05 Apr 2023
Thank the authors for presenting the comparison of compression algorithms to solve the limited bandwidth of the data transmission from satellite to the ground.There are several good points presented to us.
- 1. Create the H-score metric to measure the compression ratio and compression throughput by one value.
- 2. Find SL6 compressor to test
However, there are several concerns that need to be addressed as well.- 1.The time spent on each variable is too small, the standard deviation is too large, so the measurement of the compression/decompression time may not be reliable.
- 2. Since the compression time is so small, can the authors describe the time measurement tool they used?
- 3. SL6 compressor is shown having the best performance among all tested compressors, the authors didn't explain why the compressor is the best. It is not chunk based, it doesn't use any Huffman or entropy encoder, then what makes it compress so fast? From my experience, the FPZIP compressor has a similar compression scheme as SL6, but FPZIP cannot show the same compression performance as I recalled. Is SL6 a lossy compressor or lossless compressor?
- 4.There are several typos on the paper written. In table 5, the values of the last two columns have no space in between and are hard to understand. Figure 9 mentioned the marked red is most interesting, but there are several others showing SL6 poor performance as well.
- 5. In table 3, most time data is around 0.24 seconds to 1.7 seconds. Why do the authors display time data in such large numbers with the unit ns?
- 6. Can we know what the average compression rate obtained from SL6 for the whole SWOT dataset instead of some fields?
Citation: https://doi.org/10.5194/egusphere-2022-1094-RC3 -
AC2: 'Reply on RC3', Mathieu Thevenin, 11 Apr 2023
Dear reviewer,
We are grateful for your comments, we will address them.
The time measurement used is the based on the linux kernel from time.h and the standard library.You are right, the ns metric could actually be replaced by the microsecond, which would be easier to read, and would not really impact the accuracy. We chose to keep the same unit (ns, or us maybe in a revision) for consistency reasons.
Thanks
Mathieu
Citation: https://doi.org/10.5194/egusphere-2022-1094-AC2
Status: closed
-
CEC1: 'Comment on egusphere-2022-1094', Juan Antonio Añel, 13 Jan 2023
Dear authors,Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy" on many levels. Indeed, it should have never been published in Discussions before solving the issues listed below.https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlFirst, the few code that you have shared is archived on GitHub. However, GitHub is not a suitable repository. GitHub itself instructs authors to use other alternatives for long-term archival and publishing, such as Zenodo. Therefore, please, publish your code in one of the appropriate repositories, and reply to this comment with the relevant information (link and DOI) as soon as possible, as it should be available for the Discussions stage. Also, please, include the relevant primary input/output data. In this way, you must include in a potentially reviewed version of your manuscript the modified 'Code and Data Availability' section, the DOI of the code (and another DOI for the dataset if necessary). Also, the GitHub repository does not contain a license. If you do not include a license, despite what you state, the code is not "open-source/libre"; it continues to be your property. Therefore, when uploading the model's code to Zenodo, you could want to choose a free software/open-source (FLOSS) license. We recommend the GPLv3. You only need to include the file 'https://www.gnu.org/licenses/gpl-3.0.txt' as LICENSE.txt with your code. Also, you can choose other options that Zenodo provides: GPLv2, Apache License, MIT License, etc.Also, we can not accept that it is necessary to contact the authors or request permission to get access to code or data. Both kinds of assets must be published in a permanent repository without the ability of the authors to remove them, and this must be done before submitting the manuscript.In this way, you must reply to this comment with the link to the repository used in your manuscript, with its DOI. The reply and the repository must be available well in advance (as they should be already available) the Discussions stage is closed, to be sure that anyone has access to it for review purposes.Please, be aware that failing to comply promptly with this request will result in desk rejection of your manuscript for publication.Juan A. AñelGeosci. Model Dev. Exec. EditorCitation: https://doi.org/
10.5194/egusphere-2022-1094-CEC1 -
AC1: 'Reply on CEC1', Mathieu Thevenin, 26 Jan 2023
Dear Juan,
Thank you for your comment.
We have carefully read the conditions about the codes used in the writing of the articles. Of course, we can provide most of the codes on a viable repository.
However the SL6 code code is under license which does not allow open source.
Unfortunately, we cannot provide all the codes allowing full reproduction of our study. Indeed, some codes and tools are not open source.We understand the importance of validating our work as well as possible. However, as you know, all the experiments are not necessarily reproducible, for questions of equipment, skills or even time; and I imagine that you do not limit your review to the basic thing.Now, our question is, is it possible to derogate from this rule for legitimate reasons ?Thank you for your reply and you interest in our work.Mathieu THEVENINCitation: https://doi.org/10.5194/egusphere-2022-1094-AC1
-
AC1: 'Reply on CEC1', Mathieu Thevenin, 26 Jan 2023
-
RC1: 'Comment on egusphere-2022-1094', Anonymous Referee #1, 10 Mar 2023
Please see my comments in the file attached.
-
CC1: 'Reply on RC1', Stephane Pigoury, 02 Apr 2023
Thank you for taking the time to review our study and for your comments.
To respond to you on data-related matters. I emphasize that the data was not selected to favor SL6. Indeed, as specified, our study aims to resume the study "Evaluation of lossless and lossy algorithms for the compression of scientific datasets in netCDF-4 or HDF5 files", published here. (https://gmd.copernicus.org/articles/12/4099/2019/). These are the same data. Note also that the data in this publication is also only available on request and that there was no problem of reproducibility for this study.Citation: https://doi.org/10.5194/egusphere-2022-1094-CC1
-
CC1: 'Reply on RC1', Stephane Pigoury, 02 Apr 2023
-
RC2: 'Comment on egusphere-2022-1094', Anonymous Referee #2, 15 Mar 2023
This article compares the patented SL6 algorithm of the author(s) with existing Huffman-based lossless compression algorithms, and stresses the excellence of homogeneous results and constant compression time of the proposed SL6, which seems demanding by the satellite missions. The structure looks pretty well. The methods and results sound promising. However, I still have some concerns regarding the contents.
1. Since the SL6 is patented and shows desirable properties, and might be suitable for space missions and satellite communications, the reason why it excels should be highlighted, which might be helpful for scientific research communities.
2. Compression algorithms are actively studied, and there are public competitions such as the CVPR. Though the onboard SL6 algorithm is purposely devised, but I still wonder how it would be compared with the CNN/GAN-based compressions.
3. The Abstract is loosely organized and should be improved.
4. The citations should be correctly and consistently formated.Citation: https://doi.org/10.5194/egusphere-2022-1094-RC2 -
CC2: 'Reply on RC2', Stephane Pigoury, 02 Apr 2023
Thank you for your interest in our study.
The reason why we have not detailed the function of the SL6 algorithm is that this is not a study dedicated to the operation of SL6 technology, but an analysis of the state of the art about loseless data compression. It seemed to us more coherent and more interesting to complete the previous study by introducing a new metric allowing better analysis.Citation: https://doi.org/10.5194/egusphere-2022-1094-CC2 -
AC3: 'Reply on RC2', Mathieu Thevenin, 11 Apr 2023
The reviewer address an interesting point.
Since the study was to compare the the previous study cited in the introduction, we only focused on the compression algorithms that were previously considered in the swot mission. However, adding more compression algorithms would be very interesting. It would be the object of another article or a conference paper.Thanks
Citation: https://doi.org/10.5194/egusphere-2022-1094-AC3
-
CC2: 'Reply on RC2', Stephane Pigoury, 02 Apr 2023
-
RC3: 'Comment on egusphere-2022-1094', H. Xu, 05 Apr 2023
Thank the authors for presenting the comparison of compression algorithms to solve the limited bandwidth of the data transmission from satellite to the ground.There are several good points presented to us.
- 1. Create the H-score metric to measure the compression ratio and compression throughput by one value.
- 2. Find SL6 compressor to test
However, there are several concerns that need to be addressed as well.- 1.The time spent on each variable is too small, the standard deviation is too large, so the measurement of the compression/decompression time may not be reliable.
- 2. Since the compression time is so small, can the authors describe the time measurement tool they used?
- 3. SL6 compressor is shown having the best performance among all tested compressors, the authors didn't explain why the compressor is the best. It is not chunk based, it doesn't use any Huffman or entropy encoder, then what makes it compress so fast? From my experience, the FPZIP compressor has a similar compression scheme as SL6, but FPZIP cannot show the same compression performance as I recalled. Is SL6 a lossy compressor or lossless compressor?
- 4.There are several typos on the paper written. In table 5, the values of the last two columns have no space in between and are hard to understand. Figure 9 mentioned the marked red is most interesting, but there are several others showing SL6 poor performance as well.
- 5. In table 3, most time data is around 0.24 seconds to 1.7 seconds. Why do the authors display time data in such large numbers with the unit ns?
- 6. Can we know what the average compression rate obtained from SL6 for the whole SWOT dataset instead of some fields?
Citation: https://doi.org/10.5194/egusphere-2022-1094-RC3 -
AC2: 'Reply on RC3', Mathieu Thevenin, 11 Apr 2023
Dear reviewer,
We are grateful for your comments, we will address them.
The time measurement used is the based on the linux kernel from time.h and the standard library.You are right, the ns metric could actually be replaced by the microsecond, which would be easier to read, and would not really impact the accuracy. We chose to keep the same unit (ns, or us maybe in a revision) for consistency reasons.
Thanks
Mathieu
Citation: https://doi.org/10.5194/egusphere-2022-1094-AC2
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
499 | 488 | 43 | 1,030 | 25 | 21 |
- HTML: 499
- PDF: 488
- XML: 43
- Total: 1,030
- BibTeX: 25
- EndNote: 21
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1