the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Compressing high-resolution data through latent representation encoding for downscaling large-scale AI weather forecast model
Abstract. The rapid advancement of artificial intelligence (AI) in weather research has been driven by the ability to learn from large, high-dimensional datasets. However, this progress also poses significant challenges, particularly regarding the substantial costs associated with processing extensive data and the limitations of computational resources. Inspired by the Neural Image Compression (NIC) task in computer vision, this study seeks to compress weather data to address these challenges and enhance the efficiency of downstream applications. Specifically, we propose a variational autoencoder (VAE) framework tailored for compressing high-resolution datasets, specifically the High Resolution China Meteorological Administration Land Data Assimilation System (HRCLDAS) with a spatial resolution of 1 km. Our framework successfully reduced the storage size of 3 years of HRCLDAS data from 8.61 TB to just 204 GB, while preserving essential information. In addition, we demonstrated the utility of the compressed data through a downscaling task, where the model trained on the compressed dataset achieved accuracy comparable to that of the model trained on the original data. These results highlight the effectiveness and potential of the compressed data for future weather research.
This preprint has been withdrawn.
-
Withdrawal notice
This preprint has been withdrawn.
-
Preprint
(22140 KB)
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2024-3183', Anonymous Referee #1, 15 Nov 2024
This paper introduces a VAE-based data compression method for high resolution weather data and shows its potential application of training a downscaling model using the latent representation. The method in this paper is hardly novel since both VAE and UNet are commonly used in related fields. The claimed 43x compression ratio purely comes from the downsampling CNN in VAE. Usually a neural image compression method would use vector quantization and/or entropy encoding in combination with a VAE. Interestingly, none of the neural image compression methods is used as a baseline for compression. In fact, there is no baseline in the compression part. The authors are advised to use at least one established compression method as a baseline (some can be found in this repo https://interdigitalinc.github.io/CompressAI/).Â
Â
Minor points
- The HRCLDAS data is not openly available, thus not possible to reproduce the results.
- It would be nice to have a power spectrum plot for compression part (like Fig. 6).
- The evaluation only considers t2m, u10 and v10. Containing other variables especially in Table 3 would be better.
- ERA5 should be much larger than 226TB (as claimed). The pressure level data is at least 2PB and the model level data is at least 5PB.
Citation: https://doi.org/10.5194/egusphere-2024-3183-RC1 -
CEC1: 'Comment on egusphere-2024-3183 - No compliance with the policy of the journal', Juan Antonio Añel, 02 Dec 2024
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlTo assure the replicability of your submitted work, you must publish in a permanent repository all the data that you use to train your model and the output data obtained with it. This includes in the case of your work the HRCLDAS and the FuXi-2.0 data.
I should note that given this lack of compliance with our policy, your manuscript should not have been accepted for Discussions, and therefore, Â the current situation with your manuscript is irregular. Please, publish the requested data in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy.
Also, you must include the modified 'Code and Data Availability' section in a potentially reviewed manuscript, the DOIs of the new repositories.
I have to note that if you do not fix this problem as soon as possible, we will have to reject your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-3183-CEC1 -
AC1: 'Reply on CEC1', Bing Gong, 16 Dec 2024
Dear Editor,
Thank you so much for your feedback. Now we are preparing our published data and code. We will deal with this issue as soon as possible.
Â
Citation: https://doi.org/10.5194/egusphere-2024-3183-AC1 -
AC2: 'Reply on CEC1', Bing Gong, 20 Dec 2024
Dear Editor,
Due to confidentiality agreements and data privacy concerns, we are unable to publish the full original dataset. However, we have provided a subset of processed data, in compliance with ethical and legal guidelines. The exact version of the code and the data samples associated with this paper are archived on Zenodo at https://doi.org/10.5281/zenodo.14537263 (Liu et al., 2024) under an MIT license (http://opensource.org/licenses/mit-license.php, last access: 20 Dec 2024). Further guidelines to run the code, train the models, and generate the results presented in this paper are provided in the README.md file of the code repository.
We will add above  to our revised Code and Data Availability' section.
Â
Thank you so much.
Best,
Bing
Citation: https://doi.org/10.5194/egusphere-2024-3183-AC2 -
CEC2: 'Reply on AC2', Juan Antonio Añel, 22 Dec 2024
Dear authors,
Thanks for your answer. Unfortunately, we can not accept simply your word that "Due to confidentiality agreements and data privacy concerns, we are unable to publish the full original dataset". If somebody needs or wants to replicate your study, they need to get access to such data. Therefore, at minimum, we need that you provide information on a way to get access to the input data (even if to sign a license is necessary for it), with instructions about where the data is deposited and if necessary, contact information. On top of it, we need that you provide us evidence (documentation, a law, regulation, etc.) that proves that you are not allowed to share the data and that it is not your own decision.
Regarding the output data, we need to know why you are sharing only a part of it, and not the full dataset. Also, if there is an acceptable reasoning for not sharing it all, what percentage of the total output represents the data that you have shared.
Therefore, we are giving you some margin to comply with our requirements, but I have to be clear that if you do not provide as soon as possible the requested datasets, or instructions on how to get access to them, and documentation and evidence that you are not allowed to share them, we will have to reject your manuscript.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-3183-CEC2 -
AC3: 'Reply on CEC2', Bing Gong, 28 Dec 2024
Dear Editor,
Thank you for your response and guidance.
Unfortunately, after negotiating with our data provider, we are not permitted to share any legal documents or agreements related to the HRCLDAS data we obtained in this research. However, we can provide the website where the data is hosted, along with contact information for those interested in obtaining access to the HRCLDAS data. The data could be accessed (if the provider agrees to share it) or purchased from them.
You can find the data at the following website:Â (https://data.cma.cn/data/detail/dataCode/NAFP_CLDAS2.0_NRT.html) (Unfortunately, Â there is not an English version yet)
There is also a paper regarding the dataset that can be found via. http://www.cmalibrary.cn/amst/2018/201801/yjlw2/yjjz/201804/t20180419_102633.htm
Here is the contact information for one of the authors of this dataset: Shi chunxiang cshi@mail.iap.ac.cn; shicx@nsmc.cma.gov.cn
Regarding the output data, as mentioned earlier, the HRCLDAS dataset was used as the ground truth for training our model. Unfortunately, we are not permitted to share the full dataset publicly.
I am unsure if the approach outlined above meets your requirements. If you find this acceptable, we can include this information in a potential revised version of the paper.
Please let us know.
Thank you so much
Best,Bing
Citation: https://doi.org/10.5194/egusphere-2024-3183-AC3 -
CEC3: 'Reply on AC3', Juan Antonio Añel, 28 Dec 2024
Dear authors,
We need access to the data, at least for editorial and review purposes. Therefore, please, provide to the Topical Editor and myself with a copy of it. I have tried to download it from the webpage you have linked, and when I try to access the dataset I get a "502 Bad Gateway" error. I guess this could be related to the fact that I am trying to access it from a not-allowed IP address. Therefore, if you provide a copy of the data for review purposes (via email), we can study an exception to our policy, and allow you to continue ahead with the Discussions stage and peer-review. Otherwise, we will have to reject your manuscript for publication.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-3183-CEC3 -
AC4: 'Reply on CEC3', Bing Gong, 28 Dec 2024
Dear Editor,
Thank you for all of your coordination and support throughout the submission process.
After consideration, we have decided to withdraw this manuscript submission to GMD.Â
We greatly appreciate your efforts and understanding.
Best regards,
BingÂ
Â
Citation: https://doi.org/10.5194/egusphere-2024-3183-AC4
-
AC4: 'Reply on CEC3', Bing Gong, 28 Dec 2024
-
CEC3: 'Reply on AC3', Juan Antonio Añel, 28 Dec 2024
-
AC3: 'Reply on CEC2', Bing Gong, 28 Dec 2024
-
CEC2: 'Reply on AC2', Juan Antonio Añel, 22 Dec 2024
-
AC1: 'Reply on CEC1', Bing Gong, 16 Dec 2024
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2024-3183', Anonymous Referee #1, 15 Nov 2024
This paper introduces a VAE-based data compression method for high resolution weather data and shows its potential application of training a downscaling model using the latent representation. The method in this paper is hardly novel since both VAE and UNet are commonly used in related fields. The claimed 43x compression ratio purely comes from the downsampling CNN in VAE. Usually a neural image compression method would use vector quantization and/or entropy encoding in combination with a VAE. Interestingly, none of the neural image compression methods is used as a baseline for compression. In fact, there is no baseline in the compression part. The authors are advised to use at least one established compression method as a baseline (some can be found in this repo https://interdigitalinc.github.io/CompressAI/).Â
Â
Minor points
- The HRCLDAS data is not openly available, thus not possible to reproduce the results.
- It would be nice to have a power spectrum plot for compression part (like Fig. 6).
- The evaluation only considers t2m, u10 and v10. Containing other variables especially in Table 3 would be better.
- ERA5 should be much larger than 226TB (as claimed). The pressure level data is at least 2PB and the model level data is at least 5PB.
Citation: https://doi.org/10.5194/egusphere-2024-3183-RC1 -
CEC1: 'Comment on egusphere-2024-3183 - No compliance with the policy of the journal', Juan Antonio Añel, 02 Dec 2024
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlTo assure the replicability of your submitted work, you must publish in a permanent repository all the data that you use to train your model and the output data obtained with it. This includes in the case of your work the HRCLDAS and the FuXi-2.0 data.
I should note that given this lack of compliance with our policy, your manuscript should not have been accepted for Discussions, and therefore, Â the current situation with your manuscript is irregular. Please, publish the requested data in one of the appropriate repositories and reply to this comment with the relevant information (link and a permanent identifier for it (e.g. DOI)) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy.
Also, you must include the modified 'Code and Data Availability' section in a potentially reviewed manuscript, the DOIs of the new repositories.
I have to note that if you do not fix this problem as soon as possible, we will have to reject your manuscript for publication in our journal.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-3183-CEC1 -
AC1: 'Reply on CEC1', Bing Gong, 16 Dec 2024
Dear Editor,
Thank you so much for your feedback. Now we are preparing our published data and code. We will deal with this issue as soon as possible.
Â
Citation: https://doi.org/10.5194/egusphere-2024-3183-AC1 -
AC2: 'Reply on CEC1', Bing Gong, 20 Dec 2024
Dear Editor,
Due to confidentiality agreements and data privacy concerns, we are unable to publish the full original dataset. However, we have provided a subset of processed data, in compliance with ethical and legal guidelines. The exact version of the code and the data samples associated with this paper are archived on Zenodo at https://doi.org/10.5281/zenodo.14537263 (Liu et al., 2024) under an MIT license (http://opensource.org/licenses/mit-license.php, last access: 20 Dec 2024). Further guidelines to run the code, train the models, and generate the results presented in this paper are provided in the README.md file of the code repository.
We will add above  to our revised Code and Data Availability' section.
Â
Thank you so much.
Best,
Bing
Citation: https://doi.org/10.5194/egusphere-2024-3183-AC2 -
CEC2: 'Reply on AC2', Juan Antonio Añel, 22 Dec 2024
Dear authors,
Thanks for your answer. Unfortunately, we can not accept simply your word that "Due to confidentiality agreements and data privacy concerns, we are unable to publish the full original dataset". If somebody needs or wants to replicate your study, they need to get access to such data. Therefore, at minimum, we need that you provide information on a way to get access to the input data (even if to sign a license is necessary for it), with instructions about where the data is deposited and if necessary, contact information. On top of it, we need that you provide us evidence (documentation, a law, regulation, etc.) that proves that you are not allowed to share the data and that it is not your own decision.
Regarding the output data, we need to know why you are sharing only a part of it, and not the full dataset. Also, if there is an acceptable reasoning for not sharing it all, what percentage of the total output represents the data that you have shared.
Therefore, we are giving you some margin to comply with our requirements, but I have to be clear that if you do not provide as soon as possible the requested datasets, or instructions on how to get access to them, and documentation and evidence that you are not allowed to share them, we will have to reject your manuscript.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-3183-CEC2 -
AC3: 'Reply on CEC2', Bing Gong, 28 Dec 2024
Dear Editor,
Thank you for your response and guidance.
Unfortunately, after negotiating with our data provider, we are not permitted to share any legal documents or agreements related to the HRCLDAS data we obtained in this research. However, we can provide the website where the data is hosted, along with contact information for those interested in obtaining access to the HRCLDAS data. The data could be accessed (if the provider agrees to share it) or purchased from them.
You can find the data at the following website:Â (https://data.cma.cn/data/detail/dataCode/NAFP_CLDAS2.0_NRT.html) (Unfortunately, Â there is not an English version yet)
There is also a paper regarding the dataset that can be found via. http://www.cmalibrary.cn/amst/2018/201801/yjlw2/yjjz/201804/t20180419_102633.htm
Here is the contact information for one of the authors of this dataset: Shi chunxiang cshi@mail.iap.ac.cn; shicx@nsmc.cma.gov.cn
Regarding the output data, as mentioned earlier, the HRCLDAS dataset was used as the ground truth for training our model. Unfortunately, we are not permitted to share the full dataset publicly.
I am unsure if the approach outlined above meets your requirements. If you find this acceptable, we can include this information in a potential revised version of the paper.
Please let us know.
Thank you so much
Best,Bing
Citation: https://doi.org/10.5194/egusphere-2024-3183-AC3 -
CEC3: 'Reply on AC3', Juan Antonio Añel, 28 Dec 2024
Dear authors,
We need access to the data, at least for editorial and review purposes. Therefore, please, provide to the Topical Editor and myself with a copy of it. I have tried to download it from the webpage you have linked, and when I try to access the dataset I get a "502 Bad Gateway" error. I guess this could be related to the fact that I am trying to access it from a not-allowed IP address. Therefore, if you provide a copy of the data for review purposes (via email), we can study an exception to our policy, and allow you to continue ahead with the Discussions stage and peer-review. Otherwise, we will have to reject your manuscript for publication.
Juan A. Añel
Geosci. Model Dev. Executive Editor
Citation: https://doi.org/10.5194/egusphere-2024-3183-CEC3 -
AC4: 'Reply on CEC3', Bing Gong, 28 Dec 2024
Dear Editor,
Thank you for all of your coordination and support throughout the submission process.
After consideration, we have decided to withdraw this manuscript submission to GMD.Â
We greatly appreciate your efforts and understanding.
Best regards,
BingÂ
Â
Citation: https://doi.org/10.5194/egusphere-2024-3183-AC4
-
AC4: 'Reply on CEC3', Bing Gong, 28 Dec 2024
-
CEC3: 'Reply on AC3', Juan Antonio Añel, 28 Dec 2024
-
AC3: 'Reply on CEC2', Bing Gong, 28 Dec 2024
-
CEC2: 'Reply on AC2', Juan Antonio Añel, 22 Dec 2024
-
AC1: 'Reply on CEC1', Bing Gong, 16 Dec 2024
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
321 | 95 | 21 | 437 | 6 | 5 |
- HTML: 321
- PDF: 95
- XML: 21
- Total: 437
- BibTeX: 6
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1