the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Architectural Insights and Training Methodology Optimization of Pangu-Weather
Abstract. Data-driven medium-range weather forecasts have recently outperformed classical numerical weather prediction models, with Pangu-Weather (PGW) being the first breakthrough model to achieve this. The Transformer-based PGW introduced novel architectural components including the three-dimensional attention mechanism (3D-Transformer) in the Transformer blocks and an Earth-specific positional bias term which accounts for weather states being related to the absolute position on Earth. However, the effectiveness of different architectural components is not yet well understood. Here, we reproduce the 24-hour forecast model of PGW based on subsampled 6-hourly data. We then present an ablation study of PGW to better understand the sensitivity to the model architecture and training procedure. We find that using a two-dimensional attention mechanism (2D-Transformer) yields a model that is more robust to training, converges faster, and produces better forecasts than with the 3D-Transformer. The 2D-Transformer reduces the overall computational requirements by 20–30 %. Further, the Earth-specific positional bias term can be replaced with a relative bias, reducing the model size by nearly 40 %. A sensitivity study comparing the convergence of the PGW model and the 2D-Transformer model shows large batch effects: however, the 2D-Transformer model is more robust to such effects. Lastly, we propose a new training procedure that increases the speed of convergence for the 2D-Transformer model model by 30 % without any further hyperparameter tuning.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(1053 KB)
-
Supplement
(3629 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(1053 KB) - Metadata XML
-
Supplement
(3629 KB) - BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
CEC1: 'Comment on egusphere-2024-1714', Juan Antonio Añel, 07 Jul 2024
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlThe policy of our journal establishes that all the code and data necessary to reproduce a manuscript must be published in a permanent repository at submission time. Also, you must include in the "Code and Data Availability" section the information (e.g. DOI and link) to it. However, you have not done it. In your manuscript you have included a link to a Zenodo repository that does not contain the requested information. Indeed, your Zenodo repository seems to contain a set of scripts for the new 2D implementation of Pangu-Weather. However, these scripts point to local paths (e.g. "/hkfs/work/workspace/scratch/ke4365-pangu/pangu-weather") that have nothing to do with the Zenodo repository. Also, the Pangu-Weather code is linked to a GitHub repository (something that you do in the text of the manuscript too). However, GitHub is not a suitable repository for scientific publication. GitHub itself instructs authors to use other alternatives for long-term archival and publishing, such as Zenodo. Therefore, you must publish the Pangu-Weather code and all the code necessary (linked to local paths) in one of the appropriate repositories, and reply to this comment with the relevant information (link and DOI) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy. Therefore, the current situation with your manuscript is irregular.
A similar issue happens with the data. In the text you mention that you use several variables from WeatherBench2. You provide scripts to download data. We can not accept this. The data necessary to reproduce your manuscript must be stored in the permanent repository too, and you have to reply to this comment with the relevant information.
Therefore, you must address and solve these issues, publishing the requested information. Otherwise we will have to reject your manuscript for publication in our journal.
Additionally, you have labelled your manuscript type as a "Model experiment description paper". This does not seem right. According to the submission types of our journal your manuscript should be a "Development and technical paper". The Handling Topical Editor and the office can change this for you. However, this means that in the title of the manuscript you must include a version number. This could mean that you need to use a modified name (e.g. Pangu-Weather 2D v1.0) for your model in a potentially reviewed version of your manuscript.
Juan A. Añel
Geosci. Model Dev. Executive EditorCitation: https://doi.org/10.5194/egusphere-2024-1714-CEC1 -
AC1: 'Reply on CEC1', Deifilia To, 25 Jul 2024
Dear Dr. Añel,
Thank you for bringing the issue with the data to our attention. Following your suggestion, we have archived a standalone version of our code, along with approx. 40 GB of sample data, on Zenodo: doi.org/10.5281/zenodo.11400879. In the AI community, it is custom to provide links for Github repositories and a specific hash for the published version, such that future updates on the code can be found by the readers. However, based on your comment, we have removed this link from the manuscript and substituted it with the Zenodo reference. The new version of the code is fully runnable without further modification and does not point to any external references. If users wish to download their own copy of the ERA5 data, they can change the data paths to their own repositories. We apologize for the mistake in the previously submitted version, which still contained leftover path links that slipped our attention during code submission preparation.
With respect to providing all data relevant for rerunning the experiments, we are unfortunately unable to provide a permanent archive of the entire dataset ourselves due to the immense size of the ERA5 training data, amounting to a total of 71 TB. However, we have included a small subset of the data used for training into the above described Zenodo repository, such that the code can be run without having to download further data.
In our revision, we will add the following statement to the Code and Data Availability section to cite the ERA5 dataset and to guide the reader to the original and publicly available data archives:
"The raw ERA5 climate reanalysis data (https:// doi.org/ 10.24381/ cds.adbb2d47; Hersbach et al., 2023) underlying this study are publicly available at https:// doi.org/10.24381/cds.adbb2d47 and https://doi.org/10.24381/cds.bd0915c6. The data were downloaded from the Weather Bench 2 API, which is a cloud-based benchmarking platform from Google that provides preprocessed data archives of the ERA5 database: https:// doi.org/ 10.48550/ arXiv.2308.15560. Our download script can be found archived in the Zenodo repository under data_download/download_era5.py. The code to replicate all experiments can be found under doi.org/10.5281/zenodo.11400879."
Furthermore, a Jupyter notebook to reproduce figures in the manuscript is also found in the Zenodo directory, under Paper_plots.ipynb.
With regards to the submission type, we classified the manuscript as a “Model Experiment Description Paper” because it replicates and experiments with variations of the Pangu-Weather model by Bi et al. However, upon considering your feedback, we apologize for this misclassification and are happy to change the submission type to a “Development and Technical Paper”. In our potentially reviewed version of the manuscript, the revised title would be “Architectural Insights and Training Methodology Optimization of Pangu-Weather (v1.0)”.
We hope this addresses your concerns and are happy to make any further changes that are required to meet the standards of the journal.
Best regards,
Deifilia To and co-authorsCitation: https://doi.org/10.5194/egusphere-2024-1714-AC1
-
AC1: 'Reply on CEC1', Deifilia To, 25 Jul 2024
-
RC1: 'Comment on egusphere-2024-1714', Tobias Weigel, 17 Jul 2024
The study presented in this article is of critical relevance to the future informed development and review of the abundance of models emerging within the domain. In addition to performing an ablation study to critically analyze several key design decisions of the original PGW model, the authors contribute notable improvements of architecture and training procedures that overall make model training significantly more computationally efficient while maintaining comparable quality. Overall, this is a worthy effort not only to understand and analyze PGW, but may also inspire similar work on other models in the domain.
There are some minor inconsistencies between what is written in analysis and procedures in the text and the corresponding plots. This does not invalidate the main conclusions of the paper per se, but needs to be checked.
That the apparently intuitive benefit of taking in the vertical dimension with the 3D transformer into account appears not to be essential is a noteworthy and surprising finding. The speculations given in the discussion session about this call for further analysis and care with future models.
I feel that the most concerning shortcoming of the work is indeed the missing comparison of a (indeed costly) non-subsampled version with the original PGW, as also indicated in the discussion section; I agree it should probably not invalidate the benefits of the optimizations done, though some doubt remains. Still, the findings remain indicative even without this (still better RMSE than IFS).Detailed comments:
- p. 2: new training procedure 30% faster - compared to 2D or original 3D?
- p. 3: what was the number of compute nodes? were local SSDs used in some form? if not, is mentioning them relevant to comparable studies (I believe such hybrid setups would be very peculiar to use)?
- p. 5, fig. 1b: These plots do not show perfect sin/cos functions, they are skewed; reading the explanation in 2.6, I don't understand why. I believe this comes from tweaking them to match the original weight sums, but then I'm missing an explanation for this particular tweak. Also, they are not in phase as explained (l. 134); e.g., the maxima of U/V are slightly shifted (epoch 200 in (a), epoch ~185 in (b)).
- p. 8, l. 172: If I read figure 6 correctly, PGW-Lite failed to converge for sizes of 16, 32, and 480. For 64, it converged (hard to read the figure here but I believe there's a dashed red line just behind the solid red/orange line). This is in contradiction to what is written in the text.
- p. 8: The point of the minibatch study appears to be to 1.) analyze how minibatch sizes affect attainable loss/convergence and 2.) make a direct comparison on this between the 2D and 3D transformer approaches. For me as reader it would have been good to point this out here (it only became clear when reading discussion and conclusion) because it affects how one reads the text and plot.
- p. 8: Would different random seeds have a significant effect on convergence/attainable loss?
- in general, while zooming helps, the colour scheme (use of yellow) and size of figures 6 and 7 makes them hard to read, particularly given that many lines relevant to discussion overlapCitation: https://doi.org/10.5194/egusphere-2024-1714-RC1 -
AC2: 'Reply on RC1', Deifilia To, 08 Aug 2024
Dear Dr. Weigel,
Thank you for your thoughtful consideration of our manuscript. We have taken the comments into consideration and present an improved manuscript. We agree that the comparison to a full training run of the original Pangu-Weather model would be very interesting and provide the ultimate proof of our findings. At the moment, due to the cost associated with training a full Pangu-Weather model in its originally published configuration, we have refrained from training a full replication of the Pangu-Weather model. This is particularly true given that the results of this study highlight the uncertainty in the effectiveness of the 3D-Transformer. However, if you feel that this is important to maintain the validity of our study, we can consider investing such resources into a replication experiment.
Regarding your detailed comments, you will find a point-by-point response to all of the concerns and questions raised in the attached PDF.
Best regards,
Deifilia To and Co-authors
-
RC5: 'Reply on AC2', Tobias Weigel, 12 Aug 2024
Dear Deifilia To and co-authors,
thank you for this detailed reply. I fully understand your point concerning the additional HPC costs and the cost/benefit ratio - as you also remarked in the article - and I do not believe this impacts the results of your study sufficiently to prevent publication.The points raised are valid even with the smaller model.
I see my comments wholly addressed and from my point of view, the article is ready for publication.
Best, Tobias
Citation: https://doi.org/10.5194/egusphere-2024-1714-RC5 -
AC5: 'Reply on RC5', Deifilia To, 20 Sep 2024
Dear Dr. Weigel,
We are grateful for the time and effort you spent on reassessing the manuscript. Thank you.
Sincerely,
Deifilia To & Co-authors
Citation: https://doi.org/10.5194/egusphere-2024-1714-AC5
-
AC5: 'Reply on RC5', Deifilia To, 20 Sep 2024
-
RC5: 'Reply on AC2', Tobias Weigel, 12 Aug 2024
-
AC2: 'Reply on RC1', Deifilia To, 08 Aug 2024
-
RC2: 'Comment on egusphere-2024-1714', Anonymous Referee #2, 18 Jul 2024
In the manuscript entitled “Architectural Insights and Training Methodology Optimization of Pangu-Weather”, the authors present a two-dimensional attention mechanism (2D-Transformer) with an Earth-specific positional bias term and relative bias which accounts for weather states being related to the absolute position on Earth. The 2D-Transformer performs more effectively, reducing computational requirements by 20-30%, decreasing the model size by nearly 40%, and increasing the robustness of the Pangu-Weather model's convergence significantly. The ablation study determined a new training process to accelerate the convergence for the 2D-Transformer model without any further hyperparameter tuning.
General comments:
Figure 3 is an important chart supporting the validity of this research. However, it does not include specific humidity, Z500, or V10, as mentioned earlier in this paper. Including these variables could strengthen the argument for the effectiveness of the 2D-Transformer. Due to the 6-hour subsample, readers ultimately do not know if the 2D-Transformer has improved the forecast accuracy of the original Pangu-Weather model. Including such comparisons could significantly increase the citation rate of this paper. There are still certain changes and clarifications that the authors should address prior to publication. For these reasons, I believe that the manuscript can be accepted for publication. Below, I have some specific comments to the authors.
Specific comments:
- Line #2 - #5, the sentence is too long and difficult to read. It can be revised to “The Transformer-based PGW introduced novel architectural components, including the three-dimensional attention mechanism (3D-Transformer) in the Transformer blocks. Additionally, it features an Earth-specific positional bias term that accounts for weather states being related to the absolute position on Earth.”
- Line #24, “the authors also admit” could be replaced with more specific wording, such as “previous studies have shown”. The same issue appears in Line #91, where the architecture described “by the authors” could be replaced with “in this study.” This sentence reads as if the ablation study is original to this paper and not derived from the model itself. If this is the case, some references could be cited here as evidence to support the experiment design.
- Line #27, the published model cannot be run, what is the reason? Is it also caused by modularized manner issue? Does it conflict with reproduction introduced in section 2.3?
- Figure 1, 4, 5 and 7 could be appropriately enlarged. Some images are difficult to discern even when enlarged. For Figure 4 (a), the authors can separate the lines by adjusting the y-coordinates.
- Figure 1, the cosine functions for each variable with weights from Bi et al. (2023) could be listed to explain the normalized process at each epoch. In Figure 1(b), it would be helpful to present the equation for the sloped MSP graph to facilitate understanding.
- Figure 6: Could the author explain the reason for the failure to converge based on PGW-Lite structure with minibatch sizes of 32, 64, and 480? PGW-Lite with minibatch size 720 eventually converged. Could the authors explain this unexpected result? Part of the reason is explained in Line #229. It is not necessary to strictly separate the results and discussion sections. Explaining part of the findings in the result section can enhance the content.
- Line #255, the sentence could be updated to “since wind vectors, acting as pressure gradients, can drive certain atmospheric processes, such as advection terms in atmospheric variables.”
Following are suggestions and do not affect the validity of the argument in this paper.
- Line #98: the models could be compared in more details in Table 1, like the hidden dimension, etc.
Technical corrections:
- Figure 5 y axis could be updated into “normalized RMSE”
Citation: https://doi.org/10.5194/egusphere-2024-1714-RC2 -
AC3: 'Reply on RC2', Deifilia To, 08 Aug 2024
We thank the reviewer for the time they took to read and consider this work. They provided many valuable suggestions to improve the quality of our submission. We address specific comments in the following response.
Best regards,
Deifilia To and Co-authors
-
RC4: 'Reply on AC3', Anonymous Referee #2, 09 Aug 2024
The reviewers have diligently addressed all the comments and revisions I provided during the previous review. Consequently, I believe the manuscript is now ready for publication.
Citation: https://doi.org/10.5194/egusphere-2024-1714-RC4 -
AC6: 'Reply on RC4', Deifilia To, 20 Sep 2024
We are grateful for the time and effort you spent on reassessing the manuscript. Thank you.
Sincerely,
Deifilia To & Co-authors
Citation: https://doi.org/10.5194/egusphere-2024-1714-AC6
-
AC6: 'Reply on RC4', Deifilia To, 20 Sep 2024
-
RC4: 'Reply on AC3', Anonymous Referee #2, 09 Aug 2024
-
RC3: 'Comment on egusphere-2024-1714', Anonymous Referee #3, 23 Jul 2024
The study „Architectural Insights and Training Methodology Optimization of Pangu-Weather“ contains an ablation study for a modification of the Pangu Weather model named PanguLite. Furthermore, it presents insights on training strategies for PanguLite and a two 2d variant of the Pangu that performed best in the ablation study.
The paper is very well written and gives valued insight into the architecture and performance of the models described. Currently, research on ML models in numerical weather prediction is highly successful, and probably soon, AI methods will be used in the operational routines of national weather services. This work contributes to examining the performance and behavior of such models, making their design and training more effective and reliable. Additionally, the authors explain the function of individual components, thus bridging the gap to classical modeling, where a system understanding is a crucial aspect of mathematical-physical modeling.
I recommend the study for publication with minor modifications addressing the following questions and remarks.
- The model names used in the tables and in the text are not consistent. Table 1 and Table2 might be merged with unique model names.
- Visualisation of the modification in model architecture: I find the modularised code in the repository very helpful. Could the code be included as pseudo code in the paper giving a comparative overview of the architectural details of the different models? This would be helpful in connection with the graphics in the original Pangu Publication (Fig. 2).
- Parameter numbers in Table 1: The numbers of parameters for the 2d attention model s larger than for the 3d attention (PanguLite) in Table 1. This is counter intuitive. Is it due to the fact that the hidden dimension C was enlarged? What was the reasoning behind that choice? Could it be chosen such that the overall parameter size would match that of PanguLite again? Is this dimension C the same for PanguLite and Pangu? Could the authors extrapolate the parameter numbers in Table 1 for the original Pangu model with the original batch size?
- Parameter numbers in Table 3 (relating to Remark 3): In paragraph 2.5 the authors state that reducing the model size allows for larger local batches. Hence, PanguLite should have larger local batches than the 2d version. In Table 3 it is the other way round. Could the authors please clarify this?
- Fig. 1: As the curves for U and V are indistinguishable, one colour for both curves would render the figure clearer. Furthermore, plot a) and b) should display temperature and wind in the same colour.
- Figure 7: What does it mean that the reference model is PanguLite? The text implies that the two curves show both the 2d model with different training losses.
Citation: https://doi.org/10.5194/egusphere-2024-1714-RC3 -
AC4: 'Reply on RC3', Deifilia To, 08 Aug 2024
The authors thank the reviewer for their detailed consideration of our work. We appreciate the time spent and thorough feedback provided that will improve the quality of our manuscript. Below, you will find a point-by-point response to the concerns raised by the reviewer.
Best regards,
Deifilia To and Co-authors
Interactive discussion
Status: closed
-
CEC1: 'Comment on egusphere-2024-1714', Juan Antonio Añel, 07 Jul 2024
Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".
https://www.geoscientific-model-development.net/policies/code_and_data_policy.htmlThe policy of our journal establishes that all the code and data necessary to reproduce a manuscript must be published in a permanent repository at submission time. Also, you must include in the "Code and Data Availability" section the information (e.g. DOI and link) to it. However, you have not done it. In your manuscript you have included a link to a Zenodo repository that does not contain the requested information. Indeed, your Zenodo repository seems to contain a set of scripts for the new 2D implementation of Pangu-Weather. However, these scripts point to local paths (e.g. "/hkfs/work/workspace/scratch/ke4365-pangu/pangu-weather") that have nothing to do with the Zenodo repository. Also, the Pangu-Weather code is linked to a GitHub repository (something that you do in the text of the manuscript too). However, GitHub is not a suitable repository for scientific publication. GitHub itself instructs authors to use other alternatives for long-term archival and publishing, such as Zenodo. Therefore, you must publish the Pangu-Weather code and all the code necessary (linked to local paths) in one of the appropriate repositories, and reply to this comment with the relevant information (link and DOI) as soon as possible, as we can not accept manuscripts in Discussions that do not comply with our policy. Therefore, the current situation with your manuscript is irregular.
A similar issue happens with the data. In the text you mention that you use several variables from WeatherBench2. You provide scripts to download data. We can not accept this. The data necessary to reproduce your manuscript must be stored in the permanent repository too, and you have to reply to this comment with the relevant information.
Therefore, you must address and solve these issues, publishing the requested information. Otherwise we will have to reject your manuscript for publication in our journal.
Additionally, you have labelled your manuscript type as a "Model experiment description paper". This does not seem right. According to the submission types of our journal your manuscript should be a "Development and technical paper". The Handling Topical Editor and the office can change this for you. However, this means that in the title of the manuscript you must include a version number. This could mean that you need to use a modified name (e.g. Pangu-Weather 2D v1.0) for your model in a potentially reviewed version of your manuscript.
Juan A. Añel
Geosci. Model Dev. Executive EditorCitation: https://doi.org/10.5194/egusphere-2024-1714-CEC1 -
AC1: 'Reply on CEC1', Deifilia To, 25 Jul 2024
Dear Dr. Añel,
Thank you for bringing the issue with the data to our attention. Following your suggestion, we have archived a standalone version of our code, along with approx. 40 GB of sample data, on Zenodo: doi.org/10.5281/zenodo.11400879. In the AI community, it is custom to provide links for Github repositories and a specific hash for the published version, such that future updates on the code can be found by the readers. However, based on your comment, we have removed this link from the manuscript and substituted it with the Zenodo reference. The new version of the code is fully runnable without further modification and does not point to any external references. If users wish to download their own copy of the ERA5 data, they can change the data paths to their own repositories. We apologize for the mistake in the previously submitted version, which still contained leftover path links that slipped our attention during code submission preparation.
With respect to providing all data relevant for rerunning the experiments, we are unfortunately unable to provide a permanent archive of the entire dataset ourselves due to the immense size of the ERA5 training data, amounting to a total of 71 TB. However, we have included a small subset of the data used for training into the above described Zenodo repository, such that the code can be run without having to download further data.
In our revision, we will add the following statement to the Code and Data Availability section to cite the ERA5 dataset and to guide the reader to the original and publicly available data archives:
"The raw ERA5 climate reanalysis data (https:// doi.org/ 10.24381/ cds.adbb2d47; Hersbach et al., 2023) underlying this study are publicly available at https:// doi.org/10.24381/cds.adbb2d47 and https://doi.org/10.24381/cds.bd0915c6. The data were downloaded from the Weather Bench 2 API, which is a cloud-based benchmarking platform from Google that provides preprocessed data archives of the ERA5 database: https:// doi.org/ 10.48550/ arXiv.2308.15560. Our download script can be found archived in the Zenodo repository under data_download/download_era5.py. The code to replicate all experiments can be found under doi.org/10.5281/zenodo.11400879."
Furthermore, a Jupyter notebook to reproduce figures in the manuscript is also found in the Zenodo directory, under Paper_plots.ipynb.
With regards to the submission type, we classified the manuscript as a “Model Experiment Description Paper” because it replicates and experiments with variations of the Pangu-Weather model by Bi et al. However, upon considering your feedback, we apologize for this misclassification and are happy to change the submission type to a “Development and Technical Paper”. In our potentially reviewed version of the manuscript, the revised title would be “Architectural Insights and Training Methodology Optimization of Pangu-Weather (v1.0)”.
We hope this addresses your concerns and are happy to make any further changes that are required to meet the standards of the journal.
Best regards,
Deifilia To and co-authorsCitation: https://doi.org/10.5194/egusphere-2024-1714-AC1
-
AC1: 'Reply on CEC1', Deifilia To, 25 Jul 2024
-
RC1: 'Comment on egusphere-2024-1714', Tobias Weigel, 17 Jul 2024
The study presented in this article is of critical relevance to the future informed development and review of the abundance of models emerging within the domain. In addition to performing an ablation study to critically analyze several key design decisions of the original PGW model, the authors contribute notable improvements of architecture and training procedures that overall make model training significantly more computationally efficient while maintaining comparable quality. Overall, this is a worthy effort not only to understand and analyze PGW, but may also inspire similar work on other models in the domain.
There are some minor inconsistencies between what is written in analysis and procedures in the text and the corresponding plots. This does not invalidate the main conclusions of the paper per se, but needs to be checked.
That the apparently intuitive benefit of taking in the vertical dimension with the 3D transformer into account appears not to be essential is a noteworthy and surprising finding. The speculations given in the discussion session about this call for further analysis and care with future models.
I feel that the most concerning shortcoming of the work is indeed the missing comparison of a (indeed costly) non-subsampled version with the original PGW, as also indicated in the discussion section; I agree it should probably not invalidate the benefits of the optimizations done, though some doubt remains. Still, the findings remain indicative even without this (still better RMSE than IFS).Detailed comments:
- p. 2: new training procedure 30% faster - compared to 2D or original 3D?
- p. 3: what was the number of compute nodes? were local SSDs used in some form? if not, is mentioning them relevant to comparable studies (I believe such hybrid setups would be very peculiar to use)?
- p. 5, fig. 1b: These plots do not show perfect sin/cos functions, they are skewed; reading the explanation in 2.6, I don't understand why. I believe this comes from tweaking them to match the original weight sums, but then I'm missing an explanation for this particular tweak. Also, they are not in phase as explained (l. 134); e.g., the maxima of U/V are slightly shifted (epoch 200 in (a), epoch ~185 in (b)).
- p. 8, l. 172: If I read figure 6 correctly, PGW-Lite failed to converge for sizes of 16, 32, and 480. For 64, it converged (hard to read the figure here but I believe there's a dashed red line just behind the solid red/orange line). This is in contradiction to what is written in the text.
- p. 8: The point of the minibatch study appears to be to 1.) analyze how minibatch sizes affect attainable loss/convergence and 2.) make a direct comparison on this between the 2D and 3D transformer approaches. For me as reader it would have been good to point this out here (it only became clear when reading discussion and conclusion) because it affects how one reads the text and plot.
- p. 8: Would different random seeds have a significant effect on convergence/attainable loss?
- in general, while zooming helps, the colour scheme (use of yellow) and size of figures 6 and 7 makes them hard to read, particularly given that many lines relevant to discussion overlapCitation: https://doi.org/10.5194/egusphere-2024-1714-RC1 -
AC2: 'Reply on RC1', Deifilia To, 08 Aug 2024
Dear Dr. Weigel,
Thank you for your thoughtful consideration of our manuscript. We have taken the comments into consideration and present an improved manuscript. We agree that the comparison to a full training run of the original Pangu-Weather model would be very interesting and provide the ultimate proof of our findings. At the moment, due to the cost associated with training a full Pangu-Weather model in its originally published configuration, we have refrained from training a full replication of the Pangu-Weather model. This is particularly true given that the results of this study highlight the uncertainty in the effectiveness of the 3D-Transformer. However, if you feel that this is important to maintain the validity of our study, we can consider investing such resources into a replication experiment.
Regarding your detailed comments, you will find a point-by-point response to all of the concerns and questions raised in the attached PDF.
Best regards,
Deifilia To and Co-authors
-
RC5: 'Reply on AC2', Tobias Weigel, 12 Aug 2024
Dear Deifilia To and co-authors,
thank you for this detailed reply. I fully understand your point concerning the additional HPC costs and the cost/benefit ratio - as you also remarked in the article - and I do not believe this impacts the results of your study sufficiently to prevent publication.The points raised are valid even with the smaller model.
I see my comments wholly addressed and from my point of view, the article is ready for publication.
Best, Tobias
Citation: https://doi.org/10.5194/egusphere-2024-1714-RC5 -
AC5: 'Reply on RC5', Deifilia To, 20 Sep 2024
Dear Dr. Weigel,
We are grateful for the time and effort you spent on reassessing the manuscript. Thank you.
Sincerely,
Deifilia To & Co-authors
Citation: https://doi.org/10.5194/egusphere-2024-1714-AC5
-
AC5: 'Reply on RC5', Deifilia To, 20 Sep 2024
-
RC5: 'Reply on AC2', Tobias Weigel, 12 Aug 2024
-
AC2: 'Reply on RC1', Deifilia To, 08 Aug 2024
-
RC2: 'Comment on egusphere-2024-1714', Anonymous Referee #2, 18 Jul 2024
In the manuscript entitled “Architectural Insights and Training Methodology Optimization of Pangu-Weather”, the authors present a two-dimensional attention mechanism (2D-Transformer) with an Earth-specific positional bias term and relative bias which accounts for weather states being related to the absolute position on Earth. The 2D-Transformer performs more effectively, reducing computational requirements by 20-30%, decreasing the model size by nearly 40%, and increasing the robustness of the Pangu-Weather model's convergence significantly. The ablation study determined a new training process to accelerate the convergence for the 2D-Transformer model without any further hyperparameter tuning.
General comments:
Figure 3 is an important chart supporting the validity of this research. However, it does not include specific humidity, Z500, or V10, as mentioned earlier in this paper. Including these variables could strengthen the argument for the effectiveness of the 2D-Transformer. Due to the 6-hour subsample, readers ultimately do not know if the 2D-Transformer has improved the forecast accuracy of the original Pangu-Weather model. Including such comparisons could significantly increase the citation rate of this paper. There are still certain changes and clarifications that the authors should address prior to publication. For these reasons, I believe that the manuscript can be accepted for publication. Below, I have some specific comments to the authors.
Specific comments:
- Line #2 - #5, the sentence is too long and difficult to read. It can be revised to “The Transformer-based PGW introduced novel architectural components, including the three-dimensional attention mechanism (3D-Transformer) in the Transformer blocks. Additionally, it features an Earth-specific positional bias term that accounts for weather states being related to the absolute position on Earth.”
- Line #24, “the authors also admit” could be replaced with more specific wording, such as “previous studies have shown”. The same issue appears in Line #91, where the architecture described “by the authors” could be replaced with “in this study.” This sentence reads as if the ablation study is original to this paper and not derived from the model itself. If this is the case, some references could be cited here as evidence to support the experiment design.
- Line #27, the published model cannot be run, what is the reason? Is it also caused by modularized manner issue? Does it conflict with reproduction introduced in section 2.3?
- Figure 1, 4, 5 and 7 could be appropriately enlarged. Some images are difficult to discern even when enlarged. For Figure 4 (a), the authors can separate the lines by adjusting the y-coordinates.
- Figure 1, the cosine functions for each variable with weights from Bi et al. (2023) could be listed to explain the normalized process at each epoch. In Figure 1(b), it would be helpful to present the equation for the sloped MSP graph to facilitate understanding.
- Figure 6: Could the author explain the reason for the failure to converge based on PGW-Lite structure with minibatch sizes of 32, 64, and 480? PGW-Lite with minibatch size 720 eventually converged. Could the authors explain this unexpected result? Part of the reason is explained in Line #229. It is not necessary to strictly separate the results and discussion sections. Explaining part of the findings in the result section can enhance the content.
- Line #255, the sentence could be updated to “since wind vectors, acting as pressure gradients, can drive certain atmospheric processes, such as advection terms in atmospheric variables.”
Following are suggestions and do not affect the validity of the argument in this paper.
- Line #98: the models could be compared in more details in Table 1, like the hidden dimension, etc.
Technical corrections:
- Figure 5 y axis could be updated into “normalized RMSE”
Citation: https://doi.org/10.5194/egusphere-2024-1714-RC2 -
AC3: 'Reply on RC2', Deifilia To, 08 Aug 2024
We thank the reviewer for the time they took to read and consider this work. They provided many valuable suggestions to improve the quality of our submission. We address specific comments in the following response.
Best regards,
Deifilia To and Co-authors
-
RC4: 'Reply on AC3', Anonymous Referee #2, 09 Aug 2024
The reviewers have diligently addressed all the comments and revisions I provided during the previous review. Consequently, I believe the manuscript is now ready for publication.
Citation: https://doi.org/10.5194/egusphere-2024-1714-RC4 -
AC6: 'Reply on RC4', Deifilia To, 20 Sep 2024
We are grateful for the time and effort you spent on reassessing the manuscript. Thank you.
Sincerely,
Deifilia To & Co-authors
Citation: https://doi.org/10.5194/egusphere-2024-1714-AC6
-
AC6: 'Reply on RC4', Deifilia To, 20 Sep 2024
-
RC4: 'Reply on AC3', Anonymous Referee #2, 09 Aug 2024
-
RC3: 'Comment on egusphere-2024-1714', Anonymous Referee #3, 23 Jul 2024
The study „Architectural Insights and Training Methodology Optimization of Pangu-Weather“ contains an ablation study for a modification of the Pangu Weather model named PanguLite. Furthermore, it presents insights on training strategies for PanguLite and a two 2d variant of the Pangu that performed best in the ablation study.
The paper is very well written and gives valued insight into the architecture and performance of the models described. Currently, research on ML models in numerical weather prediction is highly successful, and probably soon, AI methods will be used in the operational routines of national weather services. This work contributes to examining the performance and behavior of such models, making their design and training more effective and reliable. Additionally, the authors explain the function of individual components, thus bridging the gap to classical modeling, where a system understanding is a crucial aspect of mathematical-physical modeling.
I recommend the study for publication with minor modifications addressing the following questions and remarks.
- The model names used in the tables and in the text are not consistent. Table 1 and Table2 might be merged with unique model names.
- Visualisation of the modification in model architecture: I find the modularised code in the repository very helpful. Could the code be included as pseudo code in the paper giving a comparative overview of the architectural details of the different models? This would be helpful in connection with the graphics in the original Pangu Publication (Fig. 2).
- Parameter numbers in Table 1: The numbers of parameters for the 2d attention model s larger than for the 3d attention (PanguLite) in Table 1. This is counter intuitive. Is it due to the fact that the hidden dimension C was enlarged? What was the reasoning behind that choice? Could it be chosen such that the overall parameter size would match that of PanguLite again? Is this dimension C the same for PanguLite and Pangu? Could the authors extrapolate the parameter numbers in Table 1 for the original Pangu model with the original batch size?
- Parameter numbers in Table 3 (relating to Remark 3): In paragraph 2.5 the authors state that reducing the model size allows for larger local batches. Hence, PanguLite should have larger local batches than the 2d version. In Table 3 it is the other way round. Could the authors please clarify this?
- Fig. 1: As the curves for U and V are indistinguishable, one colour for both curves would render the figure clearer. Furthermore, plot a) and b) should display temperature and wind in the same colour.
- Figure 7: What does it mean that the reference model is PanguLite? The text implies that the two curves show both the 2d model with different training losses.
Citation: https://doi.org/10.5194/egusphere-2024-1714-RC3 -
AC4: 'Reply on RC3', Deifilia To, 08 Aug 2024
The authors thank the reviewer for their detailed consideration of our work. We appreciate the time spent and thorough feedback provided that will improve the quality of our manuscript. Below, you will find a point-by-point response to the concerns raised by the reviewer.
Best regards,
Deifilia To and Co-authors
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
585 | 161 | 103 | 849 | 30 | 14 | 13 |
- HTML: 585
- PDF: 161
- XML: 103
- Total: 849
- Supplement: 30
- BibTeX: 14
- EndNote: 13
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Deifilia Aurora To
Julian Quinting
Gholam Ali Hoshyaripour
Markus Götz
Achim Streit
Charlotte Debus
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(1053 KB) - Metadata XML
-
Supplement
(3629 KB) - BibTeX
- EndNote
- Final revised paper