the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Deep Learning Approach Towards Precipitation Nowcasting: Evaluating Regional Extrapolation Capabilities
Abstract. Precipitation nowcasting refers to the prediction of precipitation intensity in a local region and in a short timeframe up to 6 hours. The evaluation of spatial and temporal information still challenges todays numerical weather prediction models. The increasing possibilities to store and evaluate data combined with the advancements in the developments of artificial intelligence algorithms make it natural to use these methods to improve precipitation nowcasting. In this work a Convolutional Long Short-Term Memory network (ConvLSTM) is applied to Radar data of the GermanWeather Service. The positive effectiveness of finetuning a network pretrained at a different location and for different precipitation intensity thresholds is demonstrated. Furthermore, in the framework of two case studies the skill scores for the different thresholds are shown for a prediction time up to 100 minutes. The results highlight promising regional extrapolation capabilities for such neural networks for precipitation nowcasting.
This preprint has been withdrawn.
-
Withdrawal notice
This preprint has been withdrawn.
-
Preprint
(8673 KB)
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-440', Anonymous Referee #1, 13 Jul 2022
Artificial intelligence algorithm provides new technical methods for improving precipitation forecast. The authors used finetuning to improve the prediction accuracy, the generalization of TrajGRU, and reduced the cost of training.
The authors speed up the training of TrajGRU on the new dataset by pretraining. This is a transfer learning method, which has been extensively studied. In the manuscript, this study does not seem to improve the method.
I would like to share with my specific comments and suggestions below.
1. Page 1 Line 6, “…(ConvLSTM) is applied to Radar data of the German…”. The experiment is based on TrajGRU. The description of this manuscript is inconsistent. which model was used for the study?
2. There may be significant differences between multiple datasets.The authors may discuss whether this method leads to non-convergence of training.
3. Page 2 Line 56. What is the “notework”?
4. Page 2 Line 57, “…the Critical Success Index (CSI) and the Heidke Skill Score (HSS)…”. The evaluation is incomplete without also including the False Alarm Ratio = FP/(TP + FP) and the Probability of Detection = TP/(TP+FP) scores.
5. Page 4 Line 100. It is suggested to provide the number of pretrained iterations, which helps to more objectively compare the random initialization model and the pretrained model.
6. Page 7 Line 161. It is suggested to explain why the generalization of features performs better for heavy rainfall than non-heavy rainfall.
7. The results of experimental analysis have been verified in other papers. Did the author draw any other new conclusions?
Citation: https://doi.org/10.5194/egusphere-2022-440-RC1 -
AC1: 'Reply on RC1', Annette Rudolph, 14 Aug 2022
We would like to thank the anonymous referee #1 for their constructive comments and suggestions. We are in the process of revising the manuscript with the referee’s suggested changes. Point by point answers to the referee’s comments can be found further below.
- Page 1 Line 6, “…(ConvLSTM) is applied to Radar data of the German…”. The experiment is based on TrajGRU. The description of this manuscript is inconsistent. which model was used for the study? This is indeed a mistake, it should say TrajGRU. This has been revised.
- There may be significant differences between multiple datasets.The authors may discuss whether this method leads to non-convergence of training. While we think that the features and the prediction problem are similar enough across datasets to avoid this, we will add some thoughts about datasets with significant property differences in the revised manuscript. For example in a preliminary draft we also took a look at COSMO-REA2, which is a regional reanalysis dataset with a significantly lower temporal resolution. This didn’t cause non-convergence, but resulted in significantly worse scores across the board.
- Page 2 Line 56. What is the “notework”? This is a spelling error and has been revised.
- Page 2 Line 57, “…the Critical Success Index (CSI) and the Heidke Skill Score (HSS)…”. The evaluation is incomplete without also including the False Alarm Ratio = FP/(TP + FP) and the Probability of Detection = TP/(TP+FP) scores. The revised manuscript will contain the False Alarm Ratio and Probability of Detection skill scores for all experiments in the appendix.
- Page 4 Line 100. It is suggested to provide the number of pretrained iterations, which helps to more objectively compare the random initialization model and the pretrained model. The number of pretrained iterations will be provided in the revised manuscript.
- Page 7 Line 161. It is suggested to explain why the generalization of features performs better for heavy rainfall than non-heavy rainfall This is explained on Page 7 Line 164 f. We have added a more detailed explanation in the revised manuscript: The HKO-7 data set is larger and has a significantly higher amount of heavy rainfall events compared to RADOLAN, while having a similar distribution of non-heavy rainfall events (cf. Table 1). Yosinski et al. (2014) show that transferring features (like having learned heavy rainfall events) between networks can improve generalization on data, even after finetuning the network.
- The results of experimental analysis have been verified in other papers. Did the author draw any other new conclusions? We believe the anonymous referee is asking if we came to new conclusions when compared to other finetuning studies. While other papers indeed already verified conclusions like improved generalization for smaller datasets, the main conclusion of this paper is that finetuning can be a helpful approach to enable regional extrapolation of neural networks for precipitation nowcasting.
Citation: https://doi.org/10.5194/egusphere-2022-440-AC1
-
AC1: 'Reply on RC1', Annette Rudolph, 14 Aug 2022
-
RC2: 'Comment on egusphere-2022-440', Anonymous Referee #2, 13 Jul 2022
-
AC2: 'Reply on RC2', Annette Rudolph, 14 Aug 2022
We would like to thank the anonymous referee #2 for their constructive comments and suggestions. We are in the process of revising the manuscript with the referee’s suggested changes. Point by point answers to the referee’s comments can be found further below.
- This study mainly demonstrates the effectiveness of transfer learning with TrajGRU, yet the literature review regarding the transfer learning is very limited and should be much improved with recent studies. A more extensive literature review has been added in the revised manuscript.
- In overall manuscript, the explanation of the methodology used by the author, such as model structure, is considered a little insufficient. Although it is specified that the author conducted the research based on the paper of Shi et al. (2017), it would be better to add some more detailed explanation of the research methodology. A more detailed model structure and methodology explanation has been added in the revised manuscript.
- In abstract Line 5, the authors clarified that “In this work a Convolutional Long Short-Term Memory network (ConvLSTM) is applied to Radar data of the German Weather Service.” Although they used TrajGRU in this study and mentioned the comparison between ConvLSTM and TrajGRU in Section 2.2, I don't understand why the authors said "ConvLSTM" instead of "TrajGRU". This is indeed a mistake, it should say TrajGRU. This has been revised.
- In section 2.2., the authors described the main formulas of TrajGRU, some notations of the equations are missed (e.g., ∗, f). In addition, there seems to be a lack of explanation for the comparison between ConvLSTM and TrajGRU, especially for figure 1. Also, please add more information in caption of figure 1 (e.g., what is colored lines mean?). Missing notations and missing captions for figure 1 have been added in the revised manuscript.
- In Line 118, “Because of this we freeze the weights of the outermost TrajGRU layer of both encoder and forecaster for the finetuned model and only train the two innermost layers on the German RADOLAN data afterwards.”, please provide additional information about how you fine-tuned model (e.g., learning rate setting, etc.). Furthermore, I wonder if the authors experimented with directly using pre-trained parameters in the new model on RADOLAN data. Although they mentioned that “Other finetuning configurations were tested, such as freezing more layers or none at all, but displayed worse performance.” I suggest adding more detailed explanation of the other possible fine-tuning approaches used. A detailed experimental setup of the finetuned model (for example learning rate) can be found on Page 5 Line 110 to Line 114. A more detailed breakdown of the other finetuning configurations was added in the revised manuscript.
- The general purpose of transfer learning (i.e., fine-tuning) is to solve the problem of model underfitting due to the limited availability of model input data. The authors also explained this, and information on the amount of RADOLAN data used is given, but there is a lack of information about the amount of pre-trained HKO-7 data. Therefore, it would be better to discuss not only the distribution of data according to rainfall intensity but also the difference in the overall amount of data between RADOLAN and HKO-7 data. A more detailed discussion of the differences in overall amount between RADOLAN and HKO-7 has been added in the revised manuscript.
- Since the model performance fluctuates with increasing the number of iterations, and there is a possibility that overfitting problems should seriously affect the model performance, it is difficult to say that the fully trained model result (i.e., performed initial set 100,000 iterations) is the optimal result. So, I wonder if the author used any method other than full training to reach the optimal model state used to obtain the highest scores and the number of iterations it takes to reach it, as mentioned in Tables 2 and 3. If not, it is suggested to use a methodology such as early stopping method to obtain optimal model performance. The results in the paper are obtained using the model iteration that had the best average CSI and HSS scores. Some explanation about this process as well as general thoughts about more optimal methods like early stopping or weight decay have been added in the revised manuscript.
- For the results of case studies, in figure 6, it has not been fully explained why two different pictures of the input data are needed. Are there any implications for each of the two pictures? If not, it would be better to remove one of the two. I suggest from the perspective of comparing model results, the picture in the second column is better to remove. Also, why don’t you compare the “train from the scratch” model results with “finetuned” model results in case studies? It would be interesting to see the effect of fine-tuning through qualitative comparison with the “train from scratch” model. We thank the anonymous referee for the suggestion. We agree that the second sample is redundant and a direct comparison with the model trained from scratch would be more interesting, so we replaced the second case study with a comparison to the model trained from scratch in the revised manuscript.
Citation: https://doi.org/10.5194/egusphere-2022-440-AC2
-
AC2: 'Reply on RC2', Annette Rudolph, 14 Aug 2022
-
RC3: 'Comment on egusphere-2022-440', Anonymous Referee #3, 28 Jul 2022
Find attached the document with my comments and the manuscript with the markups
-
AC3: 'Reply on RC3', Annette Rudolph, 14 Aug 2022
We would like to thank the anonymous referee #3 for their extensive constructive comments and suggestions. We are in the process of revising the manuscript with the referee’s suggested changes. Point by point answers to the referee’s comments can be found further below.
Content comments:
- 2.1 (data). This section lacks some important information:
- Did you do any pre-processing on the images or is 480x480 pixels the original size of the image?
The RADOLAN product is a composite of 17 radar stations which cover a bigger area. We cut out a 480x480 km region over central Germany. This information has been added in the revised manuscript. - Did you use reflectivity or precipitation? In case of precipitation, what was the Z-R relation used?RADOLAN is a reflectivity product. This information is given on page 3, line 67.
- Did you do any pre-processing on the images or is 480x480 pixels the original size of the image?
- Is it a CAPPI or PPI? Please, inform the height or sweep elevation; We will include the information in the text.
- The used period from 2017 to 2021 comprises only three years: from Apr/2017 to Mar/2021. Make it clear to the reader; The amount of days is already given on page 3, line 72. To make it more clear, the amount of years was also added in the revised manuscript.
- Did you use the complete sequence or only selected rain events, as Shi et al. (2017)?
Only days with precipitation were selected from the RADOLAN dataset. This information has been added in the revised manuscript. - Explain the selection of training, validation and testing sets: summer to test, winter to validate;
Roughly the first 10% of frames were chosen for testing, the last 5% of frames for validation. - Instead of “German RADOLAN”, use “RADOLAN”.
This has been changed in the revised manuscript. - Please, if possible, give more information: What is the weather radar type: band, polarization, Doppler? Where can the reader find this type of information?
All DWD weather radars are Doppler radars. This information has been added in the revised manuscript. - Line 120: “Other finetuning configurations were tested” The authors should comment more on this;
A more detailed breakdown of the other finetuning configurations was added in the revised manuscript. - Table 1:
- The first threshold includes rain rate = 0. What do you consider as no rain?
The first threshold includes all rain events with a rain rate R larger or equal to 0 and smaller than 0.5. - The RADOLAN column sum more than 100%;
This is a mistake due to rounding and has been corrected in the revised manuscript. - How was this table calculated, with the complete sequence of the dataset or with selected rainy cases? Is it the distribution of pixel values in the image set?
The table was calculated with the complete sequence of the dataset, which only contains rainy days. The percentages indeed refer to the pixel distribution. This information has been added in the revised manuscript. - Shi et al. (2017) used selected rainfall events. Is the HKO-7 column considering only these events? (You do not need to repeat Shi et al.’s paper, but you should provide enough information for your reader to understand what you are talking about.)
The HKO-7 column only considers the selected rainy days. This information has been added in the revised manuscript.
- The first threshold includes rain rate = 0. What do you consider as no rain?
- Line 138: The authors introduced binary values (0, 1) based on thresholds. They must inform the meaning of the values above and below the thresholds;
We believe the anonymous referee is asking to clarify how the binary values are assigned. If a pixel is above or equal to the currently selected threshold r, it gets converted to a 1, otherwise to a 0. This clarification has been added in the revised manuscript. - Line 145: What “measurements” do you refer?
“Measurements” refers to the two errors in Table 2. - Line 145: Briefly describe Welch’s t-test in sec. 3;
A brief description of the t-test has been added in the revised manuscript. - 4.1: The model predicts 20 images, from 5 to 100 min lead time. For which forecast times are the shown results?
The shown results use the average score of all 20 prediction frames. This information has been added in the revised manuscript. - 2 is equal as Fig. 3, the same pattern, but with different values. The authors should verify that it is correct. If correct, what is the gain of using such metrics, what does this prove? Why not use another metric to explore more information?
Both CSI and HSS measure how accurate a prediction is to the ground truth, so both measurements correlate with each other. There is value in using both scores, as the CSI measures how many rain events were predicted correctly, whereas the HSS measures if our predictions are better than if we had made a random prediction. We agree though on using additional metrics to explore more information and have added the False Alarm Rate (FAR) and Probability of Detection (POD) skill scores to the revised manuscript. - Section 4.1 (lines 157-163):
- The values are too small to draw conclusions. The authors forced a conclusion mainly with the expression “big increase” (lines 163, 225)
We think that an approximate 2% difference for such a low performing threshold is indeed a significant increase. However we agree that the expression “big increase” can be misleading and removed the word “big” from lines 163 and 225 in the revised manuscript.
- The values are too small to draw conclusions. The authors forced a conclusion mainly with the expression “big increase” (lines 163, 225)
- What about the analysis of the result evolution with forecast time?
Additional statistical analysis regarding prediction time, similar to the qualitative analysis in section 4.2, has been added to the revised manuscript. - I suggest including other metrics, such as FAR and POD, and some metric to assess the image quality, since you are using a computer vision method;
See our answer to point 8. - Lines 165-167: “We compare the model output for the real truth in the frame of two case studies. Using the RADOLAN data set, we consider frontal systems at 4 May 2017, 14:40:00 UTC and at 12 May 2017, 07:40:00 UTC. These are two exemplary dates of clusters of mainly moderate precipitation crossing Germany, where the data is not part of the training data set of the model.” The authors should comment on this first of all in sec. 2.3, Experimental setup;
Comments on the case study setup have been added to Section 2.3 in the revised manuscript. - 4.2:
- You compare your results with Ravuri et al. (2021) and Ayzel et al. (2020). How many examples did these studies use to compute their statistics? Because yours considers just one case; Line 195 we added the information: "Ravuri et al. (2021) consider a single case study (...)" and in Line 210 ff. we changed the text to Ayzel et al. (2020) (...). The authors select 11 events during the summer months of the verification period (2016–2017) and evaluate the models RainNet and Rainymotion for the intensity thresholds 0.125 mm/h, 5 mm/h and 15 mm/h for prediction time up to 60 min.
- Line 199: “the positive effect of finetuning is clearly visible for higher precipitation intensities.” You are referring to a small gain of just one score;
This line has been changed in the revised manuscript. - What is the merit of the model used in your predictions? The other models have different architectures; you should take this into account in your analysis;
This is a good point and a discussion on differences in model architectures has been added to the revised manuscript. - Why the case studies weren’t done for both “finetuned” and “scratch”? How will you evaluate the gain of one against the other?
The second sample in the case study has been replaced with a comparison of the first sample to the model trained from scratch in the revised manuscript. - Do not miss your goal as this is the scientific question you must answer. You need to structure your analyses so you do not mix up the results; Yes, we will take this into account in the revised manuscript.
- 2-5, 8: R = 0.5 or R > 0.5? As in Tab. 2. (The same for the other thresholds);
This has been clarified in the revised manuscript. - Lines 209-210: “However, a statistical analysis would be wishful to confirm an improving CSI over the prediction time for higher thresholds as shown by the case study in Fig. 8 (a).” Why haven't you done it yet? You already have the model outputs. This must be included in the results;
See our answer to point 9b. - Lines 210-211: “It can be recognized that the HSS, shown in Fig. 8 (b) and (d), provides higher scores than the CSI depicted in Fig. 8 (a) and (c).” 8, as Figs. 2 and 3, shows the same pattern for CSI and HSS, with different values. You should take a careful look at your results in case you missed something;
See our answer to point 8. - 6:
- Why did you put reflectivity images on the 2nd row? What is the point you want to show?
As explained in the caption of Figure 6, the second row is the raw input data. It is used to show what the raw data looks like, compared to our colored versions used for the case studies. - Is it prediction of rain or reflectivity? (See comment 11d.)
The black and white images represent the rain rate R in mm/h, converted from the raw dBZ RADOLAN data. Clarification and formulas used for this have been added in the revised manuscript. - In the color legend, what does it mean when the rain field is gray? The color legend is incomplete;
We believe the anonymous referee is referring to the grayscale borders of Germany that were underlaid under the precipitation data to show the 480x480 km cutout of central Germany. We attempt to make this clearer in the revised manuscript. - Which forecast times are included in Fig. 6? You should comment this in the caption and in the text;
The forecast times used in the case study are explained on page 8, line 170. Clarification to caption and text has been added in the revised manuscript. - Where are the "scratch" images?
See our answer to point 11d. - What does negative reflectivity mean? Why didn’t you filter the raw data? We added more information the data section lines 66 ff, the original data is already gauged: This data is a reflectivity composite of 17 radar stations in Germany combined with hourly values measured at the precipitation stations. In order to achieve optimized estimates of precipitation, the data on the ground is calibrated with ombrometers. This combination provides high definition data in both temporal and spatial resolution. For more information see Deutscher Wetterdienst (2022)
- Why did you put reflectivity images on the 2nd row? What is the point you want to show?
-
- Again, what is the predicted variable, rain or reflectivity?
See our answer to point 15b. - What is the range of your data so I can understand the differences in the images?
Range information for the RADOLAN data has been added in the revised manuscript.
- Again, what is the predicted variable, rain or reflectivity?
- Line 220: Here you say “similar”, but before you said “slightly better” (line 155);
This has been corrected in the revised manuscript. - Lines 226-228: “Comparing the here obtained results with recent publications on deep learning algorithms to precipitation nowcasting based on radar data (Ayzel et al., 2020; Ravuri et al., 2021) the finetuned TrajGRU shows slightly higher scores with less decrease with prediction time.” See comments 10 and 11; We added the following sentence in the conclusion: "We notice that Ayzel et al. (2020) evaluate 11 case studies and (Ravuri et al., 2021) consider one case study for their analysis."
- Lines 228-230: “While Chen et al. (2020); Ayzel et al. (2020) show that their models are suitable to predict precipitation up to 60 minutes (Chen et al., 2020; Ayzel et al., 2020), we achieve comparable scores for a 100 minute prediction time.” I couldn't find this analysis throughout the text; We included the information of the prediction time in the case study section.
- Lines 230-231: “A statistical analysis of the prediction time for more than the here presented two case studies would be wishful for future research.” You have not presented any analysis regarding forecast time (line 209). What you showed here is still not enough for a publication. You already have the data and the outputs of the models, you need to explore further analysis;
See our answer to point 9b. - Lines 232-233: “To further optimize the results in future studies, different finetuning methods and finetuning hyperparameters could be taken into account.” You should add some examples in the text;
Examples for methods for possible future studies have been added to the revised manuscript. - Line 236: “finally” This is an ambitious comment regarding the precipitation nowcasting problem itself. When completed described, your presented solution can help in some ways, but in my opinion, based on my research and experience, I think one product is not enough to solve the complex problem of precipitation nowcasting. I suggest changing to “contribute positively” to be more realistic.
The suggested change has been added in the revised manuscript.
Text comments:
Since most of these points refer to grammar, typographical errors or phrasing, we won’t answer on a point by point basis. Most of the suggested changes in this section have been added to the revised manuscript. We thank the anonymous referee again for their thorough comments and suggestions.
Citation: https://doi.org/10.5194/egusphere-2022-440-AC3 - 2.1 (data). This section lacks some important information:
-
AC3: 'Reply on RC3', Annette Rudolph, 14 Aug 2022
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-440', Anonymous Referee #1, 13 Jul 2022
Artificial intelligence algorithm provides new technical methods for improving precipitation forecast. The authors used finetuning to improve the prediction accuracy, the generalization of TrajGRU, and reduced the cost of training.
The authors speed up the training of TrajGRU on the new dataset by pretraining. This is a transfer learning method, which has been extensively studied. In the manuscript, this study does not seem to improve the method.
I would like to share with my specific comments and suggestions below.
1. Page 1 Line 6, “…(ConvLSTM) is applied to Radar data of the German…”. The experiment is based on TrajGRU. The description of this manuscript is inconsistent. which model was used for the study?
2. There may be significant differences between multiple datasets.The authors may discuss whether this method leads to non-convergence of training.
3. Page 2 Line 56. What is the “notework”?
4. Page 2 Line 57, “…the Critical Success Index (CSI) and the Heidke Skill Score (HSS)…”. The evaluation is incomplete without also including the False Alarm Ratio = FP/(TP + FP) and the Probability of Detection = TP/(TP+FP) scores.
5. Page 4 Line 100. It is suggested to provide the number of pretrained iterations, which helps to more objectively compare the random initialization model and the pretrained model.
6. Page 7 Line 161. It is suggested to explain why the generalization of features performs better for heavy rainfall than non-heavy rainfall.
7. The results of experimental analysis have been verified in other papers. Did the author draw any other new conclusions?
Citation: https://doi.org/10.5194/egusphere-2022-440-RC1 -
AC1: 'Reply on RC1', Annette Rudolph, 14 Aug 2022
We would like to thank the anonymous referee #1 for their constructive comments and suggestions. We are in the process of revising the manuscript with the referee’s suggested changes. Point by point answers to the referee’s comments can be found further below.
- Page 1 Line 6, “…(ConvLSTM) is applied to Radar data of the German…”. The experiment is based on TrajGRU. The description of this manuscript is inconsistent. which model was used for the study? This is indeed a mistake, it should say TrajGRU. This has been revised.
- There may be significant differences between multiple datasets.The authors may discuss whether this method leads to non-convergence of training. While we think that the features and the prediction problem are similar enough across datasets to avoid this, we will add some thoughts about datasets with significant property differences in the revised manuscript. For example in a preliminary draft we also took a look at COSMO-REA2, which is a regional reanalysis dataset with a significantly lower temporal resolution. This didn’t cause non-convergence, but resulted in significantly worse scores across the board.
- Page 2 Line 56. What is the “notework”? This is a spelling error and has been revised.
- Page 2 Line 57, “…the Critical Success Index (CSI) and the Heidke Skill Score (HSS)…”. The evaluation is incomplete without also including the False Alarm Ratio = FP/(TP + FP) and the Probability of Detection = TP/(TP+FP) scores. The revised manuscript will contain the False Alarm Ratio and Probability of Detection skill scores for all experiments in the appendix.
- Page 4 Line 100. It is suggested to provide the number of pretrained iterations, which helps to more objectively compare the random initialization model and the pretrained model. The number of pretrained iterations will be provided in the revised manuscript.
- Page 7 Line 161. It is suggested to explain why the generalization of features performs better for heavy rainfall than non-heavy rainfall This is explained on Page 7 Line 164 f. We have added a more detailed explanation in the revised manuscript: The HKO-7 data set is larger and has a significantly higher amount of heavy rainfall events compared to RADOLAN, while having a similar distribution of non-heavy rainfall events (cf. Table 1). Yosinski et al. (2014) show that transferring features (like having learned heavy rainfall events) between networks can improve generalization on data, even after finetuning the network.
- The results of experimental analysis have been verified in other papers. Did the author draw any other new conclusions? We believe the anonymous referee is asking if we came to new conclusions when compared to other finetuning studies. While other papers indeed already verified conclusions like improved generalization for smaller datasets, the main conclusion of this paper is that finetuning can be a helpful approach to enable regional extrapolation of neural networks for precipitation nowcasting.
Citation: https://doi.org/10.5194/egusphere-2022-440-AC1
-
AC1: 'Reply on RC1', Annette Rudolph, 14 Aug 2022
-
RC2: 'Comment on egusphere-2022-440', Anonymous Referee #2, 13 Jul 2022
-
AC2: 'Reply on RC2', Annette Rudolph, 14 Aug 2022
We would like to thank the anonymous referee #2 for their constructive comments and suggestions. We are in the process of revising the manuscript with the referee’s suggested changes. Point by point answers to the referee’s comments can be found further below.
- This study mainly demonstrates the effectiveness of transfer learning with TrajGRU, yet the literature review regarding the transfer learning is very limited and should be much improved with recent studies. A more extensive literature review has been added in the revised manuscript.
- In overall manuscript, the explanation of the methodology used by the author, such as model structure, is considered a little insufficient. Although it is specified that the author conducted the research based on the paper of Shi et al. (2017), it would be better to add some more detailed explanation of the research methodology. A more detailed model structure and methodology explanation has been added in the revised manuscript.
- In abstract Line 5, the authors clarified that “In this work a Convolutional Long Short-Term Memory network (ConvLSTM) is applied to Radar data of the German Weather Service.” Although they used TrajGRU in this study and mentioned the comparison between ConvLSTM and TrajGRU in Section 2.2, I don't understand why the authors said "ConvLSTM" instead of "TrajGRU". This is indeed a mistake, it should say TrajGRU. This has been revised.
- In section 2.2., the authors described the main formulas of TrajGRU, some notations of the equations are missed (e.g., ∗, f). In addition, there seems to be a lack of explanation for the comparison between ConvLSTM and TrajGRU, especially for figure 1. Also, please add more information in caption of figure 1 (e.g., what is colored lines mean?). Missing notations and missing captions for figure 1 have been added in the revised manuscript.
- In Line 118, “Because of this we freeze the weights of the outermost TrajGRU layer of both encoder and forecaster for the finetuned model and only train the two innermost layers on the German RADOLAN data afterwards.”, please provide additional information about how you fine-tuned model (e.g., learning rate setting, etc.). Furthermore, I wonder if the authors experimented with directly using pre-trained parameters in the new model on RADOLAN data. Although they mentioned that “Other finetuning configurations were tested, such as freezing more layers or none at all, but displayed worse performance.” I suggest adding more detailed explanation of the other possible fine-tuning approaches used. A detailed experimental setup of the finetuned model (for example learning rate) can be found on Page 5 Line 110 to Line 114. A more detailed breakdown of the other finetuning configurations was added in the revised manuscript.
- The general purpose of transfer learning (i.e., fine-tuning) is to solve the problem of model underfitting due to the limited availability of model input data. The authors also explained this, and information on the amount of RADOLAN data used is given, but there is a lack of information about the amount of pre-trained HKO-7 data. Therefore, it would be better to discuss not only the distribution of data according to rainfall intensity but also the difference in the overall amount of data between RADOLAN and HKO-7 data. A more detailed discussion of the differences in overall amount between RADOLAN and HKO-7 has been added in the revised manuscript.
- Since the model performance fluctuates with increasing the number of iterations, and there is a possibility that overfitting problems should seriously affect the model performance, it is difficult to say that the fully trained model result (i.e., performed initial set 100,000 iterations) is the optimal result. So, I wonder if the author used any method other than full training to reach the optimal model state used to obtain the highest scores and the number of iterations it takes to reach it, as mentioned in Tables 2 and 3. If not, it is suggested to use a methodology such as early stopping method to obtain optimal model performance. The results in the paper are obtained using the model iteration that had the best average CSI and HSS scores. Some explanation about this process as well as general thoughts about more optimal methods like early stopping or weight decay have been added in the revised manuscript.
- For the results of case studies, in figure 6, it has not been fully explained why two different pictures of the input data are needed. Are there any implications for each of the two pictures? If not, it would be better to remove one of the two. I suggest from the perspective of comparing model results, the picture in the second column is better to remove. Also, why don’t you compare the “train from the scratch” model results with “finetuned” model results in case studies? It would be interesting to see the effect of fine-tuning through qualitative comparison with the “train from scratch” model. We thank the anonymous referee for the suggestion. We agree that the second sample is redundant and a direct comparison with the model trained from scratch would be more interesting, so we replaced the second case study with a comparison to the model trained from scratch in the revised manuscript.
Citation: https://doi.org/10.5194/egusphere-2022-440-AC2
-
AC2: 'Reply on RC2', Annette Rudolph, 14 Aug 2022
-
RC3: 'Comment on egusphere-2022-440', Anonymous Referee #3, 28 Jul 2022
Find attached the document with my comments and the manuscript with the markups
-
AC3: 'Reply on RC3', Annette Rudolph, 14 Aug 2022
We would like to thank the anonymous referee #3 for their extensive constructive comments and suggestions. We are in the process of revising the manuscript with the referee’s suggested changes. Point by point answers to the referee’s comments can be found further below.
Content comments:
- 2.1 (data). This section lacks some important information:
- Did you do any pre-processing on the images or is 480x480 pixels the original size of the image?
The RADOLAN product is a composite of 17 radar stations which cover a bigger area. We cut out a 480x480 km region over central Germany. This information has been added in the revised manuscript. - Did you use reflectivity or precipitation? In case of precipitation, what was the Z-R relation used?RADOLAN is a reflectivity product. This information is given on page 3, line 67.
- Did you do any pre-processing on the images or is 480x480 pixels the original size of the image?
- Is it a CAPPI or PPI? Please, inform the height or sweep elevation; We will include the information in the text.
- The used period from 2017 to 2021 comprises only three years: from Apr/2017 to Mar/2021. Make it clear to the reader; The amount of days is already given on page 3, line 72. To make it more clear, the amount of years was also added in the revised manuscript.
- Did you use the complete sequence or only selected rain events, as Shi et al. (2017)?
Only days with precipitation were selected from the RADOLAN dataset. This information has been added in the revised manuscript. - Explain the selection of training, validation and testing sets: summer to test, winter to validate;
Roughly the first 10% of frames were chosen for testing, the last 5% of frames for validation. - Instead of “German RADOLAN”, use “RADOLAN”.
This has been changed in the revised manuscript. - Please, if possible, give more information: What is the weather radar type: band, polarization, Doppler? Where can the reader find this type of information?
All DWD weather radars are Doppler radars. This information has been added in the revised manuscript. - Line 120: “Other finetuning configurations were tested” The authors should comment more on this;
A more detailed breakdown of the other finetuning configurations was added in the revised manuscript. - Table 1:
- The first threshold includes rain rate = 0. What do you consider as no rain?
The first threshold includes all rain events with a rain rate R larger or equal to 0 and smaller than 0.5. - The RADOLAN column sum more than 100%;
This is a mistake due to rounding and has been corrected in the revised manuscript. - How was this table calculated, with the complete sequence of the dataset or with selected rainy cases? Is it the distribution of pixel values in the image set?
The table was calculated with the complete sequence of the dataset, which only contains rainy days. The percentages indeed refer to the pixel distribution. This information has been added in the revised manuscript. - Shi et al. (2017) used selected rainfall events. Is the HKO-7 column considering only these events? (You do not need to repeat Shi et al.’s paper, but you should provide enough information for your reader to understand what you are talking about.)
The HKO-7 column only considers the selected rainy days. This information has been added in the revised manuscript.
- The first threshold includes rain rate = 0. What do you consider as no rain?
- Line 138: The authors introduced binary values (0, 1) based on thresholds. They must inform the meaning of the values above and below the thresholds;
We believe the anonymous referee is asking to clarify how the binary values are assigned. If a pixel is above or equal to the currently selected threshold r, it gets converted to a 1, otherwise to a 0. This clarification has been added in the revised manuscript. - Line 145: What “measurements” do you refer?
“Measurements” refers to the two errors in Table 2. - Line 145: Briefly describe Welch’s t-test in sec. 3;
A brief description of the t-test has been added in the revised manuscript. - 4.1: The model predicts 20 images, from 5 to 100 min lead time. For which forecast times are the shown results?
The shown results use the average score of all 20 prediction frames. This information has been added in the revised manuscript. - 2 is equal as Fig. 3, the same pattern, but with different values. The authors should verify that it is correct. If correct, what is the gain of using such metrics, what does this prove? Why not use another metric to explore more information?
Both CSI and HSS measure how accurate a prediction is to the ground truth, so both measurements correlate with each other. There is value in using both scores, as the CSI measures how many rain events were predicted correctly, whereas the HSS measures if our predictions are better than if we had made a random prediction. We agree though on using additional metrics to explore more information and have added the False Alarm Rate (FAR) and Probability of Detection (POD) skill scores to the revised manuscript. - Section 4.1 (lines 157-163):
- The values are too small to draw conclusions. The authors forced a conclusion mainly with the expression “big increase” (lines 163, 225)
We think that an approximate 2% difference for such a low performing threshold is indeed a significant increase. However we agree that the expression “big increase” can be misleading and removed the word “big” from lines 163 and 225 in the revised manuscript.
- The values are too small to draw conclusions. The authors forced a conclusion mainly with the expression “big increase” (lines 163, 225)
- What about the analysis of the result evolution with forecast time?
Additional statistical analysis regarding prediction time, similar to the qualitative analysis in section 4.2, has been added to the revised manuscript. - I suggest including other metrics, such as FAR and POD, and some metric to assess the image quality, since you are using a computer vision method;
See our answer to point 8. - Lines 165-167: “We compare the model output for the real truth in the frame of two case studies. Using the RADOLAN data set, we consider frontal systems at 4 May 2017, 14:40:00 UTC and at 12 May 2017, 07:40:00 UTC. These are two exemplary dates of clusters of mainly moderate precipitation crossing Germany, where the data is not part of the training data set of the model.” The authors should comment on this first of all in sec. 2.3, Experimental setup;
Comments on the case study setup have been added to Section 2.3 in the revised manuscript. - 4.2:
- You compare your results with Ravuri et al. (2021) and Ayzel et al. (2020). How many examples did these studies use to compute their statistics? Because yours considers just one case; Line 195 we added the information: "Ravuri et al. (2021) consider a single case study (...)" and in Line 210 ff. we changed the text to Ayzel et al. (2020) (...). The authors select 11 events during the summer months of the verification period (2016–2017) and evaluate the models RainNet and Rainymotion for the intensity thresholds 0.125 mm/h, 5 mm/h and 15 mm/h for prediction time up to 60 min.
- Line 199: “the positive effect of finetuning is clearly visible for higher precipitation intensities.” You are referring to a small gain of just one score;
This line has been changed in the revised manuscript. - What is the merit of the model used in your predictions? The other models have different architectures; you should take this into account in your analysis;
This is a good point and a discussion on differences in model architectures has been added to the revised manuscript. - Why the case studies weren’t done for both “finetuned” and “scratch”? How will you evaluate the gain of one against the other?
The second sample in the case study has been replaced with a comparison of the first sample to the model trained from scratch in the revised manuscript. - Do not miss your goal as this is the scientific question you must answer. You need to structure your analyses so you do not mix up the results; Yes, we will take this into account in the revised manuscript.
- 2-5, 8: R = 0.5 or R > 0.5? As in Tab. 2. (The same for the other thresholds);
This has been clarified in the revised manuscript. - Lines 209-210: “However, a statistical analysis would be wishful to confirm an improving CSI over the prediction time for higher thresholds as shown by the case study in Fig. 8 (a).” Why haven't you done it yet? You already have the model outputs. This must be included in the results;
See our answer to point 9b. - Lines 210-211: “It can be recognized that the HSS, shown in Fig. 8 (b) and (d), provides higher scores than the CSI depicted in Fig. 8 (a) and (c).” 8, as Figs. 2 and 3, shows the same pattern for CSI and HSS, with different values. You should take a careful look at your results in case you missed something;
See our answer to point 8. - 6:
- Why did you put reflectivity images on the 2nd row? What is the point you want to show?
As explained in the caption of Figure 6, the second row is the raw input data. It is used to show what the raw data looks like, compared to our colored versions used for the case studies. - Is it prediction of rain or reflectivity? (See comment 11d.)
The black and white images represent the rain rate R in mm/h, converted from the raw dBZ RADOLAN data. Clarification and formulas used for this have been added in the revised manuscript. - In the color legend, what does it mean when the rain field is gray? The color legend is incomplete;
We believe the anonymous referee is referring to the grayscale borders of Germany that were underlaid under the precipitation data to show the 480x480 km cutout of central Germany. We attempt to make this clearer in the revised manuscript. - Which forecast times are included in Fig. 6? You should comment this in the caption and in the text;
The forecast times used in the case study are explained on page 8, line 170. Clarification to caption and text has been added in the revised manuscript. - Where are the "scratch" images?
See our answer to point 11d. - What does negative reflectivity mean? Why didn’t you filter the raw data? We added more information the data section lines 66 ff, the original data is already gauged: This data is a reflectivity composite of 17 radar stations in Germany combined with hourly values measured at the precipitation stations. In order to achieve optimized estimates of precipitation, the data on the ground is calibrated with ombrometers. This combination provides high definition data in both temporal and spatial resolution. For more information see Deutscher Wetterdienst (2022)
- Why did you put reflectivity images on the 2nd row? What is the point you want to show?
-
- Again, what is the predicted variable, rain or reflectivity?
See our answer to point 15b. - What is the range of your data so I can understand the differences in the images?
Range information for the RADOLAN data has been added in the revised manuscript.
- Again, what is the predicted variable, rain or reflectivity?
- Line 220: Here you say “similar”, but before you said “slightly better” (line 155);
This has been corrected in the revised manuscript. - Lines 226-228: “Comparing the here obtained results with recent publications on deep learning algorithms to precipitation nowcasting based on radar data (Ayzel et al., 2020; Ravuri et al., 2021) the finetuned TrajGRU shows slightly higher scores with less decrease with prediction time.” See comments 10 and 11; We added the following sentence in the conclusion: "We notice that Ayzel et al. (2020) evaluate 11 case studies and (Ravuri et al., 2021) consider one case study for their analysis."
- Lines 228-230: “While Chen et al. (2020); Ayzel et al. (2020) show that their models are suitable to predict precipitation up to 60 minutes (Chen et al., 2020; Ayzel et al., 2020), we achieve comparable scores for a 100 minute prediction time.” I couldn't find this analysis throughout the text; We included the information of the prediction time in the case study section.
- Lines 230-231: “A statistical analysis of the prediction time for more than the here presented two case studies would be wishful for future research.” You have not presented any analysis regarding forecast time (line 209). What you showed here is still not enough for a publication. You already have the data and the outputs of the models, you need to explore further analysis;
See our answer to point 9b. - Lines 232-233: “To further optimize the results in future studies, different finetuning methods and finetuning hyperparameters could be taken into account.” You should add some examples in the text;
Examples for methods for possible future studies have been added to the revised manuscript. - Line 236: “finally” This is an ambitious comment regarding the precipitation nowcasting problem itself. When completed described, your presented solution can help in some ways, but in my opinion, based on my research and experience, I think one product is not enough to solve the complex problem of precipitation nowcasting. I suggest changing to “contribute positively” to be more realistic.
The suggested change has been added in the revised manuscript.
Text comments:
Since most of these points refer to grammar, typographical errors or phrasing, we won’t answer on a point by point basis. Most of the suggested changes in this section have been added to the revised manuscript. We thank the anonymous referee again for their thorough comments and suggestions.
Citation: https://doi.org/10.5194/egusphere-2022-440-AC3 - 2.1 (data). This section lacks some important information:
-
AC3: 'Reply on RC3', Annette Rudolph, 14 Aug 2022
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
767 | 319 | 38 | 1,124 | 27 | 20 |
- HTML: 767
- PDF: 319
- XML: 38
- Total: 1,124
- BibTeX: 27
- EndNote: 20
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1