the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Deep learning of extreme rainfall events from convective atmospheres
Abstract. Our subject is a new Catalogue of radar-based heavy Rainfall Events (CatRaRE) over Germany, and how it relates to the concurrent atmospheric circulation. We classify daily atmospheric ERA5 fields of convective indices according to CatRaRE, using an array of conventional statistical and more recent machine learning (ML) algorithms, and apply them to corresponding fields of simulated present and future atmospheres from the CORDEX project. Due to the stochastic nature of ML optimization there is some spread in the results. The ALL-CNN network performs best on average, with several learning runs exceeding an Equitable Threat Score (ETS) of 0.52; the single best result was from ResNet with ETS = 0.54. The best performing classical scheme was a Random Forest with ETS = 0.51. Regardless of the method, increasing trends are predicted for the probability of CatRaRE-type events, from ERA5 as well as from the CORDEX fields.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(633 KB)
-
Supplement
(289 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(633 KB) - Metadata XML
-
Supplement
(289 KB) - BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-1159', Anonymous Referee #1, 28 Nov 2022
In their manuscript, Bürger and Heistermann trained and applied multiple ML/DL models to the (binary) classification task of detecting convectively enforced extreme rainfall in Germany. They compare a whole range of different models ranging from regression models through random forests and shallow neural nets to well-known DL models for image classification like AlexNet, GoogLeNet or ResNet. The study's classification task can be formulated as follows: Given the (daily aggregated) ERA5 fields of CAPE, convective rainfall, and total column water over Germany, does CatRaRE contain at least one event exceeding warning level 3 where the event does not exceed nine hours. Bürger and Heistermann used cross-entropy as the loss function during training and evaluated their final results based on the Equitable Threat Score. As their DL models use a stochastic gradient descent algorithm during the optimisation, they trained 20 individual realisations per DL model resulting in an ensemble. Based on that (simplistic) ensemble, they report that the ALL-CNN model shows the highest mean ETS (across the ensemble) of .52. At the same time, the ResNet architecture provides the ensemble member with the highest individual ETS (.54).
The manuscript is well structured, and I appreciate the extensive model selection used for comparison and acknowledge the effort spent to train all of these. Even though the intercomparison of different methods and architectures is interesting on its own, I have difficulties distilling the overall relevance (concrete use case) of the classification for meteorological applications. I have some more points of concern, as listed below, but I am confident that the authors can adequately address them and that the manuscript can provide an important contribution to the field. I wonder if the GMD/ESSD inter-Journal SI "Benchmark datasets and machine learning algorithms for Earth system science data" (https://gmd.copernicus.org/articles/special_issue386_1147.html) might be suited better to the manuscript's scope. At least for me, the study's focus lies more on the intercomparison aspect rather than the "monitoring of precursors of evolution".
Major Comments- As mentioned above, it does not become clear to me what consequences a statement like "There is an extreme convective event (somewhere) over Germany" might have for a meteorologist, climatologist or decision-maker. L 229f somehow reflects the ultimate goal; however, it might be good to further distil the gain also in the introduction.
- I wonder how a cross-entropy or ETS analysis might contribute to a better understanding of the influence of 'deep' in DL models, as stated in l. 41f. For such a statement, I would have expected some explainable AI (XAI) methods or some sensitivity analysis of each model type, like varying the number of inception blocks in the 'GoogLeNet-style' model. Here the introduction raises expectations that the conclusion does not reflect.
- As far as I understand, you are using ERA5 data (cape, cp, tcw) as input X and CatRaRE as target y for training (2001-2010) and validation (2011-2020). Finally, you apply the trained model to data from HIST and RCP85. In l 144, you correctly state that the second dataset is not independent of the DL models, as you use those for model selection. As overfitting can happen on both - parameters (training set) and hyperparameters (validation set), why do you not split your data into three sets (training, validation, test)? Especially as you apply the trained models to data from different sources that likely have different properties, I think it would be beneficial to compare the test set's performance against the same (sub-)period of RCP85. Thus, you could detect differences in model performance that might serve as a guide towards interpreting all RCP85 data where you do not have any labels.
- I suggest broadening the analysis of the predicted probabilities over the entire detection period. For example, replacing Fig. 5 with a reliability diagram where the predicted probability is plotted against the observed relative frequency might reveal model-specific differences.
- Given the close range of ETS values across the different models, I suggest providing uncertainty quantifications and/or statistical tests to demonstrate the significance of your findings.
- How do already existing 'classical' findings of the expected change of extreme precipitation align with your classification results? Can you discuss the concept drift in the data that the classifier faces?
- In that regard, which period do you use to calculate the mean and std for the z-transformation?
- In that regard, which period do you use to calculate the mean and std for the z-transformation?
Minor Comments
- L. 22ff Besides the references to the 'classical' DL introductions, I encourage the authors to also focus on the recent discussions on ML/DL applications in atmospheric sciences like Reichstein et al. (2019) and Schultz et al. (2021).
- L. 158f How do you analyse the influence of cape? In l. 126 you state that you are using cape, cp and tcw as channels similar to RGB. Please clarify how you create the "non-cape" classifications. Do you train the models with two channels only? Do you replace the cape channel with zeros or another variable?
- Fig. 1 shows cape values jointly with the CatRaRe events used to define the extreme labels. The selected model domain contains pixels outside of Germany. CatRaRE, however, covers Germany only. Did you check (most likely with some other dataset) how often (if at all) extreme events occur outside of Germany but within your defined model domain? For me, that seems to be a potential source of introducing labelling errors.
- Fig. 4: I suggest using a more colourblind-friendly palette.
- Even though Table S1 lists several tuned hyperparameters, how does the learning rate change under the poly policy?
- I suggest adding a column reporting the number of trainable parameters of your modified versions
- I suggest adding a column reporting the number of trainable parameters of your modified versions
- Did you consider also using architectures already focussing on precipitation (for example (your) RainNet model (Ayzel et al., 2020)) and adjusting details for your classification task?
- L. 59 I am wondering if a log transformation for cp before applying the standardisation might be beneficial
- Please provide some more details on the EOF reduction. For example, how many components are you using?
- From the first sentence in your abstract, I expect this manuscript to focus on creating a new data set that can be used for ML/DL applications. In its current state, the abstract does not adequately transport the enormous (DL-)model comparison you performed.
Formal Comments
As a reviewer, I was asked helping to ensure that manuscripts comply with the journal's guidelines. Therefore I'd like to point out some formal aspects:
- Please add a "competing interests" statement as required by Copernicus Publication (see https://www.natural-hazards-and-earth-system-sciences.net/submission.html#manuscriptcomposition §16)
- Software Code: You refer to your GitHub repository but to the best of my knowledge Copernicus Journals prefer software provided through a DOI (e.g. through zenodo)
- URLs: Please add the last access dates to all URLs
- A legend is missing in Fig. 3
References
- Ayzel, G., Scheffer, T., and Heistermann, M.: RainNet v1.0: a convolutional neural network for radar-based precipitation nowcasting, Geoscientific Model Development, 13, 2631–2644, https://doi.org/10.5194/gmd-13-2631-2020, number: 6, 2020.
- Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., and Prabhat: Deep learning and process understanding for data-driven Earth system science, Nature, 566, 195–204, https://doi.org/10.1038/s41586-019-0912-1, 2019.
- Schultz, M. G., Betancourt, C., Gong, B., Kleinert, F., Langguth, M., Leufen, L. H., Mozaffari, A., and Stadtler, S.: Can deep learning beat numerical weather prediction?, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379, 20200 097, https://doi.org/10.1098/rsta.2020.0097, 2021.
Citation: https://doi.org/10.5194/egusphere-2022-1159-RC1 - AC1: 'Reply on RC1', Gerd Bürger, 10 Jan 2023
-
RC2: 'Comment on egusphere-2022-1159', Anonymous Referee #2, 21 Feb 2023
The authors applied a series of classification approaches to classify ERA5 data to binary catalogues of convective precipitation indices with an existing catalogue data (CatRaRE) as target. The series of classification approaches include several conventional methods (Lasso, random forests, logistic, and shallow neural network) and more sophisticated CNN based approaches that have been used in computer vision area (LeNet-5, AlexNet, ResNet, and so on). All the models are trained with the daily data from 2001 to 2010 and validated with the data from 2011 to 2020. The trained models were also applied to predict extreme indices in the future with future projections from GCM data.
The manuscript is relatively extensive. However, it basically lays out all the methods and is not driven by any research questions and objectives. Without finely tuning each method, especially for the sophisticated CNN models, it is unclear why to compare these methods and also the very similar results for each method give readers very limited insights from their studies. The manuscript appears more like a machine learning exercise instead of scientific research and it is not suitable for publication in this research journal.
Citation: https://doi.org/10.5194/egusphere-2022-1159-RC2 -
AC2: 'Reply on RC2', Gerd Bürger, 28 Feb 2023
We would like to thank the referee for taking the time to review our manuscript, and for the open and clear criticism. The central part of the comment is, as we understand, as follows:
„[…] [the manuscript] basically [...] is not driven by any research questions and objectives. Without finely tuning each method, especially for the sophisticated CNN models, it is unclear why to compare these methods and also the very similar results for each method give readers very limited insights from their studies.“
The referee‘s critique falls into three parts, which we rephrase to our understanding and respond to as follows:
1. The study lacks research questions and objectives.
The overall objective of this study is to link the occurence of impact-relevant convective rainfall events to the large scale ciculation, and to explore this link in order to quantify changes in the frequency of such events under past and future climate conditions. The specific research questions following from this objective are: (i) is there a sufficiently stable link between convective heavy rainfall events, as represented by the recently published event catalogue CatRaRE, and indices of the prevailing convective atmosphere? (ii) which set of indices should be chosen? (iii) do conventional methods still stand the challenge to compete with up-to-date deep learning (DL) methods with regard to the previous question? (iv) if we can establish such a link and apply it to simulated atmospheres, how does past and future climate affect the frequency of such impact-relevant convective rainfall events?
In our view, these research questions are relevant and, to date, not sufficiently addressed in the scientific literature. So while we disagree with the referee‘s notion, we admit that the objectives and research questions could be elaborated more concisely, and we will revise the manuscript accordingly.
2. The results pertaining to the DL models are inconclusive because their application requires more fine-tuning.
With regard to DL tuning, we have described that we purposefully used the basic model structure as is from the corresponding image recognition tasks, but fine-tuned settings to achieve convergence in the learning curves. This was successful, and the residual uncertainty is likely a genuine one. This topic is widely discussed in the DL community (cf. https://machinelearningmastery.com/randomness-in-machine-learning) and represents in itself an interesting scientific result for which we have not found any reference in the corresponding literature. Our dealing with randomness is exactly as proposed in the above-linked document. We will discuss the issue more explicitly in the revised version of the manuscript.
3. The results are not relevant to the reader because the performance of all benchmarked models is very similar.
First, Fig. 4 clearly shows that the models are not all very similar. Second, we see the comprehensive assessment of a large collection of methods as part of a transparent and objective methodological approach, and also as a service to the scientific community in order to identfy promising model architectures. While it may seem more exciting if a model had emerged as a clear „winner“, we do not see why lacking such an obvious discrimination should make our results less relevant."
Citation: https://doi.org/10.5194/egusphere-2022-1159-AC2
-
AC2: 'Reply on RC2', Gerd Bürger, 28 Feb 2023
-
EC1: 'Comment on egusphere-2022-1159', Andreas Hense, 22 Feb 2023
There are now two reviews available: Reviewer 1 suggest major revisions based on an extensive lists of suggestions, Reviewer 2 is more pessimistic and suggests rejection but leaves open the possibility to declare in more details the scientific questions behind the described approaches. As responsible handling editor I do read this as a major major revision, which I would like to be incorporated into a revised version of the manuscript.
Therefore I am looking forward to receive first the authors comments and then a revised version acc to the suggestions by both reviewers.
Citation: https://doi.org/10.5194/egusphere-2022-1159-EC1
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2022-1159', Anonymous Referee #1, 28 Nov 2022
In their manuscript, Bürger and Heistermann trained and applied multiple ML/DL models to the (binary) classification task of detecting convectively enforced extreme rainfall in Germany. They compare a whole range of different models ranging from regression models through random forests and shallow neural nets to well-known DL models for image classification like AlexNet, GoogLeNet or ResNet. The study's classification task can be formulated as follows: Given the (daily aggregated) ERA5 fields of CAPE, convective rainfall, and total column water over Germany, does CatRaRE contain at least one event exceeding warning level 3 where the event does not exceed nine hours. Bürger and Heistermann used cross-entropy as the loss function during training and evaluated their final results based on the Equitable Threat Score. As their DL models use a stochastic gradient descent algorithm during the optimisation, they trained 20 individual realisations per DL model resulting in an ensemble. Based on that (simplistic) ensemble, they report that the ALL-CNN model shows the highest mean ETS (across the ensemble) of .52. At the same time, the ResNet architecture provides the ensemble member with the highest individual ETS (.54).
The manuscript is well structured, and I appreciate the extensive model selection used for comparison and acknowledge the effort spent to train all of these. Even though the intercomparison of different methods and architectures is interesting on its own, I have difficulties distilling the overall relevance (concrete use case) of the classification for meteorological applications. I have some more points of concern, as listed below, but I am confident that the authors can adequately address them and that the manuscript can provide an important contribution to the field. I wonder if the GMD/ESSD inter-Journal SI "Benchmark datasets and machine learning algorithms for Earth system science data" (https://gmd.copernicus.org/articles/special_issue386_1147.html) might be suited better to the manuscript's scope. At least for me, the study's focus lies more on the intercomparison aspect rather than the "monitoring of precursors of evolution".
Major Comments- As mentioned above, it does not become clear to me what consequences a statement like "There is an extreme convective event (somewhere) over Germany" might have for a meteorologist, climatologist or decision-maker. L 229f somehow reflects the ultimate goal; however, it might be good to further distil the gain also in the introduction.
- I wonder how a cross-entropy or ETS analysis might contribute to a better understanding of the influence of 'deep' in DL models, as stated in l. 41f. For such a statement, I would have expected some explainable AI (XAI) methods or some sensitivity analysis of each model type, like varying the number of inception blocks in the 'GoogLeNet-style' model. Here the introduction raises expectations that the conclusion does not reflect.
- As far as I understand, you are using ERA5 data (cape, cp, tcw) as input X and CatRaRE as target y for training (2001-2010) and validation (2011-2020). Finally, you apply the trained model to data from HIST and RCP85. In l 144, you correctly state that the second dataset is not independent of the DL models, as you use those for model selection. As overfitting can happen on both - parameters (training set) and hyperparameters (validation set), why do you not split your data into three sets (training, validation, test)? Especially as you apply the trained models to data from different sources that likely have different properties, I think it would be beneficial to compare the test set's performance against the same (sub-)period of RCP85. Thus, you could detect differences in model performance that might serve as a guide towards interpreting all RCP85 data where you do not have any labels.
- I suggest broadening the analysis of the predicted probabilities over the entire detection period. For example, replacing Fig. 5 with a reliability diagram where the predicted probability is plotted against the observed relative frequency might reveal model-specific differences.
- Given the close range of ETS values across the different models, I suggest providing uncertainty quantifications and/or statistical tests to demonstrate the significance of your findings.
- How do already existing 'classical' findings of the expected change of extreme precipitation align with your classification results? Can you discuss the concept drift in the data that the classifier faces?
- In that regard, which period do you use to calculate the mean and std for the z-transformation?
- In that regard, which period do you use to calculate the mean and std for the z-transformation?
Minor Comments
- L. 22ff Besides the references to the 'classical' DL introductions, I encourage the authors to also focus on the recent discussions on ML/DL applications in atmospheric sciences like Reichstein et al. (2019) and Schultz et al. (2021).
- L. 158f How do you analyse the influence of cape? In l. 126 you state that you are using cape, cp and tcw as channels similar to RGB. Please clarify how you create the "non-cape" classifications. Do you train the models with two channels only? Do you replace the cape channel with zeros or another variable?
- Fig. 1 shows cape values jointly with the CatRaRe events used to define the extreme labels. The selected model domain contains pixels outside of Germany. CatRaRE, however, covers Germany only. Did you check (most likely with some other dataset) how often (if at all) extreme events occur outside of Germany but within your defined model domain? For me, that seems to be a potential source of introducing labelling errors.
- Fig. 4: I suggest using a more colourblind-friendly palette.
- Even though Table S1 lists several tuned hyperparameters, how does the learning rate change under the poly policy?
- I suggest adding a column reporting the number of trainable parameters of your modified versions
- I suggest adding a column reporting the number of trainable parameters of your modified versions
- Did you consider also using architectures already focussing on precipitation (for example (your) RainNet model (Ayzel et al., 2020)) and adjusting details for your classification task?
- L. 59 I am wondering if a log transformation for cp before applying the standardisation might be beneficial
- Please provide some more details on the EOF reduction. For example, how many components are you using?
- From the first sentence in your abstract, I expect this manuscript to focus on creating a new data set that can be used for ML/DL applications. In its current state, the abstract does not adequately transport the enormous (DL-)model comparison you performed.
Formal Comments
As a reviewer, I was asked helping to ensure that manuscripts comply with the journal's guidelines. Therefore I'd like to point out some formal aspects:
- Please add a "competing interests" statement as required by Copernicus Publication (see https://www.natural-hazards-and-earth-system-sciences.net/submission.html#manuscriptcomposition §16)
- Software Code: You refer to your GitHub repository but to the best of my knowledge Copernicus Journals prefer software provided through a DOI (e.g. through zenodo)
- URLs: Please add the last access dates to all URLs
- A legend is missing in Fig. 3
References
- Ayzel, G., Scheffer, T., and Heistermann, M.: RainNet v1.0: a convolutional neural network for radar-based precipitation nowcasting, Geoscientific Model Development, 13, 2631–2644, https://doi.org/10.5194/gmd-13-2631-2020, number: 6, 2020.
- Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., and Prabhat: Deep learning and process understanding for data-driven Earth system science, Nature, 566, 195–204, https://doi.org/10.1038/s41586-019-0912-1, 2019.
- Schultz, M. G., Betancourt, C., Gong, B., Kleinert, F., Langguth, M., Leufen, L. H., Mozaffari, A., and Stadtler, S.: Can deep learning beat numerical weather prediction?, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379, 20200 097, https://doi.org/10.1098/rsta.2020.0097, 2021.
Citation: https://doi.org/10.5194/egusphere-2022-1159-RC1 - AC1: 'Reply on RC1', Gerd Bürger, 10 Jan 2023
-
RC2: 'Comment on egusphere-2022-1159', Anonymous Referee #2, 21 Feb 2023
The authors applied a series of classification approaches to classify ERA5 data to binary catalogues of convective precipitation indices with an existing catalogue data (CatRaRE) as target. The series of classification approaches include several conventional methods (Lasso, random forests, logistic, and shallow neural network) and more sophisticated CNN based approaches that have been used in computer vision area (LeNet-5, AlexNet, ResNet, and so on). All the models are trained with the daily data from 2001 to 2010 and validated with the data from 2011 to 2020. The trained models were also applied to predict extreme indices in the future with future projections from GCM data.
The manuscript is relatively extensive. However, it basically lays out all the methods and is not driven by any research questions and objectives. Without finely tuning each method, especially for the sophisticated CNN models, it is unclear why to compare these methods and also the very similar results for each method give readers very limited insights from their studies. The manuscript appears more like a machine learning exercise instead of scientific research and it is not suitable for publication in this research journal.
Citation: https://doi.org/10.5194/egusphere-2022-1159-RC2 -
AC2: 'Reply on RC2', Gerd Bürger, 28 Feb 2023
We would like to thank the referee for taking the time to review our manuscript, and for the open and clear criticism. The central part of the comment is, as we understand, as follows:
„[…] [the manuscript] basically [...] is not driven by any research questions and objectives. Without finely tuning each method, especially for the sophisticated CNN models, it is unclear why to compare these methods and also the very similar results for each method give readers very limited insights from their studies.“
The referee‘s critique falls into three parts, which we rephrase to our understanding and respond to as follows:
1. The study lacks research questions and objectives.
The overall objective of this study is to link the occurence of impact-relevant convective rainfall events to the large scale ciculation, and to explore this link in order to quantify changes in the frequency of such events under past and future climate conditions. The specific research questions following from this objective are: (i) is there a sufficiently stable link between convective heavy rainfall events, as represented by the recently published event catalogue CatRaRE, and indices of the prevailing convective atmosphere? (ii) which set of indices should be chosen? (iii) do conventional methods still stand the challenge to compete with up-to-date deep learning (DL) methods with regard to the previous question? (iv) if we can establish such a link and apply it to simulated atmospheres, how does past and future climate affect the frequency of such impact-relevant convective rainfall events?
In our view, these research questions are relevant and, to date, not sufficiently addressed in the scientific literature. So while we disagree with the referee‘s notion, we admit that the objectives and research questions could be elaborated more concisely, and we will revise the manuscript accordingly.
2. The results pertaining to the DL models are inconclusive because their application requires more fine-tuning.
With regard to DL tuning, we have described that we purposefully used the basic model structure as is from the corresponding image recognition tasks, but fine-tuned settings to achieve convergence in the learning curves. This was successful, and the residual uncertainty is likely a genuine one. This topic is widely discussed in the DL community (cf. https://machinelearningmastery.com/randomness-in-machine-learning) and represents in itself an interesting scientific result for which we have not found any reference in the corresponding literature. Our dealing with randomness is exactly as proposed in the above-linked document. We will discuss the issue more explicitly in the revised version of the manuscript.
3. The results are not relevant to the reader because the performance of all benchmarked models is very similar.
First, Fig. 4 clearly shows that the models are not all very similar. Second, we see the comprehensive assessment of a large collection of methods as part of a transparent and objective methodological approach, and also as a service to the scientific community in order to identfy promising model architectures. While it may seem more exciting if a model had emerged as a clear „winner“, we do not see why lacking such an obvious discrimination should make our results less relevant."
Citation: https://doi.org/10.5194/egusphere-2022-1159-AC2
-
AC2: 'Reply on RC2', Gerd Bürger, 28 Feb 2023
-
EC1: 'Comment on egusphere-2022-1159', Andreas Hense, 22 Feb 2023
There are now two reviews available: Reviewer 1 suggest major revisions based on an extensive lists of suggestions, Reviewer 2 is more pessimistic and suggests rejection but leaves open the possibility to declare in more details the scientific questions behind the described approaches. As responsible handling editor I do read this as a major major revision, which I would like to be incorporated into a revised version of the manuscript.
Therefore I am looking forward to receive first the authors comments and then a revised version acc to the suggestions by both reviewers.
Citation: https://doi.org/10.5194/egusphere-2022-1159-EC1
Peer review completion
Journal article(s) based on this preprint
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
432 | 178 | 20 | 630 | 51 | 4 | 6 |
- HTML: 432
- PDF: 178
- XML: 20
- Total: 630
- Supplement: 51
- BibTeX: 4
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Maik Heistermann
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(633 KB) - Metadata XML
-
Supplement
(289 KB) - BibTeX
- EndNote
- Final revised paper