the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
DEUCE v1.0: A neural network for probabilistic precipitation nowcasting with aleatoric and epistemic uncertainties
Abstract. Precipitation nowcasting (forecasting locally for 0–6 h) serves both public security and industries, facilitating the mitigation of losses incurred due to e.g. flash floods, and is usually done by predicting weather radar echoes, which provides better performance than NWP at that scale. Probabilistic nowcasts are especially useful as they provide a desirable framework for operational decision-making. Many extrapolation-based statistical nowcasting methods exist, but they all suffer from a limited ability to capture the nonlinear growth and decay of precipitation, leading to a recent paradigm shift towards deep learning methods, more capable of representing these patterns.
Despite of its potential advantages, the application of deep learning in probabilistic nowcasting has only recently started to be explored. Here we develop a novel probabilistic precipitation nowcasting method, based on Bayesian neural networks with variational inference and the U-Net architecture, named DEUCE. The method estimates the total predictive uncertainty of precipitation by combining estimates of the epistemic (knowledge-related, reducible) and heteroscedastic aleatoric (data-dependent, irreducible) uncertainties, and produces an ensemble of development scenarios for the following 60 minutes.
DEUCE is trained and verified using Finnish Meteorological Institute radar composites against established classical models. Our model is found to produce both skillful and reliable probabilistic nowcasts based on various evaluation criteria. It improves ROC Area Under the Curve scores 1–5 % over STEPS and LINDA-P baselines, and comes close to the best-performer STEPS on a CRPS metric. The reliability of DEUCE is demonstrated with, e.g., having the lowest Expected Calibration Error at 20 and 25 dBZ reflectivity thresholds, and coming second at 35 dBZ. On the other hand, deterministic performance of ensemble means is found to be worse than that of extrapolation and LINDA-D baselines. Lastly, the composition of the predictive uncertainty is analysed and described, with the conclusion that aleatoric uncertainty is more significant and informative than epistemic uncertainty in the DEUCE model.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(7199 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(7199 KB) - Metadata XML
- BibTeX
- EndNote
- Final revised paper
Journal article(s) based on this preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-1100', Anonymous Referee #1, 25 Sep 2023
The authors consider a Bayesian neural network for precipitation nowcasting that is based on the U-Net architecture (DEUCE). This method esmtiates the total predictive uncertainty of precipition, subdividing the uncertainty into epistemic and aleatoric uncertainties. The model provides development scenarios up to 60 minutes. DEUCE is trained and evaluated for Finnish Meteoroloigcal Institute radar composites against established methods. First results seem to be a promising approach improving precipitation nowcasting.
General comments:
The text is written very well and the illustrations are very good, and helpful. The evaluation of the different uncertainties is valuable. The authors observe that most of the uncertainty is of aleatoric nature, and that the contribution of epistemic variance is universally.
A challenging rainfall event is chosen as a case study. This is a good idea. But the reader might also be interested in how well the model performs at a more frequently occurring precipitation event.
The results are discussed carefully and in detail.
There is some information about the data preprocessing missing. How is the input data distributed? This is important in order to understand the plausibility of the application of the used mathematical method.
Small comments:
- Notation in 2.1 , first paragraph: Real value bold(y), do you mean real valued vector? Theta is not defined, is it also vector valued? Could you give a hint, what (kind of) parameter theta is?
- Please explaine with more detail: Line 190: How is D_KL defined?
- Line 280: You normalize the data between zero and one. Maybe I missed it, but how is the input data distributed? Precipitation usually is skew symmetric, is the data transformed into a normally distribution? Pleas check, if it is mathematically correct to apply all methods to not-normally-distributed (?) data
Citation: https://doi.org/10.5194/egusphere-2023-1100-RC1 -
AC1: 'Reply on RC1', Bent Harnist, 06 Nov 2023
Dear reviewer,
We thank you for the valuable comments that you provided us. We have addressed the points mentioned, and our response to each of them will be detailed below.
In response to general comments:
The case study chosen at first indeed is a rather intense mostly convective rainfall event. Despite the quantitative verification representing the average performance of the model across a diverse corpus of events, we added a second case study, which concerns a rather different stratiform large-scale precipitation event. There, we found many of the same features that were found with the first case study, and we believe that the addition of this case study gives a better general view of how the model performs in different scenarios to the reader.
To address the concerns about the data distribution, we added a histogram of the dataset reflectivity, highlighting the fact that those reflectivity values that are the most likely to represent precipitation are normally distributed, motivating the modeling method of the predictive distribution
In response to specific comments:
We attempted to clarify the notation in subsection 2.1. What was meant by real was observed, or ground-truth. Next, the symbols x, y, ŷ were clarified to mean tensors, and Θ a list of tensors . We hope that these changes will make the sentences concerned more clear and understandable.
Line 190:
We expanded the KL-divergence D_KL to its definition in Eq. 1.Line 280:
It is correct to note that precipitation is not normally distributed (but log-normally). However, in this work we are applying the model to reflectivity and not precipitation rates. Radar reflectivity is known to follow a normal distribution because rain rate follows a log-normal distribution [1], and we verified whether this is in addition the case for our input data specifically with histograms of the data reflectivity of the dataset. We found that the part of the reflectivity distribution corresponding most likely to precipitation and not clutter or noise has indeed a Gaussian shape. Tangentially, we clarify the language used to make it clear that we are approaching precipitation nowcasting through working with radar reflectivity values, which may further be used to make actual quantitative precipitation rate or accumulation predictions.1. Kedem, B. and Chiu, L. S.: On the lognormality of rain rate, Proceedings of the National Academy of Sciences, 84, 901–905, https://doi.org/10.1073/pnas.84.4.901, publisher: Proceedings of the National Academy of Sciences, 1987
Citation: https://doi.org/10.5194/egusphere-2023-1100-AC1
-
RC2: 'Comment on egusphere-2023-1100', Anonymous Referee #2, 09 Oct 2023
Overall, the manuscript is well-written and the main results are clearly highlighted throughout the text. All the figures are appropriately labeled and capitoned; it's evident that the authors have devoted significant effort to effectively communicating their results. Consequently, I believe it absolutely deserves to be published in Geoscientific Model Development.
However, I am flagging this manuscript as a minor revision because there are a couple of important areas (see major comments below) that deserve a more careful examination along with several minor grammatical and typographical errors. However, once these are addressed, I will be happy to enthusiastically recommend the revised manuscript for publication.
Major comments:
While the authors address both points below at some length in Section 5, I am still not fully satisfied with their given explanations from a machine learning perspective.
1. The inability of DEUCE to accurately forecast precipitation at smaller scales (see Fig. 15) deserves additional discussion. For instance, how does the DGMR research for Ravuri et al. (2021) address this issue? How do the authors plan to augment their current model to improve their model expressivity at small scales?
A simple starting point could be to train the model for more epochs. From L285-286, I can infer that the current training procedure is only ~30 epochs. Given that the batch size is only N=2, this is lower than most neural networks of a similar size. Another potential area of improvement could be to use normalize the MSE in the first term of eq. 5 with the forecast output (see for example eq. 4 of https://arxiv.org/abs/2310.02994) to prevent the error from larger scales to dominate the loss.
2. A secondary area of concern for the DEUCE model is the under-forecasting of exceedance probabilities especially at short lead times. Moreover, according to Fig. 10, this behavior is valid for all reflectance thresholds. Again, could this be connected to the model training procedure in some fashion? For example, would it be mitigated by training the model on weighted input samples with weights determined according to the observed precipitation distribution quantiles?
Minor comments:
L7: Omit comma in "...deep learning methods, more capable..." and replace by "...deep learning methods which are more capable..."
L75: Unclear what "discriminative deep learning models" refers to here since, as far as I understand, all the models described in the previous paragraph have at least some generative component to them.
The post-processing procedure described in L240-252 for correctly approximating the spatio-temporal structure of ensemble members is quite impressive. I was wondering if the authors could add 1-2 lines in the Conclusion discussing how this step could be performed within a neural network setup.
L324-325: Omit "the" in "...the 9 July 2022...;" add "upto" instead of "at" in "...leading at 15:00 UTC..."
L353: Omit "all" in "...which we all computed..."
L421: Rephrase "...keeps open the possibility..."
L423: Rephrase "The suspicions are..."
L426: Replace "to" with "for" in "...explanation to this..."
L431: Replace "some very" with "a" in "...some very slight increase..."
L503: What does "variety" mean in "variety of the ensemble"?
L507-509: Rephrase the sentence: "Hence, despite ... results of Ravuri et al. (2021)." because it's unclear what the main point here is. See also major comment 1 above.
Citation: https://doi.org/10.5194/egusphere-2023-1100-RC2 -
AC2: 'Reply on RC2', Bent Harnist, 06 Nov 2023
Dear reviewer,
We thank you for the valuable comments that you provided us. We have addressed the points mentioned, and our response to each of them will be detailed below.
In response to major comments:
We addressed the comments by adding discussion relating to the two facets mentioned from a machine learning perspective. Since what was asked is minor revisions, we did not include any supplementary experiments in the manuscript. We still carried some experiments exploring the suggestions made, and our findings were:
1. We tried training for more epochs (60 in total) but did not find that increasing the number of training epochs improves validation performance in the present case. However we cannot exclude the potential effect of external factors such as our learning rate schedule.
2. We tried to weight the likelihood part of the loss according to the inverse of the density of the pixel reflectivity value in the dataset distribution. Unfortunately, this did only result in considerable over-forecasting and oddly-behaving aleatoric uncertainties. Despite of this type of weighting being difficult to get right, we acknowledge the data imbalance problem in precipitation nowcasting giving rise to the importance of weighting samples such that all, even rare occurrences, can reasonably contribute to the gradients.The discussion section was modified such that the paragraph on small-scale variability was reworked, and a new paragraph was added next to it describing potential ways to mitigate the underforecasting and lack of small-scale variability problems together from a machine learning perspective. We presented the two methods which you suggested together with some of our own ideas.
In response to minor comments:
L7:
This was fixed.L75:
To the extent of our knowledge, all of the models presented above L75 are discriminative, as they only model the conditional probability of predicted outputs given input frames, maximizing its likelihood through the MSE loss. In contrast, generative models such as GANs model the joint probability distribution of the outputs and inputs. We have however removed the word "discriminative" from the sentence such as to reduce possible confusion, because the same point can be made without its explicit usage.L240-252:
After some thought, we indeed came by a possible method to better incorporate the correlated noise sampling procedure into the network, which was detailed in the conclusions section :
- "We could also think of directly appending the post-processing sampling with spatially correlated noise to the neural network, or even learning context-dependent spatiotemporal correlation structures. The sampled outputs could then be, e.g., fed to a GAN-like discriminator module, which would drive the processed outputs to be more realistic while retaining the uncertainty decomposition. "L324-325:
We fixed that.L353:
This has been fixed too.L421:
"...keeps open the possibility for..." rephrased to "...means that we cannot exclude... ".L423:
Rephrased to "Such dependencies are...".L426:
This has been fixed.L431:
This has been fixed.L503:
We meant "breadth of the ensemble" and replaced "variety" here, acknowledging the ambiguity of the meaning of "variety" in this context.L507-509:
The paragraph containing was restructured in an attempt to separate and make clearer the different points made. See major comment modifications.Citation: https://doi.org/10.5194/egusphere-2023-1100-AC2 -
AC3: 'Supplement for the reply on RC2', Bent Harnist, 07 Nov 2023
Dear reviewer,
We have here attached a supplement illustrating the points made in our response to the major comments 1 and 2. The supplement is a figure illustrating the predictive mean and standard deviation of a base DEUCE checkpoint, a checkpoint with longer training, and a checkpoint trained with the loss weighting scheme described on a case taken from the prediction split (2019-05-25 13:00:00 UTC).
For this run, we slightly modified the training procedure as we found that the original DEUCE checkpoint sometimes experienced instabilities (NaN loss) in training for longer. The base DEUCE checkpoint here did not experience the same issues but had similar performance to the original one and was trained for 37 epochs with a batch size of 8, a sample size of 4, using 256x256px random crops from the 512x512px area, with learning rate decay after 5 epochs of non-improving validation loss (start at 1e-4). The "long" version started from the base checkpoint but was trained until 60 epochs. In the visualization, we show it after 57 epochs, as that was where its validation performance peaked. The "weighted" version was again trained from scratch with the same hyper-parameters except for the loss function. It achieved its peak validation performance after 28 epochs, which is the checkpoint shown in the figure.
-
AC2: 'Reply on RC2', Bent Harnist, 06 Nov 2023
Interactive discussion
Status: closed
-
RC1: 'Comment on egusphere-2023-1100', Anonymous Referee #1, 25 Sep 2023
The authors consider a Bayesian neural network for precipitation nowcasting that is based on the U-Net architecture (DEUCE). This method esmtiates the total predictive uncertainty of precipition, subdividing the uncertainty into epistemic and aleatoric uncertainties. The model provides development scenarios up to 60 minutes. DEUCE is trained and evaluated for Finnish Meteoroloigcal Institute radar composites against established methods. First results seem to be a promising approach improving precipitation nowcasting.
General comments:
The text is written very well and the illustrations are very good, and helpful. The evaluation of the different uncertainties is valuable. The authors observe that most of the uncertainty is of aleatoric nature, and that the contribution of epistemic variance is universally.
A challenging rainfall event is chosen as a case study. This is a good idea. But the reader might also be interested in how well the model performs at a more frequently occurring precipitation event.
The results are discussed carefully and in detail.
There is some information about the data preprocessing missing. How is the input data distributed? This is important in order to understand the plausibility of the application of the used mathematical method.
Small comments:
- Notation in 2.1 , first paragraph: Real value bold(y), do you mean real valued vector? Theta is not defined, is it also vector valued? Could you give a hint, what (kind of) parameter theta is?
- Please explaine with more detail: Line 190: How is D_KL defined?
- Line 280: You normalize the data between zero and one. Maybe I missed it, but how is the input data distributed? Precipitation usually is skew symmetric, is the data transformed into a normally distribution? Pleas check, if it is mathematically correct to apply all methods to not-normally-distributed (?) data
Citation: https://doi.org/10.5194/egusphere-2023-1100-RC1 -
AC1: 'Reply on RC1', Bent Harnist, 06 Nov 2023
Dear reviewer,
We thank you for the valuable comments that you provided us. We have addressed the points mentioned, and our response to each of them will be detailed below.
In response to general comments:
The case study chosen at first indeed is a rather intense mostly convective rainfall event. Despite the quantitative verification representing the average performance of the model across a diverse corpus of events, we added a second case study, which concerns a rather different stratiform large-scale precipitation event. There, we found many of the same features that were found with the first case study, and we believe that the addition of this case study gives a better general view of how the model performs in different scenarios to the reader.
To address the concerns about the data distribution, we added a histogram of the dataset reflectivity, highlighting the fact that those reflectivity values that are the most likely to represent precipitation are normally distributed, motivating the modeling method of the predictive distribution
In response to specific comments:
We attempted to clarify the notation in subsection 2.1. What was meant by real was observed, or ground-truth. Next, the symbols x, y, ŷ were clarified to mean tensors, and Θ a list of tensors . We hope that these changes will make the sentences concerned more clear and understandable.
Line 190:
We expanded the KL-divergence D_KL to its definition in Eq. 1.Line 280:
It is correct to note that precipitation is not normally distributed (but log-normally). However, in this work we are applying the model to reflectivity and not precipitation rates. Radar reflectivity is known to follow a normal distribution because rain rate follows a log-normal distribution [1], and we verified whether this is in addition the case for our input data specifically with histograms of the data reflectivity of the dataset. We found that the part of the reflectivity distribution corresponding most likely to precipitation and not clutter or noise has indeed a Gaussian shape. Tangentially, we clarify the language used to make it clear that we are approaching precipitation nowcasting through working with radar reflectivity values, which may further be used to make actual quantitative precipitation rate or accumulation predictions.1. Kedem, B. and Chiu, L. S.: On the lognormality of rain rate, Proceedings of the National Academy of Sciences, 84, 901–905, https://doi.org/10.1073/pnas.84.4.901, publisher: Proceedings of the National Academy of Sciences, 1987
Citation: https://doi.org/10.5194/egusphere-2023-1100-AC1
-
RC2: 'Comment on egusphere-2023-1100', Anonymous Referee #2, 09 Oct 2023
Overall, the manuscript is well-written and the main results are clearly highlighted throughout the text. All the figures are appropriately labeled and capitoned; it's evident that the authors have devoted significant effort to effectively communicating their results. Consequently, I believe it absolutely deserves to be published in Geoscientific Model Development.
However, I am flagging this manuscript as a minor revision because there are a couple of important areas (see major comments below) that deserve a more careful examination along with several minor grammatical and typographical errors. However, once these are addressed, I will be happy to enthusiastically recommend the revised manuscript for publication.
Major comments:
While the authors address both points below at some length in Section 5, I am still not fully satisfied with their given explanations from a machine learning perspective.
1. The inability of DEUCE to accurately forecast precipitation at smaller scales (see Fig. 15) deserves additional discussion. For instance, how does the DGMR research for Ravuri et al. (2021) address this issue? How do the authors plan to augment their current model to improve their model expressivity at small scales?
A simple starting point could be to train the model for more epochs. From L285-286, I can infer that the current training procedure is only ~30 epochs. Given that the batch size is only N=2, this is lower than most neural networks of a similar size. Another potential area of improvement could be to use normalize the MSE in the first term of eq. 5 with the forecast output (see for example eq. 4 of https://arxiv.org/abs/2310.02994) to prevent the error from larger scales to dominate the loss.
2. A secondary area of concern for the DEUCE model is the under-forecasting of exceedance probabilities especially at short lead times. Moreover, according to Fig. 10, this behavior is valid for all reflectance thresholds. Again, could this be connected to the model training procedure in some fashion? For example, would it be mitigated by training the model on weighted input samples with weights determined according to the observed precipitation distribution quantiles?
Minor comments:
L7: Omit comma in "...deep learning methods, more capable..." and replace by "...deep learning methods which are more capable..."
L75: Unclear what "discriminative deep learning models" refers to here since, as far as I understand, all the models described in the previous paragraph have at least some generative component to them.
The post-processing procedure described in L240-252 for correctly approximating the spatio-temporal structure of ensemble members is quite impressive. I was wondering if the authors could add 1-2 lines in the Conclusion discussing how this step could be performed within a neural network setup.
L324-325: Omit "the" in "...the 9 July 2022...;" add "upto" instead of "at" in "...leading at 15:00 UTC..."
L353: Omit "all" in "...which we all computed..."
L421: Rephrase "...keeps open the possibility..."
L423: Rephrase "The suspicions are..."
L426: Replace "to" with "for" in "...explanation to this..."
L431: Replace "some very" with "a" in "...some very slight increase..."
L503: What does "variety" mean in "variety of the ensemble"?
L507-509: Rephrase the sentence: "Hence, despite ... results of Ravuri et al. (2021)." because it's unclear what the main point here is. See also major comment 1 above.
Citation: https://doi.org/10.5194/egusphere-2023-1100-RC2 -
AC2: 'Reply on RC2', Bent Harnist, 06 Nov 2023
Dear reviewer,
We thank you for the valuable comments that you provided us. We have addressed the points mentioned, and our response to each of them will be detailed below.
In response to major comments:
We addressed the comments by adding discussion relating to the two facets mentioned from a machine learning perspective. Since what was asked is minor revisions, we did not include any supplementary experiments in the manuscript. We still carried some experiments exploring the suggestions made, and our findings were:
1. We tried training for more epochs (60 in total) but did not find that increasing the number of training epochs improves validation performance in the present case. However we cannot exclude the potential effect of external factors such as our learning rate schedule.
2. We tried to weight the likelihood part of the loss according to the inverse of the density of the pixel reflectivity value in the dataset distribution. Unfortunately, this did only result in considerable over-forecasting and oddly-behaving aleatoric uncertainties. Despite of this type of weighting being difficult to get right, we acknowledge the data imbalance problem in precipitation nowcasting giving rise to the importance of weighting samples such that all, even rare occurrences, can reasonably contribute to the gradients.The discussion section was modified such that the paragraph on small-scale variability was reworked, and a new paragraph was added next to it describing potential ways to mitigate the underforecasting and lack of small-scale variability problems together from a machine learning perspective. We presented the two methods which you suggested together with some of our own ideas.
In response to minor comments:
L7:
This was fixed.L75:
To the extent of our knowledge, all of the models presented above L75 are discriminative, as they only model the conditional probability of predicted outputs given input frames, maximizing its likelihood through the MSE loss. In contrast, generative models such as GANs model the joint probability distribution of the outputs and inputs. We have however removed the word "discriminative" from the sentence such as to reduce possible confusion, because the same point can be made without its explicit usage.L240-252:
After some thought, we indeed came by a possible method to better incorporate the correlated noise sampling procedure into the network, which was detailed in the conclusions section :
- "We could also think of directly appending the post-processing sampling with spatially correlated noise to the neural network, or even learning context-dependent spatiotemporal correlation structures. The sampled outputs could then be, e.g., fed to a GAN-like discriminator module, which would drive the processed outputs to be more realistic while retaining the uncertainty decomposition. "L324-325:
We fixed that.L353:
This has been fixed too.L421:
"...keeps open the possibility for..." rephrased to "...means that we cannot exclude... ".L423:
Rephrased to "Such dependencies are...".L426:
This has been fixed.L431:
This has been fixed.L503:
We meant "breadth of the ensemble" and replaced "variety" here, acknowledging the ambiguity of the meaning of "variety" in this context.L507-509:
The paragraph containing was restructured in an attempt to separate and make clearer the different points made. See major comment modifications.Citation: https://doi.org/10.5194/egusphere-2023-1100-AC2 -
AC3: 'Supplement for the reply on RC2', Bent Harnist, 07 Nov 2023
Dear reviewer,
We have here attached a supplement illustrating the points made in our response to the major comments 1 and 2. The supplement is a figure illustrating the predictive mean and standard deviation of a base DEUCE checkpoint, a checkpoint with longer training, and a checkpoint trained with the loss weighting scheme described on a case taken from the prediction split (2019-05-25 13:00:00 UTC).
For this run, we slightly modified the training procedure as we found that the original DEUCE checkpoint sometimes experienced instabilities (NaN loss) in training for longer. The base DEUCE checkpoint here did not experience the same issues but had similar performance to the original one and was trained for 37 epochs with a batch size of 8, a sample size of 4, using 256x256px random crops from the 512x512px area, with learning rate decay after 5 epochs of non-improving validation loss (start at 1e-4). The "long" version started from the base checkpoint but was trained until 60 epochs. In the visualization, we show it after 57 epochs, as that was where its validation performance peaked. The "weighted" version was again trained from scratch with the same hyper-parameters except for the loss function. It achieved its peak validation performance after 28 epochs, which is the checkpoint shown in the figure.
-
AC2: 'Reply on RC2', Bent Harnist, 06 Nov 2023
Peer review completion
Post-review adjustments
Journal article(s) based on this preprint
Data sets
Data for the manuscript "DEUCE v1.0: A neural network for probabilistic precipitation nowcasting with aleatoric and epistemic uncertainties" by Harnist et al. (2023) Bent Harnist, Seppo Pulkkinen, Terhi Mäkinen https://doi.org/10.23728/fmi-b2share.3efcfc9080fe4871bd756c45373e7c11
Model code and software
fmidev/deuce-nowcasting: Initial release of the source code for the manuscript Bent Harnist https://doi.org/10.5281/zenodo.7961955
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
403 | 249 | 36 | 688 | 22 | 18 |
- HTML: 403
- PDF: 249
- XML: 36
- Total: 688
- BibTeX: 22
- EndNote: 18
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
Seppo Pulkkinen
Terhi Mäkinen
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(7199 KB) - Metadata XML