the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Predicting thunderstorm risk probability at very short time range using deep learning
Abstract. Forecasting electrical activity within the atmosphere remains one of the most challenging predictions, especially due to the chaotic nature of thunderstorms. Lightning strikes are precisely located and occur very quickly, which makes this task particularly difficult. Additionally, these phenomena pose a significant risk to aviation, as they statistically strike each aircraft more than once per year. Over the years, several techniques have been employed for very short-term lightning forecasting (lower than one hour 5 and every five minutes), such as observation-based methods and, more recently, deep learning methods. Previous studies often face difficulties in accurately forecasting lightning probability, and even with AI-driven methods, it is still difficult to obtain calibrated outputs. To address this limitation, we propose a methodology that successfully predicts lightning risk using Convolutional Neural Networks (CNNs) with attention mechanisms. The network is fed with satellite observations and Numerical Weather Prediction (NWP) outputs formatted as a spatio-temporal sequence. Results show a F1 10 score of 0.65 for 5-minute predictions and 0.5 for 30-minute predictions with a very low Expected Calibration Error (ECE) of less than 10 %. Thanks to the well-calibrated outputs, risk probability maps can be plotted, showing areas with strong to low chances of having electrical activity.
- Preprint
(3785 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-2893', Anonymous Referee #1, 02 Oct 2025
-
AC1: 'Reply on RC1', Mélanie Bosc, 06 Nov 2025
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-2893/egusphere-2025-2893-AC1-supplement.pdf
-
AC1: 'Reply on RC1', Mélanie Bosc, 06 Nov 2025
-
RC2: 'Comment on egusphere-2025-2893', Anonymous Referee #2, 19 Nov 2025
This work is a novel application of AI to lightning forecasts and has significant potential for impact.
The authors successfully adapt the ED-DRAP (Encoder-Decoder Deep Residual Attention Prediction) network, demonstrating that its architecture, particularly the spatial and sequential attention mechanisms, is superior to other spatio-temporal models like ConvLSTM and PredRNN for this specific task. The introduction of a composite loss function combining Cross-Entropy and Dice Loss, with an optimally tuned parameter, effectively addresses the severe class imbalance (only 1% lightning pixels on average) and contributes to the excellent calibration scores.
Overall the work is well done, but I have some concerns which if addressed can strengthen the results of the paper:
1) Since the models were trained separately for each forecast horizon, there can be concerns of incoherent forecasts between different forecast horizons. The authors should provide some discussion or visualizations on how the forecasts look between different timestamps.2) Was the evaluation dataset fixed for the different models per forecast horizon or was the 30% chosen separately for each horizon?
3) The training / evaluation dataset seems quite small, this also shows in the results as the results are quite jumpy from one forecast horizon to another. I wonder if there was any overfitting also due to this.
4) To overcome the concerns around a small training / validation dataset, it might be interesting to see if the results generalize to a different part of CONUS - likely keeping the latitude boundaries the same but shifting the longitude bounding box more to the west. If the model yields good evaluation results trained over the Gulf of Mexico but evaluated over a different region the results might be more robust.
5) It would be good to discuss the results separated by diurnal cycles and any peaks through the day / hours of the day.
6) The authors state they selected the 13th band of the ABI sensor (infrared at 10.3mu ) because it is "more sensitive to cloud classification". While this band's Brightness Temperature (BT) is correlated with high cloud tops (cumulonimbus), the argument for selecting only this single band out of 16 is not fully explored. The addition of other relevant channels (e.g., water vapor channels) could provide complementary information about the atmospheric column. The authors could at least outline any restrictions they faced in incorporating other bands.
7) The authors use of NWP data is not entirely clear with regards to which initialization / forecast time is fed as input into the model. The authors state: "Specifically, the following configuration was adopted: 00:00 UTC forecasts were applied from 00:00 UTC to 01:30 UTC, 03:00 UTC forecasts from 01:30 UTC to 04:30 UTC, and 06:00 UTC forecasts from 04:30 UTC to 05:00 UTC".
My questions are - (a) how will this work in realtime because it seems the forecasts initialized at 6:00 UTC are being applied to init times in the past? (b) How will the operational latencies of GFS impact performance?8) In Figure 11. it would be more useful to have a PR curve for a few forecast forizons instead of two different figures for Precision and Recall and on the curve the impact of choosing different thresholds can be plotted. That would make it much more easier to understand the tradeoff.
9) Authors state that they use 0.05 threshold to plot the risk probability map since they want high recall but that can lead to a very low precision. I think a more robust explanation of chosen thresholds and their impact on metrics should be discussed.
10) In Figure 11(a) and (b) the result for precision and recall jumps quite a bit across different horizons and sometimes lower , sometimes higher than other models. It;s actually unclear if the model truly performs better than others. In 11(c) the ED-DRAP model actually performs worse than others for first 30 mins and then better. I think it would help to report more metrics here and better understand the performance at earlier horizons across the different baselines. Maybe visualize the probability maps for the different models.
Citation: https://doi.org/10.5194/egusphere-2025-2893-RC2
Data sets
Availability GOES-R data T. Schmit et al. https://www.ncei.noaa.gov/products/goes-terrestrial-weather-abi-glm
Availability of GFS data G. White et al. http://doi.org/10.5065/D65D8PWK
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,537 | 120 | 20 | 1,677 | 26 | 25 |
- HTML: 1,537
- PDF: 120
- XML: 20
- Total: 1,677
- BibTeX: 26
- EndNote: 25
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of “Predicting thunderstorm risk probability at very short time range using deep learning”
The preprint proposes a deep learning methodology for very short-term (5-60 minutes) probabilistic forecasting of lightning risk, motivated by aviation safety within the ALBATROS project. It adapts the ED-DRAP neural network, incorporating spatio-temporal sequences from satellite (GOES-16 ABI brightness temperature and GLM lightning groups) and NWP (GFS lifted index and relative humidity) data over a region centered on the Gulf of Mexico and Florida. A key focus is on achieving well-calibrated outputs through a combined cross-entropy and Dice loss function, enabling interpretable risk probability maps. Results report F1 scores of 0.65 at 5 minutes and 0.5 at 30 minutes, with ECE below 10%. In general, the manuscript is well-structured, the approach is innovative in emphasizing calibration for probabilistic lightning nowcasting without radar data, and the topic is highly relevant for natural hazards research, particularly in aviation and thunderstorm impacts. However, I have some concerns regarding the scope, comparisons, and generalizability.
My main concern is the limited scope and potential lack of generalizability of the dataset and results. The data is restricted to winter mornings (00:00-05:00 UTC, December-February) from 2020-2023, covering only 154 days with a balanced split of stormy and non-stormy periods. While this controls for variability, it may not capture seasonal, diurnal, or regional differences in thunderstorm dynamics (e.g., summer afternoons or other global hotspots). The study area is narrowed to a subset of CONUS, but no sensitivity analysis is provided for other regions. A discussion on how these choices affect broader applicability, perhaps with preliminary tests on extended data, would strengthen the contribution.
My second concern is the benchmarking and novelty assessment. The model is compared to ConvLSTM, PredRNN, persistence, and U-Net, showing superior F1 and calibration scores. However, it lacks direct comparison to recent lightning-specific DL models from the literature, such as those in Brodehl et al. (2022), Geng et al. (2021), or Leinonen et al. (2023), which also use satellite/radar data for nowcasting. While the intentional exclusion of radar data is well-justified for enhancing applicability to aircraft flight paths where radar coverage may be limited or absent, discussing how the proposed method might compare to radar-inclusive baselines would better contextualize its advantages and limitations.
Other comments
L90-95: Clarify why the smaller area (red rectangle in Fig. 1) was chosen beyond computation cost, does it represent typical thunderstorm regimes?
Fig. 2: Add coordinate axes (latitude/longitude) to subfigure (b) to match (a) for consistency and better spatial context.
L164-165: The effective training/testing area is further cropped to 256x256 pixels (17.3°N–37.7°N, 93°W–72°W) from the subselected red rectangle; consider adding this cropped boundary as an inner rectangle in Fig. 1 for clarity.
L175-180: The input sequence (6 timesteps) is justified by a comparative study, but I suggest including a table or figure summarizing F1 scores for 2/4/6/8 timesteps to support this.
L305-310: The example in Fig. 9 misses only 5% lightning, but it’s not clear which threshold is used in this case.