Predicting thunderstorm risk probability at very short time range using deep learning

Bosc, Mélanie; Chan-Hon-Tong, Adrien; Bouchard, Aurélie; Béréziat, Dominique

doi:10.5194/egusphere-2025-2893

Preprints

https://doi.org/10.5194/egusphere-2025-2893

Preprints

23 Jul 2025

| 23 Jul 2025

Predicting thunderstorm risk probability at very short time range using deep learning

Mélanie Bosc, Adrien Chan-Hon-Tong, Aurélie Bouchard, and Dominique Béréziat

Abstract. Forecasting electrical activity within the atmosphere remains one of the most challenging predictions, especially due to the chaotic nature of thunderstorms. Lightning strikes are precisely located and occur very quickly, which makes this task particularly difficult. Additionally, these phenomena pose a significant risk to aviation, as they statistically strike each aircraft more than once per year. Over the years, several techniques have been employed for very short-term lightning forecasting (lower than one hour 5 and every five minutes), such as observation-based methods and, more recently, deep learning methods. Previous studies often face difficulties in accurately forecasting lightning probability, and even with AI-driven methods, it is still difficult to obtain calibrated outputs. To address this limitation, we propose a methodology that successfully predicts lightning risk using Convolutional Neural Networks (CNNs) with attention mechanisms. The network is fed with satellite observations and Numerical Weather Prediction (NWP) outputs formatted as a spatio-temporal sequence. Results show a F₁ 10 score of 0.65 for 5-minute predictions and 0.5 for 30-minute predictions with a very low Expected Calibration Error (ECE) of less than 10 %. Thanks to the well-calibrated outputs, risk probability maps can be plotted, showing areas with strong to low chances of having electrical activity.

Received: 17 Jun 2025 – Discussion started: 23 Jul 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Mélanie Bosc, Adrien Chan-Hon-Tong, Aurélie Bouchard, and Dominique Béréziat

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-2893', Anonymous Referee #1, 02 Oct 2025

Review of “Predicting thunderstorm risk probability at very short time range using deep learning”
The preprint proposes a deep learning methodology for very short-term (5-60 minutes) probabilistic forecasting of lightning risk, motivated by aviation safety within the ALBATROS project. It adapts the ED-DRAP neural network, incorporating spatio-temporal sequences from satellite (GOES-16 ABI brightness temperature and GLM lightning groups) and NWP (GFS lifted index and relative humidity) data over a region centered on the Gulf of Mexico and Florida. A key focus is on achieving well-calibrated outputs through a combined cross-entropy and Dice loss function, enabling interpretable risk probability maps. Results report F1 scores of 0.65 at 5 minutes and 0.5 at 30 minutes, with ECE below 10%. In general, the manuscript is well-structured, the approach is innovative in emphasizing calibration for probabilistic lightning nowcasting without radar data, and the topic is highly relevant for natural hazards research, particularly in aviation and thunderstorm impacts. However, I have some concerns regarding the scope, comparisons, and generalizability.
My main concern is the limited scope and potential lack of generalizability of the dataset and results. The data is restricted to winter mornings (00:00-05:00 UTC, December-February) from 2020-2023, covering only 154 days with a balanced split of stormy and non-stormy periods. While this controls for variability, it may not capture seasonal, diurnal, or regional differences in thunderstorm dynamics (e.g., summer afternoons or other global hotspots). The study area is narrowed to a subset of CONUS, but no sensitivity analysis is provided for other regions. A discussion on how these choices affect broader applicability, perhaps with preliminary tests on extended data, would strengthen the contribution.
My second concern is the benchmarking and novelty assessment. The model is compared to ConvLSTM, PredRNN, persistence, and U-Net, showing superior F1 and calibration scores. However, it lacks direct comparison to recent lightning-specific DL models from the literature, such as those in Brodehl et al. (2022), Geng et al. (2021), or Leinonen et al. (2023), which also use satellite/radar data for nowcasting. While the intentional exclusion of radar data is well-justified for enhancing applicability to aircraft flight paths where radar coverage may be limited or absent, discussing how the proposed method might compare to radar-inclusive baselines would better contextualize its advantages and limitations.
Other comments

L90-95: Clarify why the smaller area (red rectangle in Fig. 1) was chosen beyond computation cost, does it represent typical thunderstorm regimes?

Fig. 2: Add coordinate axes (latitude/longitude) to subfigure (b) to match (a) for consistency and better spatial context.

L164-165: The effective training/testing area is further cropped to 256x256 pixels (17.3°N–37.7°N, 93°W–72°W) from the subselected red rectangle; consider adding this cropped boundary as an inner rectangle in Fig. 1 for clarity.

L175-180: The input sequence (6 timesteps) is justified by a comparative study, but I suggest including a table or figure summarizing F1 scores for 2/4/6/8 timesteps to support this.

L305-310: The example in Fig. 9 misses only 5% lightning, but it’s not clear which threshold is used in this case.

Citation: https://doi.org/10.5194/egusphere-2025-2893-RC1
- AC1: 'Reply on RC1', Mélanie Bosc, 06 Nov 2025
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-2893/egusphere-2025-2893-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-2893-AC1
RC2: 'Comment on egusphere-2025-2893', Anonymous Referee #2, 19 Nov 2025

This work is a novel application of AI to lightning forecasts and has significant potential for impact.
The authors successfully adapt the ED-DRAP (Encoder-Decoder Deep Residual Attention Prediction) network, demonstrating that its architecture, particularly the spatial and sequential attention mechanisms, is superior to other spatio-temporal models like ConvLSTM and PredRNN for this specific task. The introduction of a composite loss function combining Cross-Entropy and Dice Loss, with an optimally tuned parameter, effectively addresses the severe class imbalance (only 1% lightning pixels on average) and contributes to the excellent calibration scores.
Overall the work is well done, but I have some concerns which if addressed can strengthen the results of the paper:

1) Since the models were trained separately for each forecast horizon, there can be concerns of incoherent forecasts between different forecast horizons. The authors should provide some discussion or visualizations on how the forecasts look between different timestamps.
2) Was the evaluation dataset fixed for the different models per forecast horizon or was the 30% chosen separately for each horizon?
3) The training / evaluation dataset seems quite small, this also shows in the results as the results are quite jumpy from one forecast horizon to another. I wonder if there was any overfitting also due to this.
4) To overcome the concerns around a small training / validation dataset, it might be interesting to see if the results generalize to a different part of CONUS - likely keeping the latitude boundaries the same but shifting the longitude bounding box more to the west. If the model yields good evaluation results trained over the Gulf of Mexico but evaluated over a different region the results might be more robust.
5) It would be good to discuss the results separated by diurnal cycles and any peaks through the day / hours of the day.
6) The authors state they selected the 13th band of the ABI sensor (infrared at 10.3mu ) because it is "more sensitive to cloud classification". While this band's Brightness Temperature (BT) is correlated with high cloud tops (cumulonimbus), the argument for selecting only this single band out of 16 is not fully explored. The addition of other relevant channels (e.g., water vapor channels) could provide complementary information about the atmospheric column. The authors could at least outline any restrictions they faced in incorporating other bands.
7) The authors use of NWP data is not entirely clear with regards to which initialization / forecast time is fed as input into the model. The authors state: "Specifically, the following configuration was adopted: 00:00 UTC forecasts were applied from 00:00 UTC to 01:30 UTC, 03:00 UTC forecasts from 01:30 UTC to 04:30 UTC, and 06:00 UTC forecasts from 04:30 UTC to 05:00 UTC".

My questions are - (a) how will this work in realtime because it seems the forecasts initialized at 6:00 UTC are being applied to init times in the past? (b) How will the operational latencies of GFS impact performance?
8) In Figure 11. it would be more useful to have a PR curve for a few forecast forizons instead of two different figures for Precision and Recall and on the curve the impact of choosing different thresholds can be plotted. That would make it much more easier to understand the tradeoff.
9) Authors state that they use 0.05 threshold to plot the risk probability map since they want high recall but that can lead to a very low precision. I think a more robust explanation of chosen thresholds and their impact on metrics should be discussed.
10) In Figure 11(a) and (b) the result for precision and recall jumps quite a bit across different horizons and sometimes lower , sometimes higher than other models. It;s actually unclear if the model truly performs better than others. In 11(c) the ED-DRAP model actually performs worse than others for first 30 mins and then better. I think it would help to report more metrics here and better understand the performance at earlier horizons across the different baselines. Maybe visualize the probability maps for the different models.

Citation: https://doi.org/10.5194/egusphere-2025-2893-RC2

Mélanie Bosc, Adrien Chan-Hon-Tong, Aurélie Bouchard, and Dominique Béréziat

Data sets

Availability GOES-R data T. Schmit et al. https://www.ncei.noaa.gov/products/goes-terrestrial-weather-abi-glm

Availability of GFS data G. White et al. http://doi.org/10.5065/D65D8PWK

Mélanie Bosc, Adrien Chan-Hon-Tong, Aurélie Bouchard, and Dominique Béréziat

Viewed

Total article views: 1,677 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,537	120	20	1,677	26	25

HTML: 1,537
PDF: 120
XML: 20
Total: 1,677
BibTeX: 26
EndNote: 25

Views and downloads (calculated since 23 Jul 2025)

Month	HTML	PDF	XML	Total
Jul 2025	76	15	3	94
Aug 2025	244	30	8	282
Sep 2025	1,123	11	3	1,137
Oct 2025	57	37	2	96
Nov 2025	37	27	4	68

Cumulative views and downloads (calculated since 23 Jul 2025)

Month	HTML	PDF	XML	Total
Jul 2025	76	15	3	94
Aug 2025	244	30	8	282
Sep 2025	1,123	11	3	1,137
Oct 2025	57	37	2	96
Nov 2025	37	27	4	68

Viewed (geographical distribution)

Total article views: 1,656 (including HTML, PDF, and XML) Thereof 1,656 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 21 Nov 2025

Short summary

In the context of aeronautics, one of the main dangers along flight paths is the presence of cumulonimbus clouds, which can generate lightning and strike aircraft causing damages. To address this issue, we have developed a data-driven AI method to predict thunderstorms risk that allows to estimate electrical activity probability at very short time range (every 5 minutes up to 1 hour ahead).


Total:	0
HTML:	0
PDF:	0
XML:	0