the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Ensemble Random Forest for Tropical Cyclone Tracking
Abstract. Even though tropical cyclones (TCs) are well documented during the intense part of their lifecycle until they start to evanesce, many physical and statistical properties governing them are not well captured by gridded reanalysis or simulated by earth system models. Thus, the tracking of TCs remains a matter of interest for the investigation of observed and simulated tropical cyclones. Two types of cyclone tracking schemes are available. On the one hand, there are trackers that rely on physical and dynamical properties of the TCs and user-prescribed thresholds, which make them rigid. They need numerous variables that are not always available in the models. On the other hand, there are trackers leaning on deep learning which, by nature, need large amounts of data and computing power. Besides, given the number of physical variables needed for the tracking, they can be prone to overfitting, which hinders their transferability to climate models. In this study, the ability of a Random Forest (RF) approach to track TCs with a limited number of aggregated variables is explored. Hence, the tracking is considered as a binary supervised classification problem of TC-free (zero) and TC (one) situations. Our analysis focuses on the Eastern North Pacific and North Atlantic basins, for which, respectively, 514 and 431 observed tropical cyclone track records are available from the IBTrACS database during the 1980–2021 period. For each 6-hourly time step, RF associates TC occurrence or absence (1 or 0) to atmospheric situations described by predictors extracted from the ERA5 reanalysis. Then situations with TC occurrences are joined for reconstructing TC trajectories. Results show the ability and performance of this method for tracking tropical cyclones over both basins, and good temporal and spatial generalization as well. RF has a similar TC detection rate as trackers based on TCs' properties and significantly lower false alarm rate. RF allows us to detect TC situations for a range of predictor combinations, which brings more flexibility than threshold based trackers. Last but not least, this study shed light on the most relevant variables allowing to detect tropical cyclone.
- Preprint
(3988 KB) - Metadata XML
-
Supplement
(2160 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-252', Anonymous Referee #1, 05 May 2025
- AC1: 'Reply on RC1', Pradeebane Vaittinada Ayar, 21 Jul 2025
-
RC2: 'Comment on egusphere-2025-252', Anonymous Referee #2, 20 May 2025
Summary:
This study uses an ensemble of random forests (ERFs) to identify and track tropical cyclones (TCs) within ERA5 in the North Atlantic and East Pacific basins. The identified TCs and tracks are compared with observations, taken from IBTrACS, and the ERF performance was compared with the Tempest Extreme tracking algorithm. Overall, the authors demonstrate that the ERF performs well in identifying and tracking observed TCs (high probability of detection and low false alarm ratio).
Beyond simply demonstrating that the ERF “works”, the authors also nicely examined the characteristics of false alarms and misses. The authors found that missed TCs and false alarms were generally associated with short duration storms that were of marginal tropical storm intensity. In addition, the authors examined which of the chosen predictors from ERA5 had the largest geni-based feature importance and the contribution of each to the outcome of the random forest prediction using SHAP values.
I personally found the manuscript an interesting and useful application of ERFs. I particularly appreciated the authors discussion on the misses, false alarms, and predictors that most informed the random forest outcome. I also believe the manuscript can be further improved through both the comments below and a more careful editing of the spelling and grammar within the text. I am specifically interested in encouraging the authors to more carefully consider the probability provided by the ERFs using traditional ensemble verification methods such as Brier Skill Score and ROC diagrams. It would also be of interest to better understand if the characteristics of the misses and false alarms from the ERF and Tempest Extreme exhibit any noteworthy differences in location, intensity, duration, or environmental conditions. Overall, I believe this is a study worthwhile of publication after addressing the below comments.
Specific Comments:
- One of the main benefits of the ERFs is the probabilities provided. I wish the authors examined this in more detail. I recommend the authors reconsider the use of a strict threshold, greater than 50% probability, as defining a TC event. There is no requirement for this to be the cutoff and the authors may wish to explore alternative thresholds. Furthermore, the authors may wish to examine the reliability of the ERFs by assessing whether the spread correctly represents the forecast uncertainty by examining the spread-error ratio. On average the ensemble spread should be equal to the error. In addition, I suggest the authors examine reliability diagrams which compare the forecasted probability with the observed frequency, ROC diagrams, and Brier skill score. Each of these analyses will help determine the benefits of the probability provided by the ERFs and may reveal weaknesses of the ensemble design.
- I struggled to fully understand the details of the calibration, validation, and test experiments (L174-182). I am still a bit confused by the overlap between the calibration, validation, and testing periods. It appears from point 3 that the whole 1980-2021 period is used for testing. This is not a fair testing dataset as the ERF was also tested using much of this same period. I believe the authors should perform testing using an entirely new period that was not used during training.
- The authors also mentioned a potential change in the quality of IBTrACS with time in motivating their choice of validation data (every 6 years). I am curious if the authors tested how the performance of the ERF would change if they trained on an earlier period and then tested on a more recent period. This would be interesting for several reasons, including serving as an “easier” initial test for the potential application to future climate simulations the authors mention in the summary section.
- 300 km (L193) appears to be a generous threshold for the distance between an observed and identified TC to be considered a hit. This value is still probably small enough that it is identifying the same storm but large enough that the center location may be off by the approximate size of the TC. Why was this value chosen and how sensitive are the results to this threshold?
- It would be helpful to readers to define each of the predictors from ERA5 within a table in the supplementary material.
- I am interested in understanding if the characteristics of the false alarms and misses are similar with the ERFs and Tempest Extremes. I suggest the authors recreate Figures 5, 6, and 7 for Tempest Extreme within the supplemental figures. This analysis may help identify strengths and weaknesses of each approach.
Technical edits:
L2: change “evanesce” to “weaken”
L37-38: Another tracking algorithm the authors may be interested in referencing here is TRACK (Hodges, 1994). This algorithm differs from others in that it is more general and tracks all vorticity maximums and only later filter out TCs using a warm core threshold.
Hodges, K. I., 1994: A General Method for Tracking Analysis and Its Application to Meteorological Data. Mon. Wea. Rev., 122, 2573–2586, https://doi.org/10.1175/1520-0493(1994)122<2573:AGMFTA>2.0.CO;2.
L144: The authors should more carefully describe what is meant by “standardized”. This is important for the reproducibility.
L252: I suggest the authors replace “different subsampling of zeros” with language more physically intuitive.
Figure 6: The layout of the figure panels in Figure 6 are a bit confusing. I was repeatedly confusing panels (a) and (b). I suggest revising the layout to avoid this.
L311: remove “basin”
L314-315: What is the basis for this hypothesis?
L323: Change “with” to “which”.
L325: Change “since they are associated with the strong surface winds and the location of the cyclone eye, respectively”.
L332-333: A transition would be helpful here to emphasize the different information provided by each of these analyses.
L362-362: I suggest splitting this into two sentences. Ending the first sentence after “literature”.
L390-391: The end of this sentence, “indicating us to be…” should be revised.
Citation: https://doi.org/10.5194/egusphere-2025-252-RC2 - AC2: 'Reply on RC2', Pradeebane Vaittinada Ayar, 21 Jul 2025
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
823 | 92 | 26 | 941 | 57 | 23 | 54 |
- HTML: 823
- PDF: 92
- XML: 26
- Total: 941
- Supplement: 57
- BibTeX: 23
- EndNote: 54
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Review of “Ensemble Random Forest for Tropical Cyclone Tracking”
Overview
This work applies Random Forest (RF) models to track tropical cyclones using environmental variables from a global reanalysis (ERA5) with an eventual goal of using the RF tracker in long-running climate simulations. The Eastern Pacific and Northern Atlantic TC basics were chosen for investigation. Random Forests were trained by categorizing localized boxed regions in each basin as either containing a TC or not (TC-free) and associating statistics of environmental variables in each box from ERA5 to the binary events. Variables of mean sea level pressure, relative vorticity, column water vapor, and thickness were used as they represented different facets of physical mechanisms and TCs. Statistics are computed for these variables and included as inputs during RF training.
Training is conducted with 6-fold cross-validation to generate a range of RF solutions that are then used to compute MCC, POD, and FAR over a series of subsampling experiments – the authors note a significant proportion of their samples are TC-free compared to TC samples. Generally, a ratio of 25-1 is seen as reasonable with POD and FAR tradeoffs as the ratio is increased/decreased. Detection skill is notably better than the baseline UZ method in both basins. Further investigation of skill suggests the model primarily misses TCs at low intensity and low duration. The authors also devise analyses to interpret physical meaning, although I have some comments on this aspect of the analysis below.
Overall, the authors have employed RFs in a very unique and potentially innovative application area to track TCs in global reanalyses. The manuscript could benefit from improved grammar and clarity in locations, along with consideration of additional analyses or methods to improve the scientific presentation. I look forward to seeing a revised manuscript after careful revision.
Comments
McGovern, A., R. Lagerquist, D. John Gagne, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the Black Box More Transparent: Understanding the Physical Implications of Machine Learning. Bull. Amer. Meteor. Soc., 100, 2175–2199, https://doi.org/10.1175/BAMS-D-18-0195.1.
Technical Edits and Questions
Generally: the authors should spend a substantial amount of time proof-reading the document for lingering grammar issues.
Line 48: Change to “this study focuses on data-driven algorithms using machine learning”. Sometimes “so-called” can have a negative/inappropriate connotation, which I don’t believe was your intent.
Lines 93-96: While I understand it is a long-held tradition to include a “table of contents paragraph” in this manner, you can remove this paragraph – it has no particular value for readers. The scientific structure of manuscripts has remained unchanged for decades and every reader knows that methods will come next, results afterward, and so on. If a reader is interested in a particular section, they can seek out the section header to know what is contained within.
Line 99: Remove this single line
Line 103: Remove “cyclonic” – seasons are not “cyclonic”. Alternatively, can adjust to “cyclone seasons”
Line 106: “Track records that do not provide”
Lines 106-107: If a TC undergoes extratropical transition, how is the transition from TC to extratropical TC handled? Also, how is TC demise to depression stage handled? Only TC achievement is mentioned here (i.e., genesis).
Line 131: Moisture is misspelled
Lines 136-138: The description here appears to have two statements in conflict with one another. First, the text says that every box has a vector of ones and zeros is constructed: is this for every grid point in the box? The next sentence says the box is encoded as a 1 or 0. Some additional clarity and perhaps rewording these sentences is needed to clarify the approach. I suspect it is the latter, but the wording is a bit confusing.
Line 134: Why are the boxes not immediately adjacent to one another? Could a TC be missed if it lies outside of the boxes in the white areas of Figure 1?
Lines 139-140: What is the motivation for synthesizing the ERA5 data in the boxes to single-statistic values? Other works have used spatial regions to encode relevant spatial relationships into RFs (see Hill et al. 2020, 2021, 2023, 2024, Schumacher et al. 2021) and have had tremendous success, including deducing how those spatially oriented data contribute to forecast skill (Mazurek et al. 2025). Others tackling severe weather hazards have taken a synthesizing approach too (see Clark and Loken 2022, Loken et al. 2022). Were there any tests that also included the full box of ERA5 data to demonstrate the single-value statistics were a better methodological choice?
Loken, E. D., A. J. Clark, and A. McGovern, 2022: Comparing and Interpreting Differently Designed Random Forests for Next-Day Severe Weather Hazard Prediction. Wea. Forecasting, 37, 871–899, https://doi.org/10.1175/WAF-D-21-0138.1.
Clark, A. J., and E. D. Loken, 2022: Machine Learning–Derived Severe Weather Probabilities from a Warn-on-Forecast System. Wea. Forecasting, 37, 1721–1740, https://doi.org/10.1175/WAF-D-22-0056.1.
Mazurek, A. C., A. J. Hill, R. S. Schumacher, and H. J. McDaniel, 2025: Can Ingredients-Based Forecasting Be Learned? Disentangling a Random Forest’s Severe Weather Predictions. Wea. Forecasting, 40, 237–258, https://doi.org/10.1175/WAF-D-23-0193.1.
Hill, A. J., R. S. Schumacher, and M. R. Green, 2024: Observation Definitions and their Implications in Machine Learning-based Predictions of Excessive Rainfall. doi.org/10.1175/WAF-D-24-0033.1.
Hill, A. J., R. S. Schumacher, and I. L. Jirak, 2023: A new paradigm for medium-range severe weather forecasts: probabilistic random forest-based predictions. doi:10.1175/WAF-D-22-0143.1.
Hill, A. J. and R. S. Schumacher, 2021: Forecasting excessive rainfall with random forests and a deterministic convection-allowing model. doi:10.1175/WAF-D-21-0026.1.
Schumacher, R. S., A. J. Hill, M. Klein, J. Nelson, M. Erickson, S. M. Trojniak, and G. R. Herman, 2021: From random forests to flood forecasts: A research to operations success story. doi:10.1175/BAMSD-20-0186.1.
Hill, A. J., G. R. Herman, and R. S. Schumacher, 2020: Forecasting severe weather with random forests. doi:10.1175/MWR-D-19-0344.1.
Lines 147-148: This sentence is not needed – can be removed. All of this information is contained in the section headers.
Line 174-175: To be consistent with both machine learning and atmospheric science literature, the “calibration” phase should be referred to as the “training” phase of the ERF. Then, you use cross-validation to validate the trained model on withheld periods – you don’t use those withheld periods to “calibrate” the models.
Line 188: Should RF actually be ERF?
Line 188: Did you consider alternative probability thresholds (beyond just 50%) to assignment detected tracks (D)?
Lines 251-253: This text is best reserved for the figure caption – please move there if not already. This text is just describing the figure, not the science.
Figure 3: It would be good to see the full distribution of MCC scores for the 100 RFs plotted as error bars, akin to a 95% confidence interval. Are the MCC values truly indifferent statistically? (it is hard to tell but maybe this detail is plotted as light blue lines? If so, please try and make these lines clearer so they can be discerned and provide a description in the figure caption)
Lines 273-274: What is meant by “calibration experiments”? Are you just evaluating the model’s ability to detect storms over the testing period for which it was trained? It is to be expected that POD will be high and FAR low.
Line 283-284: Isn’t a missed track by definition lower probability? Aren’t hits/misses defined by probabilities greater than or less than 50%? These box plots in Figure 5b are being more or less constrained by the methods used, and don’t necessarily provide much scientific reasoning for “FA are less likely to happen than hits”. The authors should reconsider the usefulness of this analysis in regard to their methodological choices.
Lines 320-322: As mentioned earlier, they are also prescribed by the authors, so these results are not extremely surprising. See major comment above.
Lines 348-349: This information is once again best reserved for the figure caption.
Figure 10: This is an excellent figure that clearly demonstrates how the RFs are learning the relevance of each predictor to drive the yes/no predictions.