Investigating the drivers of wintertime Southern Ocean sea-ice leads using random forest algorithms

Dubey, Umesh; Willmes, Sascha; Heinemann, Günther

doi:10.5194/egusphere-2026-514

Preprints

https://doi.org/10.5194/egusphere-2026-514

Preprints

10 Feb 2026

| 10 Feb 2026

Investigating the drivers of wintertime Southern Ocean sea-ice leads using random forest algorithms

Umesh Dubey, Sascha Willmes, and Günther Heinemann

Abstract. Sea-ice leads play a crucial role in regulating ocean-atmosphere energy exchange, yet the physical drivers controlling their variability across the Southern Ocean remain unquantified. This study uses a machine learning approach based on random forest regression with permutation importance analysis to identify predictors of Southern Ocean lead frequency during winters (April–September, 2003–2023). The model integrates nine predictors representing atmospheric (wind speed, wind divergence, sea-level pressure, 2 m temperature), oceanic (surface current speed), and sea-ice kinematic variables (ice velocity, divergence, concentration), together with a seasonal descriptor (month). Evaluated on independent test data, the model achieves an evaluation performance correlation of r = 0.70 at the pan-Antarctic scale and r = 0.68–0.78 across regional sectors (MAE = 0.016–0.024). Permutation analysis indicates that 2 m temperature (20 %), wind divergence (13 %), ice divergence (11.9 %), and ocean current speed (11.6 %) collectively explain approximately 57 % of the observed lead frequency variability. Regional analysis reveals sector-specific drivers: The Weddell Sea is controlled by wind and ice divergence; the Ross Sea exhibits contributions from air temperature, wind divergence, and ocean current. The Indian and Pacific Ocean sectors show strong air temperature and ocean current influence, and the Bellingshausen–Amundsen Seas are dominated by seasonal wind forcing. However, the model does not fully resolve fine-scale structures evident in observations, hence a notable portion of the lead frequency variance remains unexplained due to the spatial resolution used in this study. This suggests the need for future work to apply a random forest framework at higher spatial resolution to investigate small-scale regional lead hotspots, including bathymetrically-controlled and coastal lead zones.

Received: 28 Jan 2026 – Discussion started: 10 Feb 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Umesh Dubey, Sascha Willmes, and Günther Heinemann

Status: final response (author comments only)

Subscribe to comment alert

RC1: 'Comment on egusphere-2026-514', Anonymous Referee #1, 05 Mar 2026

Review comments
This manuscript applies a random forest regression framework to reconstruct and interpret wintertime Southern Ocean lead frequency (LF) using a gap-filled monthly lead-frequency product (Dubey et al., 2025a) together with a set of atmospheric, sea-ice kinematic, and ocean-current predictors.

The study of Dubey et al. (2025a) itself is highly valuable, and the dataset produced in that work represents an important contribution to the community. Compared with that work, however, the contribution of the present manuscript appears more limited and gives the impression of a preliminary or exploratory application of the method. The identified top predictors appear to be only apparent or indirect drivers, and the paper provides relatively limited new physical insight into the mechanisms controlling lead variability.

Using a random forest approach for this type of problem is potentially meaningful. However, for publication as a full paper, the study requires more in-depth analysis. The choice of predictors and the way they are treated should also be reconsidered. In particular, leads are small-scale phenomena, yet the analysis is conducted at a very coarse grid resolution (2° × 5°). In addition, further clarification and discussion are needed regarding the interpretation of the dominant predictor (2 m air temperature) and the regional classification based solely on longitude sectors.

For these reasons, I believe that substantial revision and additional analysis are required before the manuscript can be considered for publication in The Cryosphere.
Major comments
1. Need to address coastal polynyas explicitly

From reading the Introduction, the discussion includes not only leads but also the role of coastal polynyas. However, the term “coastal polynya” is not explicitly used in the manuscript, and leads appear to be treated in a way that implicitly includes coastal polynyas.

Although leads and coastal polynyas cannot always be strictly separated, the text should consistently refer to them as “leads and coastal polynyas…” if both are included in the analysis.

Furthermore, from examining Dubey et al. (2025a), it appears that coastal polynyas forming along ice shelves or along landfast ice may also be included in the lead counts. If this is the case, the manuscript should explicitly state that open water or thin ice regions forming along landfast ice, which are commonly referred to as coastal polynyas, are included in this study and treated as leads.

Some clarification on this point is necessary.
2. Differences between coastal and offshore regions, especially the role of landfast ice

Based on Dubey et al. (2025a), many leads appear to occur near the coast, particularly along the edge of landfast ice. In the present study, regional divisions are made only by longitude sectors. However, it seems reasonable to expect that lead variability differs between coastal regions and offshore pack ice. Despite this, no such regional distinction is made in the analysis. I suggest reconsidering the regional classification. In particular, leads near the coast are likely strongly influenced by the presence and variability of landfast ice. However, the manuscript contains no discussion of the relationship between leads and landfast ice. The landfast ice dataset of Fraser et al. (2020) is openly available and could be used in the analysis. Even if the dataset is not used directly, some discussion of the potential role of landfast ice would be appropriate.

Fraser, A. D., et al., 2020: High-resolution mapping of circum-Antarctic landfast sea ice distribution, 2000–2018. Earth System Science Data, 12, 2987–2999.
3. Causal interpretation of the dominance of 2 m air temperature

From a physical perspective, lower 2 m air temperature should promote freezing of open water and therefore act to close leads. However, the analysis indicates the opposite: lower air temperatures are associated with increased lead frequency, and this variable is identified as the top driver. The paper later argues that this reflects a proxy relationship with cold offshore winds. In other words, the predictor is not a direct physical driver but rather an indirect or apparent one. A conclusion in which an apparent or indirect predictor becomes the top driver unfortunately weakens the physical significance of the result. That said, such outcomes can occur in statistical analyses, and I do not dispute the result itself. However, in this case, the interpretation requires stronger support. The manuscript should provide clearer justification for this proxy interpretation and discuss whether it is possible to remove or isolate the apparent effect. As suggested in comment 2, separating coastal and offshore regions may help address this issue.
4. Gap between the spatial scale of leads and the analysis grid scale

As the authors themselves note, leads are small-scale phenomena. The original data used in the study have a spatial resolution of about 1 km², yet the analysis is conducted on a much coarser grid of 2° × 5°. The authors acknowledge that such aggregation smooths bathymetrically controlled hotspots and narrow coastal leads. Because many leads are controlled by coastal divergence, tides, and shelf-break dynamics, coarse resolution may bias the apparent importance of predictors away from kinematic drivers and toward broader thermodynamic patterns. Please discuss more explicitly how coarse gridding may suppress mechanical deformation signals and alter the apparent ranking of predictors. If possible, I would appreciate seeing a supplemental analysis at higher resolution for a subset region or time period to demonstrate the scale dependence of the results.
5. Unnecessary figures and analysis

Figures 4 and 9 should be removed. The relative importance of predictors is already shown in Figures 5 and 10, and these figures add little additional insight. Moreover, the results depend on the order in which predictors are added, and it is not clear why the present order was chosen. Removing these figures would not cause any essential loss to the paper. As noted elsewhere, there are several areas where additional analysis would be more useful. Therefore, these figures and the associated explanation should be removed in favor of more meaningful analysis.
Minor comments
6. Description of the LF dataset

P3:“This study uses the monthly LF dataset by Dubey et al. (2025b) …”

It would be helpful to include a slightly more concise explanation of this dataset within the paper.
7. Use of climatological sea-ice concentration during AMSR-E/AMSR2 data gap

P4:“For April to June 2012 … we use the mean sea-ice concentration …”

Using climatological values during this period seems inappropriate. The analysis focuses on interannual variability, and therefore replacing missing data with climatology may distort the results.

Instead, it would be preferable either to use SSM/I data or to exclude this period from the analysis.
8. Reliability of the ocean current dataset

P4: Ocean surface current speed data were obtained from ORAS5…”

Ocean surface currents in the Southern Ocean are still poorly constrained. I am not convinced that this dataset reliably represents the variability of ocean currents on monthly and interannual timescales.

Datasets such as B-SOSE may be more appropriate. If the authors wish to use ORAS5, they should provide justification for its reliability in this context. Personally, I would suggest excluding this predictor rather than using highly uncertain data. Ocean surface currents are largely determined by winds and sea-ice conditions, which are already included as predictors.
9. Definition of ice divergence and wind divergence

Please clarify how ice divergence was calculated when grid cells include land or coastline. For example, offshore ice drift near a coast can produce strong divergence along the coast. Were coastline or topographic constraints considered when calculating ice divergence? If not, I recommend recalculating divergence while accounting for land boundaries. For wind divergence, it may be acceptable to compute divergence directly from wind fields without considering topography, although clarification would still be helpful.
10. Sign of relationships between predictors and LF

Figures 5 and 10 show the relative contribution of each predictor to LF variability. However, it is not clear whether the relationships are positive or negative. For example, I initially assumed that higher 2 m air temperature would increase LF, but Section 4.2 later shows that the relationship is actually inverse. Similarly, the sign of relationships for SLP, ocean current speed, and ice velocity is not immediately obvious. At the beginning of the results section, the manuscript should clearly indicate whether each predictor has a positive or negative relationship with LF.
11. Mismatch in Ross Sea (Fig. 8e)

In Fig. 8, the mismatch between observations and predictions appears particularly large in the Ross Sea (Fig. 8e). What causes this discrepancy? If a clear explanation is not already provided in the manuscript (I may have overlooked it), it should be discussed.

Citation: https://doi.org/10.5194/egusphere-2026-514-RC1
RC2:
'Comment on egusphere-2026-514', Anonymous Referee #2, 09 Mar 2026
The manuscript presents a novel approach to determine the factors controlling the winter lead frequency in the Southern Ocean from a recent satellite product using random forest regression with permutation importance analysis. Based on this analysis, the authors identify the main predictors for lead frequency for the whole Southern Ocean and separately for the different basins, and discuss the associated mechanisms. This provides very useful information and the study is thus of great interest. However, additional information would be required before publication, in particular in the methodology and the impact of some of the modelling choices, as detailed below.

General points
The manuscript should explain more explicitly the difference between lead frequency and ice concentration. The introduction discusses in general the impact of leads and opening within the ice pack, including studies that use ice concentration (from satellite products and models) as a measure of the surface occupied by leads. Naively, one could have expected that lead frequency and ice concentration would have been related. However, this does not seem to be the case, except in a few regions (e.g. Table 1). That would be useful if the authors explain why this naïve view is not valid.

It may be hard to evaluate the model performance from the absolute numbers that are provided. Would it be possible to compare the model performance to the one of a much simpler model, assuming for instance a constant lead frequency in each basin or a climatological lead frequency at each point ?

The regions where the differences between observed and predicted lead frequency are the largest are not easy to find from figure 1. From figure 2, it is clear that the predictions are biased for large values of the lead frequency but we cannot see on Figure 1 to which region this corresponds. Would it be possible to plot the difference between observed and predicted lead frequency in Figure 1 in addition to the absolute values ? This would facilitate the discussion. Additionally, it is mentioned several times in the manuscript that large biases are present close to topographic gradients, in bathymetrically-controlled and coastal zone, at the shelf break, etc. but this is not explicitly shown. Maybe a difference in Fig. 1 will help identifying those zones but an additional diagnostic may be needed to see it.

The random forest model is based on monthly mean values at a relatively coarse resolution (2° latitude × 5° longitude grid). This can be justified to reduce the amount of data that has to be processed. However, lead opening and closing could occur at shorter timescales and the link with some atmospheric variables could be obscured by monthly averages. For instance, we can imagine that the link between horizontal wind divergence and lead opening is very strong during the passage of a storm but weak at lower frequency. If possible, repeating some of the diagnostics (like Figure 2 and maybe Figure 5 if possible) at two different resolutions would allow testing this. If it is too time demanding, at least a discussion of the possible impacts of the resolution would be needed.

The contribution of ocean processes to explain the lead frequency is relatively limited but only one oceanic variable is used (ocean current). Why not using more oceanic variables (like for instance the ocean current divergence – equivalent for the ocean to wind divergence) ? Eddy kinetic energy could also be a good candidate as the authors mention the role of the eddies in the discussion and eddy kinetic energy can be large in the region of rough topography and at the shelf break, regions where the model performance is apparently low. Another advantage is that, in contrast to individual eddy that are at a smaller scale than the one used in the model, eddy kinetic energy can be interpolated at the scale of the model. Alternatively, if bathymetry is so important, it could also be a good predictor. This is maybe too much additional work for the current manuscript but the possibility that the ocean role is underestimated due to the limited number of variables used could be mentioned.

The added value of incremental predictor contribution (3.2) compared to the relative importance (3.3) is not clear. Is the relative contribution in the incremental method depending on the order in which the variables are added (while it is not the case for the relative contribution) ? Are they elements that are clear in 3.2 and not in 3.3 ? If this is the case, this should be more clearly highlighted or alternatively section 3.2 could be suppressed or put in supplement.

Specific points.
Line 32. The formulation can be interpreted as if ice divergence was an atmospheric process only.

Lines 73-83. The introduction is nice and reviews well the different processes. However, the last paragraphs of the introduction are mainly a description of the work that has been done while the specific motivation of the study is not yet clear and this could be developed more at this stage.

Line 144. If it is standard recommendation maybe adding a reference would be useful ?

Line 257-258. The first sentence is very general and could be suppressed.

Is the Section 3.4 adding new information or could it be moved to the supplement for instance ?

Line 241. The discussion about whole number percentage is confusing while on Figure 5 and in most of the text the number are given with one decimal. Why not giving all the numbers with one decimal ? Same issue line 311.

It not clear what is shown on Figure 8 and what is meant by sample-by-sample comparison? Is it the actual lead frequency and the predicted lead frequency on the same x-point ? If this is the case, what is the added information compared to Figure 7?

Lines 396-399: Is there a good way to test the hypothesis of 2m temperature acting as proxy for offshore winds? Maybe by testing the correlation between temperature and wind speed/direction or something along those lines?

Line 374. The previous paragraph states that 2m air temperature is the main driver. How this finding is it ‘consistent with earlier work’ which shows that Antarctic leads tend to occur in area influenced by bathymetric steering ? Maybe ‘this’ refers to something else ?

Line 381. That would be useful to give the coordinates of Maud Rise and Gunnerus Ridge (or to show them on a figure) and maybe show the biases of the model for those points ?

Line 471. The link with ‘bathymetric contrasts’ is not clear here.

Lines 475-478: If r=0.73 shows comparatively higher model skill, what is it in comparison to? And, if the Ross sea has many events that are on sub-monthly scales that the model can’t account for, then shouldn’t it result in lower model skill?

Line 506. Is this indirect dynamical effect due to the resolution of the reanalysis (and the spatial resolution of the model) that is not able to reproduce the winds and their divergence at the scale that matters for leads opening ? (see also the general point 4 above).
Citation: https://doi.org/10.5194/egusphere-2026-514-RC2

Umesh Dubey, Sascha Willmes, and Günther Heinemann

Viewed

Total article views: 475 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
216	239	20	475	50	62

HTML: 216
PDF: 239
XML: 20
Total: 475
BibTeX: 50
EndNote: 62

Views and downloads (calculated since 10 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	120	54	11	185
Mar 2026	96	185	9	290

Cumulative views and downloads (calculated since 10 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	120	54	11	185
Mar 2026	96	185	9	290

Viewed (geographical distribution)

Total article views: 459 (including HTML, PDF, and XML) Thereof 459 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 25 Mar 2026

Short summary

Cracks and openings in sea ice, called leads, play an important role in regulating heat exchange between the ocean and atmosphere, affecting global climate patterns. This study used a machine learning model trained on wintertime data from 2003 to 2023 to identify the key drivers of leads across the Southern Ocean and sub-regions. The analysis shows that 2m air temperature, wind-ice divergence, and ocean currents are the primary drivers, though their importance differs across regions and seasons.


Total:	0
HTML:	0
PDF:	0
XML:	0