the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Iceberg Detection Based on the Swin Transformer Algorithm and SAR Imagery: Case Studies off Prydz Bay and the Ross Sea, Antarctic
Abstract. Icebergs pose persistent hazards to maritime navigation and offshore operations. In Antarctica, grounded offshore icebergs may gradually melt, altering the local ocean stratification conditions. This in turn influences coastal ocean circulation, sea ice dynamics, and thermodynamics. Accurately identifying the spatiotemporal distribution of icebergs is essential for both maritime operations and oceanographic research. In this study, we developed an iceberg detection algorithm based on the Swin transformer model (IDAS-Transformer). The IDAS-Transformer, along with a support vector machine (SVM) and a residual network (ResNet18), was applied to four synthetic aperture radar (SAR) images acquired over Prydz Bay and the Ross Sea, which represented a landfast ice zone, a drift ice zone, and an open ocean. The coverage area of each image was 80 km × 80 km. Manual interpretation was employed to generate reference data for algorithmic evaluation purposes. The iceberg concentration, defined as the area occupied by icebergs per grid unit, along with the total number of icebergs and their average size, was introduced to provide a quantitative iceberg detection assessment. We found that the IDAS-Transformer performed well across various sea ice conditions, and a total of more than 800 icebergs were detected. Both the F1 scores and the kappa coefficients of the model exceeded 85 %. The total number of identified icebergs and their area presented mean biases of +4.13 % and +3.65 %, respectively. The IDAS-Transformer outperformed the other two tested algorithms. The sea ice concentration affects the iceberg detection process, with the main challenge being the separation of icebergs from similarly textured pack ice in complex ice-covered regions. Furthermore, distinguishing icebergs that are smaller than 160 m × 160 m among large ice floes remains difficult.
Competing interests: Some authors are members of the editorial board of journal TC.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(3042 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 31 Oct 2025)
- RC1: 'Comment on egusphere-2025-3214', Anonymous Referee #1, 14 Oct 2025 reply
-
RC2: 'Comment on egusphere-2025-3214', Anonymous Referee #2, 14 Oct 2025
reply
The manuscript “Iceberg Detection Based on the Swin Transformer Algorithm and SAR Imagery: Case Studies off Prydz Bay and the Ross Sea, Antarctic” introduces the IDAS-Transformer to segment Sentinel 1 images into icebergs and background. The authors train and apply this method, which is based on a Swin Transformer, to areas adjacent to two Chinese research bases in Antarctica and present maps of iceberg concentration, number and area. The results presented look encouraging and the fact that the method has already been applied on vessels is great to close the gap between science and applications. However, I am missing various details on the method and especially on the visual interpretation to follow the paper and to really assess the results.
General comments:
1. The introduction and data description sections are not fluent and some sentences are out of context – sounds a bit AI generated. Please fix and acknowledge use of AI if applicable
2. The visual interpretation is not explained at all, but forms the basis of all analysis and results
- How did you label the training and evaluation data?
- What is the estimated uncertainty of the visual interpretation?
- Have you checked the areas where only the automated methods detect icebergs to confirm that they weren’t just missed visually? It would be nice to include some examples of those cases and also of cases where the automated methods miss visually detected icebergs to understand the challenges better.
- Figure 11 is very good for understanding what is going on. However, I disagree with some of the manual interpretation (e.g. the small spike in a, the lower left corner in a, and the general shapes in e and f). Please include details on how these ‘ground truth’ data are generated and what their uncertainty and limitations are!
3. The description of the method is unclear
- How was the training, validation and test data split up? Using only 4 scenes is very little data and the test data should be from an independent scene and time and if the method is supposed to be applicable across Antarctica also from a different area.
- Where do the three channels come from? Isn’t the HV data just one channel?
- Did you use a land mask? I am surprised that none of the ice shelf got mixed up with icebergs.
- How were the hyperparameters optimized?
- Where they also optimized for the other two methods?
- Was a 4x4 patch size best for all?
- Did you calculate the statistics per pixel or per 4x4 patch?
4. The results could benefit from restructuring, more details on the performance and a discussion of the limitations
- I would first compare the three methods and then apply the best performing one, so I suggest starting with section 3.4 and 3.5. This might require some reshuffling of the text within the sections, but makes more sense to me.
- You only mention how many icebergs were identified and how many more these were compared to visual interpretation. It would also be useful to know how many of the visually detected icebergs were missed and how many of the automatically detected ones are missing from the visual interpretation.
- The maps of iceberg concentration, number and area are a nice idea. Could you please also comment on the limitations due to the time gaps between Sentinel-1 overpasses, whether you suggest using NRT data, how long the algorithm takes to run on each scene and how far you expect the icebergs to drift between overpasses or between the acquisition and the generation of the maps?
Detailed comments:
L16: not just grounded?
L20-22 this sounds like you apply all three methods consecutively rather than for comparison.
L23 delete an, also the in L97
L51 carving or calving?
L62 What do you mean by this?
L64 the pdf of what?
L65, 68 replace classification with extraction, segmentation or detection.
L66 Context of this sentence? The previous one is on SAR, now you talk about altimetry
L110, 111 translation error or AI generated? European Earth does not exist
L125 Why did you not add June to really cover all seasons?
L143 By AMSR2 you mean the University of Bremen product?
L147 Are you using 4 or 50 scenes? What time or area do they cover?
L153 How do you annotate patches that are half or partly covered by an iceberg?
L156 Are these new scenes? What time and area do they cover?
Fig 3 Why are the several background labels not merged into one? They should be connected?
Fig 4 It looks like the background is gone in b. How does it come back in c if c is generated from b?
It also looks like the icebergs shrink from a to b/cL192 why two successive Swin transformer blocks? Looks like 4 in Figure 5
L357 Where is this shown?
L366-367 Isn’t the ice shelf shown in grey? Please use different colours for the ice shelf front and fast ice edge or clarify the caption.
L375-376 SVM is mainly worse for AOIs 2 and 3 but very similar for 1 and 4, so I wouldn’t say the margins are significant everywhere.
L391 Why is this number different from table 2? Which scene(s) are you talking about here?
L492-500 This is a very nice application, which strongly supports your method. I would therefore already mention it in the introduction and methods, as it also makes it clearer what the goal is.
L505 This sounds like you used AMSR2 as input to the neural network?
Citation: https://doi.org/10.5194/egusphere-2025-3214-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 1,054 | 43 | 15 | 1,112 | 23 | 25 |
- HTML: 1,054
- PDF: 43
- XML: 15
- Total: 1,112
- BibTeX: 23
- EndNote: 25
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The authors present a description of applying and adapting the Swin transformer architecture to the problem of detecting icebergs in satellite radar data. They compare their model performance against SVM and ResNet18 benchmarks and train and evaluate in a range of sea ice concentrations, with a view to developing a detection system that functions well across contexts. The contribution is interesting and a useful addition to the discussion, exploring how novel model architectures perform on a long-standing, important and as yet unsolved environmental monitoring problem. My opinion is that the manuscript contains some valuable contributions to the field but requires improvement before it is suitable to be published. My main concerns are around the lack of detail provided around methodological and design choices that have been made, and how these affect this study itself and the utility for the wider community. Some choices require stronger justification while others require more consideration and discussion of their implications. We also need to see full details of the machine learning training regimes to ensure that models have been appropriately and comparably trained, particularly considering the relatively small training dataset used. I have attempted to highlight where other detail would be beneficial in my specific comments below. Another general observation is that the literature cited is incomplete and somewhat outdated. I would recommend that the authors ensure they have read and included where appropriate all the relevant literature in this (still rather small) field. Figures are hard to interpret due to the scale. I would encourage the authors to show smaller areas in more detail where possible. Overall, I would consider that major revisions are likely to be required before publication, but the work definitely has the potential to form a valuable contribution to the field.
Specific points:
Introduction: Many of these references look quite old and there is newer literature available to support most of these initial statements – Can the authors update their literature search, maybe looking at authors like Coulon, Davison etc. to provide more recent relevant insights.
L68 – There are more studies out there. Consider citing
Evans et al. (2023): https://www.sciencedirect.com/science/article/pii/S0034425723003310
and Chen et al. (2025): https://essd.copernicus.org/preprints/essd-2025-51/
and Jafari et al. (2025) for the DL side: https://www.mdpi.com/2072-4292/17/4/702
L85-89. These claims that SWIN transformer is better than other methods need to be substantiated by citations if they appear in this section. At present they are not evidenced, and even if citations were provided, a little more quantitative detail on exactly how much better SWIN is than other methods for certain tasks would be needed. These statements seem to be the foundational premise for the choice of architecture in this study, but are currently unsupported by evidence, which raises questions over the chosen approach. Indeed, having claimed that SWIN transformer is better, on L95 the authors say it hasn’t previously been applied to this question.
L115 – The authors select HV based on some evaluation of contrast. Please could the provide more detail on how this was conducted and present some of the data supporting their decision (possibly in supplementary material)? HV is not widely available across Antarctica compared to HH, so the choice to develop a system that may not be scalable because of the polarisations chosen probably needs a little more justification and some discussion of the trade-offs with generalisability that it implies.
Table 1 – I note that the AOIs represent different times of year. Backscatter and sea-ice contrast are known to vary seasonally, but this sampling is varying the seasonality at the same time as the sea ice concentration. Can the authors please expand on the motivation for these particular time points that do not allow for independent assessment of the effect of sea ice concentration on classifier performance rather than controlling for seasonality while varying sea ice concentration? Such an experimental setup would imply introducing a greater latitudinal variation, which also creates complexities for any classifier. In general I would like to see more recognition and explanation of the implications of the design decisions being made for the performance within this study and the scope for generalising to other areas.
L149 - “Regions with high brightness levels in the SAR images were labelled icebergs.” – it is unclear at this stage how this was done, but this sentence implies it might have been thresholding, which is unlikely to be robust. On L153 the authors say that each patch was annotated. This implies a manual annotation. Can the authors explicitly describe the process? Was the annotator able to see the context of the patch? If so, how much context? At what resolution/zoom level were annotations decided upon? Were multiple human annotators used? Was any attempt made to assess annotator consistency?
As with much of the method description, the design choices need more explanation and justification. Why did the authors choose to annotate on the patch level rather than vectorising outlines? Why was 4*4 the chosen patch size, and what are the implications and trade-offs of this compared to other patch sizes? What was done with patches that were only partially iceberg? Does the patch approach limit the ability of the transformer to learn to make precise boundary delineations?
L151-3: Why should replicating the same data three times result in better noise resilience and improved performance when it isn’t providing any more information to the network? This may be another place where some supplementary material could be used to detail what tests were undertaken and elucidate and evidence why duplicating data helps. Alternatively, can the authors demonstrate from mathematical principles why the transformer architecture should perform better with three identical channels?
L157 – the training data were balanced for iceberg/non iceberg, but what was the balance for sea ice conditions? Again, contextual information like a histogram would help the reader interpret what the transformer is being shown.
L163 – Did the authors not apply orbit files, sensor calibrations, thermal noise removal or radiometric terrain calibration? These are standard procedures for generating robust, reusable data from Sentinel 1. If the authors chose not to follow standard procedures then their models will not be transferrable to other Level 1 or analysis-ready datasets. If they can demonstrate that excluding these steps produces better detection performance then there is an argument for using non-standard products, but this paragraph fails to adequately support that choice. Furthermore, some of the SNAP preprocessing would likely help to mitigate the cross-scene and cross-swath variability that the authors then apply some bespoke processing to overcome. Can the authors provide a more robust justification for their pre-processing choices and show that they translate into better model performance?
L170 – What is this linear stretching process and how was it conducted?
L175 – If pre-processing is manually customised on a per scene or per-AOI basis, this introduces subjectivity is highly detrimental to the wider applicability of any trained models arising from this work. Can the authors justify their choice to use a highly manual pre-processing stage having argued in the introduction that they are testing the SWIN transformer to improve generalisability of detection methods and overcome issues like backscatter variability?
L219-220 – Again I feel that the design choice has not been fully justified. Are there examples of SVM used for iceberg detection on SAR (yes) that motivate its use as a benchmark here – or would picking a more recent/widely used approach from the iceberg detection literature as an initial benchmark allow for more meaningful evaluation of the SWIN transformer against the state-or-the-art in the iceberg detection field specifically?
L230 – 235 – There is some duplication here of the justification for use of ResNet18. Again, why not benchmark on a DL architecture that has already been deployed for iceberg detection?
L240-244 – These claims about the suitability of ResNet for polar and specifically iceberg tasks need references to support them.
L251 – “following segmentation” – Are the authors referring to patching here? Segmentation is surely the task of the models?
General: Where can I see the config for the various training runs? What hypaerparameters were explored and how? How long was each model trained for? What was the stopping criterion? These need to be presented to convince the reader that the differences in performance arise from greater architectural suitability of the transformer rather than hyperparameter choices or training regime/effort.
Figure 6 – These panels are a bit too small to easily see how well the classifier is detecting smaller icebergs and delineated boundaries – can the authors provide a zoomed-in panel in this or another figure showing detail of the annotations and predictions please?
L313 – I disagree that the authors introduce the concept of iceberg concentration for the first time (as implied in my reading of this line). Please rephrase.
Figure 7 – It may be related to the scale of these images again but it is hard for the reader to appreciate what benefit converting to iceberg concentration provides over simply giving presence/absence. Is this more evident on a larger-scale figure? Do the authors have an example of a standard route for a research vessel to a station and how the navigator would find the concentration product more useful than the raw iceberg map when traversing it? Something like that would give the reader a better appreciation of how this may be useful in support of safer navigation. I see something akin to this in Fig. 8, but even at that scale it is hard to appreciate what value the gridded concentration or count products provide over the raw detections.
L349 – Was this an application of the SWIN classifier to a completely previously unseen image? In this paragraph it is a little unclear. If so, this is a potentially a good test of its generalisability, having been trained on the other four scenes and arguable could offer more insight into the value of the approach than the patch-wise validation carried out on the same scenes as the training. If so, I would encourage the authors to provide evaluation metrics for this example too, and elaborate more on what these say about the method’s wider applicability.
Section 3.4: I was expecting to see this earlier – to me it would make sense to present this alongside the transformer evaluation, and before the use-case example. Please consider moving this results section up the manuscript to sit more comfortably alongside the other performance metric descriptions, although the evaluation/discussion should remain in the discussion section of the manuscript.
Figure 9 – Again, at this scale it is hard to see any differences between the three approaches – can the authors provide zoomed-in views to allow the reader to see the nuances of how the different methods classify icebergs?
Thank you for the explanation and detail on the failure modes – this provides useful context to those thinking about building on this work. In L421-424 the conflation of smaller touching icebergs into a single predicted object is argued to result in a positive prediction bias in terms of object count – surely this is the other way around and should reduce the number of predicted objects compared to manual delineation?
L429-430: “A rough iceberg surface, which is heavily jagged with cracks, is a clear sign of
impending disintegration, whereas a smooth iceberg with no visible cracks is categorized as stable”
– what is the evidence for this and can the authors either provide references or remove this assertion? Surface cracking and texture are a product of the stress history of the ice and may not imply imminent fragmentation – some large tabular icebergs have substantial surface texture and crevassing but remain stable for decades – e.g. B-22. If the authors do have studies that link surface characteristics to a predictive capability regarding fracture I would be fascinated to read them and would encourage them to cite them here.
Figure 11: This is a really useful figure and should form the basis for discussion of the effect of classifying on a patch basis on predicitons.
L453-454 – This discussion comes back to the effect of selecting AOIs that vary both sea ice concentration and seasonality. By my reading, the authors have not structured their sampling to directly evaluate the effect of seasonality on their classifier and should probably acknowledge this when drawing comparison with other studies. See also the claims made in L470-471 that don’t feel to me to be robustly supported. In addition to Mazur et al. (2017), Evans et al. (2023) also explicitly evaluated the effect of seasonality on classifier performance, while the recent work of Chen et al. (2025) is an example of using October-only data.
L474 onwards – The authors claim that previous superpixel approaches require a lot of training data, but transformers are notoriously data-hungry yet have been selected here for a relatively small study. This leads me to a general observation about the proposed study – which is that it is reliant on a fairly small dataset for training (particularly in the context of transformer architectures), and is developed and tested on small AOIs. This is not necessarily problematic, but the reader needs to be convinced that the size of the model is appropriate to the relatively small training set and that claims of generalisability such as deployment on vessels are justified. The authors should therefore present more detail on how they have guarded against overfitting and overparameterisation during model selection and training. This could be in supplementary material but will be needed to convince the audience that the approach is appropriate to the scale of the data and robust.
L488-490 – Can the authors explain why iceberg displacement is a problem for the specific task being addressed here of detection? Surely they are where they are when the image is acquired? I can see that it is an issue for providing data to mariners or other science questions. Can the authors clarify this statement to be explicit about why the temporally sparse satellite acquisitions might be challenging and for what purposes?
L513 - This is also the setting where other methods (including CFAR) perform best, and where such methods may outperform this approach. Can the authors briefly contextualise their model’s performance against the best available previous study in similar sea ice contexts for each of their settings? (This may not be for the conclusion, but would be nice to see somewhere).
L530 – This discussion of modelling effects feels a bit ‘out of the blue’ – in that none of the previous discussion (except very early and briefly in the introduction) relates to model implementations. Does it need to be here? If so, should it be accompanied by a bit more detail about how this detection method contributes earlier in the discussion rather than appearing just in the conclusion?
Minor points:
L51 – carving à calving