Self-supervised learning reduces labelling requirements for sea ice segmentation in Sentinel-1 SAR imagery
Abstract. Monitoring Arctic sea ice variability is crucial for maritime safety. Synthetic Aperture Radar (SAR) imagery provides an effective means of achieving this through all-weather, day-and-night coverage of the Arctic. Navigation in the Canadian Arctic Archipelago currently relies on operational ice information services, including analyst-derived ice charts, satellite imagery, and ice routing products provided by national ice services. However, the development of machine-learning systems capable of automatically processing large volumes of satellite imagery and accurately identifying ice conditions is constrained by the need for extensive manually labelled datasets. To address this limitation, we developed a self-supervised learning (SSL) approach, which uses unlabelled data to learn general image representations. Specifically, we use Bootstrap Your Own Latent (BYOL), a non-contrastive SSL framework, to pretrain a UNet encoder on unlabelled dual-polarised Sentinel-1 Extra-Wide mode (EW) SAR scenes before fine-tuning with a small set of labelled images. We compare the BYOL-pretrained UNet (called UNet SSL in this study) to four baselines: a control UNet, a fully supervised UNet, a Random Forest classifier, and the Segment Anything Model (SAM). With only three labelled scenes, the BYOL-pretrained UNet achieved higher segmentation accuracy than the fully supervised model trained on seven images, more than twice the number of labelled scenes. The most significant gains occurred in Marginal Ice Zone (MIZ) scenes, where the BYOL-pretrained UNet achieved a Matthews Correlation Coefficient (MCC) of 0.2087, compared with 0.1685 for the fully supervised UNet trained on seven labelled scenes and 0.1449 for the control model trained on three scenes—representing an MCC increase of approximately 24 % and 44 %, respectively. These improvements were accompanied by a substantial reduction in false negatives and a marked increase in recall, indicating improved discrimination under low-contrast, fragmented floe conditions. Our findings demonstrate that SSL reduces annotation requirements for SAR-based sea ice segmentation, improving model generalisation in both consolidated and fragmented ice conditions. This approach offers a scalable solution to the labelling bottleneck in Arctic monitoring and highlights the potential of BYOL as a general pretraining strategy for SAR-based Earth observation image segmentation.
This study proposes a self-supervised learning (SSL) approach to improve ice-water mapping in Sentinel-1 (S1) wide-swath data (EW). The authors correctly identify accurate and abundant labelling of training data as a bottleneck for high-resolution sea ice mapping and hence investigate the benefits of SSL combined with limited training data (3 images). A comparison of their “UNet SSL” with four baseline models (UNet Control, UNet SL, RF, SAM) evaluated on two test images shows improvements in ice-water separation from the SSL.
The paper is overall quite well written, although there is need for clarification and rephrasing in several places. The topic is interesting and worthwhile to explore, and the detailed labelling of the 9 images used for training and evaluation seems to be accurate (based on the shown examples) and presents a clear difference to other commonly used training sets that are often based on ice charts.
However, there are several issues that should be addressed before publication.
Major / general comments
(1) Lack of thermal noise discussion:
Figures 7 and 8 show the classification results for both test scenes, obtained from the UNet SSL and the four baseline models. The visually most striking feature in almost all results is the S1 thermal pattern. Yet, thermal noise is not explicitly mentioned at any point in the paper. Especially the results for test scene 2 (Figure 8) are dominated by thermal noise patterns (scalloping and sub-swath boundaries), and it appears that the classification results must be highly dependent on how well each model learns these noise patterns. Although this is a well-known issue for S1 sea ice mapping, the authors do not include thermal noise correction in their pre-processing chain. At least, these patterns and their influence on the scores should be discussed.
(2) Figure quality:
The quality of some figures in their present form is not sufficient and should be improved (in particular Figures 2, 7, and 8). Suggestions for improvement are given in the detailed comments below. Generally, I suggest including all figures as vector graphics instead of pixel graphics, as this will highly improve the quality especially when zooming in on details.
(3) Limited training and evaluation:
While the authors explain that “the primary contribution of the study lies in demonstrating relative performance gains under reduced annotation budgets, rather than maximising accuracy on a specific benchmark”, I am sceptical that the results (“MCC improvement of 24% and 44%”) are transferable to “more realistic scenarios” with significantly more training data. Generally, all performances presented in Figures 7 and 8 are still quite poor and heavily affected by thermal noise (see comment (1)), and any algorithm that is moving closer to operations needs to produce much more accurate and reliable ice-water maps, requiring additional training for the UNet SSL as well as the baseline methods. While it may be worthwhile to demonstrate the potential improvements by the SSL approach (as done in this study), the authors should probably discuss the transferability/scalability of the improvements when a lot more training data is used to achieve overall good enough results (with any method).
(4) Random Forest performance:
Although it is only a “baseline” algorithm for comparison, the performance of the RF algorithm is noticeably bad and concerning. The presented RF basically maps everything as sea ice without even identifying the dark leads in the test scene 1 (Figure 7). Although the deep-learning approaches are probably expected to perform better than a “traditional RF”, this exceptionally bad performance is still quite surprising, since RF have been used quite successfully for sea ice mapping in previous studies (e.g. Park 2020, Lu 2023). Without any additional explanation, this raises concerns that (a) the RF and the texture features for it are not well designed and/or (b) that the training data is simply not sufficient for any reasonable assessment. I think at some discussion/explanation of this poor performance compared to RF in previous studies is needed.
(5) “Applicational” motivation of the study:
It is not entirely clear whether the main applicational focus (i.e., why do we need improved automated sea ice mapping) is on navigation or environmental studies (or both). The initial mentioning of “maritime safety” and explanation of operational ice charting suggests an operational focus on tactical navigation, but later this is repeatedly mixed with the importance of leads for ocean-atmosphere interactions, which suggest an environmental focus. Both applications are valid and important, but they should be clearly distinguished in their explanations.
This is relatively easy to fix, so more a “general” than a “major” comment.
(6) Roughness scales:
Roughness is always a relative term, depending on scale. Whenever mentioning roughness throughout the manuscript, the authors should specify the roughness scales they are thinking of: small-scale (on the order of the radar wavelength) or large-scale (on the order of meters). Large-scale roughness could also be referred to as large-scale deformation and is directly related to ice type (deformed FYI or deformed MYI), whereas small-scale roughness can in fact vary significantly within one ice type, especially young ice with/without frost flowers or finger rafting. It might be worth to explicitly mention somewhere that the interplay of both small- and large-scale roughness effects contributes to the challenging interpretation of sea ice in SAR imagery.
Also easy to fix and rather “general” than “major”.
(7) Description of the 2 test scenes:
The two selected test scenes are throughout the manuscript described as “consolidated ice pack” (scene 1, Figure 7) and “MIZ” (scene 2, Figure 8). The MIZ is “traditionally” defined as the area with 20-80% SIC, and more recently physics-based approaches as the “area affected by waves” are more common. Based on visual inspection of Figure 8, neither of these definitions makes me think that this image is in the MIZ. The authors should explain the reasoning behind this description of the test scenes and maybe consider to change them.
Detailed comments
Title and abstract: The term “sea ice segmentation” sometimes refers to ice-water mapping and sometimes to sea ice types (or sometimes both). Please consider a slight adjustment of your title to indicate that you are working towards separation ice and water (not sea ice types). Even after reading the abstract, this remains unclear.
Lines 21 and following: You introduce the term “UNet SSL” here but then keep referring to the “BYOL-pretrained UNet”. Unless I misunderstand, these two terms refer to the same algorithm/model in your study. Please consider sticking with one single term to avoid possible confusion.
Lines 41-42: The cited numbers are from 2018, which is by now 8 years ago. Please consider presenting more recent numbers, especially since the decline in September extent has significantly slowed in the past years (see e.g. https://www.meereisportal.de/en/maps-graphics/sea-ice-trends#gallery-1 or attached png)
Lines 49-50: Time lag is one issue, but also subjectivity of the analyst and increased data availability overall with more sensors being launched (-> more analysts needed)
Lines 50-53: This statement should be more clearly formulated. The sea ice charts don’t really lack details of leads or ridges because SAR products cannot resolve them, but rather because mapping individual leads manually is too time consuming in the manual operational production chain of most services. Hence, many ice charts include lead information in the form of young ice fractions in the egg codes each polygon in the chart. However, even rather simple automated products can in fact capture individual leads quite well, of course still limited by the sensor resolution (~90x90m for S1 EW) (e.g. Johansson 2018, Murashkin 2019, Lohse 2024).
Also, while the authors are right that leads or deformation zones are important for ocean-atmosphere interaction, this statement does not really fit into the context of ice charts. Here, the leads are important for safe and efficient for navigation and route planning.
Line 57: Please quantify the size of fine-scale features. Compared to other sensors, SAR very good at resolving fine spatial scales (although of course still limited by pulse and Doppler bandwidth, i.e. spatial resolution). If you are referring to the lack of individual leads in labelled data such as ice charts, consider specifying that this is an “ice chart issue” and not necessarily a “SAR issue”.
Line 71: Please specify: Do they struggle with the separation between these two ice types, or with separating these ice types from other types? Consider explaining why.
Line 103: Please specify roughness scale. I assume you are talking about “small-scale” surface roughness here. Maybe add “(cm-scale or wavelength-scale)” or somethings similar to avoid confusion with large-scale deformation (sometimes also called roughness).
Lines 129-131: “heavily deformed”, “smoother”, “increases roughness”. Please clarify roughness scales for the different statements.
Lines 134-135: See general comment (5): Until here I was under the impression that the main application focus is on navigation support. If you want to keep both navigation (“automated ice charting”) and environmental studies (“lead detection to study energy balance”) for your motivation, I suggest mentioning both of them quite early on and explaining that “accurate ice type mapping is required for a range of applications, including support of safe navigation as well as environmental studies of ocean-ice-atmosphere interactions” (or something along those lines).
Line 138: Overlapping backscatter signatures from which surface types?
Lines 145-150: If I understand correctly, this reads like it should be a list of 2 research questions, but the paragraph/line break between them seems strange. I suggest listing them as two bullet points or even numbering them as research goals (1) and (2) which you can then explicitly refer to later.
Lines 152-156: See general comment (3): This “relative” comparison of the different models makes sense to some extent. However, I am wondering if the relative improvement that you demonstrate later will also hold if you overall use much more training data, which will be needed to achieve better results. In practice, you would probably never use a deep-learning approach for ice-water mapping trained on only seven images. I would like to see this commented on in the discussion.
Lines 171-172: Something missing in the sentence; maybe a “-“ after pack ice?
Figure 1: Legend says “Sea Ice Concentration (m)”, should probably be “(%)”. I also find the legend entry “Label Extents” slightly confusing. I think you are showing the footprints of the S1 EW scenes used in the study? Please consider changing the label to “S1 footprints” or something similar.
Maybe also consider colour-coding footprints as “test scenes”, “full training set (7 images)”, and “small training set (3 images)” or similar.
Sentinel-1 SAR imagery:
This entire section needs some clarification.
Lines 229-230: Figures should be numbered in order of appearance. You refer to Figures 7 and 8 before Figure 2.
Lines 245-249: Please add acquisition date (season) for validation scenes. I am aware that they are in Table 1, but I think it is worth to repeat them in the text here.
Figure 2: I appreciate that you are showing the example of the training data and compare to other already published sets. While the advantages of your detailed manual labelling compared to the other methods become clear, the figure in its current form needs multiple changes/improvements:
Figures 7 and 8: (commented here because they are referred to first here. Some of the comments below relate to the results part of the figures)
The figure quality in its current form is not good. Since you are showing a range of different panels, I do not see the need to rotate the figure 90deg, making the individual panels smaller and leaving half the page empty. Also please insert vector graphics to maintain better quality when zooming into details.
The manually selected labels look convincing.
The performance of the RF makes me question the training and design of the RF and whether it can be considered a fair comparison, please see general comment (4).
Finally, we see a lot of thermal noise effects across the classification results of all UNet approaches in Figure 8, please see general comment (1)
Line 344-345: What exactly do you mean by selecting scenes based on “quality”?
Lines 374: Consider rephrasing “raw HH and HV backscatter” to “HH and HV backscatter intensities” or a similar more precise description. “Raw” backscatter in SAR usually refers to the unfocused image.
Lines 373-377: Good to see that you are using texture features in the RF, this will make the comparison fairer. Please add information as to choices and design of texture features, e.g. GLCM parameters such as distance, angle, window size, discretisation. These choices are critical for good ice type separation (e.g. Zakhvatkina 2017, Karvonen, 2017, Park 2020, Lohse 2021, Khachatrian 2021).
Lines 386-388: Please specify the scaling (min/max values) when setting HH, HV, and HH/HV to 8-bit RGB channels.
Experiment design: Most commonly, I would think that you split the 9 labelled images, into 3 sets: train, test, validation. What you call the test set (the two images kept aside) would then be the validation set, whereas the remaining 7 images would be split into training (to fit the model weights) and testing (to avoid overfitting). Please comment/specify why you decide to only split into 2 sets and how you avoid overfitting.
Figure 3: The visualization seems to be missing some connections, e.g. the “labelled dataset” is also the input for the RF and SAM, not just the for the UNet (SL).
Also, colours for “SAM” and “Compare Models” appear very similar, please consider adjusting one of them.
Figure 4: The label “HH, HV, 1024” for the input layer does not seem entirely accurate. I assume the three layers shown are HH, HV, and training labels, while the size of the input patches is 1024x1024?
Figures 5 and 6: The main message that the reader is supposed to get from these figures is not entirely clear to me. You refer to the figures in the “model intercomparison” section on page 13 (lines 327-343), but I don’t really understand what I am supposed to learn from these. I am sure there was some clear idea of why to show them, please consider explaining the main message more explicitly.
Model performance across ice types and HH backscatter: In addition to the different ice regimes, I think you need to associate backscatter bins with different IA regimes. E.g. in Figure 7, the overall decrease of HH sigma_0 across the swath is clearly visible. This should at least be included in the discussion of the results presented in Figure 9. (Please see also previous comment on IA sensitivity of sigma_0 in data section). Generally, you should be careful with any over-interpretation of this figure, since you do not account for HV at all in this analysis. However, based on the noise patterns in Figures 7 and 8, we see a clear influence/contribution of HV.
Lines 559-562: Please be careful to make sure that these interpretations are only valid for the shown test scene. I don’t think you can generally associate the ice-water transition with a strong contrast in HH sigma_0, as sigma_0 is ice-type dependent and, more importantly, for open water highly wind-state dependent.
Lines 563 -572: The result description and discussion of MCC-vs-HH(dB) graphs for test scene 2 must include thermal noise, which is clearly visible in the results of all methods (Figure 8), both as scalloping in sub-swath EW1 and clearly at sub-swath boundaries. Due to its lower signal strength, HV is much more affected by thermal noise and should therefore be included in the visualization in Figures 7 and 8. A lot of the differences associated with the HH bins in Figure 9 may in fact be caused by noise effects or variation in HV, which is neither shown nor discussed here.
Figure 9: Better figure quality than many of the other figures. However, while this presentation of the MCC may contain useful information, I don’t think it can be interpreted without accounting for the HV channel.
Lines 637-543: Although the limitations of RFs are pointed out correctly, I remain puzzled by the fact that the RF in this study almost completely fails to detect even the very dark (in HH) lead structures. Some additional knowledge and discussion of the parameters for the computation of textural features might help to explain this.
Subsection 5.3: If I understand this section correctly, I would not call it a “comparison”. It rather provides reasoning why the alternative self-supervised models are not implemented and tested in this study. Maybe consider rephrasing it in the discussion, or whether it could be moved into the introduction and method section, strengthening the reasoning for the choice of BYOL.
Line 690: The Park (2020) study cited here is a good example of a RF classifier producing much better results than the RF in this study.
Lines 742-755: Please be careful to choose your wording such that this is only necessarily true for your two example images. The general statement of better contrast between ice (“bright”) and leads (“dark”) in the consolidated pack ice region compared to more overlapping signatures in the MIZ may be true for the 2 examples discussed here but does not necessarily hold generally. Even within the pack ice, wind-roughened leads may appear bright in HH and significantly overlap with sea ice backscatter signatures. HV will then be critical to distinguish ice and water. On the other hand, brash ice (heavily deformed -> strong backscatter) and calm water (in still wind conditions) can be easily separable in the MIZ and close to the ice edge. I think you should be more careful with general statements on the differences in model performance and ice-water separability based on the two examples selected here.