Self-supervised learning reduces labelling requirements for sea ice segmentation in Sentinel-1 SAR imagery

Seston, Jacob; Harcourt, William David; Leontidis, Georgios; Rea, Brice R.; Spagnolo, Matteo; McWhinnie, Lauren

doi:10.5194/egusphere-2026-376

Preprints

https://doi.org/10.5194/egusphere-2026-376

Preprints

04 Feb 2026

| 04 Feb 2026

Self-supervised learning reduces labelling requirements for sea ice segmentation in Sentinel-1 SAR imagery

Jacob Seston, William David Harcourt, Georgios Leontidis, Brice R. Rea, Matteo Spagnolo, and Lauren McWhinnie

Abstract. Monitoring Arctic sea ice variability is crucial for maritime safety. Synthetic Aperture Radar (SAR) imagery provides an effective means of achieving this through all-weather, day-and-night coverage of the Arctic. Navigation in the Canadian Arctic Archipelago currently relies on operational ice information services, including analyst-derived ice charts, satellite imagery, and ice routing products provided by national ice services. However, the development of machine-learning systems capable of automatically processing large volumes of satellite imagery and accurately identifying ice conditions is constrained by the need for extensive manually labelled datasets. To address this limitation, we developed a self-supervised learning (SSL) approach, which uses unlabelled data to learn general image representations. Specifically, we use Bootstrap Your Own Latent (BYOL), a non-contrastive SSL framework, to pretrain a UNet encoder on unlabelled dual-polarised Sentinel-1 Extra-Wide mode (EW) SAR scenes before fine-tuning with a small set of labelled images. We compare the BYOL-pretrained UNet (called UNet SSL in this study) to four baselines: a control UNet, a fully supervised UNet, a Random Forest classifier, and the Segment Anything Model (SAM). With only three labelled scenes, the BYOL-pretrained UNet achieved higher segmentation accuracy than the fully supervised model trained on seven images, more than twice the number of labelled scenes. The most significant gains occurred in Marginal Ice Zone (MIZ) scenes, where the BYOL-pretrained UNet achieved a Matthews Correlation Coefficient  (MCC) of 0.2087, compared with 0.1685 for the fully supervised UNet trained on seven labelled scenes and 0.1449 for the control model trained on three scenes—representing an MCC increase of approximately 24 % and 44 %, respectively. These improvements were accompanied by a substantial reduction in false negatives and a marked increase in recall, indicating improved discrimination under low-contrast, fragmented floe conditions. Our findings demonstrate that SSL reduces annotation requirements for SAR-based sea ice segmentation, improving model generalisation in both consolidated and fragmented ice conditions. This approach offers a scalable solution to the labelling bottleneck in Arctic monitoring and highlights the potential of BYOL as a general pretraining strategy for SAR-based Earth observation image segmentation.

Received: 22 Jan 2026 – Discussion started: 04 Feb 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Jacob Seston, William David Harcourt, Georgios Leontidis, Brice R. Rea, Matteo Spagnolo, and Lauren McWhinnie

Status: final response (author comments only)

RC1:
'Comment on egusphere-2026-376', Johannes Lohse, 25 Feb 2026
This study proposes a self-supervised learning (SSL) approach to improve ice-water mapping in Sentinel-1 (S1) wide-swath data (EW). The authors correctly identify accurate and abundant labelling of training data as a bottleneck for high-resolution sea ice mapping and hence investigate the benefits of SSL combined with limited training data (3 images). A comparison of their “UNet SSL” with four baseline models (UNet Control, UNet SL, RF, SAM) evaluated on two test images shows improvements in ice-water separation from the SSL.
The paper is overall quite well written, although there is need for clarification and rephrasing in several places. The topic is interesting and worthwhile to explore, and the detailed labelling of the 9 images used for training and evaluation seems to be accurate (based on the shown examples) and presents a clear difference to other commonly used training sets that are often based on ice charts.
However, there are several issues that should be addressed before publication.

Major / general comments

(1) Lack of thermal noise discussion:
Figures 7 and 8 show the classification results for both test scenes, obtained from the UNet SSL and the four baseline models. The visually most striking feature in almost all results is the S1 thermal pattern. Yet, thermal noise is not explicitly mentioned at any point in the paper. Especially the results for test scene 2 (Figure 8) are dominated by thermal noise patterns (scalloping and sub-swath boundaries), and it appears that the classification results must be highly dependent on how well each model learns these noise patterns. Although this is a well-known issue for S1 sea ice mapping, the authors do not include thermal noise correction in their pre-processing chain. At least, these patterns and their influence on the scores should be discussed.

(2) Figure quality:
The quality of some figures in their present form is not sufficient and should be improved (in particular Figures 2, 7, and 8). Suggestions for improvement are given in the detailed comments below. Generally, I suggest including all figures as vector graphics instead of pixel graphics, as this will highly improve the quality especially when zooming in on details.

(3) Limited training and evaluation:
While the authors explain that “the primary contribution of the study lies in demonstrating relative performance gains under reduced annotation budgets, rather than maximising accuracy on a specific benchmark”, I am sceptical that the results (“MCC improvement of 24% and 44%”) are transferable to “more realistic scenarios” with significantly more training data. Generally, all performances presented in Figures 7 and 8 are still quite poor and heavily affected by thermal noise (see comment (1)), and any algorithm that is moving closer to operations needs to produce much more accurate and reliable ice-water maps, requiring additional training for the UNet SSL as well as the baseline methods. While it may be worthwhile to demonstrate the potential improvements by the SSL approach (as done in this study), the authors should probably discuss the transferability/scalability of the improvements when a lot more training data is used to achieve overall good enough results (with any method).

(4) Random Forest performance:
Although it is only a “baseline” algorithm for comparison, the performance of the RF algorithm is noticeably bad and concerning. The presented RF basically maps everything as sea ice without even identifying the dark leads in the test scene 1 (Figure 7). Although the deep-learning approaches are probably expected to perform better than a “traditional RF”, this exceptionally bad performance is still quite surprising, since RF have been used quite successfully for sea ice mapping in previous studies (e.g. Park 2020, Lu 2023). Without any additional explanation, this raises concerns that (a) the RF and the texture features for it are not well designed and/or (b) that the training data is simply not sufficient for any reasonable assessment. I think at some discussion/explanation of this poor performance compared to RF in previous studies is needed.

(5) “Applicational” motivation of the study:
It is not entirely clear whether the main applicational focus (i.e., why do we need improved automated sea ice mapping) is on navigation or environmental studies (or both). The initial mentioning of “maritime safety” and explanation of operational ice charting suggests an operational focus on tactical navigation, but later this is repeatedly mixed with the importance of leads for ocean-atmosphere interactions, which suggest an environmental focus. Both applications are valid and important, but they should be clearly distinguished in their explanations.
This is relatively easy to fix, so more a “general” than a “major” comment.

(6) Roughness scales:
Roughness is always a relative term, depending on scale. Whenever mentioning roughness throughout the manuscript, the authors should specify the roughness scales they are thinking of: small-scale (on the order of the radar wavelength) or large-scale (on the order of meters). Large-scale roughness could also be referred to as large-scale deformation and is directly related to ice type (deformed FYI or deformed MYI), whereas small-scale roughness can in fact vary significantly within one ice type, especially young ice with/without frost flowers or finger rafting. It might be worth to explicitly mention somewhere that the interplay of both small- and large-scale roughness effects contributes to the challenging interpretation of sea ice in SAR imagery.
Also easy to fix and rather “general” than “major”.

(7) Description of the 2 test scenes:
The two selected test scenes are throughout the manuscript described as “consolidated ice pack” (scene 1, Figure 7) and “MIZ” (scene 2, Figure 8). The MIZ is “traditionally” defined as the area with 20-80% SIC, and more recently physics-based approaches as the “area affected by waves” are more common. Based on visual inspection of Figure 8, neither of these definitions makes me think that this image is in the MIZ. The authors should explain the reasoning behind this description of the test scenes and maybe consider to change them.

Detailed comments

Title and abstract: The term “sea ice segmentation” sometimes refers to ice-water mapping and sometimes to sea ice types (or sometimes both). Please consider a slight adjustment of your title to indicate that you are working towards separation ice and water (not sea ice types). Even after reading the abstract, this remains unclear.

Lines 21 and following: You introduce the term “UNet SSL” here but then keep referring to the “BYOL-pretrained UNet”. Unless I misunderstand, these two terms refer to the same algorithm/model in your study. Please consider sticking with one single term to avoid possible confusion.

Lines 41-42: The cited numbers are from 2018, which is by now 8 years ago. Please consider presenting more recent numbers, especially since the decline in September extent has significantly slowed in the past years (see e.g. https://www.meereisportal.de/en/maps-graphics/sea-ice-trends#gallery-1 or attached png)

Lines 49-50: Time lag is one issue, but also subjectivity of the analyst and increased data availability overall with more sensors being launched (-> more analysts needed)

Lines 50-53: This statement should be more clearly formulated. The sea ice charts don’t really lack details of leads or ridges because SAR products cannot resolve them, but rather because mapping individual leads manually is too time consuming in the manual operational production chain of most services. Hence, many ice charts include lead information in the form of young ice fractions in the egg codes each polygon in the chart. However, even rather simple automated products can in fact capture individual leads quite well, of course still limited by the sensor resolution (~90x90m for S1 EW) (e.g. Johansson 2018, Murashkin 2019, Lohse 2024).
Also, while the authors are right that leads or deformation zones are important for ocean-atmosphere interaction, this statement does not really fit into the context of ice charts. Here, the leads are important for safe and efficient for navigation and route planning.

Line 57: Please quantify the size of fine-scale features. Compared to other sensors, SAR very good at resolving fine spatial scales (although of course still limited by pulse and Doppler bandwidth, i.e. spatial resolution). If you are referring to the lack of individual leads in labelled data such as ice charts, consider specifying that this is an “ice chart issue” and not necessarily a “SAR issue”.

Line 71: Please specify: Do they struggle with the separation between these two ice types, or with separating these ice types from other types? Consider explaining why.

Line 103: Please specify roughness scale. I assume you are talking about “small-scale” surface roughness here. Maybe add “(cm-scale or wavelength-scale)” or somethings similar to avoid confusion with large-scale deformation (sometimes also called roughness).

Lines 129-131: “heavily deformed”, “smoother”, “increases roughness”. Please clarify roughness scales for the different statements.

Lines 134-135: See general comment (5): Until here I was under the impression that the main application focus is on navigation support. If you want to keep both navigation (“automated ice charting”) and environmental studies (“lead detection to study energy balance”) for your motivation, I suggest mentioning both of them quite early on and explaining that “accurate ice type mapping is required for a range of applications, including support of safe navigation as well as environmental studies of ocean-ice-atmosphere interactions” (or something along those lines).

Line 138: Overlapping backscatter signatures from which surface types?

Lines 145-150: If I understand correctly, this reads like it should be a list of 2 research questions, but the paragraph/line break between them seems strange. I suggest listing them as two bullet points or even numbering them as research goals (1) and (2) which you can then explicitly refer to later.

Lines 152-156: See general comment (3): This “relative” comparison of the different models makes sense to some extent. However, I am wondering if the relative improvement that you demonstrate later will also hold if you overall use much more training data, which will be needed to achieve better results. In practice, you would probably never use a deep-learning approach for ice-water mapping trained on only seven images. I would like to see this commented on in the discussion.

Lines 171-172: Something missing in the sentence; maybe a “-“ after pack ice?

Figure 1: Legend says “Sea Ice Concentration (m)”, should probably be “(%)”. I also find the legend entry “Label Extents” slightly confusing. I think you are showing the footprints of the S1 EW scenes used in the study? Please consider changing the label to “S1 footprints” or something similar.
Maybe also consider colour-coding footprints as “test scenes”, “full training set (7 images)”, and “small training set (3 images)” or similar.

Sentinel-1 SAR imagery:
This entire section needs some clarification.
The presentation of the image size and spatial resolution in its current form is confusing. S1 EW in GRD format comes originally at 40x40m pixel spacing with an actual resolution of approximately 97x93m. A single scene covers about 400x400km. After geocoding to 80m pixel spacing (not the same as resolution, especially not since you speckle filter and multi-look during pre-processing, both of which reduces the effective resolution), you end up with an image size of 7000x7500 pixels, corresponding to 560x600km. This is not the swath width, which is fixed to ~400km by the acquisition geometry, but just the full extent of your geocoded image.

In your pre-processing chain, you apply a Lee Sigma speckle filter followed by additional multi-looking. Please explain the need and benefit of doing both.

What is GRD border noise? Do you perform thermal noise removal? If no, why not?

Calibration to sigma_0 does not mitigate IA effects. In fact, sigma_0 remains significantly dependent on IA, with typical sea ice slopes between 0.1 and 0.3 dB/1deg and open water slopes up to 0.7dB/1deg for HH (e.g., Mäkynen 2017, Mahmud 2018, Lohse 2020, Geldsetzer 2023)

Please clarify how the use of dual-pol data “implicitly handles residual IA effects”. You are right in the sense that HV is much less dependent on IA than HH, due to predominantly different scattering mechanisms. But HH remains dependent on IA, with different slopes for different surface types. Consider adding this to your explanation.

The deep-learning approaches you are using inherently consider contextual information; hence I think you can get away with ignoring IA effects for these methods. “Traditional” approaches like the RF, however, should somehow account for IA effects.

Lines 229-230: Figures should be numbered in order of appearance. You refer to Figures 7 and 8 before Figure 2.

Lines 245-249: Please add acquisition date (season) for validation scenes. I am aware that they are in Table 1, but I think it is worth to repeat them in the text here.

Figure 2: I appreciate that you are showing the example of the training data and compare to other already published sets. While the advantages of your detailed manual labelling compared to the other methods become clear, the figure in its current form needs multiple changes/improvements:
Panels (a) and (d) (and respectively (b) and (e), (c) and (f)) should match exactly in scale and extent. Currently there is some offset in extent and maybe scale. Additionally, some of the floe shapes almost look distorted, e.g. the rather large floe in the centre bottom of (a) and (d).

You are not showing the full images, which is fine, but you should add a scale for reference.

Add information on what channel we are looking at. Maybe consider a false-colour representation combining information from both HH and HV.

Consider adjusting the dynamic range of the SAR intensity visualization. Currently all the sea ice in (a) looks very homogeneous, like one single ice type.

Consider adjusting adjust the colour scheme for the segments in (e) – they are hard to distinguish

A lot of the “leads” in the lower right part of (f) are likely rather deformed ice. Impossible to say for sure without knowing which channel we are looking. I am aware that this data set is not from this study and mostly shown for comparison, but from this presentation here I am doubtful of the quality of the labels.

Figures 7 and 8: (commented here because they are referred to first here. Some of the comments below relate to the results part of the figures)
The figure quality in its current form is not good. Since you are showing a range of different panels, I do not see the need to rotate the figure 90deg, making the individual panels smaller and leaving half the page empty. Also please insert vector graphics to maintain better quality when zooming into details.
The manually selected labels look convincing.
The performance of the RF makes me question the training and design of the RF and whether it can be considered a fair comparison, please see general comment (4).
Finally, we see a lot of thermal noise effects across the classification results of all UNet approaches in Figure 8, please see general comment (1)

Line 344-345: What exactly do you mean by selecting scenes based on “quality”?

Lines 374: Consider rephrasing “raw HH and HV backscatter” to “HH and HV backscatter intensities” or a similar more precise description. “Raw” backscatter in SAR usually refers to the unfocused image.

Lines 373-377: Good to see that you are using texture features in the RF, this will make the comparison fairer. Please add information as to choices and design of texture features, e.g. GLCM parameters such as distance, angle, window size, discretisation. These choices are critical for good ice type separation (e.g. Zakhvatkina 2017, Karvonen, 2017, Park 2020, Lohse 2021, Khachatrian 2021).

Lines 386-388: Please specify the scaling (min/max values) when setting HH, HV, and HH/HV to 8-bit RGB channels.

Experiment design: Most commonly, I would think that you split the 9 labelled images, into 3 sets: train, test, validation. What you call the test set (the two images kept aside) would then be the validation set, whereas the remaining 7 images would be split into training (to fit the model weights) and testing (to avoid overfitting). Please comment/specify why you decide to only split into 2 sets and how you avoid overfitting.

Figure 3: The visualization seems to be missing some connections, e.g. the “labelled dataset” is also the input for the RF and SAM, not just the for the UNet (SL).
Also, colours for “SAM” and “Compare Models” appear very similar, please consider adjusting one of them.

Figure 4: The label “HH, HV, 1024” for the input layer does not seem entirely accurate. I assume the three layers shown are HH, HV, and training labels, while the size of the input patches is 1024x1024?

Figures 5 and 6: The main message that the reader is supposed to get from these figures is not entirely clear to me. You refer to the figures in the “model intercomparison” section on page 13 (lines 327-343), but I don’t really understand what I am supposed to learn from these. I am sure there was some clear idea of why to show them, please consider explaining the main message more explicitly.

Model performance across ice types and HH backscatter: In addition to the different ice regimes, I think you need to associate backscatter bins with different IA regimes. E.g. in Figure 7, the overall decrease of HH sigma_0 across the swath is clearly visible. This should at least be included in the discussion of the results presented in Figure 9. (Please see also previous comment on IA sensitivity of sigma_0 in data section). Generally, you should be careful with any over-interpretation of this figure, since you do not account for HV at all in this analysis. However, based on the noise patterns in Figures 7 and 8, we see a clear influence/contribution of HV.

Lines 559-562: Please be careful to make sure that these interpretations are only valid for the shown test scene. I don’t think you can generally associate the ice-water transition with a strong contrast in HH sigma_0, as sigma_0 is ice-type dependent and, more importantly, for open water highly wind-state dependent.

Lines 563 -572: The result description and discussion of MCC-vs-HH(dB) graphs for test scene 2 must include thermal noise, which is clearly visible in the results of all methods (Figure 8), both as scalloping in sub-swath EW1 and clearly at sub-swath boundaries. Due to its lower signal strength, HV is much more affected by thermal noise and should therefore be included in the visualization in Figures 7 and 8. A lot of the differences associated with the HH bins in Figure 9 may in fact be caused by noise effects or variation in HV, which is neither shown nor discussed here.

Figure 9: Better figure quality than many of the other figures. However, while this presentation of the MCC may contain useful information, I don’t think it can be interpreted without accounting for the HV channel.

Lines 637-543: Although the limitations of RFs are pointed out correctly, I remain puzzled by the fact that the RF in this study almost completely fails to detect even the very dark (in HH) lead structures. Some additional knowledge and discussion of the parameters for the computation of textural features might help to explain this.

Subsection 5.3: If I understand this section correctly, I would not call it a “comparison”. It rather provides reasoning why the alternative self-supervised models are not implemented and tested in this study. Maybe consider rephrasing it in the discussion, or whether it could be moved into the introduction and method section, strengthening the reasoning for the choice of BYOL.

Line 690: The Park (2020) study cited here is a good example of a RF classifier producing much better results than the RF in this study.

Lines 742-755: Please be careful to choose your wording such that this is only necessarily true for your two example images. The general statement of better contrast between ice (“bright”) and leads (“dark”) in the consolidated pack ice region compared to more overlapping signatures in the MIZ may be true for the 2 examples discussed here but does not necessarily hold generally. Even within the pack ice, wind-roughened leads may appear bright in HH and significantly overlap with sea ice backscatter signatures. HV will then be critical to distinguish ice and water. On the other hand, brash ice (heavily deformed -> strong backscatter) and calm water (in still wind conditions) can be easily separable in the MIZ and close to the ice edge. I think you should be more careful with general statements on the differences in model performance and ice-water separability based on the two examples selected here.
Citation: https://doi.org/10.5194/egusphere-2026-376-RC1
- AC1: 'Reply on RC1', jacob seston, 30 Mar 2026
  
  Thank you for the helpful and constructive feedback. We have attached our detailed response to both reviewers’ comments, outlining the planned revisions to the manuscript. A fully revised version will be submitted by the specified deadline.
  
  Citation: https://doi.org/10.5194/egusphere-2026-376-AC1
RC2:
'Comment on egusphere-2026-376', Anton Korosov, 25 Feb 2026

The manuscript "Self-supervised learning reduces labelling requirements for sea ice

segmentation in Sentinel-1 SAR imagery" presents an interesting approach to reduce the

number of manually drawn labels for training automatic ML-based ice classification on

SAR imagery. The manuscript is well written, and the description of data and methods is

clear. However, there are critical errors in the labelling and interpretation of the SAR

imagery used, which lead to drawing wrong conclusions in the manuscript (see the major

comments below). I therefore recommend rejecting it.

Citation: https://doi.org/10.5194/egusphere-2026-376-RC2
- AC1: 'Reply on RC1', jacob seston, 30 Mar 2026
  
  Thank you for the helpful and constructive feedback. We have attached our detailed response to both reviewers’ comments, outlining the planned revisions to the manuscript. A fully revised version will be submitted by the specified deadline.
  
  Citation: https://doi.org/10.5194/egusphere-2026-376-AC1

Jacob Seston, William David Harcourt, Georgios Leontidis, Brice R. Rea, Matteo Spagnolo, and Lauren McWhinnie

Viewed

Total article views: 497 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
368	108	21	497	56	66

HTML: 368
PDF: 108
XML: 21
Total: 497
BibTeX: 56
EndNote: 66

Views and downloads (calculated since 04 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	223	67	15	305
Mar 2026	117	29	4	150
Apr 2026	28	12	2	42

Cumulative views and downloads (calculated since 04 Feb 2026)

Month	HTML	PDF	XML	Total
Feb 2026	223	67	15	305
Mar 2026	117	29	4	150
Apr 2026	28	12	2	42

Viewed (geographical distribution)

Total article views: 518 (including HTML, PDF, and XML) Thereof 518 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 12 Apr 2026

Short summary

Monitoring sea ice is important for understanding climate change and supporting safe travel in the Arctic, but creating detailed maps usually requires time-consuming manual work. In this study, we trained a computer model to learn from large amounts of unlabelled satellite images before using only a small number of labelled examples. This approach produced accurate ice maps while greatly reducing the need for manual labelling, making large-scale monitoring more practical.


Total:	0
HTML:	0
PDF:	0
XML:	0