the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A comparative analysis of deep learning models for classifying shallow mesoscale cloud patterns in satellite images
Abstract. Representation of clouds in climate models is challenging, not the least due to their heterogeneous spatial structures and dynamic behavior. In this study, the potential of advanced machine learning (ML) techniques to identify and categorize mesoscale low-level cloud structures in satellite imagery is explored, with particular emphasis on those patterns that are frequently observed over the trade wind regions of the south Atlantic Ocean.
Rectified Level 1.5 satellite images from the spinning enhanced visible and infrared imager (SEVIRI) for the year 2021 are used for the analysis. To assess the potential gains in classification accuracy under limited labeled datasets, several deep learning approaches are evaluated. The analysis considers a custom-built convolutional neural network, a pre-trained 50-layer residual neural network adapted through transfer learning using EuroSat, and a self-supervised vision transformer framework known as DINOv2 (self-distillation with no labels version 2). The embeddings, i.e. the feature representations yielded by DINOv2 are used in two separate approaches, one based on manually-labeled data and the other using the k-means clustering algorithm.
The results show that combining the DINOv2 model with a multilayer perceptron and training on labeled data achieves the highest cloud pattern classification accuracy among the evaluated ML approaches.
- Preprint
(32936 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2026-915', Anonymous Referee #1, 04 Jun 2026
The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2026/egusphere-2026-915/egusphere-2026-915-RC1-supplement.pdfCitation: https://doi.org/
10.5194/egusphere-2026-915-RC1 -
RC2: 'Comment on egusphere-2026-915', Anonymous Referee #2, 27 Jun 2026
This manuscript compares CNN, EuroSat transfer-learning ResNet50, supervised DINOv2+MLP, and DINOv2 feature-based clustering approaches for classifying shallow mesoscale cloud patterns in SEVIRI imagery. The research question is clear, the dataset and benchmark design are useful, and the results consistently suggest that DINOv2 features are more robust under domain shift. The manuscript is generally well structured and the figures and tables are comprehensive. However, several aspects require clarification, especially data splitting, annotation reliability, model comparability, and reproducibility. I recommend minor revision.
Major Minor-Revision Comments
-
Clarify whether all models were evaluated on exactly the same test set.
Sections 4.2 and 6.1 state that all models use a unified 15% hold-out test set, but the class counts in Tables 5/6 differ from those in Table 7. For example, Closed Cell has a count of 71 for CNN/ResNet50 but 98 for DINOv2+MLP. This affects direct comparison of Top-1/Top-2 and per-class metrics. The authors should explain this discrepancy or re-report all supervised model results on an identical test set. -
Provide more detail on the manual annotation process and label consistency.
The dataset was annotated by a meteorologist, and the Discussion notes that some annotations are incomplete or inconsistent. The authors should describe the annotation criteria, how ambiguous cases were handled, whether any quality control or second review was performed, and whether inter-annotator agreement was assessed. If only one annotator was used, this should be explicitly acknowledged as a limitation. -
Describe the benchmark domain shift more concretely.
The manuscript states that benchmark images are larger and differ in annotation conditions, causing domain shift. Please provide more information on the benchmark sampling strategy, date/month distribution, bounding-box size distribution, class distribution, and how these differ from the original dataset. If possible, a simple stratified analysis by bounding-box size or month would strengthen the claim. -
Add uncertainty estimates or significance testing for model comparisons.
On the benchmark set, DINOv2+MLP achieves a Top-1 accuracy of 0.61, compared with 0.56 for both CNN and ResNet50. The direction is clear, but the margin is modest. Bootstrap confidence intervals, McNemar tests, or at least macro/weighted averages would make the comparison more convincing. -
Improve reproducibility details for DINOv2 and clustering.
Please specify the exact DINOv2 version/model size, embedding dimension, input normalization, k-means initialization settings, number of runs, random seeds, and whether PCA or feature standardization was used. Also, Tables 11/12 report cluster-label prediction performance, which is not equivalent to semantic cloud-class accuracy; this distinction should be made clearer. -
Strengthen the code and data availability statement.
The current statement says that the code will be made public after the AI4PEX project and can meanwhile be provided upon request. At minimum, the authors should provide model configurations, training/test split files, random seeds, annotation files, or a reproducibility package. If full release is not yet possible, the reason and expected timeline should be stated.
Minor Comments
-
The abstract would benefit from including the main benchmark result, such as the Top-1/Top-2 accuracy of DINOv2+MLP.
-
Tables 13/14 should also report macro-F1 or balanced accuracy, since per-class performance varies substantially.
-
Figure 10 is dense, and some Wasserstein/Silhouette labels are difficult to read. Please enlarge the font or move some panels to supplementary material.
-
Please standardize terminology, such as “data set” vs. “dataset,” “DINOv2+MLP” vs. “supervised DINOv2+MLP,” and “Custom CNN” vs. “CNN.”
-
Several minor language issues should be corrected, for example “Each metric capture” should be “Each metric captures,” and “class class” is repeated in the caption of Table 14.
-
The Discussion would benefit from a clearer distinction between transfer learning from general remote-sensing imagery and transfer learning from cloud/atmospheric imagery.
-
Please define how Top-2 accuracy is computed, especially in cases where class probabilities are close or tied.
-
The conclusion that cloud patterns “can be effectively classified” should be softened, since the best benchmark Top-1 accuracy is only 0.61. A more cautious phrasing would be “moderately classified under domain shift, with DINOv2 showing the strongest robustness.”
Citation: https://doi.org/10.5194/egusphere-2026-915-RC2 -
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 224 | 131 | 17 | 372 | 15 | 22 |
- HTML: 224
- PDF: 131
- XML: 17
- Total: 372
- BibTeX: 15
- EndNote: 22
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1