Exploring the effect of training set size and number of categories on ice crystal classification through a contrastive semi-supervised learning algorithm
Abstract. The shapes of ice crystals play an important role in global precipitation formation and radiation budget. Classifying ice crystal shapes can improve our understanding of in-cloud conditions and these processes. However, existing classification methods rely on features such as the aspect ratio of ice crystals, environmental temperature, and so on, which bring high instability to the classification performance, or employ supervised learning machine learning algorithms that heavily rely on human labeling. This poses significant challenges, including human subjectivity in classification and a substantial labor cost in manual labeling. In addition, previous deep learning algorithms for ice crystal classification are often trained and evaluated on datasets with varying sizes and different classification schemes, each with distinct criteria and a different number of categories, making it difficult to make a fair comparison of algorithm performance. To overcome these limitations, a contrastive semi-supervised learning (CSSL) algorithm for the classification of ice crystals is proposed. The algorithm consists of an upstream unsupervised learning network tasked with extracting meaningful representations from a large amount of unlabeled ice crystal images, and a downstream supervised network is fine-tuned with a small subset labeled images of the entire dataset to perform the classification task. To determine the minimal number of ice crystal images that require human labeling while balancing the algorithm performance and manual labeling effort, the algorithm is trained and evaluated on datasets with varying sizes and numbers of categories. The ice crystal data used in this study was collected during the NASCENT campaign at Ny-Ålesund and CLOUDLAB project on the Swiss plateau, using a holographic imager mounted on a tethered balloon system. In general, the CSSL algorithm performs better than a purely supervised algorithm in classifying 19 categories. Approximately 154 hours of manual labeling can be avoided using just 11 % (2048 images) of the training set for fine-tuning, sacrificing only 3.8 % in overall precision compared to a fully supervised model trained on the entire dataset. In the 4-category classification task, the CSSL algorithm also outperforms the purely supervised algorithm. The algorithm fine-tuned on 2048 images (25 % of the entire 4-category dataset) achieves an overall accuracy of 89.6 %, which is comparable to the purely supervised algorithm trained on 8192 images (91.0 %). Moreover, when tested on the unseen CLOUDLAB dataset, the CSSL algorithm exhibits significantly stronger generalization capabilities than the supervised approach, with an average improvement of 2.19 % in accuracy. These results highlight the strength and practical effectiveness of CSSL in comparison to purely supervised methods, and the potential of the CSSL algorithm to perform well on datasets that would be collected under different conditions.