CloudMViT: Cloud Classification Using Ground-Based Remote Sensing Imagery and a Lightweight Hybrid Architecture
Abstract. Ground-based remote sensing cloud image data can be used to analyze regional cloud type variation trends, thereby predicting future water resource supply capacity. However, existing cloud classification methods based on ground-based remote sensing imagery often suffer from limited recognition accuracy due to insufficient fine-grained feature extraction, and their large model parameter counts hinder deployment on embedded terminals. To address these issues, this study proposes CloudMViT, a lightweight hybrid network architecture fusing a dual-pooling channel attention module and cross-scale self-attention, which enhances both local and global feature representation of cloud images while optimizing computational efficiency. Specifically, the model suppresses sky background interference and strengthens cloud edge features via the dual-pooling channel attention module that combines global average pooling (GAP) and global max pooling (GMP); captures cross-channel detailed features (e.g., cirrus fibril structures and stratocumulus shadows) using depthwise separable convolution and a decoupling mechanism; and further reduces model parameters by introducing CloudGhost cascade compression technology through linear feature redundancy elimination.Experiments on the World Meteorological Organization (WMO)-compliant HBMCD (10 standard cloud genera) and GCD (7 sky conditions) datasets demonstrate that CloudMViT achieves classification accuracies of 98.81 % and 95.13 %, respectively, significantly outperforming lightweight models such as MobileViT and EfficientNet. Ablation experiments validate the effectiveness of the dual-pooling channel attention module (improving accuracy by 5.31 %) and the CloudGhost module (increasing inference speed by 50 %). When deployed on the RK3588 embedded platform, the INT8-quantized CloudMViT enables real-time inference, maintaining an accuracy of 94.79 % with only 0.47 MB of memory occupation. The proposed cloud classification method and hardware acceleration strategy provide a feasible solution for the development of portable ground-based cloud observation and classification devices.
Â
This manuscript presents an impressive work on ground-based cloud classification. It proposes CloudMViT, an innovative lightweight hybrid architecture that effectively integrates a dual-pooling channel attention module and cross-scale self-attention. The model achieves quite high classification accuracies on two high-quality datasets, including the fully WMO-compliant HBMCD and the GCD dataset which contains multiple WMO-standard cloud genera, and demonstrates excellent practical viability through successful deployment on the RK3588 embedded platform. To further strengthen the manuscript's impact and clarity, adding the introduction to cloud categories and the cloud image datasets, providing a deeper physical interpretation of the model's behavior, a detailed analysis of quantization-induced accuracy loss, and clarifying a few points regarding symbol/table consistency would be beneficial. I recommend publication after these minor enhancements.
Major Comments
Specific Comments
1.Many citation marker were missing in the literature; for example, line 55 Wang et al.; and in Section 2 Method, during the introduction of each method I couldn’t see any references
2.In Section 3.2, the manuscript states "total number of iterations was 100." However, deep learning training typically uses "epochs" rather than "iterations." The authors should clarify the intended meaning and unify the terminology to avoid confusion.
3. In Fig. 1, σ points to the output of the sigmoid activation function. In Eq. (1), η=σ(Vkγ), where σ already denotes the sigmoid function. However, Fig. 1 also shows a multiplication operation (⊗) after σ, which is not reflected in Eq. (1). The authors should check and correct this inconsistency.
4. The Conclusion states that the inference time is "only 0.006 s," but Table 10 lists an inference time of 30.14 s for CloudMViT (total time). Readers cannot easily derive the per-image inference time of 0.006 s. The authors should verify and correct the data, or explicitly state the calculation basis for the per-image inference time.
5. In Eq. (4) and Eq. (5), the parameters band rare only stated as "set to 1 and 2" in the text, but their physical meaning or derivation is not explained. Readers cannot understand why these specific values were chosen. The authors should clarify the physical significance or the basis for determining these parameters.
6. Fig. 6 labels step1, step2, and step3. Section 2.5 describes them as "Step 1: Extraction of local detailed features,""Step 2: Extraction of global features," and "Step 3: Enhancement of local detailed features." However, step1 and step3 have identical structures in the figure, and step2 is positioned in the middle. The text states that step3 "repeats step1," but the spatial arrangement in the figure conflicts with the semantic meaning of "repetition." The authors should revise either the figure or the text to ensure consistency.
7. The "Inference Time/s" column in Table 10 shows large values (e.g., 30.14 s for CloudMViT), but the manuscript does not specify that this is the total time for the entire test set . Readers may mistakenly interpret it as the per-image inference time, leading to misunderstanding of the model's speed. The authors should clearly label the column as "Total Inference Time on test set" in the table header or footnote, and also provide the per-image inference time in the main text.