CloudMViT: Cloud Classification Using Ground-Based Remote Sensing Imagery and a Lightweight Hybrid Architecture

Xu, Wei; Wu, Ningning; Feng, Lin

doi:10.5194/egusphere-2026-1512

Preprints

https://doi.org/10.5194/egusphere-2026-1512

Preprints

11 May 2026

| 11 May 2026

Status: this preprint is open for discussion and under review for Atmospheric Measurement Techniques (AMT).

CloudMViT: Cloud Classification Using Ground-Based Remote Sensing Imagery and a Lightweight Hybrid Architecture

Wei Xu, Ningning Wu, and Lin Feng

Abstract. Ground-based remote sensing cloud image data can be used to analyze regional cloud type variation trends, thereby predicting future water resource supply capacity. However, existing cloud classification methods based on ground-based remote sensing imagery often suffer from limited recognition accuracy due to insufficient fine-grained feature extraction, and their large model parameter counts hinder deployment on embedded terminals. To address these issues, this study proposes CloudMViT, a lightweight hybrid network architecture fusing a dual-pooling channel attention module and cross-scale self-attention, which enhances both local and global feature representation of cloud images while optimizing computational efficiency. Specifically, the model suppresses sky background interference and strengthens cloud edge features via the dual-pooling channel attention module that combines global average pooling (GAP) and global max pooling (GMP); captures cross-channel detailed features (e.g., cirrus fibril structures and stratocumulus shadows) using depthwise separable convolution and a decoupling mechanism; and further reduces model parameters by introducing CloudGhost cascade compression technology through linear feature redundancy elimination.Experiments on the World Meteorological Organization (WMO)-compliant HBMCD (10 standard cloud genera) and GCD (7 sky conditions) datasets demonstrate that CloudMViT achieves classification accuracies of 98.81 % and 95.13 %, respectively, significantly outperforming lightweight models such as MobileViT and EfficientNet. Ablation experiments validate the effectiveness of the dual-pooling channel attention module (improving accuracy by 5.31 %) and the CloudGhost module (increasing inference speed by 50 %). When deployed on the RK3588 embedded platform, the INT8-quantized CloudMViT enables real-time inference, maintaining an accuracy of 94.79 % with only 0.47 MB of memory occupation. The proposed cloud classification method and hardware acceleration strategy provide a feasible solution for the development of portable ground-based cloud observation and classification devices.

Received: 19 Mar 2026 – Discussion started: 11 May 2026

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Wei Xu, Ningning Wu, and Lin Feng

Status: open (until 12 Jul 2026)

Post a comment Subscribe to comment alert

RC1:
'Comment on egusphere-2026-1512', Anonymous Referee #1, 08 Jun 2026 reply
This manuscript presents an impressive work on ground-based cloud classification. It proposes CloudMViT, an innovative lightweight hybrid architecture that effectively integrates a dual-pooling channel attention module and cross-scale self-attention. The model achieves quite high classification accuracies on two high-quality datasets, including the fully WMO-compliant HBMCD and the GCD dataset which contains multiple WMO-standard cloud genera, and demonstrates excellent practical viability through successful deployment on the RK3588 embedded platform. To further strengthen the manuscript's impact and clarity, adding the introduction to cloud categories and the cloud image datasets, providing a deeper physical interpretation of the model's behavior, a detailed analysis of quantization-induced accuracy loss, and clarifying a few points regarding symbol/table consistency would be beneficial. I recommend publication after these minor enhancements.
Major Comments
The manuscript should add a section to introduce the cloud categories for classification and the cloud image datasets.

The manuscript discusses the overall accuracy drop of 4.02% after INT8 quantization but does not analyze which cloud categories suffer the most significant loss from a physical perspective. Specifically, the recall of cirrocumulus (Cc) drops dramatically from 97.39% to 87.06%, yet the physical reason for this degradation is not explained. The authors should provide a detailed analysis of why cirrocumulus are more sensitive to quantization errors, and how the quantization process destroys these physical texture features. A quantitative comparison of recall changes across all cloud categories is strongly recommended.

The manuscript mentions that the ECA-DP module suppresses background interference via GMP, but does not explain from a physical perspective why traditional CNNs (e.g., MobileNet) struggle with blurred cloud-sky boundaries. Furthermore, how does the self-attention mechanism in CloudMViT physically capture such gradual transitions? The authors should supplement the description of the physical characteristics of cloud edges and clarify how specific modulesrespond to these physical properties.

The manuscript states that "DW convolution can extract subtle edge differences between cirrus (Ci) and cirrocumulus (Cc)," but does not specify how the physical texture features of these clouds are represented in the model. The authors should add a discussion explaining how specific modules (e.g., the local receptive field of depthwise separable convolution) physically match the scale characteristics of these special textures.

Specific Comments
1.Many citation marker were missing in the literature; for example, line 55 Wang et al.; and in Section 2 Method, during the introduction of each method I couldn’t see any references
2.In Section 3.2, the manuscript states "total number of iterations was 100." However, deep learning training typically uses "epochs" rather than "iterations." The authors should clarify the intended meaning and unify the terminology to avoid confusion.
3. In Fig. 1, σ points to the output of the sigmoid activation function. In Eq. (1), η=σ(Vkγ), where σ already denotes the sigmoid function. However, Fig. 1 also shows a multiplication operation (⊗) after σ, which is not reflected in Eq. (1). The authors should check and correct this inconsistency.
4. The Conclusion states that the inference time is "only 0.006 s," but Table 10 lists an inference time of 30.14 s for CloudMViT (total time). Readers cannot easily derive the per-image inference time of 0.006 s. The authors should verify and correct the data, or explicitly state the calculation basis for the per-image inference time.
5. In Eq. (4) and Eq. (5), the parameters band rare only stated as "set to 1 and 2" in the text, but their physical meaning or derivation is not explained. Readers cannot understand why these specific values were chosen. The authors should clarify the physical significance or the basis for determining these parameters.
6. Fig. 6 labels step1, step2, and step3. Section 2.5 describes them as "Step 1: Extraction of local detailed features,""Step 2: Extraction of global features," and "Step 3: Enhancement of local detailed features." However, step1 and step3 have identical structures in the figure, and step2 is positioned in the middle. The text states that step3 "repeats step1," but the spatial arrangement in the figure conflicts with the semantic meaning of "repetition." The authors should revise either the figure or the text to ensure consistency.
7. The "Inference Time/s" column in Table 10 shows large values (e.g., 30.14 s for CloudMViT), but the manuscript does not specify that this is the total time for the entire test set . Readers may mistakenly interpret it as the per-image inference time, leading to misunderstanding of the model's speed. The authors should clearly label the column as "Total Inference Time on test set" in the table header or footnote, and also provide the per-image inference time in the main text.

Reply
Citation: https://doi.org/10.5194/egusphere-2026-1512-RC1

Wei Xu, Ningning Wu, and Lin Feng

Viewed

Total article views: 274 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
225	34	15	274	19	18

HTML: 225
PDF: 34
XML: 15
Total: 274
BibTeX: 19
EndNote: 18

Views and downloads (calculated since 11 May 2026)

Month	HTML	PDF	XML	Total
May 2026	199	27	11	237
Jun 2026	26	7	4	37

Cumulative views and downloads (calculated since 11 May 2026)

Month	HTML	PDF	XML	Total
May 2026	199	27	11	237
Jun 2026	26	7	4	37

Viewed (geographical distribution)

Total article views: 268 (including HTML, PDF, and XML) Thereof 268 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 25 Jun 2026

Short summary

Clouds shape Earth’s climate and water supply. Classifying them from ground-based images helps track regional weather. Existing models are either inaccurate or too large for portable devices. We present a lightweight model, CloudMViT, using dual-pooling channel attention and cross-scale self-attention for cloud classification. Experiments show higher accuracy and real-time, low-memory performance on embedded hardware. This work supports portable cloud observation for climate monitoring.


Total:	0
HTML:	0
PDF:	0
XML:	0