the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
From Ground Photos to Aerial Insights: Automating Citizen Science Labeling for Tree Species Segmentation in UAV Images
Abstract. Spatially accurate information on plant species is essential for various biodiversity monitoring applications like vegetation monitoring. Unoccupied Aerial Vehicle (UAV)-based remote sensing combined with supervised Convolutional Neural Networks (CNNs)-based segmentation methods has enabled accurate segmentation of plant species. However, labeling training data for supervised CNN methods in vegetation monitoring is a resource-intensive task, particularly for large-scale remote sensing datasets. This study presents an automated workflow that integrates the Segment Anything Model (SAM) with Gradient-weighted Class Activation Mapping (Grad-CAM) to generate segmentation masks for citizen science plant photographs, reducing the efforts required for manual annotation. We evaluated the workflow by using the generated masks to train CNN-based segmentation models to segment 10 broadleaf tree species in UAV images. The results demonstrate that segmentation models can be trained directly using citizen science-sourced plant photographs, automating mask generation without the need for extensive manual labeling. Despite the inherent complexity of segmenting broadleaf tree species, the model achieved an overall acceptable performance. Towards efficiently monitoring vegetation dynamics across space and time, this study highlights the potential of integrating foundation models with citizen science data and remote sensing into automated vegetation mapping workflows, providing a scalable and cost-effective solution for biodiversity monitoring.
- Preprint
(54188 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 07 Apr 2025)
-
RC1: 'Comment on egusphere-2025-662', Anonymous Referee #1, 18 Mar 2025
reply
In this study, authors develop an end-to-end workflow that transforms the simple labels of crowd-sourced plant photos from iNaturalist and Pl@ntNet into segmentations masks. This mask dataset serves as labelled data to train deep learning species classification models. Authors also successfully utilized the dataset to train a CNN model to classify UAV ortho-imagery and accurately segment plant species at large scale. By reducing the time and labor required for field surveys to collect reference data for remote sensing image classification, this labeled dataset may offer some practical benefits. Overall, the study demonstrates both intellectual merit and practical relevance. The manuscript is also well-structured and well-written. However, the use of these citizen science datasets as labelled data for segmenting UAV images yields low accuracy in various species, hindering practical applications of these datasets and the method. The UAV image segmentation model performance should be improved for further evaluation.
Other comments
- Lines 184-188: Other than learning rate, batch size, and epoch, did you tune other parameters? Also, for learning rate, batch size, and epoch, it is better to test with a wider range of values to evaluate model performance before narrowing them down to a specific range. Also, for model training, did you use k-fold cross-validation for hyperparameter tuning? If so, what is the k-fold value did you use? This needs to be clarified.
- Lines 239-243: The prediction of acquisition distance seems skeptical. In citizen science data, people use various cameras and may set various zooming modes when capturing photos, it is hard to predict acquisition distance just from the photo itself; thus, distance thresholds of 0.2 m and 20 m seem skeptical. In the earlier paragraph, authors use an area threshold of 30% to filter out some photos. Should a similar method be used to filter out photos with large amounts of tree trunk/branch?
- Lines 278-284: Did you use k-fold cross-validation to train the model? If so, the k-fold value you used should be reported.
- Lines 286-301: The classification performance seems to be low for various species. Citizen science data helps reduce time and labor in reference data collection; however, we also need to make sure output data are accurate and usable. With this low accuracy, what do authors suggest for future works? Should we incorporate some UAV-based high accuracy labelled data in the model together with citizen science data to improve classification accuracy? Also, the hyperparameter tuning seems not to be well-performed in your deep learning model training, I recommend conducting a more exhaustive tuning and trying different deep learning architecture to see if the classification results are improved.
- One of the main reasons that cause low segmentation accuracy in this study could be the difference in the spatial resolutions between citizen science photos and UAV images. One possible solution for this discrepancy could be that during your segmentation model training, authors may want to manipulate/resample citizen science photos to different resolutions, including the 0.22 cm resolution of the UAV image, and incorporate features extracted from these layers into the final segmentation prediction to help improve the final segmentation results (see below paper with similar idea, note: this is not a reviewer’s paper).
Martins et al., 2020. Exploring multiscale object-based convolutional neural network (multi-OCNN) for remote sensing image classification at high spatial resolution. https://doi.org/10.1016/j.isprsjprs.2020.08.004
Citation: https://doi.org/10.5194/egusphere-2025-662-RC1
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
184 | 49 | 8 | 241 | 6 | 4 |
- HTML: 184
- PDF: 49
- XML: 8
- Total: 241
- BibTeX: 6
- EndNote: 4
Viewed (geographical distribution)
Country | # | Views | % |
---|---|---|---|
United States of America | 1 | 75 | 27 |
Germany | 2 | 52 | 19 |
France | 3 | 24 | 8 |
Canada | 4 | 14 | 5 |
Portugal | 5 | 11 | 4 |
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
- 75