the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
From Ground Photos to Aerial Insights: Automating Citizen Science Labeling for Tree Species Segmentation in UAV Images
Abstract. Spatially accurate information on plant species is essential for various biodiversity monitoring applications like vegetation monitoring. Unoccupied Aerial Vehicle (UAV)-based remote sensing combined with supervised Convolutional Neural Networks (CNNs)-based segmentation methods has enabled accurate segmentation of plant species. However, labeling training data for supervised CNN methods in vegetation monitoring is a resource-intensive task, particularly for large-scale remote sensing datasets. This study presents an automated workflow that integrates the Segment Anything Model (SAM) with Gradient-weighted Class Activation Mapping (Grad-CAM) to generate segmentation masks for citizen science plant photographs, reducing the efforts required for manual annotation. We evaluated the workflow by using the generated masks to train CNN-based segmentation models to segment 10 broadleaf tree species in UAV images. The results demonstrate that segmentation models can be trained directly using citizen science-sourced plant photographs, automating mask generation without the need for extensive manual labeling. Despite the inherent complexity of segmenting broadleaf tree species, the model achieved an overall acceptable performance. Towards efficiently monitoring vegetation dynamics across space and time, this study highlights the potential of integrating foundation models with citizen science data and remote sensing into automated vegetation mapping workflows, providing a scalable and cost-effective solution for biodiversity monitoring.
- Preprint
(54188 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on egusphere-2025-662', Anonymous Referee #1, 18 Mar 2025
-
AC1: 'Response to Reviewer 1 Comments', Salim Soltani, 09 May 2025
Dear Reviewer,
We would like to sincerely thank you for your constructive and thoughtful comments. We greatly appreciate the time and effort you invested in reviewing our manuscript. Your feedback has been very helpful in identifying areas for improvement.
We have carefully addressed all comments and will revise the manuscript accordingly. For better readability, we have compiled our detailed responses in the attached PDF, structured in a clear table format.
Thank you once again for your valuable input.
Sincerely,
Salim Soltani
(on behalf of the Co-authors, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn)
-
AC1: 'Response to Reviewer 1 Comments', Salim Soltani, 09 May 2025
-
RC2: 'Comment on egusphere-2025-662', Anonymous Referee #2, 19 Apr 2025
I have reviewed the manuscript “From Ground Photos to Aerial Insights: Automating Citizen
Science Labeling for Tree Species Segmentation in UAV Images”. The authors examined the use of citizen science plant photographs to generate large training data needed for segmenting plant species from high-resolution UAV imagery. Specifically, the authors combined several AI/ML models to extract species training masks from photographs. The research topic is very interesting and timely, and addresses a core need to advance the use of optical UAV imagery for larger-scale vegetation mapping. The manuscript is well structured and nice discussed. My concerns are mainly on the Methods and Results.
I would recommend the authors to add a workflow chart to help readers understand the various types of methods and data used for the study. There are several AI/ML models employed for various different data processing, including both photographs and UAV imagery. I found it hard to connect the different processing steps, and how different data streams and AI/ML methods are used.
Second, not much information is presented in the Results, barely enough to understand the performance of the model. The authors did quite significant work on processing and segmenting the photographs from iNaturealist and Pl@ntNet. However, results about these processing and segmentation are completely missed in the Results. I am nervous the presentation of Results is disconnected with the Methods. Recommend the authors to carefully tie them together, especially, how F1 score, confusion matrix was calculated. The authors mentioned independent transect validation data were identified from UAV imagery, but did not mention where and how those were produced, distribution across species and space etc. I think it is also useful to present the species maps across the experiment plots.
Lastly an overall thought, a core advance of using UAV imagery is to provide landscape-scale observations. The authors argued that ultra-high (finer than 0.22 cm) might be necessary to better segment species from UAV imagery. This statement appears to “false”, and ignored that canopy structure and form are important information for species identification, which are not considered in this study. On the other hand, it is cool to generate the initial masks for UAV species identification using photographs, but it might be more useful to iterate over the species segmentation at UAV level, leveraging other information like canopy form and structure, to enlarge training samples at UAV level, instead of forcing UAV data to the same resolution as ground photographs?
Minor comments:
- I wonder what features the authors used for segmentation? It is clear that the authors used only RGB imagery, but are other indices or transformations incorporated in the SAM segmentation?
- The author mentioned that photos/masks from citizen science were ‘zoomed’ out when applied as training for UAV imagery. What’s the resolution after that? Is it comparable to UAV resolution?
Citation: https://doi.org/10.5194/egusphere-2025-662-RC2 -
AC2: 'Response to Reviewer 2 Comments', Salim Soltani, 09 May 2025
Dear Reviewer,
We would like to sincerely thank you for your constructive and thoughtful comments. We greatly appreciate the time and effort you invested in reviewing our manuscript. Your feedback has been very helpful in identifying areas for improvement.
We have carefully addressed all comments and will revise the manuscript accordingly. For better readability, we have compiled our detailed responses in the attached PDF, structured in a clear table format.
Thank you once again for your valuable input.
Sincerely,
Salim Soltani
(on behalf of the Co-authors, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn)
Status: closed
-
RC1: 'Comment on egusphere-2025-662', Anonymous Referee #1, 18 Mar 2025
In this study, authors develop an end-to-end workflow that transforms the simple labels of crowd-sourced plant photos from iNaturalist and Pl@ntNet into segmentations masks. This mask dataset serves as labelled data to train deep learning species classification models. Authors also successfully utilized the dataset to train a CNN model to classify UAV ortho-imagery and accurately segment plant species at large scale. By reducing the time and labor required for field surveys to collect reference data for remote sensing image classification, this labeled dataset may offer some practical benefits. Overall, the study demonstrates both intellectual merit and practical relevance. The manuscript is also well-structured and well-written. However, the use of these citizen science datasets as labelled data for segmenting UAV images yields low accuracy in various species, hindering practical applications of these datasets and the method. The UAV image segmentation model performance should be improved for further evaluation.
Other comments
- Lines 184-188: Other than learning rate, batch size, and epoch, did you tune other parameters? Also, for learning rate, batch size, and epoch, it is better to test with a wider range of values to evaluate model performance before narrowing them down to a specific range. Also, for model training, did you use k-fold cross-validation for hyperparameter tuning? If so, what is the k-fold value did you use? This needs to be clarified.
- Lines 239-243: The prediction of acquisition distance seems skeptical. In citizen science data, people use various cameras and may set various zooming modes when capturing photos, it is hard to predict acquisition distance just from the photo itself; thus, distance thresholds of 0.2 m and 20 m seem skeptical. In the earlier paragraph, authors use an area threshold of 30% to filter out some photos. Should a similar method be used to filter out photos with large amounts of tree trunk/branch?
- Lines 278-284: Did you use k-fold cross-validation to train the model? If so, the k-fold value you used should be reported.
- Lines 286-301: The classification performance seems to be low for various species. Citizen science data helps reduce time and labor in reference data collection; however, we also need to make sure output data are accurate and usable. With this low accuracy, what do authors suggest for future works? Should we incorporate some UAV-based high accuracy labelled data in the model together with citizen science data to improve classification accuracy? Also, the hyperparameter tuning seems not to be well-performed in your deep learning model training, I recommend conducting a more exhaustive tuning and trying different deep learning architecture to see if the classification results are improved.
- One of the main reasons that cause low segmentation accuracy in this study could be the difference in the spatial resolutions between citizen science photos and UAV images. One possible solution for this discrepancy could be that during your segmentation model training, authors may want to manipulate/resample citizen science photos to different resolutions, including the 0.22 cm resolution of the UAV image, and incorporate features extracted from these layers into the final segmentation prediction to help improve the final segmentation results (see below paper with similar idea, note: this is not a reviewer’s paper).
Martins et al., 2020. Exploring multiscale object-based convolutional neural network (multi-OCNN) for remote sensing image classification at high spatial resolution. https://doi.org/10.1016/j.isprsjprs.2020.08.004
Citation: https://doi.org/10.5194/egusphere-2025-662-RC1 -
AC1: 'Response to Reviewer 1 Comments', Salim Soltani, 09 May 2025
Dear Reviewer,
We would like to sincerely thank you for your constructive and thoughtful comments. We greatly appreciate the time and effort you invested in reviewing our manuscript. Your feedback has been very helpful in identifying areas for improvement.
We have carefully addressed all comments and will revise the manuscript accordingly. For better readability, we have compiled our detailed responses in the attached PDF, structured in a clear table format.
Thank you once again for your valuable input.
Sincerely,
Salim Soltani
(on behalf of the Co-authors, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn)
-
RC2: 'Comment on egusphere-2025-662', Anonymous Referee #2, 19 Apr 2025
I have reviewed the manuscript “From Ground Photos to Aerial Insights: Automating Citizen
Science Labeling for Tree Species Segmentation in UAV Images”. The authors examined the use of citizen science plant photographs to generate large training data needed for segmenting plant species from high-resolution UAV imagery. Specifically, the authors combined several AI/ML models to extract species training masks from photographs. The research topic is very interesting and timely, and addresses a core need to advance the use of optical UAV imagery for larger-scale vegetation mapping. The manuscript is well structured and nice discussed. My concerns are mainly on the Methods and Results.
I would recommend the authors to add a workflow chart to help readers understand the various types of methods and data used for the study. There are several AI/ML models employed for various different data processing, including both photographs and UAV imagery. I found it hard to connect the different processing steps, and how different data streams and AI/ML methods are used.
Second, not much information is presented in the Results, barely enough to understand the performance of the model. The authors did quite significant work on processing and segmenting the photographs from iNaturealist and Pl@ntNet. However, results about these processing and segmentation are completely missed in the Results. I am nervous the presentation of Results is disconnected with the Methods. Recommend the authors to carefully tie them together, especially, how F1 score, confusion matrix was calculated. The authors mentioned independent transect validation data were identified from UAV imagery, but did not mention where and how those were produced, distribution across species and space etc. I think it is also useful to present the species maps across the experiment plots.
Lastly an overall thought, a core advance of using UAV imagery is to provide landscape-scale observations. The authors argued that ultra-high (finer than 0.22 cm) might be necessary to better segment species from UAV imagery. This statement appears to “false”, and ignored that canopy structure and form are important information for species identification, which are not considered in this study. On the other hand, it is cool to generate the initial masks for UAV species identification using photographs, but it might be more useful to iterate over the species segmentation at UAV level, leveraging other information like canopy form and structure, to enlarge training samples at UAV level, instead of forcing UAV data to the same resolution as ground photographs?
Minor comments:
- I wonder what features the authors used for segmentation? It is clear that the authors used only RGB imagery, but are other indices or transformations incorporated in the SAM segmentation?
- The author mentioned that photos/masks from citizen science were ‘zoomed’ out when applied as training for UAV imagery. What’s the resolution after that? Is it comparable to UAV resolution?
Citation: https://doi.org/10.5194/egusphere-2025-662-RC2 -
AC2: 'Response to Reviewer 2 Comments', Salim Soltani, 09 May 2025
Dear Reviewer,
We would like to sincerely thank you for your constructive and thoughtful comments. We greatly appreciate the time and effort you invested in reviewing our manuscript. Your feedback has been very helpful in identifying areas for improvement.
We have carefully addressed all comments and will revise the manuscript accordingly. For better readability, we have compiled our detailed responses in the attached PDF, structured in a clear table format.
Thank you once again for your valuable input.
Sincerely,
Salim Soltani
(on behalf of the Co-authors, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn)
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
803 | 138 | 30 | 971 | 20 | 35 |
- HTML: 803
- PDF: 138
- XML: 30
- Total: 971
- BibTeX: 20
- EndNote: 35
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
In this study, authors develop an end-to-end workflow that transforms the simple labels of crowd-sourced plant photos from iNaturalist and Pl@ntNet into segmentations masks. This mask dataset serves as labelled data to train deep learning species classification models. Authors also successfully utilized the dataset to train a CNN model to classify UAV ortho-imagery and accurately segment plant species at large scale. By reducing the time and labor required for field surveys to collect reference data for remote sensing image classification, this labeled dataset may offer some practical benefits. Overall, the study demonstrates both intellectual merit and practical relevance. The manuscript is also well-structured and well-written. However, the use of these citizen science datasets as labelled data for segmenting UAV images yields low accuracy in various species, hindering practical applications of these datasets and the method. The UAV image segmentation model performance should be improved for further evaluation.
Other comments
Martins et al., 2020. Exploring multiscale object-based convolutional neural network (multi-OCNN) for remote sensing image classification at high spatial resolution. https://doi.org/10.1016/j.isprsjprs.2020.08.004