From Ground Photos to Aerial Insights: Automating Citizen Science Labeling for Tree Species Segmentation in UAV Images

Soltani, Salim; Gillespie, Lauren E.; Exposito-Alonso, Moises; Ferlian, Olga; Eisenhauer, Nico; Feilhauer, Hannes; Kattenborn, Teja

doi:https://doi.org/10.5194/egusphere-2025-662

Preprints

https://doi.org/10.5194/egusphere-2025-662

Preprints

24 Feb 2025

| 24 Feb 2025

From Ground Photos to Aerial Insights: Automating Citizen Science Labeling for Tree Species Segmentation in UAV Images

Salim Soltani, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn

Abstract. Spatially accurate information on plant species is essential for various biodiversity monitoring applications like vegetation monitoring. Unoccupied Aerial Vehicle (UAV)-based remote sensing combined with supervised Convolutional Neural Networks (CNNs)-based segmentation methods has enabled accurate segmentation of plant species. However, labeling training data for supervised CNN methods in vegetation monitoring is a resource-intensive task, particularly for large-scale remote sensing datasets. This study presents an automated workflow that integrates the Segment Anything Model (SAM) with Gradient-weighted Class Activation Mapping (Grad-CAM) to generate segmentation masks for citizen science plant photographs, reducing the efforts required for manual annotation. We evaluated the workflow by using the generated masks to train CNN-based segmentation models to segment 10 broadleaf tree species in UAV images. The results demonstrate that segmentation models can be trained directly using citizen science-sourced plant photographs, automating mask generation without the need for extensive manual labeling. Despite the inherent complexity of segmenting broadleaf tree species, the model achieved an overall acceptable performance. Towards efficiently monitoring vegetation dynamics across space and time, this study highlights the potential of integrating foundation models with citizen science data and remote sensing into automated vegetation mapping workflows, providing a scalable and cost-effective solution for biodiversity monitoring.

Received: 14 Feb 2025 – Discussion started: 24 Feb 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 54188 KB)

Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
Preprint (54188 KB)

Download & links

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Journal article(s) based on this preprint

06 Nov 2025

| Highlight paper

Automated mask generation in citizen science smartphone photos and their value for mapping plant species in drone imagery

Salim Soltani, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn

Biogeosciences, 22, 6545–6561, https://doi.org/10.5194/bg-22-6545-2025,https://doi.org/10.5194/bg-22-6545-2025, 2025

Short summary Co-editor-in-chief

Salim Soltani, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-662', Anonymous Referee #1, 18 Mar 2025
In this study, authors develop an end-to-end workflow that transforms the simple labels of crowd-sourced plant photos from iNaturalist and Pl@ntNet into segmentations masks. This mask dataset serves as labelled data to train deep learning species classification models. Authors also successfully utilized the dataset to train a CNN model to classify UAV ortho-imagery and accurately segment plant species at large scale. By reducing the time and labor required for field surveys to collect reference data for remote sensing image classification, this labeled dataset may offer some practical benefits. Overall, the study demonstrates both intellectual merit and practical relevance. The manuscript is also well-structured and well-written. However, the use of these citizen science datasets as labelled data for segmenting UAV images yields low accuracy in various species, hindering practical applications of these datasets and the method. The UAV image segmentation model performance should be improved for further evaluation.

Other comments
Lines 184-188: Other than learning rate, batch size, and epoch, did you tune other parameters? Also, for learning rate, batch size, and epoch, it is better to test with a wider range of values to evaluate model performance before narrowing them down to a specific range. Also, for model training, did you use k-fold cross-validation for hyperparameter tuning? If so, what is the k-fold value did you use? This needs to be clarified.

Lines 239-243: The prediction of acquisition distance seems skeptical. In citizen science data, people use various cameras and may set various zooming modes when capturing photos, it is hard to predict acquisition distance just from the photo itself; thus, distance thresholds of 0.2 m and 20 m seem skeptical. In the earlier paragraph, authors use an area threshold of 30% to filter out some photos. Should a similar method be used to filter out photos with large amounts of tree trunk/branch?

Lines 278-284: Did you use k-fold cross-validation to train the model? If so, the k-fold value you used should be reported.

Lines 286-301: The classification performance seems to be low for various species. Citizen science data helps reduce time and labor in reference data collection; however, we also need to make sure output data are accurate and usable. With this low accuracy, what do authors suggest for future works? Should we incorporate some UAV-based high accuracy labelled data in the model together with citizen science data to improve classification accuracy? Also, the hyperparameter tuning seems not to be well-performed in your deep learning model training, I recommend conducting a more exhaustive tuning and trying different deep learning architecture to see if the classification results are improved.

One of the main reasons that cause low segmentation accuracy in this study could be the difference in the spatial resolutions between citizen science photos and UAV images. One possible solution for this discrepancy could be that during your segmentation model training, authors may want to manipulate/resample citizen science photos to different resolutions, including the 0.22 cm resolution of the UAV image, and incorporate features extracted from these layers into the final segmentation prediction to help improve the final segmentation results (see below paper with similar idea, note: this is not a reviewer’s paper).

Martins et al., 2020. Exploring multiscale object-based convolutional neural network (multi-OCNN) for remote sensing image classification at high spatial resolution. https://doi.org/10.1016/j.isprsjprs.2020.08.004
Citation: https://doi.org/10.5194/egusphere-2025-662-RC1
- AC1: 'Response to Reviewer 1 Comments', Salim Soltani, 09 May 2025
  
  Dear Reviewer,
  We would like to sincerely thank you for your constructive and thoughtful comments. We greatly appreciate the time and effort you invested in reviewing our manuscript. Your feedback has been very helpful in identifying areas for improvement.
  We have carefully addressed all comments and will revise the manuscript accordingly. For better readability, we have compiled our detailed responses in the attached PDF, structured in a clear table format.
  Thank you once again for your valuable input.
  
  Sincerely,
  Salim Soltani
  (on behalf of the Co-authors, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn)
  
  Citation: https://doi.org/10.5194/egusphere-2025-662-AC1
RC2:
'Comment on egusphere-2025-662', Anonymous Referee #2, 19 Apr 2025
I have reviewed the manuscript “From Ground Photos to Aerial Insights: Automating Citizen
Science Labeling for Tree Species Segmentation in UAV Images”. The authors examined the use of citizen science plant photographs to generate large training data needed for segmenting plant species from high-resolution UAV imagery. Specifically, the authors combined several AI/ML models to extract species training masks from photographs. The research topic is very interesting and timely, and addresses a core need to advance the use of optical UAV imagery for larger-scale vegetation mapping. The manuscript is well structured and nice discussed. My concerns are mainly on the Methods and Results.
I would recommend the authors to add a workflow chart to help readers understand the various types of methods and data used for the study. There are several AI/ML models employed for various different data processing, including both photographs and UAV imagery. I found it hard to connect the different processing steps, and how different data streams and AI/ML methods are used.
Second, not much information is presented in the Results, barely enough to understand the performance of the model. The authors did quite significant work on processing and segmenting the photographs from iNaturealist and Pl@ntNet. However, results about these processing and segmentation are completely missed in the Results. I am nervous the presentation of Results is disconnected with the Methods. Recommend the authors to carefully tie them together, especially, how F1 score, confusion matrix was calculated. The authors mentioned independent transect validation data were identified from UAV imagery, but did not mention where and how those were produced, distribution across species and space etc. I think it is also useful to present the species maps across the experiment plots.
Lastly an overall thought, a core advance of using UAV imagery is to provide landscape-scale observations. The authors argued that ultra-high (finer than 0.22 cm) might be necessary to better segment species from UAV imagery. This statement appears to “false”, and ignored that canopy structure and form are important information for species identification, which are not considered in this study. On the other hand, it is cool to generate the initial masks for UAV species identification using photographs, but it might be more useful to iterate over the species segmentation at UAV level, leveraging other information like canopy form and structure, to enlarge training samples at UAV level, instead of forcing UAV data to the same resolution as ground photographs?
Minor comments:
I wonder what features the authors used for segmentation? It is clear that the authors used only RGB imagery, but are other indices or transformations incorporated in the SAM segmentation?

The author mentioned that photos/masks from citizen science were ‘zoomed’ out when applied as training for UAV imagery. What’s the resolution after that? Is it comparable to UAV resolution?
Citation: https://doi.org/10.5194/egusphere-2025-662-RC2
- AC2: 'Response to Reviewer 2 Comments', Salim Soltani, 09 May 2025
  
  Dear Reviewer,
  We would like to sincerely thank you for your constructive and thoughtful comments. We greatly appreciate the time and effort you invested in reviewing our manuscript. Your feedback has been very helpful in identifying areas for improvement.
  We have carefully addressed all comments and will revise the manuscript accordingly. For better readability, we have compiled our detailed responses in the attached PDF, structured in a clear table format.
  Thank you once again for your valuable input.
  
  Sincerely,
  Salim Soltani
  (on behalf of the Co-authors, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn)
  
  Citation: https://doi.org/10.5194/egusphere-2025-662-AC2

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-662', Anonymous Referee #1, 18 Mar 2025
In this study, authors develop an end-to-end workflow that transforms the simple labels of crowd-sourced plant photos from iNaturalist and Pl@ntNet into segmentations masks. This mask dataset serves as labelled data to train deep learning species classification models. Authors also successfully utilized the dataset to train a CNN model to classify UAV ortho-imagery and accurately segment plant species at large scale. By reducing the time and labor required for field surveys to collect reference data for remote sensing image classification, this labeled dataset may offer some practical benefits. Overall, the study demonstrates both intellectual merit and practical relevance. The manuscript is also well-structured and well-written. However, the use of these citizen science datasets as labelled data for segmenting UAV images yields low accuracy in various species, hindering practical applications of these datasets and the method. The UAV image segmentation model performance should be improved for further evaluation.

Other comments
Lines 184-188: Other than learning rate, batch size, and epoch, did you tune other parameters? Also, for learning rate, batch size, and epoch, it is better to test with a wider range of values to evaluate model performance before narrowing them down to a specific range. Also, for model training, did you use k-fold cross-validation for hyperparameter tuning? If so, what is the k-fold value did you use? This needs to be clarified.

Lines 239-243: The prediction of acquisition distance seems skeptical. In citizen science data, people use various cameras and may set various zooming modes when capturing photos, it is hard to predict acquisition distance just from the photo itself; thus, distance thresholds of 0.2 m and 20 m seem skeptical. In the earlier paragraph, authors use an area threshold of 30% to filter out some photos. Should a similar method be used to filter out photos with large amounts of tree trunk/branch?

Lines 278-284: Did you use k-fold cross-validation to train the model? If so, the k-fold value you used should be reported.

Lines 286-301: The classification performance seems to be low for various species. Citizen science data helps reduce time and labor in reference data collection; however, we also need to make sure output data are accurate and usable. With this low accuracy, what do authors suggest for future works? Should we incorporate some UAV-based high accuracy labelled data in the model together with citizen science data to improve classification accuracy? Also, the hyperparameter tuning seems not to be well-performed in your deep learning model training, I recommend conducting a more exhaustive tuning and trying different deep learning architecture to see if the classification results are improved.

One of the main reasons that cause low segmentation accuracy in this study could be the difference in the spatial resolutions between citizen science photos and UAV images. One possible solution for this discrepancy could be that during your segmentation model training, authors may want to manipulate/resample citizen science photos to different resolutions, including the 0.22 cm resolution of the UAV image, and incorporate features extracted from these layers into the final segmentation prediction to help improve the final segmentation results (see below paper with similar idea, note: this is not a reviewer’s paper).

Martins et al., 2020. Exploring multiscale object-based convolutional neural network (multi-OCNN) for remote sensing image classification at high spatial resolution. https://doi.org/10.1016/j.isprsjprs.2020.08.004
Citation: https://doi.org/10.5194/egusphere-2025-662-RC1
- AC1: 'Response to Reviewer 1 Comments', Salim Soltani, 09 May 2025
  
  Dear Reviewer,
  We would like to sincerely thank you for your constructive and thoughtful comments. We greatly appreciate the time and effort you invested in reviewing our manuscript. Your feedback has been very helpful in identifying areas for improvement.
  We have carefully addressed all comments and will revise the manuscript accordingly. For better readability, we have compiled our detailed responses in the attached PDF, structured in a clear table format.
  Thank you once again for your valuable input.
  
  Sincerely,
  Salim Soltani
  (on behalf of the Co-authors, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn)
  
  Citation: https://doi.org/10.5194/egusphere-2025-662-AC1
RC2:
'Comment on egusphere-2025-662', Anonymous Referee #2, 19 Apr 2025
I have reviewed the manuscript “From Ground Photos to Aerial Insights: Automating Citizen
Science Labeling for Tree Species Segmentation in UAV Images”. The authors examined the use of citizen science plant photographs to generate large training data needed for segmenting plant species from high-resolution UAV imagery. Specifically, the authors combined several AI/ML models to extract species training masks from photographs. The research topic is very interesting and timely, and addresses a core need to advance the use of optical UAV imagery for larger-scale vegetation mapping. The manuscript is well structured and nice discussed. My concerns are mainly on the Methods and Results.
I would recommend the authors to add a workflow chart to help readers understand the various types of methods and data used for the study. There are several AI/ML models employed for various different data processing, including both photographs and UAV imagery. I found it hard to connect the different processing steps, and how different data streams and AI/ML methods are used.
Second, not much information is presented in the Results, barely enough to understand the performance of the model. The authors did quite significant work on processing and segmenting the photographs from iNaturealist and Pl@ntNet. However, results about these processing and segmentation are completely missed in the Results. I am nervous the presentation of Results is disconnected with the Methods. Recommend the authors to carefully tie them together, especially, how F1 score, confusion matrix was calculated. The authors mentioned independent transect validation data were identified from UAV imagery, but did not mention where and how those were produced, distribution across species and space etc. I think it is also useful to present the species maps across the experiment plots.
Lastly an overall thought, a core advance of using UAV imagery is to provide landscape-scale observations. The authors argued that ultra-high (finer than 0.22 cm) might be necessary to better segment species from UAV imagery. This statement appears to “false”, and ignored that canopy structure and form are important information for species identification, which are not considered in this study. On the other hand, it is cool to generate the initial masks for UAV species identification using photographs, but it might be more useful to iterate over the species segmentation at UAV level, leveraging other information like canopy form and structure, to enlarge training samples at UAV level, instead of forcing UAV data to the same resolution as ground photographs?
Minor comments:
I wonder what features the authors used for segmentation? It is clear that the authors used only RGB imagery, but are other indices or transformations incorporated in the SAM segmentation?

The author mentioned that photos/masks from citizen science were ‘zoomed’ out when applied as training for UAV imagery. What’s the resolution after that? Is it comparable to UAV resolution?
Citation: https://doi.org/10.5194/egusphere-2025-662-RC2
- AC2: 'Response to Reviewer 2 Comments', Salim Soltani, 09 May 2025
  
  Dear Reviewer,
  We would like to sincerely thank you for your constructive and thoughtful comments. We greatly appreciate the time and effort you invested in reviewing our manuscript. Your feedback has been very helpful in identifying areas for improvement.
  We have carefully addressed all comments and will revise the manuscript accordingly. For better readability, we have compiled our detailed responses in the attached PDF, structured in a clear table format.
  Thank you once again for your valuable input.
  
  Sincerely,
  Salim Soltani
  (on behalf of the Co-authors, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn)
  
  Citation: https://doi.org/10.5194/egusphere-2025-662-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (18 May 2025) by Andrew Feldman

Dear Authors,

Thank you for submitting your article to Biogeosciences. We have received two expert reviews on the manuscript, with both having expertise with use of UAV in vegetation remote sensing. They have a positive outlook on the study, but do have several major comments that need to be addressed before we can consider the article for publication. I have read the manuscript myself along with the reviews and I recommend that all comments should be addressed, given a mixed of methodological and clarity concerns that should lead to an improved manuscript.

Perhaps the comment that stood out the most was that of reviewer 2 having difficulty following the methodology and recommending a workflow diagram to show each step, a concern and recommendation I have as well. In reading myself, I found that while there is a lot of detail for each step, it was challenging to follow (in layman’s terms) what each step does and why it is being done. This is especially a concern for Biogeosciences, where the readership will need clear definitions and descriptions more than a purely remote sensing and/or machine learning-aware readership. For example, lines 189-198 were challenging to follow, and I imagine they will be for other readers. With the different steps here for image classification of species, and use of citizen science databases and the authors’ own drone measurements, I agree with reviewer 2 that it might be useful to create an accessible figure showing the methodological workflow. At least the steps need to be outlined more simply in the text.
Similarly, please remember to define technical methodological terms. For example, “supervised learning” should be more clearly defined for Biogeosciences readership.

Both reviewers also have major comments that should be considered on specifics of the UAV image segmentation procedure, including performance of specific plants, use of background imagery from UAV, and mismatch of scale and distance of the sensor between citizen science and drone measurements.

We look forward to receiving your revisions.

Sincerely,
Andrew Feldman

Hide

AR by Salim Soltani on behalf of the Authors (22 Jun 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (29 Jun 2025) by Andrew Feldman

RR by Anonymous Referee #1 (18 Aug 2025)

RR by Anonymous Referee #2 (28 Aug 2025)

ED: Publish subject to minor revisions (review by editor) (02 Sep 2025) by Andrew Feldman

AR by Salim Soltani on behalf of the Authors (05 Sep 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (18 Sep 2025) by Andrew Feldman

AR by Salim Soltani on behalf of the Authors (25 Sep 2025)

Journal article(s) based on this preprint

06 Nov 2025

| Highlight paper

Automated mask generation in citizen science smartphone photos and their value for mapping plant species in drone imagery

Salim Soltani, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn

Biogeosciences, 22, 6545–6561, https://doi.org/10.5194/bg-22-6545-2025,https://doi.org/10.5194/bg-22-6545-2025, 2025

Short summary Co-editor-in-chief

Salim Soltani, Lauren E. Gillespie, Moises Exposito-Alonso, Olga Ferlian, Nico Eisenhauer, Hannes Feilhauer, and Teja Kattenborn

Viewed

Total article views: 1,038 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
854	153	31	1,038	23	40

HTML: 854
PDF: 153
XML: 31
Total: 1,038
BibTeX: 23
EndNote: 40

Views and downloads (calculated since 24 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	66	13	4	83
Mar 2025	129	37	4	170
Apr 2025	63	25	3	91
May 2025	34	17	3	54
Jun 2025	39	12	6	57
Jul 2025	21	10	0	31
Aug 2025	82	10	3	95
Sep 2025	349	12	7	368
Oct 2025	56	16	1	73
Nov 2025	15	1	0	16

Cumulative views and downloads (calculated since 24 Feb 2025)

Month	HTML	PDF	XML	Total
Feb 2025	66	13	4	83
Mar 2025	129	37	4	170
Apr 2025	63	25	3	91
May 2025	34	17	3	54
Jun 2025	39	12	6	57
Jul 2025	21	10	0	31
Aug 2025	82	10	3	95
Sep 2025	349	12	7	368
Oct 2025	56	16	1	73
Nov 2025	15	1	0	16

Viewed (geographical distribution)

Total article views: 1,070 (including HTML, PDF, and XML) Thereof 1,070 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 06 Nov 2025

Download

The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.

Preprint (54188 KB)
Metadata XML

Short summary

We introduce an automated approach for generating segmentation masks for citizen science plant photos, making them applicable to computer vision models. This framework effectively transforms citizen science data into a data treasure for segmentation models for plant species identification in aerial imagery. Using automatically labeled photos, we train segmentation models for mapping tree species in drone imagery, showcasing their potential for forestry, agriculture, and biodiversity monitoring.

From Ground Photos to Aerial Insights: Automating Citizen Science Labeling for Tree Species Segmentation in UAV Images

Journal article(s) based on this preprint

Interactive discussion

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection

Suggestions for revision or reasons for rejection

Journal article(s) based on this preprint

Viewed

Viewed (geographical distribution)


Total:	0
HTML:	0
PDF:	0
XML:	0