OrthoSAM: Multi-Scale Extension of the Segment Anything Model for River Pebble Delineation from Large Orthophotos

Chan, Vito; Rheinwalt, Aljoscha; Bookhagen, Bodo

doi:10.5194/egusphere-2025-4003

Preprints

https://doi.org/10.5194/egusphere-2025-4003

Preprints

29 Aug 2025

| 29 Aug 2025

OrthoSAM: Multi-Scale Extension of the Segment Anything Model for River Pebble Delineation from Large Orthophotos

Vito Chan, Aljoscha Rheinwalt, and Bodo Bookhagen

Abstract. Sediment characteristics and grain-size distribution are crucial for understanding natural hazards, hydrologic conditions, and ecosystems. However, traditional methods for collecting this information are costly, labor-intensive, and time-consuming. To address this, we present OrthoSAM, a workflow leveraging the Segment Anything Model (SAM) for automated delineation of densely packed pebbles in high-resolution orthomosaics. Our framework consists of a tiling scheme, improved seed (input) point generation, and a multi-scale resampling scheme. Validation using synthetic images shows high precision close to 1, a recall above 0.9, with a mean IoU above 0.9. Using a large synthetic dataset, we show that the two-sample Kolmogorov-Smirnov test confirms the accuracy of the grain size distribution. We identified a size detection limit of 30 pixels; pebbles with a diameter below this limit are not reliably detected. Applying OrthoSAM to orthomosaics from the Ravi River in India, we delineated 6087 pebbles with high precision (0.93) and recall (0.94). The resulting grain statistics include area, axis lengths, perimeter, RGB statistics, and smoothness measurements, providing valuable insights for further analysis in geomorphology and ecosystem studies.

Received: 26 Aug 2025 – Discussion started: 29 Aug 2025

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Vito Chan, Aljoscha Rheinwalt, and Bodo Bookhagen

Status: final response (author comments only)

RC1:
'Comment on egusphere-2025-4003', David Mair, 20 Sep 2025
General Comments:
The authors present a novel method and proof-of-concept for pebble segmentation in orthoimages by adapting the popular and widely-used Segment Anything Model (SAM; Kirillov et al., 2023). They identify important, but often unaddressed, weaknesses of SAM, such as the reduced performance in dense segmentation tasks (where many instances of the same object class should be segmented), and its limited capability to segment objects from one class with a significant size variability. To test their approach, the authors use 1) synthetic images with circles as a proxy for pebbles and 2) ortho-mosaics of real pebbles created with handheld cameras and photogrammetric processing. In their experiment 1, they test for the effect of a variety of image perturbations on segmentation quality. Here, they find that particularly shadow effects have some negative impact on SAM’s segmentation performance. In experiment 2, they apply their workflow to real-world images, showcasing the improvement of their multi-scale segmentation with SAM. In this scenario, they categorically evaluate segmentation performance through manual counting due to the lack of ground truth masks. Both experiments show that their approach is up to the task and has the potential to mitigate some of the segmentation shortcomings of SAM for such applications.
I find the method well-conceived and thought-through, the data rigorously tested and clearly reported, and the manuscript well structured. In particular, I consider the balance between technical details in the main manuscript and the appendices well struck, which makes the manuscript very readable, while not omitting relevant information. The presented results generally support the findings and conclusions. Here, I would only have two suggestions for calculating additional scores and using an additional image dataset to test the approach (see specific comments below), which might allow for a better evaluation of some aspects of the segmentation performance of SAM/OrthoSAM. However, these are just suggestions, not concerns raised. Currently, the manuscript has many small figures; maybe combining some figures into larger figures (e.g., Figures 10 and 11) would be helpful. Additionally, some minor/technical comments are included as in-line comments in the attached pdf.
In summary, I find the work of very high quality, with only a few minor points where the manuscript could be further improved. I suspect the authors will have no problems in addressing these points, and I look forward to seeing the manuscript published soon.

Kind regards,
David Mair (Uni Bern)
Specific comments:
Additional metrics for segmentation performance: The authors use well-established metrics to evaluate the segmentation performance. However, I would suggest additionally also calculating Average Precision (AP) scores where IoU thresholds are used (e.g., AP@0.5 IoU and/or mAP@0.5-0.9 IoU), as used in the SAM paper (Kirillov et al., 2023) or in general is widely used for instance segmentation tasks (e.g., Padilla et al., 2020). This is because I suspect that SAM segmentations are slightly worse for the colored synthetic images than for the black and white, while in both cases they score high in precision, recall, and mean IoU (see also lines 198-199). These scores could be calculated from the TP, FN, and FP values, where all TPs falling below a certain IoU threshold would count as FP. These scores might more clearly show that SAM is sensitive to shadows during segmentation (see also related comments in the pdf).

Adding a dataset with instance labels for pebbles. In lines 99-100, it is stated that ideally the workflow should be tested on a dataset of several hundred to thousands of delineated pebbles. This is picked up in line 216, when it is correctly stated that no ground truth masks are available for the Ravi dataset, and hence no IoU scores can be calculated. Here, I would like to mention our S1 dataset (as part of the data used in Mair et al., 2024), which has > 2000 manually annotated pebble masks from orthomosaics (available here: https://zenodo.org/records/8005771). It would be interesting to see how OrthoSAM would perform here; I suspect it will perform very well, especially due to the grain size variability similar to that of the Ravi River. Using these data as an additional test could help to increase confidence in the performance of OrthoSAM.

References (including in-line comments):
Chen, Y., Bao, J., Chen, R., Li, B., Yang, Y., Renteria, L., Delgado, D., Forbes, B., Goldman, A. E., Simhan, M., Barnes, M. E., Laan, M., McKever, S., Hou, Z. J., Chen, X., Scheibe, T., & Stegen, J. (2024). Quantifying Streambed Grain Size, Uncertainty, and Hydrobiogeochemical Parameters Using Machine Learning Model YOLO. Water Resources Research, 60(11). https://doi.org/10.1029/2023WR036456
Huang, Y., Yang, X., Liu, L., Zhou, H., Chang, A., Zhou, X., Chen, R., Yu, J., Chen, J., Chen, C., Liu, S., Chi, H., Hu, X., Yue, K., Li, L., Grau, V., Fan, D. P., Dong, F., & Ni, D. (2024). Segment anything model for medical images? Medical Image Analysis, 92. https://doi.org/10.1016/j.media.2023.103061
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023). Segment Anything. http://arxiv.org/abs/2304.02643
Mair, D., Witz, G., Do Prado, A. H., Garefalakis, P., & Schlunegger, F. (2024). Automated detecting, segmenting and measuring of grains in images of fluvial sediments: The potential for large and precise data from specialist deep learning models and transfer learning. Earth Surface Processes and Landforms, 49(3), 1099–1116. https://doi.org/10.1002/esp.5755
Pachitariu, M., Rariden, M., & Stringer, C. (2025). Cellpose-SAM: superhuman generalization for cellular segmentation. https://doi.org/10.1101/2025.04.28.651001
Padilla, R., Netto, S. L., & da Silva, E. A. B. (2020). A Survey on Performance Metrics for Object-Detection Algorithms. 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), 237–242. https://doi.org/10.1109/IWSSIP48289.2020.9145130
Stringer, C., Wang, T., Michaelos, M., & Pachitariu, M. (2021). Cellpose: a generalist algorithm for cellular segmentation. Nature Methods, 18(1), 100–106. https://doi.org/10.1038/s41592-020-01018-x
Zegers, G., Hayashi, M., & Garcés, A. (2025). Distributed estimation of surface sediment size in paraglacial and periglacial environments using drone photogrammetry. Earth Surface Processes and Landforms, 50(7). https://doi.org/10.1002/esp.70093
Citation: https://doi.org/10.5194/egusphere-2025-4003-RC1
RC2:
'Comment on egusphere-2025-4003', Zoltan Sylvester, 07 Oct 2025
The manuscript by Chan et al. focuses on the description and validation of an open-source Python machine learning model called ‘OrthoSAM’, which relies on the Segment Anything Model (SAM) to generate instance segmentations of images of coarse-grained fluvial sediment. As someone who has also done some work on using SAM for grain segmentation, I think this is a promising approach and a having access to a variety of techniques and implementations at this stage are overall an advantage. The paper is well written and nicely illustrated, it includes a number of novel approaches that have not been implemented before, and the authors have clearly put a significant amount of thoughtful and careful work into the software and into validating the results with synthetic and field data. In addition, they have made the code open-source and available as a GitHub repository, which makes it a lot easier for these methods to be adopted and tested on other datasets.
I do have a number of comments that I think should be addressed by the authors before publication; these are as follows.
The SAM-based approach and the tiling of large images are features of OrthoSAM that our Python module called ‘Segmenteverygrain’ also relies on. Although Segmenteverygrain is mentioned in the manuscript, I think there should be a bit more detailed discussion of what are the differences between the two techniques - not just the fact that OrthoSAM only relies on SAM, without the need for the U-Net pass, but also aspects like how broadly is the model applicable, how is it possible to improve the model outputs, is it possible to fine-tune the model. I do think that there is room for a variety of approaches to taking advantage of SAM (and of other similar) models in sedimentology and geomorphology, but it will be useful for the reader to get a brief a overview of the differences between the existing tools.

One the novel aspects of the work presented by Chan et al. is the generation of synthetic data that is then used for validation. While I totally see the value of this in increasing the community’s confidence in the model, one of the important questions about ML models is their ability to generalize. Although SAM has been trained on a wide variety of images and is good at generalization, I think it is less clear how well OrthoSAM would perform on real images of coarse-grained sediment that are quite different from the examples used in the paper. Although the authors are right that “manual validation is inevitably prone to subjectivity and human error, leading to potential biases and inconsistencies”, I would argue that a carefully QC-d segmentation of real datasets is potentially more valuable for validating a machine learning model than a synthetic dataset that does not fully reproduce the complexity and variety of actual datasets. So I concur with the other reviewer that applying OrthoSAM to other datasets would be a valuable addition to the paper. It should not take too long to run it on some other publicly available datasets.

I do not think this is a major issue, certainly not for this manuscript, but: I have tried to install OrthoSAM on my computer and to run one of the notebooks but I gave up without getting to a result because I got a number of errors early on. Making it easier for a broad range of users to install and run the code will ensure a broader adoption of OrthoSAM.

The ‘hardware requirements’ section is quite useful, but it could be improved if typical compute times were added, e.g., how long does it take to create a segmentation result for an image with ~1000 grains? Is it possible/feasible at all to run the segmentation on a CPU?

I hope the authors will find these comments / suggestions somewhat useful.
Sincerely,
Zoltan Sylvester
Citation: https://doi.org/10.5194/egusphere-2025-4003-RC2

Vito Chan, Aljoscha Rheinwalt, and Bodo Bookhagen

Viewed

Total article views: 999 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
907	76	16	999	32	36

HTML: 907
PDF: 76
XML: 16
Total: 999
BibTeX: 32
EndNote: 36

Views and downloads (calculated since 29 Aug 2025)

Month	HTML	PDF	XML	Total
Aug 2025	99	6	3	108
Sep 2025	699	17	4	720
Oct 2025	70	20	6	96
Nov 2025	39	33	3	75

Cumulative views and downloads (calculated since 29 Aug 2025)

Month	HTML	PDF	XML	Total
Aug 2025	99	6	3	108
Sep 2025	699	17	4	720
Oct 2025	70	20	6	96
Nov 2025	39	33	3	75

Viewed (geographical distribution)

Total article views: 980 (including HTML, PDF, and XML) Thereof 980 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 29 Nov 2025

Short summary

OrthoSAM is a new method that uses Segment Anything Model (SAM) to automatically identify and outline individual pebbles in high-resolution aerial images. OrthoSAM divides large photos into smaller sections that SAM can process effectively, and it improves the way to tell SAM where to look for objects. It uses a multi-resolution approach to handle different sizes, and it can be used to determine the distribution. Tests with computer-generated images and field data show that it is very precise.


Total:	0
HTML:	0
PDF:	0
XML:	0