the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
OrthoSAM: Multi-Scale Extension of the Segment Anything Model for River Pebble Delineation from Large Orthophotos
Abstract. Sediment characteristics and grain-size distribution are crucial for understanding natural hazards, hydrologic conditions, and ecosystems. However, traditional methods for collecting this information are costly, labor-intensive, and time-consuming. To address this, we present OrthoSAM, a workflow leveraging the Segment Anything Model (SAM) for automated delineation of densely packed pebbles in high-resolution orthomosaics. Our framework consists of a tiling scheme, improved seed (input) point generation, and a multi-scale resampling scheme. Validation using synthetic images shows high precision close to 1, a recall above 0.9, with a mean IoU above 0.9. Using a large synthetic dataset, we show that the two-sample Kolmogorov-Smirnov test confirms the accuracy of the grain size distribution. We identified a size detection limit of 30 pixels; pebbles with a diameter below this limit are not reliably detected. Applying OrthoSAM to orthomosaics from the Ravi River in India, we delineated 6087 pebbles with high precision (0.93) and recall (0.94). The resulting grain statistics include area, axis lengths, perimeter, RGB statistics, and smoothness measurements, providing valuable insights for further analysis in geomorphology and ecosystem studies.
- Preprint
(10279 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on egusphere-2025-4003', David Mair, 20 Sep 2025
-
AC3: 'Reply on RC1', Vito Chan, 07 Dec 2025
The authors present a novel method and proof-of-concept for pebble segmentation in orthoimages by adapting the popular and widely-used Segment Anything Model (SAM; Kirillov et al., 2023). They identify important, but often unaddressed, weaknesses of SAM, such as the reduced performance in dense segmentation tasks (where many instances of the same object class should be segmented), and its limited capability to segment objects from one class with a significant size variability. To test their approach, the authors use 1) synthetic images with circles as a proxy for pebbles and 2) ortho-mosaics of real pebbles created with handheld cameras and photogrammetric processing. In their experiment 1, they test for the effect of a variety of image perturbations on segmentation quality. Here, they find that particularly shadow effects have some negative impact on SAM’s segmentation performance. In experiment 2, they apply their workflow to real-world images, showcasing the improvement of their multi-scale segmentation with SAM. In this scenario, they categorically evaluate segmentation performance through manual counting due to the lack of ground truth masks. Both experiments show that their approach is up to the task and has the potential to mitigate some of the segmentation shortcomings of SAM for such applications.
I find the method well-conceived and thought-through, the data rigorously tested and clearly reported, and the manuscript well structured. In particular, I consider the balance between technical details in the main manuscript and the appendices well struck, which makes the manuscript very readable, while not omitting relevant information. The presented results generally support the findings and conclusions. Here, I would only have two suggestions for calculating additional scores and using an additional image dataset to test the approach (see specific comments below), which might allow for a better evaluation of some aspects of the segmentation performance of SAM/OrthoSAM. However, these are just suggestions, not concerns raised. Currently, the manuscript has many small figures; maybe combining some figures into larger figures (e.g., Figures 10 and 11) would be helpful. Additionally, some minor/technical comments are included as in-line comments in the attached pdf.
In summary, I find the work of very high quality, with only a few minor points where the manuscript could be further improved. I suspect the authors will have no problems in addressing these points, and I look forward to seeing the manuscript published soon.
Kind regards,
David Mair (Uni Bern)
Thank you for your positive feedback and the helpful suggestions to improve the manuscript.
Specific comments:
Additional metrics for segmentation performance: The authors use well-established metrics to evaluate the segmentation performance. However, I would suggest additionally also calculating Average Precision (AP) scores where IoU thresholds are used (e.g., AP@0.5 IoU and/or mAP@0.5-0.9 IoU), as used in the SAM paper (Kirillov et al., 2023) or in general is widely used for instance segmentation tasks (e.g., Padilla et al., 2020). This is because I suspect that SAM segmentations are slightly worse for the colored synthetic images than for the black and white, while in both cases they score high in precision, recall, and mean IoU (see also lines 198-199). These scores could be calculated from the TP, FN, and FP values, where all TPs falling below a certain IoU threshold would count as FP. These scores might more clearly show that SAM is sensitive to shadows during segmentation (see also related comments in the pdf).
We have followed this suggestion and included the additional segmentation metric Average Precision (AP) scores. Due to the lack of confidence values in segment anything, we modified the calculation of AP to use object size as a proxy for confidence. We report AP@0.75 for both the B&W and Color with Shadow synthetic images. B&W [4,3000] has achieved an AP@0.75 of 1.00, while the Color with Shadow images have achieved an average of 0.87. For Ravi images, an overall AP of 0.97 was achieved. AP confirms that OrthoSAM’s segmentation performance is slightly worse for the colored synthetic images, which aligns with the observation. We have revised the manuscript to reflect the additional metric and results.
Adding a dataset with instance labels for pebbles. In lines 99-100, it is stated that ideally the workflow should be tested on a dataset of several hundred to thousands of delineated pebbles. This is picked up in line 216, when it is correctly stated that no ground truth masks are available for the Ravi dataset, and hence no IoU scores can be calculated. Here, I would like to mention our S1 dataset (as part of the data used in Mair et al., 2024), which has > 2000 manually annotated pebble masks from orthomosaics (available here: https://zenodo.org/records/8005771). It would be interesting to see how OrthoSAM would perform here; I suspect it will perform very well, especially due to the grain size variability similar to that of the Ravi River. Using these data as an additional test could help to increase confidence in the performance of OrthoSAM.
Yes, we are aware of the limitations of a synthetic dataset, but also its benefits: a perfectly labeled dataset. In our preparation for the initial manuscript submission, we have performed segmentation for the SediNet and ImageGrains (mentioned by the reviewer) datasets. For the ImageGrains dataset, we noticed that the labels (i.e., the true ground-truth dataset) are only partially complete and do not include all pebbles. Our initial analysis and discussion among the authors of this manuscript suggested that we can not use their labeled data as a validation dataset. We have re-evaluated our initial assessment in the revisions and now include segmentation results and a comparison (see Table 1 and Figure 1). However, we refrain from treating these data as a true label dataset and do not report revised accuracy and precision statistics.
Instead, we compare the derived size distribution with a two-sided KS test (Figure 3). We note that according to the KS test, the distributions are not equal. We observe that OrthoSAM detects many more objects (cf. Table 1) and also many more small objects. Visual inspection reveals that OrthoSAM detects many of the smaller grains correctly (that are not in the ground-truth labeling dataset), but also identifies some objects that are not pebbles. These will need to be removed by additional filtering steps, for example through the normalized isoperimetric ratio (IRn). IRn provides a measurement for the roundness of an object, which can be used to remove irrelevant objects such as vegetation. OrthoSAM reports various statistics of segmented objects that can also be used for further filtering through more sophisticated outlier detection algorithms, such as decision trees. We note that some of the pebbles detected by OrthoSAM and not ImageGrains are low-contrast pebbles (i.e., the pebbles' color and texture are only slightly different from the background). We speculate that the gradient-based convolutional neural network in ImageGrains has not been extensively trained for low-gradient segments and thus excludes them.
We have extended the evaluation of OrthoSAM to include three images from the ImageGrains dataset (FH, K1, S1) and a selection of images from the SediNet dataset. We assessed OrthoSAM’s prediction based on ImageGrains’ prediction. Because of the different number of detected objects, we only rely on recall and mean IoU. Recall measures the ability of a model to identify the relevant instances in a dataset. It is calculated as the number of true positives divided by the sum of true positives and false negatives. We set the true positives to only the objects that have been identified in ImageGrains. By computing the recall between two predictions, we quantify the degree of agreement in object detection, whereas mean IoU provides a means to compare the output mask. Since the ImageGrains models have already been validated in their original study, these measurements provide a proxy of validity based on how close our predictions are to a validated reference.
As illustrated in Figure 1, OrthoSAM segments objects that are not pebbles or grains, and it also tends to detect finer objects. Due to its reduced sharpness and clarity, the FH image produces stronger noise in the results. The presence of very fine pebbles requires more input points, which further increases the likelihood of noise, making FH a particularly challenging case. This behavior likely explains the reduced precision and reflects a current limitation of the approach and a lack of ground-truthing data.
Best regards,
-
AC3: 'Reply on RC1', Vito Chan, 07 Dec 2025
-
RC2: 'Comment on egusphere-2025-4003', Zoltan Sylvester, 07 Oct 2025
The manuscript by Chan et al. focuses on the description and validation of an open-source Python machine learning model called ‘OrthoSAM’, which relies on the Segment Anything Model (SAM) to generate instance segmentations of images of coarse-grained fluvial sediment. As someone who has also done some work on using SAM for grain segmentation, I think this is a promising approach and a having access to a variety of techniques and implementations at this stage are overall an advantage. The paper is well written and nicely illustrated, it includes a number of novel approaches that have not been implemented before, and the authors have clearly put a significant amount of thoughtful and careful work into the software and into validating the results with synthetic and field data. In addition, they have made the code open-source and available as a GitHub repository, which makes it a lot easier for these methods to be adopted and tested on other datasets.
I do have a number of comments that I think should be addressed by the authors before publication; these are as follows.
- The SAM-based approach and the tiling of large images are features of OrthoSAM that our Python module called ‘Segmenteverygrain’ also relies on. Although Segmenteverygrain is mentioned in the manuscript, I think there should be a bit more detailed discussion of what are the differences between the two techniques - not just the fact that OrthoSAM only relies on SAM, without the need for the U-Net pass, but also aspects like how broadly is the model applicable, how is it possible to improve the model outputs, is it possible to fine-tune the model. I do think that there is room for a variety of approaches to taking advantage of SAM (and of other similar) models in sedimentology and geomorphology, but it will be useful for the reader to get a brief a overview of the differences between the existing tools.
- One the novel aspects of the work presented by Chan et al. is the generation of synthetic data that is then used for validation. While I totally see the value of this in increasing the community’s confidence in the model, one of the important questions about ML models is their ability to generalize. Although SAM has been trained on a wide variety of images and is good at generalization, I think it is less clear how well OrthoSAM would perform on real images of coarse-grained sediment that are quite different from the examples used in the paper. Although the authors are right that “manual validation is inevitably prone to subjectivity and human error, leading to potential biases and inconsistencies”, I would argue that a carefully QC-d segmentation of real datasets is potentially more valuable for validating a machine learning model than a synthetic dataset that does not fully reproduce the complexity and variety of actual datasets. So I concur with the other reviewer that applying OrthoSAM to other datasets would be a valuable addition to the paper. It should not take too long to run it on some other publicly available datasets.
- I do not think this is a major issue, certainly not for this manuscript, but: I have tried to install OrthoSAM on my computer and to run one of the notebooks but I gave up without getting to a result because I got a number of errors early on. Making it easier for a broad range of users to install and run the code will ensure a broader adoption of OrthoSAM.
- The ‘hardware requirements’ section is quite useful, but it could be improved if typical compute times were added, e.g., how long does it take to create a segmentation result for an image with ~1000 grains? Is it possible/feasible at all to run the segmentation on a CPU?
I hope the authors will find these comments / suggestions somewhat useful.
Sincerely,
Zoltan Sylvester
Citation: https://doi.org/10.5194/egusphere-2025-4003-RC2 -
AC1: 'Reply on RC2', Vito Chan, 07 Dec 2025
Publisher’s note: this comment is a copy of AC2 and its content was therefore removed on 9 December 2025.
Citation: https://doi.org/10.5194/egusphere-2025-4003-AC1 -
AC2: 'Reply on RC2', Vito Chan, 07 Dec 2025
The manuscript by Chan et al. focuses on the description and validation of an open-source Python machine learning model called ‘OrthoSAM’, which relies on the Segment Anything Model (SAM) to generate instance segmentations of images of coarse-grained fluvial sediment. As someone who has also done some work on using SAM for grain segmentation, I think this is a promising approach and a having access to a variety of techniques and implementations at this stage are overall an advantage. The paper is well written and nicely illustrated, it includes a number of novel approaches that have not been implemented before, and the authors have clearly put a significant amount of thoughtful and careful work into the software and into validating the results with synthetic and field data. In addition, they have made the code open-source and available as a GitHub repository, which makes it a lot easier for these methods to be adopted and tested on other datasets.
Thank you for your thoughtful feedback and the helpful suggestions to improve the manuscript.
I do have a number of comments that I think should be addressed by the authors before publication; these are as follows.
The SAM-based approach and the tiling of large images are features of OrthoSAM that our Python module called ‘Segmenteverygrain’ also relies on. Although Segmenteverygrain is mentioned in the manuscript, I think there should be a bit more detailed discussion of what are the differences between the two techniques - not just the fact that OrthoSAM only relies on SAM, without the need for the U-Net pass, but also aspects like how broadly is the model applicable, how is it possible to improve the model outputs, is it possible to fine-tune the model. I do think that there is room for a variety of approaches to taking advantage of SAM (and of other similar) models in sedimentology and geomorphology, but it will be useful for the reader to get a brief a overview of the differences between the existing tools.
OrthoSAM is specifically designed as a workflow to assist SAM in delineating densely packed objects in large, high-resolution images. It achieves this by focusing on three main components: a tiling scheme, improved input point generation, and a multi-scale resampling scheme (resolution passes).
Segmenteverygrain, conversely, benefits from its initial U-Net pass because it restricts SAM's operation to areas already classified as grains, thereby effectively filtering out irrelevant objects. It might be a good idea to combine efforts in the future on this.
OrthoSAM segments all objects in an image and may delineate objects that are not pebbles. These will need to be removed by additional filtering steps or manually. Segmenteverygrain instead will only segment pebbles that have been identified as pebbles using a convolutional neural network approach. This will ensure that only pebbles are delineated, but may also miss pebbles that were not initially detected by the neural network.
In the first version of the manuscript, we had a paragraph dedicated to Segmenteverygrain in the introduction. We have now elaborated on these points in the discussion in the revised manuscript and included additional points of OrthoSAM’s performance on both ImageGrains and Sedinet datasets (see Figures 1 and 2 below). While precise metrics cannot be computed due to the lack of ground truth data, the results were visually assessed and are available through our GitHub repository. These examples demonstrate that OrthoSAM generalizes well across images with different grain characteristics and scene complexities.
One the novel aspects of the work presented by Chan et al. is the generation of synthetic data that is then used for validation. While I totally see the value of this in increasing the community’s confidence in the model, one of the important questions about ML models is their ability to generalize. Although SAM has been trained on a wide variety of images and is good at generalization, I think it is less clear how well OrthoSAM would perform on real images of coarse-grained sediment that are quite different from the examples used in the paper. Although the authors are right that “manual validation is inevitably prone to subjectivity and human error, leading to potential biases and inconsistencies”, I would argue that a carefully QC-d segmentation of real datasets is potentially more valuable for validating a machine learning model than a synthetic dataset that does not fully reproduce the complexity and variety of actual datasets. So I concur with the other reviewer that applying OrthoSAM to other datasets would be a valuable addition to the paper. It should not take too long to run it on some other publicly available datasets.
We agree that a high-quality training dataset for pebble segmentation will be useful for several machine-learning applications. However, these data do not (yet) exist, and it would be an important community effort to produce such a dataset - similar to reference datasets that have been generated for the lidar classification community.
We have used the synthetic training dataset to identify SAM’s sensitivity to grain size and color. The identification of a lower detection size limit was achieved with the synthetic training dataset. We identified that SAM is not very sensitive to color variation and color noise - until a very high level of noise is added. These were important findings of the synthetic analysis, and real-world imagery provides additional challenges.
We elaborated in the revised manuscript on the validation of real-world datasets. We included additional discussion of OrthoSAM’s performance on both ImageGrains and Sedinet datasets, although they do not contain a high-quality ground truth dataset. While precise metrics cannot be computed due to the lack of ground truth data, the results were visually assessed and are available through our GitHub repository. These examples demonstrate that OrthoSAM generalizes well across images with different grain characteristics and scene complexities.
We provide Jupyter Labs that guide through the processes of segmenting SediNet Images on our Github repository. This contains the parameters used for their segmentation in OrthoSAM. The parameters will need to be adjusted for different imagery, because grain packing varies. We note that the SediNet images are only somewhat useful in this regard, because many of the images are not scaled, and the sizes are only relative to the pixel areas.
I do not think this is a major issue, certainly not for this manuscript, but: I have tried to install OrthoSAM on my computer and to run one of the notebooks but I gave up without getting to a result because I got a number of errors early on. Making it easier for a broad range of users to install and run the code will ensure a broader adoption of OrthoSAM.
We note the reviewer’s comments and have modified the packaging and installation routine of our setup. OrthoSAM is now properly packaged and included in the requirements.txt. It can be installed using pip install -e . as well, provided that the repository directory is set as the working directory. Once installed, OrthoSAM can be imported system-wide within the active virtual environment. These updates streamline the installation process and has improved the overall usability of the software.
We provide video material that guides through the installation and processing steps (in addition to the tutorials on the GitHub webpage).
We note that the processing can also be done within a Google Colab Environment, and we provide an example of this: https://www.youtube.com/watch?v=bLU6dbQ3vt0
An example of a video guiding through the analysis: https://www.youtube.com/watch?v=vu67RpeNHO4
The ‘hardware requirements’ section is quite useful, but it could be improved if typical compute times were added, e.g., how long does it take to create a segmentation result for an image with ~1000 grains? Is it possible/feasible at all to run the segmentation on a CPU?
In the revised manuscript, we have added duration for average processing times. The segmentation of a synthetic image with 10,000 × 10,000 pixels requires approximately 4 hours, whereas an image with 2,048 × 2,048 pixels requires about 5 minutes. Both were run on Quadro RTX 5000 GPU with 16 GB RAM and Intel Xeon W-1290 10-core processor. The processing of the similarly sized ImageGrains K1 image with 1,350 x 1,200 pixels took 10 minutes on the same system due to the use of an upscaled layer, whereas image S1 with 3,062 x 2,722 pixels took 50 minutes on the same system with the same settings.
We note that this may differ with different hardware setup. OrthoSAM is aimed at GPU processing, but will work with CPU processing (processing with CPU can be one order of magnitude longer).
We have included these numbers in the revised manuscript.
I hope the authors will find these comments / suggestions somewhat useful.
Sincerely,
Zoltan Sylvester
Citation: https://doi.org/10.5194/egusphere-2025-4003-RC2
Best regards,
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 945 | 101 | 28 | 1,074 | 35 | 39 |
- HTML: 945
- PDF: 101
- XML: 28
- Total: 1,074
- BibTeX: 35
- EndNote: 39
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General Comments:
The authors present a novel method and proof-of-concept for pebble segmentation in orthoimages by adapting the popular and widely-used Segment Anything Model (SAM; Kirillov et al., 2023). They identify important, but often unaddressed, weaknesses of SAM, such as the reduced performance in dense segmentation tasks (where many instances of the same object class should be segmented), and its limited capability to segment objects from one class with a significant size variability. To test their approach, the authors use 1) synthetic images with circles as a proxy for pebbles and 2) ortho-mosaics of real pebbles created with handheld cameras and photogrammetric processing. In their experiment 1, they test for the effect of a variety of image perturbations on segmentation quality. Here, they find that particularly shadow effects have some negative impact on SAM’s segmentation performance. In experiment 2, they apply their workflow to real-world images, showcasing the improvement of their multi-scale segmentation with SAM. In this scenario, they categorically evaluate segmentation performance through manual counting due to the lack of ground truth masks. Both experiments show that their approach is up to the task and has the potential to mitigate some of the segmentation shortcomings of SAM for such applications.
I find the method well-conceived and thought-through, the data rigorously tested and clearly reported, and the manuscript well structured. In particular, I consider the balance between technical details in the main manuscript and the appendices well struck, which makes the manuscript very readable, while not omitting relevant information. The presented results generally support the findings and conclusions. Here, I would only have two suggestions for calculating additional scores and using an additional image dataset to test the approach (see specific comments below), which might allow for a better evaluation of some aspects of the segmentation performance of SAM/OrthoSAM. However, these are just suggestions, not concerns raised. Currently, the manuscript has many small figures; maybe combining some figures into larger figures (e.g., Figures 10 and 11) would be helpful. Additionally, some minor/technical comments are included as in-line comments in the attached pdf.
In summary, I find the work of very high quality, with only a few minor points where the manuscript could be further improved. I suspect the authors will have no problems in addressing these points, and I look forward to seeing the manuscript published soon.
Kind regards,
David Mair (Uni Bern)
Specific comments:
References (including in-line comments):
Chen, Y., Bao, J., Chen, R., Li, B., Yang, Y., Renteria, L., Delgado, D., Forbes, B., Goldman, A. E., Simhan, M., Barnes, M. E., Laan, M., McKever, S., Hou, Z. J., Chen, X., Scheibe, T., & Stegen, J. (2024). Quantifying Streambed Grain Size, Uncertainty, and Hydrobiogeochemical Parameters Using Machine Learning Model YOLO. Water Resources Research, 60(11). https://doi.org/10.1029/2023WR036456
Huang, Y., Yang, X., Liu, L., Zhou, H., Chang, A., Zhou, X., Chen, R., Yu, J., Chen, J., Chen, C., Liu, S., Chi, H., Hu, X., Yue, K., Li, L., Grau, V., Fan, D. P., Dong, F., & Ni, D. (2024). Segment anything model for medical images? Medical Image Analysis, 92. https://doi.org/10.1016/j.media.2023.103061
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023). Segment Anything. http://arxiv.org/abs/2304.02643
Mair, D., Witz, G., Do Prado, A. H., Garefalakis, P., & Schlunegger, F. (2024). Automated detecting, segmenting and measuring of grains in images of fluvial sediments: The potential for large and precise data from specialist deep learning models and transfer learning. Earth Surface Processes and Landforms, 49(3), 1099–1116. https://doi.org/10.1002/esp.5755
Pachitariu, M., Rariden, M., & Stringer, C. (2025). Cellpose-SAM: superhuman generalization for cellular segmentation. https://doi.org/10.1101/2025.04.28.651001
Padilla, R., Netto, S. L., & da Silva, E. A. B. (2020). A Survey on Performance Metrics for Object-Detection Algorithms. 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), 237–242. https://doi.org/10.1109/IWSSIP48289.2020.9145130
Stringer, C., Wang, T., Michaelos, M., & Pachitariu, M. (2021). Cellpose: a generalist algorithm for cellular segmentation. Nature Methods, 18(1), 100–106. https://doi.org/10.1038/s41592-020-01018-x
Zegers, G., Hayashi, M., & Garcés, A. (2025). Distributed estimation of surface sediment size in paraglacial and periglacial environments using drone photogrammetry. Earth Surface Processes and Landforms, 50(7). https://doi.org/10.1002/esp.70093