the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Oriented Object Detection for Complex Hydrodynamic Features: A Multi-Platform Rip Current Identification System
Abstract. Rip currents are hazardous, fast-moving seaward flows and remain one of the leading causes of rescues and drownings on surf beaches, yet their automated detection remains a significant challenge due to their amorphous, dynamic morphology and the environmental complexity of the surf zone. This study introduces a novel platform-agnostic deep learning–based framework for automated rip current detection from beach imaging platforms, integrating three core contributions: a diverse new dataset, a rigorous architectural benchmark, and a deployable operational tool. We first present RipAID, a comprehensive dataset enriched with multi-platform imagery and multiple viewing angles to ensure scale-invariant learning. Building on this resource, a systematic evaluation of state-of-the-art architectures demonstrates that geometric fidelity is critical; specifically Oriented Bounding Boxes (OBB) significantly outperform standard axis-aligned methods. Our optimized YOLOv11n-OBB model achieves robust performance (mAP50: 0.927), with inference speeds from 2.4 to 60 FPS on hardware ranging from edge devices to GPU workstations. To bridge the gap between research and practice, and ensure that the results are reusable and reproducible, the framework and model weights have been released as an open-source, containerized module (socib-rip-currents-detection), providing the coastal safety community with a scalable, ready-to-use and standardized tool for continuous, automated rip current monitoring.
- Preprint
(28612 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 17 May 2026)
- RC1: 'Comment on egusphere-2026-1138: good idea but needs significant improvements', Anonymous Referee #1, 17 Apr 2026 reply
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 73 | 34 | 8 | 115 | 6 | 10 |
- HTML: 73
- PDF: 34
- XML: 8
- Total: 115
- BibTeX: 6
- EndNote: 10
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General comments:
The paper introduces a fair idea, arguing that OBB facilitates less background noise compared to axis-aligned bounding boxes, with some loss in accuracy but at a fraction of the cost compared to segmentation. It introduces 6,789 images with rip currents annotated with oriented bounding boxes.
Specific comments:
The paper's main argument is a reasonable one, rip currents benefiting from OBB detection. There are several strong critiques that need to be addressed:
- the evaluation is done on classic detection metrics and a custom metric, rip_fit. While this is a reasonable approach, as per their own introduction, rip currents are a leading cause of drownings globally. For such scenarios where a false positive is costly, the F2 is a standardized metric that can be used (alongside currently used ones). Introducing a custom metric risks fragmenting the research even more, with minimal benefits, if any;
- the dataset, with 6,789 is rather modest in size. One of the cites sources is an roboflow link which is no longer available (CnE UFSC (2023)) and another source, RipScout (2025), claims 2555 images from 73 videos; upon closer look, RipScout introduced 2555 images from 8 videos, with highly repetitive patterns, alongside 1767 images from de Silva (2021); this raises a doubt over the claimed source and diversity of the dataset, especially when compared to de Silva (2021). De Silva (2021) trained on 2440 images (1767 with rip currents and 673 without) and they tested on 23 videos. Dumitriu (2025) introduced 150 rip currents videos with 15,784 annotated frames, alongside 34 videos with no rip currents. The conclusion states that "a key contribution is the development [...] a novel dataset that significantly expands the diversity of the available training data [...]". Considering that more than 1/3 of the data is from previous papers and another source is not available anymore to analyze, this is questionable;
- there is little discussion on the annotation procedure. Subsequent studies (Ballantyne et al. (2005), Sherker et al. (2010), Caldwell et al. (2013), Brannstrom et al. (2014), Sotes et al. (2018), Pitman et al. (2021)) have found that while many people are confident that they can correctly identify rip currents, most of them cannot do so. While the surveyed people were not experts, such process needs to be thoroughly done and documented. If multiple annotators were used, an inter-annotator agreement coefficient is also recommended (such as Cohen-Kappa). For example, a rip current expert that we asked has said that at first glance, Figure 8.c is incorrectly annotated. The yolo detection on the right appear to be correct, while the annotation on the left is not clear. There are other instances from that camera that do have only rip current on the left, but these seem to be at a different time and date and do not necessarily generalize (based on simple visualization);
- 10-fold is indeed useful in scenarios with this volume of data. It is not clear, however, how it was done. The paper introduces 6,789 images with 10,131 annotations (I am assuming rip currents instances, which can be more per image) and proceed to do the 10-fold split on the rip currents and not on the images. It is both unclear how and why the authors have proceeded to do so;
- the way the authors approached k-fold cross-validation is incorrect. Please check https://scikit-learn.org/stable/modules/cross_validation.html for details. 10-fold should be done only on train and validation with a separate and fixed test data. The way the authors did it, the test data is unreliable and more akin to validation data. A correct way to do it is do an initial split between (train+val) and keep test fixed. Then you can do a k-fold split in order to find the best combination of train and val splits; once done so, you evaluate only once on test data and this becomes you final result. Modifying the training based on this result negates the value of the test data;
- splitting into 3 classes, "rip", "sediment" and "doubt" is risky from a training point of view. The doubt and rip class can be visually similar; the model understands "doubt" as a completely different class and thus may be confused by the visual similarities. I suggest an approach based on clear rip current classes (in this case, sediment is a large enough distribution to make sense) and judge the "doubt" or not based on model confidence, not as a separate class on its own. Especially since the annotations in itself are not discussed and can be questionable. This both confuses the model and risks introducing the strong bias of the annotators. They did compare 3 classes vs 1 class (line 317), which is reasonable, but in order to make an argument for the doubt class, it should be isolated in an experiment with 3 classes vs 2 classes and the results should be evaluated by a rip currents expert. The quantitative results are not enough in this context: dataset too small, no null (no rip) images in training and evaluation, class imbalance, annotation biases etc. They should also account for the class imbalance of the 3 classes and maybe use a different loss (such as weighted cross-entropy);
- training and evaluation in such cases needs to also consider images without rip currents, simiarly to previous rip current papers (deSilva (2021), Zhu (2023), Dumitriu (2025)) and to most papers in object detection field;
- considering the low volume of data and several questionable aspects, mentioned above, packaging a detector in "ocib-rip-currents-detection", while a noble and arguably required endeavor, poses multiple risks. Unknowing and non-technical people can use and rely on it with potentially grave consequences. A rip current detector, while extremely useful if done correctly, should be approached and published with meticulous detail, testing and care;
- the authors also do not compare directly (enough) to other dataset-benchmark papers. They do mention some of them, but a direct comparison (numbers, annotations, results etc) is warranted;
- the paper claims that recent advacements use "advancet modalities" (line 80). That is incorrect - all of the current papers use RGB images / videos. Advanced modalities in the context of computer vision implies non-traditional data sources beyond 2D RGB imaging;
- the paper is long and detailed, containing a bit of everything. In some places, the detailed approach is well done, while in others it either doesn't mention relevant information (such as the aforementioned annotation proceedure) or it goes into too much detail of basic information (such as explaining what TP, FP etc are). The paper should focus more on the novelty, the contribution and the results;
Overall, the paper starts researching in a direction with considerably potential, but aims for claims that are too big for the data, research and the effort that was put into it. I recommend this be a valuable checkpoint in the direction of such claims and target and not the final result.