the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Oriented Object Detection for Complex Hydrodynamic Features: A Multi-Platform Rip Current Identification System
Abstract. Rip currents are hazardous, fast-moving seaward flows and remain one of the leading causes of rescues and drownings on surf beaches, yet their automated detection remains a significant challenge due to their amorphous, dynamic morphology and the environmental complexity of the surf zone. This study introduces a novel platform-agnostic deep learning–based framework for automated rip current detection from beach imaging platforms, integrating three core contributions: a diverse new dataset, a rigorous architectural benchmark, and a deployable operational tool. We first present RipAID, a comprehensive dataset enriched with multi-platform imagery and multiple viewing angles to ensure scale-invariant learning. Building on this resource, a systematic evaluation of state-of-the-art architectures demonstrates that geometric fidelity is critical; specifically Oriented Bounding Boxes (OBB) significantly outperform standard axis-aligned methods. Our optimized YOLOv11n-OBB model achieves robust performance (mAP50: 0.927), with inference speeds from 2.4 to 60 FPS on hardware ranging from edge devices to GPU workstations. To bridge the gap between research and practice, and ensure that the results are reusable and reproducible, the framework and model weights have been released as an open-source, containerized module (socib-rip-currents-detection), providing the coastal safety community with a scalable, ready-to-use and standardized tool for continuous, automated rip current monitoring.
- Preprint
(28612 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 13 Jun 2026)
-
RC1: 'Comment on egusphere-2026-1138: good idea but needs significant improvements', Anonymous Referee #1, 17 Apr 2026
reply
-
AC1: 'Reply on RC1', Jesús Soriano-González, 15 May 2026
reply
We thank the reviewer for the highly constructive feedback and meticulous review of our manuscript. The attention to detail, highlighting important methodological nuances and even taking the extra time to consult an additional domain expert, is highly valuable. We believe that the reviewer’s comments and corrections will significantly strengthen both the scientific rigor and the practical grounding of this paper. Below, we provide a detailed, point-by-point response to each of the comments.
- 1. Replacement of Custom Evaluation Metric with Standardized F2 Score
The rip_fit metric was initially introduced during the early experimental phases of this project, and we originally chose to include it in the manuscript for continuity. However, we completely agree with the reviewer’s assessment that introducing a custom metric risks fragmenting the literature. Consequently, we propose entirely removing the rip_fit metric from the manuscript to avoid unnecessary complexity and noise, and to align our methodology with community standards. We have recalculated our results using the F2 score, and all relevant text, tables, and figures will be updated accordingly in the revised document. To facilitate the review, we have attached the revised figures to this response. As these updated results demonstrate, adopting the F2 score does not alter the main conclusions of the models benchmarking.
- 2. Clarification on Dataset Accessibility and Justification of Novelty Claims
We would like to clarify that the CnE UFSC (2023) dataset remains publicly available. The URL appeared invalid in the preprint due to a PDF line-break formatting error. The correct, unbroken link is: https://universe.roboflow.com/cne-ufsc/rip-current-coastsnap-mocambique-santinho .
Regarding the size and diversity of the dataset, we agree that relying solely on frame count, our dataset is modest. Our intention behind claiming "diversity" and "novelty" was based on two main factors: (i) the extension of prior efforts with the addition of more imaging platforms, viewing geometries, and low-energy Mediterranean beaches. (ii) OBB re-annotation: Even for the images curated from previous standard datasets, images were re-annotated using OBBs, providing a dataset designed specifically for oriented detection. We have revised the text in both sections to be more precise. The proposed text explicitly states that the dataset’s main or differential contribution lies in the OBB annotations rather than its absolute size, and we have ensured that prior works are accurately contextualized.We propose modifying the first paragraph of the discussion (line 282) to read:
"The primary objective of this study was to advance the automation of rip current detection using beach imaging systems by delivering a platform-agnostic tool that integrates novel computer vision approaches. While previous studies have demonstrated the feasibility of deep learning for this task (de Silva et al., 2021; Zhu et al., 2022; Rashid et al., 2023), they have often relied on axis-aligned annotations and aerial imaging, which are difficult to obtain for near-real time applications. By curating RipAID v2.0.0 (Soriano-González et al., 2026), an open-access dataset enriched with multi-platform imagery and oblique viewing angles, we aim to provide a new perspective, using OBB to annotate rip currents. Beyond varied viewpoints and image geometries, the dataset also includes images from low-energy coastal environments in the Mediterranean, where rips often exhibit lower intensity and less visible structure, thereby broadening the variety of scenarios represented."
The revised Conclusion (line 287) will state:
"A central part of this work involves the development of RipAID v2.0.0, an updated dataset that builds upon previous efforts through the incorporation of OBB annotations. By reprocessing imagery from previous studies alongside new data, this version provides a multi-platform representation of rip current morphodynamics and aims to address specific camera angle limitations, while improving model adaptability to varied coastal environments."
- 3. Annotation Procedure and Justification of Figure 8c
(i) We agree with the reviewer that the annotation procedure required further elaboration in the text, and while we did not calculate a formal inter-annotator agreement coefficient (which we acknowledge as a limitation to be addressed in future iterations of the dataset), we did implement a multi-tiered review process to mitigate individual bias and uncertainty. The process was conducted by three researchers: two performed the initial annotations, while the third independently reviewed the entire dataset to ensure consistent criteria. To clarify the annotation procedure for the readers, we propose adding the following paragraph to the Training Dataset section:
"Annotations were performed following a predefined protocol: (1) OBBs must encompass both lateral boundaries of the rip current whenever possible; (2) rip currents located at the edge of the image are labeled if clearly identifiable; and (3) features exhibiting high uncertainty are labeled as ‘doubt’. Since rip currents are frequently misidentified (Pitman et al., 2021), ensuring annotation quality is inherently challenging. To minimize labeling errors and individual biases, the annotation process was conducted by three researchers: two performed the initial labeling, while a third independently reviewed the entire dataset to mitigate criteria heterogeneity and correct potential mistakes. While a formal inter-annotator agreement coefficient was not calculated for this iteration, this multi-tiered review process helped to establish a more homogeneous baseline. Further details regarding the annotation criteria can be found in the RipAID dataset repository (Soriano-González et al., 2026)."
(ii) Regarding Figure 8c, the human annotation on the left is indeed incomplete, while the YOLO detections on the right accurately identify the true rip currents. We deliberately selected and retained this specific image to illustrate this exact phenomenon. Because human annotators struggle to perfectly identify all rip currents (as noted by the literature cited by the reviewer), the dataset contains unavoidable human errors. However, Figure 8c demonstrates that the model has generalized the visual features of rip currents well, so it can detect hazards that the human annotator missed. We refer to this in the text as the "Ground Truth Paradox":
"This capacity of the model to occasionally outperform the human annotators highlights a ‘ground truth paradox’ (Plank, 2022). For instance, Figure 8c presents a scenario where the model detected two additional rip currents alongside the labeled instance; post-hoc visual inspection suggests these detections are likely valid."
While future versions of the RipAID dataset will correct these specific annotation errors, we believe it is important to keep Figure 8c as it is in this manuscript. It provides evidence that AI-assisted detection can help mitigate human limitations/mistakes, as outlined in lines 371-373 and 404-407 of the document.
- 4. Clarification on Data Splitting and Correction of K-Fold Cross-Validation Methodology
(i) Clarification on data splitting: We can confirm that the data partitioning was strictly performed at the image level, not the instance level. We mentioned the instance distribution in the original text simply to demonstrate that despite splitting by images, the resulting folds maintained a balanced ratio of target features.
(ii) Correction of the K-Fold Methodology: Following the reviewer's recommendation, we have completely re-run the 10-fold cross-validation isolating a fixed, independent test set (10% of the images), and performed the 10-fold split exclusively on the remaining 90% of the data. We have updated the corresponding figure and numbers (attached to this response). Aligning with this, we will rewrite the 2.4.5 cross-validation methodology section in the manuscript; to read:
“In order to obtain a more reliable estimate of model performance and stability, a 10-fold cross-validation procedure was implemented using the optimized model configuration. Data partitioning was performed at the image level. First, 10% of the total images were isolated as a fixed, independent test set, which remained unseen during the training and validation processes. The remaining 90% of the dataset was then partitioned into ten subsets (i.e., folds). The model was trained and evaluated ten times; in each iteration, one fold was used for validation while the remaining nine folds were used for training. While the data splits were image-based, the distribution of rip current instances remained highly consistent (Figure 2). Furthermore, background images lacking annotations comprised ~16% of the data across all splits, providing the model with negative examples to help minimize false detections. This methodology ensured that each run was executed using a balanced and representative data sample.”
- Updated crossvalidation results on the test set -
Precision Recall mAP50 mAP50-95 F2 mean±sd 0.875±0.01 0.841±0.01 0.921±0.00 0.669±0.00 0.848±001 - 5. Clarification on Single-Class Training and Inclusion of Background Images
We completely agree with the reviewer’s assessment: training a model to distinguish between "rip" and "doubt" as separate classes would cause severe class confusion due to their visual overlap; and introducing negative samples is critical to reduce false positives.
(i) Clarification on the 3-Class Training: We did not train the model to predict three separate classes. In all of our experiments, the models were trained using a single, unified "rip current" class. The "rip", "sediment", and "doubt" tags exist strictly as metadata within the open-source dataset. We included these tags to give future users and researchers the flexibility to filter the dataset for more permissive or conservative training regimens. For example, in our own experiments, the difference between the "permissive" and "conservative" configurations simply dictated whether the "doubt" and "sediment" annotations were merged into the single unified target class, or if they were ignored and treated as background. To prevent this misunderstanding for future readers, we propose adding clarifying paragraphs to the pertinent sections (detailed below).
(ii) Inclusion of background Images (without rip currents): As detailed in our response to Comment 5 (and in lines 110-111 of the original manuscript), our dataset and training splits do, in fact, include these. Approximately 16% of the images in every data split are pure background scenes containing no annotations. We have now made this explicit in the 2.4.5 methodology section, as noted in our previous response.
Proposed addition for the revised manuscript (2.1 RipAID training dataset):
“This three-label classification ('rip', 'sediment', 'doubt') is provided strictly as dataset metadata to grant future users the flexibility to apply permissive or conservative filtering. Training models to predict these classes separately is discouraged due to their high visual and conceptual similarity, which would likely induce severe class confusion”.Proposed addition for the revised manuscript (2.4.3 Annotation class configuration section):
“The models were never trained to predict multiple distinct classes; instead, all selected annotations were mapped into a single unified "rip current" class”.- 6. Mitigation of Deployment Risks and Clarification of Module Intent
Our primary goal in packaging this detector as an open-source module was to bridge the gap between academic research and deployment by providing an operative framework. However, as stated in our original manuscript, the system is explicitly designed to function as an advisory support tool—not as an autonomous "absolute truth" or a replacement for human judgment. We noted in the original text:
"Ultimately, its greatest potential lies in the adoption of the module within coastal safety frameworks as a support tool for lifeguards, emergency services, or applications that bridge beach monitoring cameras with end-users. In this role, the system could serve to reduce rip current risk by drawing attention to potential hazards, maximizing coverage while retaining human verification to ensure reliability."
However, we recognize that releasing the module inherently carries the risks highlighted by the reviewer. To ensure we are promoting responsible deployment, we have reviewed and tempered the language in both the Discussion and Conclusion sections. We will add explicit disclaimers emphasizing that this module represents a "first step" baseline requiring local calibration and human-in-the-loop verification, rather than a definitive, standalone safety guarantee.
Proposed text modification (line 382-384) to the Discussion:
“Despite these promising results, it is imperative to approach the operational deployment of this module with caution. In high-stakes domains, full automation is often not desirable due to safety, ethical, and legal concerns (Lai et al., 2023). Given the inherent complexities of coastal morphodynamics and the current volume of training data, the delivered system is not intended to function as an autonomous, standalone decision-making tool. Rather, it is designed strictly as a 'human-in-the-loop' advisory system. Its purpose is to flag potential hazards for human verification, rather than replacing human oversight. In this context its greatest potential lies in the adoption of the module within coastal safety frameworks as a support tool for lifeguards, emergency services, or applications that bridge beach monitoring cameras with end-users while retaining human verification to ensure reliability."
Proposed adjustment to the Conclusion (line 401):
"By releasing the complete codebase and pre-trained weights, we provide a foundational framework to democratize access to advanced safety tools. However, we emphasize that this module represents an initial baseline rather than a definitive fit-for-all solution. Safe operational deployment requires rigorous local benchmarking and strict integration into existing coastal safety protocols where human verification remains the ultimate authority."
- 7. Comparison with Existing Benchmark Datasets and Justification of Dataset Scale
We agree that a more explicit comparison with existing dataset-benchmark papers would strengthen the manuscript. However, direct comparison of model performance metrics is challenging due to fundamental differences in annotation types (axis-aligned bounding boxes, oriented bounding boxes, and segmentation masks) and evaluation protocols across datasets. To the best of our knowledge, no prior rip current dataset employs oriented bounding box annotations, further limiting direct comparison. However, to provide a more detailed introduction to the current dataset-benchmark papers, we propose the following changes:
Proposed addition to the Introduction (line 84):
“Several prior studies have also published their training datasets alongside their results, providing a basis for comparison. Maryan et al. (2019) constructed a dataset of 514 rip channel instances from time-averaged aerial imagery. De Silva et al. (2021) released a dataset of 2,440 images, the majority consisting of top-down Google Earth imagery, with 700 background images, and supplemented the test split with 23 video sequences. Building on this, Zhu et al. (2022) incorporated 1,352 additional real beach scene photographs (746 containing rip currents and 606 without), yielding a combined dataset of 3,792 images. All of these datasets employ axis-aligned bounding box annotations. More recently, Dumitriu et al. (2025) released a large-scale, multi-angle benchmark comprising 184 videos with 212,328 annotated frames (163,528 containing rip currents), using pixel-level segmentation masks.”
Proposed modification to the Discussion (line 355):
“Direct comparison between the results of the proposed model and existing state-of-the-art rip current detection models is challenging due to fundamental differences in annotation types and evaluation protocols across datasets. To the best of our knowledge, no prior rip current dataset employs oriented bounding box annotations, further limiting direct comparison. In terms of dataset scale, the RipAID v2.0.0 includes 6,789 training images, 2,815 of which are new images from 8 different camera angles of low-energy Mediterranean beaches, and 3,974 reprocessed images from previous datasets. The current version of RipAID includes more rip current instances than previously reported image-based datasets (Maryan et al., 2019, Zhu et al., 2022), but fewer than video-based datasets (De Silva et al., 2021, Dumitriu et al., 2025). However, by sampling independent images (at minimum one-hour intervals), we mitigate the temporal correlation and information redundancy that typically exist between consecutive frames in large video-based datasets.
Our model generalizes to the oblique viewing angles typical of shore-based cameras, contributing to address previous concerns in the literature (de Silva et al., 2021; Zhu et al., 2022; Rampal et al., 2022; Rashid et al., 2023), and aligning with most recent developments in the field (Khan et al., 2025a, Dumitriu et al., 2025). Furthermore, by opting for OBBs, the system captures orientation effectively without the cost of pixel-level masking as in instance segmentation approaches (Dumitriu et al., 2025), easing scalability for future dataset generation.”- 8. Revision of "Advanced Modalities" Terminology
We agree that the term used is inaccurate. We propose rephrasing as:
"Most recently, research has focused on increasing precision through dense mask annotation."- 9. Addressing General Comments: Target Audience, Level of Detail, and Adjustment of Claims
We sincerely thank the reviewer for this comprehensive and constructive overview of our manuscript. We align with the framing of this work as a "valuable checkpoint" rather than a final result; this captures our true intent, and we have adjusted the tone of the manuscript to better reflect this reality.
Regarding the omission of the annotation procedure, we have addressed this by including the detailed protocol in the text, as outlined in our response to previous comments.
Regarding the inclusion of basic deep learning concepts (such as explaining True Positives, False Positives, etc.): this was a deliberate choice tailored to the wide readership we expect from NHESS (including marine researchers, geomorphologists, and coastal emergency managers). Because a core contribution of this paper is the release of an open-source, Dockerized inference module meant for public use, we felt it was important to ensure the evaluation metrics were accessible to non-experts in machine learning.
We acknowledge the reviewer’s critique that the manuscript aimed for claims that were too large for the current volume of data and the stage of the research. Our core novelty lies in taking the first steps toward operationalization, providing the first OBB dataset, open weights, and a containerized API that the community can use and expand upon over time. To ensure the manuscript reflects this "checkpoint" reality and avoids overstating our achievements, we have reviewed the Discussion and Conclusion sections and removed or toned down any overly strong claims. Besides other changes addressed in previous comments in this line, some other modifications include: (i) removing the phrase “(in line 375)......establishing a new baseline for operational readiness” and simply stating that the work "provides an open-source foundational framework for future operational development."; (ii) modifying the phrase "(in line 376)...designed for immediate integration into multi-platform observation networks." by removing "immediate”, acknowledging that while the Dockerized packaging makes it available, actual integration into external operational networks might require site-specific calibration.
We believe these revisions, prompted by the reviewer’s valuable feedback, will result in a more scientifically rigorous, prudent, and well-grounded manuscript. We sincerely thank the reviewer again for the time, expertise, and guidance in improving this work.
-
AC1: 'Reply on RC1', Jesús Soriano-González, 15 May 2026
reply
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 524 | 278 | 43 | 845 | 46 | 57 |
- HTML: 524
- PDF: 278
- XML: 43
- Total: 845
- BibTeX: 46
- EndNote: 57
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
General comments:
The paper introduces a fair idea, arguing that OBB facilitates less background noise compared to axis-aligned bounding boxes, with some loss in accuracy but at a fraction of the cost compared to segmentation. It introduces 6,789 images with rip currents annotated with oriented bounding boxes.
Specific comments:
The paper's main argument is a reasonable one, rip currents benefiting from OBB detection. There are several strong critiques that need to be addressed:
- the evaluation is done on classic detection metrics and a custom metric, rip_fit. While this is a reasonable approach, as per their own introduction, rip currents are a leading cause of drownings globally. For such scenarios where a false positive is costly, the F2 is a standardized metric that can be used (alongside currently used ones). Introducing a custom metric risks fragmenting the research even more, with minimal benefits, if any;
- the dataset, with 6,789 is rather modest in size. One of the cites sources is an roboflow link which is no longer available (CnE UFSC (2023)) and another source, RipScout (2025), claims 2555 images from 73 videos; upon closer look, RipScout introduced 2555 images from 8 videos, with highly repetitive patterns, alongside 1767 images from de Silva (2021); this raises a doubt over the claimed source and diversity of the dataset, especially when compared to de Silva (2021). De Silva (2021) trained on 2440 images (1767 with rip currents and 673 without) and they tested on 23 videos. Dumitriu (2025) introduced 150 rip currents videos with 15,784 annotated frames, alongside 34 videos with no rip currents. The conclusion states that "a key contribution is the development [...] a novel dataset that significantly expands the diversity of the available training data [...]". Considering that more than 1/3 of the data is from previous papers and another source is not available anymore to analyze, this is questionable;
- there is little discussion on the annotation procedure. Subsequent studies (Ballantyne et al. (2005), Sherker et al. (2010), Caldwell et al. (2013), Brannstrom et al. (2014), Sotes et al. (2018), Pitman et al. (2021)) have found that while many people are confident that they can correctly identify rip currents, most of them cannot do so. While the surveyed people were not experts, such process needs to be thoroughly done and documented. If multiple annotators were used, an inter-annotator agreement coefficient is also recommended (such as Cohen-Kappa). For example, a rip current expert that we asked has said that at first glance, Figure 8.c is incorrectly annotated. The yolo detection on the right appear to be correct, while the annotation on the left is not clear. There are other instances from that camera that do have only rip current on the left, but these seem to be at a different time and date and do not necessarily generalize (based on simple visualization);
- 10-fold is indeed useful in scenarios with this volume of data. It is not clear, however, how it was done. The paper introduces 6,789 images with 10,131 annotations (I am assuming rip currents instances, which can be more per image) and proceed to do the 10-fold split on the rip currents and not on the images. It is both unclear how and why the authors have proceeded to do so;
- the way the authors approached k-fold cross-validation is incorrect. Please check https://scikit-learn.org/stable/modules/cross_validation.html for details. 10-fold should be done only on train and validation with a separate and fixed test data. The way the authors did it, the test data is unreliable and more akin to validation data. A correct way to do it is do an initial split between (train+val) and keep test fixed. Then you can do a k-fold split in order to find the best combination of train and val splits; once done so, you evaluate only once on test data and this becomes you final result. Modifying the training based on this result negates the value of the test data;
- splitting into 3 classes, "rip", "sediment" and "doubt" is risky from a training point of view. The doubt and rip class can be visually similar; the model understands "doubt" as a completely different class and thus may be confused by the visual similarities. I suggest an approach based on clear rip current classes (in this case, sediment is a large enough distribution to make sense) and judge the "doubt" or not based on model confidence, not as a separate class on its own. Especially since the annotations in itself are not discussed and can be questionable. This both confuses the model and risks introducing the strong bias of the annotators. They did compare 3 classes vs 1 class (line 317), which is reasonable, but in order to make an argument for the doubt class, it should be isolated in an experiment with 3 classes vs 2 classes and the results should be evaluated by a rip currents expert. The quantitative results are not enough in this context: dataset too small, no null (no rip) images in training and evaluation, class imbalance, annotation biases etc. They should also account for the class imbalance of the 3 classes and maybe use a different loss (such as weighted cross-entropy);
- training and evaluation in such cases needs to also consider images without rip currents, simiarly to previous rip current papers (deSilva (2021), Zhu (2023), Dumitriu (2025)) and to most papers in object detection field;
- considering the low volume of data and several questionable aspects, mentioned above, packaging a detector in "ocib-rip-currents-detection", while a noble and arguably required endeavor, poses multiple risks. Unknowing and non-technical people can use and rely on it with potentially grave consequences. A rip current detector, while extremely useful if done correctly, should be approached and published with meticulous detail, testing and care;
- the authors also do not compare directly (enough) to other dataset-benchmark papers. They do mention some of them, but a direct comparison (numbers, annotations, results etc) is warranted;
- the paper claims that recent advacements use "advancet modalities" (line 80). That is incorrect - all of the current papers use RGB images / videos. Advanced modalities in the context of computer vision implies non-traditional data sources beyond 2D RGB imaging;
- the paper is long and detailed, containing a bit of everything. In some places, the detailed approach is well done, while in others it either doesn't mention relevant information (such as the aforementioned annotation proceedure) or it goes into too much detail of basic information (such as explaining what TP, FP etc are). The paper should focus more on the novelty, the contribution and the results;
Overall, the paper starts researching in a direction with considerably potential, but aims for claims that are too big for the data, research and the effort that was put into it. I recommend this be a valuable checkpoint in the direction of such claims and target and not the final result.