Community-scale urban flood monitoring through fusion of time-lapse imagery, terrestrial lidar, and remote sensing data
Abstract. High-frequency flood events in urban areas pose significant cumulative hazards. These floods are often difficult to detect and monitor using existing infrastructure, making the development of alternative approaches critical. This study presents the implementation of a computer vision-based urban flood monitoring network deployed in Cahokia Heights, Illinois, USA. Flood observations were collected at 30-minute intervals using consumer-grade trail cameras. Water surface elevations were estimated from the intersection of segmented flood masks with 2D-projected terrestrial lidar data. Flood extents and depths were extrapolated using a terrain depression-filling algorithm. Camera-derived peak flood extents and depths were compared to independent predictions from a 2D HEC-RAS Rain-on-Grid flood model. This procedure was applied to two flood events, one moderate and one severe, using imagery from two camera sites. For the severe event, water level estimates agreed closely between cameras, with a median difference of less than 3 cm and a peak difference of less than 2 cm. For the moderate event, differences were larger (median <10 cm, peak <16 cm). Agreement between modeled and camera-derived peak flood extents exceeded 90 % for the severe event but ranged between 21 % and 42 % for the moderate event. We use the convergence and divergence of independent camera observations to infer differences in spatiotemporal flood connectivity, disconnected in the moderate event and connected in the severe one. This study demonstrates the utility of low-cost, camera-based systems for high-resolution monitoring of flood dynamics in complex urban environments and highlights their potential integration with hydrodynamic modeling.
This is a well-written and methodologically solid paper addressing an important and timely topic—urban flooding. The study effectively builds upon previous efforts, particularly Erfani et al. and Eltner et al., and integrates their insights into a novel framework. The authors demonstrate a strong grasp of both the hydrologic and vision-based aspects of flood monitoring, making the work a valuable contribution to the field. Below, I offer a few comments and questions that may help strengthen the manuscript.
“While aerial lidar offers broad spatial coverage, it does not resolve fine-scale topographic features such as street curbs or shallow depressions common in urban environments” (Dale et al., 2025, p. 7) (pdf)
Why did you use aerial lidar in the first place? If it was not used directly in your workflow, you might consider omitting it to avoid confusion.
“This approach relies on annotated point prompts that indicate the presence or absence of 230 flooding at individual pixels within a reference image.” (Dale et al., 2025, p. 9) (pdf)
“For a given flood event, the earliest image in which flooding was visible was annotated with three to five positive point prompts. These prompts were then used to segment the remaining image sequence.” (Dale et al., 2025, p. 9) (pdf)
“The visual confirmation of flooding was used to iteratively refine the segmentation, with additional positive prompts added to correct for false negatives (i.e., flooded areas classified as non-flooded), and negative prompts added to address false positives (i.e., non-flooded areas 235 misclassified as flooded)” (Dale et al., 2025, p. 9) (pdf)
I understand that machine learning is not the main focus of this study—it primarily serves as a tool to extract information from 2D imagery. However, given that previous studies have already addressed similar challenges, it might have been advantageous to employ some of those established methods directly. Although the amount of manual annotation here is reduced, it still represents a bottleneck to achieving full automation.
“The extrinsic camera pose matrix, P, was estimated based on a set of matched reference features with known locations in both image coordinates (u, v), and world coordinates (X, Y, Z). This process, known as the Perspective-n-Point (PnP) problem, yields an estimated camera pose denoted as PPnP. Feature matching was performed manually, with image coordinates of reference features labeled in ImageJ (Schindelin et al., 2012) and their corresponding world coordinates annotated from the terrestrial lidar point cloud using CloudCompare (CloudCompare, 2023). In the absence of permanent ground control points, 270 static scene elements such as rooftops, fence posts, and utility poles were used as reference features. Between 20 and 30 such features were labeled for each camera. Point precision was limited by image resolution, point cloud noise, and the spatial resolution of the lidar scan.” (Dale et al., 2025, p. 10) (pdf)
In this section, the methodology appears somewhat behind the state of the art. As mentioned earlier, even though these technical components might seem peripheral, exploring ways to automate them is crucial for advancing toward operational applications of such frameworks.
Also, how many times did the authors perform this procedure? Assuming the camera locations are fixed, it seems unnecessary to repeat it multiple times—unless the cameras were moved between events.
“A separate camera pose estimate was computed for each camera and flood event. For the moderate May 14 flood, Camera A’s pose was calculated using 18 reference features, yielding a median reprojection error of 6.83 pixels. The recovered camera location was offset 46 cm from the labeled camera center in the point cloud. For the July 4 event, pose estimation at Camera A used 24 features, resulting in a median reprojection error of 23.6 pixels and a reduced camera position offset to 6 cm.” (Dale et al., 2025, p. 11) (pdf)
This part is a bit confusing. Could the authors clarify why the July event—with more reference features—has a higher reprojection error in image space but a smaller offset in 3D space? The 3D error seems quite large and could significantly affect flood mapping accuracy (e.g., introducing nearly a meter of uncertainty in flood extent). Did the authors examine how this uncertainty propagates into flood depth estimates?
“Flood extent estimation is based on the intersection of lidar-derived topography and image-derived water classifications. Using the established projection pipeline in Equation 2, each point in the terrestrial lidar point cloud is mapped to a corresponding image pixel. If a pixel is identified as flooded in the SAM2-derived binary segmentation mask, the associated terrestrial lidar point is classified as inundated.” (Dale et al., 2025, p. 11) (pdf)
How was this implemented? Since multiple 3D points may project onto a single image pixel, how did the authors handle indexing or correspondence between flooded pixels and their associated 3D points?
“To estimate water surface elevation (WSE), the highest elevations along the boundary of the inundated zone are used as a proxy for the maximum water level and the water surface is 305 assumed to be flat. Edge pixels are extracted using a Canny Edge Detection filter, and the 90th and 95th percentiles of the extracted edge elevation distribution are used to represent a range of possible water surface levels (WSE90 and WSE95) to account for potential topographic noise or obstruction of the water edge in the time lapse images.” (Dale et al., 2025, p. 11) (pdf)
This appears to be the core contribution of the paper and would benefit from more detailed elaboration. The rest of the workflow closely follows prior studies. Based on Figure 1, I initially thought the authors were using a hypsometric curve approach (Dale et al., 2025, p. 6) (pdf). It might be helpful to elaborate on how these curves are utilized and how they relate to the conceptual model applied later in the iterative flood-fill procedure at 0.5 m resolution (Wu et al., 2018; Samela et al., 2020).
“The area of interest for the flood-fill implementation focused on the direct area spanning the two camera locations, approximately 500 m by 250 m, to avoid propagation into unobservable areas.” (Dale et al., 2025, p. 11) (pdf)
This aspect could also be an interesting avenue for future research—for example, using a location-allocation optimization approach to minimize the number of cameras while maximizing the coverage area.
“Although image data informed general model development, no direct calibration against the imagery was performed.” (Dale et al., 2025, p. 12) (pdf)
This raises an interesting question: if sparse information extracted from cameras were available, how could such data be assimilated into flood models to refine their outputs? Could this be implemented in real time?
“Our comparison focuses on quantifying the relative agreement in predicted flood extent between the two methods. The primary metric focuses on identifying regions where both the model and camera-based approaches indicate flooding areas of mutual agreement in predicted inundation. This shared extent is expressed as Foverlap, the ratio of the number of pixels classified as flooded by both methods to the total number of pixels classified as flooded by either. The model domain includes areas separated from our camera sites by major roads and drainage canals. To provide a meaningful comparison between model output and our image-based methods, we spatially restricted our comparison to a region with the approximate bounds of the topographic depression containing the study neighborhood. Where flood extents overlap, we also compared modeled and observed water surface elevations and flood depths.” (Dale et al., 2025, p. 12) (pdf)
This section feels somewhat unconventional and could benefit from clarification. If I were the authors, I would consider treating the HEC-RAS output as the reference (or ground truth) and evaluating the vision-based estimates using standard metrics such as a confusion matrix. This would make the comparison more transparent and interpretable. It would also help highlight that the vision framework is not isolated—the overall performance reflects both the errors of the camera-based method (which provides boundary and initial conditions) and those of the conceptual flood model. A more detailed characterization of each component’s contribution would strengthen the paper considerably.