Multi-View Fusion Object Detection Method Based on Lidar Point Cloud Projection

With the rapid advancement of automobile intelligence, the demand for high-precision object detection of road obstacles in autonomous driving continues to grow to ensure driving safety. However, existing object detection methods based on lidar point clouds face significant challenges. For instance, direct point cloud processing consumes substantial computational resources, voxel-based methods still have high computational costs, and approaches combining point clouds with visual images encounter complex data fusion challenges. While point cloud projection methods simplify data representation and reduce computational demand, they suffer from issues such as information loss and feature fusion difficulties. Consequently, simplifying point cloud representation, reducing computational overhead, and improving detection precision have become pressing challenges. To address these issues, we propose a multi-view fusion object detection method based on lidar point cloud projection.

Methods

In this paper, we propose a multi-view fusion object detection method to address three-dimensional (3D) point cloud detection tasks. The system architecture is shown in Fig. 1. Specifically, to achieve dimensionality reduction, the 3D point clouds are first projected onto a plane to generate a two-dimensional (2D) bird’s eye view (BEV). Simultaneously, the 3D point clouds are converted into cylindrical coordinates, and the cylindrical surface is unfolded into a rectangle to create a 2D range view (RV). Projection views are encoded into multi-channel images from the point cloud data. These images serve as input to the object detection network. In addition, the efficient channel attention (ECA) mechanism is incorporated into the Complex-YOLO and YOLOv5s networks, which are employed as the object detection networks for BEV and RV, respectively. The preliminary object detection results from both views are then fused at the decision-making level using weighted Dempster-Shafer (D-S) evidence theory, resulting in the final detection outputs.

Results and Discussions

In scenarios without occlusion, the proposed method achieves slightly lower detection precision for pedestrians and cyclists compared to BEVDetNet and sparse-to-dense (STD), respectively. Specifically, the precision for detecting pedestrians is 1.22 percentage points lower than BEVDetNet (Table 2), and the precision for detecting cyclists is 0.40 percentage points lower than STD (Table 3). However, under occlusion conditions, the method significantly improves detection precision by 1‒5 percentage points. This improvement is primarily due to the integration of information from both views, which compensates for occlusion effects. When detecting cars (Table 1), the physical shapes of cars in the two different views are relatively regular, making their features easier to extract. After fusing the object detection results from the two views, precision under the three occlusion levels improves to varying degrees. Compared with STD, a method with relatively strong detection performance, the precision is improved by 0.52 percentage points under the easy level, 2.04 percentage points under the moderate level, and 1.25 percentage points under the hard level. These performance indicators demonstrate significant improvement, particularly in cases of occlusion. Using the object detection method proposed in this paper, average AP values achieved are 81.37% for cars, 49.34% for pedestrians, and 67.97% for cyclists. In ablation experiments (Table 5), compared with the original single-view object detections for BEV and RV, the average precision (AP) for cars increases by 4.70 and 6.16 percentage points, respectively. For pedestrians, AP increases by 3.44 and 2.73 percentage points, and for cyclists, by 4.06 and 3.63 percentage points. Overall, the mean average precision (mAP) improves by 4.07 and 4.18 percentage points for BEV and RV, respectively. In addition, visualization results demonstrate that the proposed method effectively reduces false detections (Fig. 7) and missed detections (Fig. 8).

Conclusions

To address missed and false detections in single-view lidar point cloud projection methods, we propose a multi-view fusion object detection method. The point clouds are projected into BEV and RV views and encoded into three-channel images. The ECA module is integrated into Complex-YOLOv4 and YOLOv5s networks, which are used as detection models for BEV and RV, respectively, to generate preliminary results. These results are then fused using weighted D-S evidence theory to produce the final detection outputs. Compared to single-view methods, the mAP is improved by 4.07 and 4.18 percentage points for BEV and RV, respectively. By reducing 3D point clouds to 2D images, our method significantly reduces computational complexity. It also combines information from multiple views to overcome challenges such as occlusion and feature extraction limitations. Future research will focus on balancing precision and recall in point cloud object detection, refine objective functions, and further enhance detection performance.

Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.

Keywords

attention mechanism Dempster-Shafer evidence theory lidar multi-view fusion object detection

Tools

Get Citation

Copy Citation Text

Mu Zhou, Haocheng Ran, Yong Wang, Nan Du. Multi-View Fusion Object Detection Method Based on Lidar Point Cloud Projection[J]. Acta Optica Sinica, 2025, 45(12): 1228006

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites