Acta Optica Sinica, Volume. 45, Issue 11, 1111004(2025)
A Deep Learning‑Based Method For Binocular Endoscopic Stereoscopic Imaging
Intraoperative precise navigation plays a crucial role in minimally invasive liver surgery, where binocular endoscopy enables depth information acquisition from the field of view, providing essential data support. However, current binocular endoscopic systems require manual depth perception by surgeons and lack digital integration with surgical instruments. This study investigates a binocular endoscopic stereoscopic imaging method based on the CroCo v2 stereo matching network to address the challenge of generating rapid and stable three-dimensional depth information through computer stereo vision technology. This approach aims to provide surgeons with direct three-dimensional digital image information and establish a foundation for future digital automatic navigation and early warning systems for surgical instruments.
To address the challenge of limited binocular endoscopic datasets, this study introduces a pre-training approach utilizing simulation datasets generated from human organ models. The methodology employs an open-source human cardiac anatomical model to generate simulation datasets through Blender. A binocular endoscope system is developed by horizontally aligning two identical camera models with adjustable baseline distance and focal length parameters to replicate various endoscopic configurations. The CroCo v2 stereo matching network undergoes a three-phase optimization process: initial pre-training with simulation datasets, fine-tuning using the SCARED training dataset, and comprehensive evaluation using both the SCARED test dataset and clinical binocular endoscopic datasets to verify cross-domain generalization capabilities.
The experimental results on the SCARED test dataset demonstrate that networks pre-trained solely with simulation data achieve competent disparity estimation performance. Notably, the network fine-tuned after simulation-based pretraining produces clearer and more accurate disparity maps compared to the network trained exclusively on the SCARED training dataset (Fig. 4). This demonstrates the effectiveness of the simulation dataset in enhancing network performance. At the same time, it can be observed that as the number of simulation datasets increases, the training accuracy of the network improves (Table 2). However, under the premise that the network parameters and algorithms are already determined, the improvement in training accuracy achieved by simply changing the number of training datasets is not linear; there is an upper limit to this improvement. To evaluate the advancement of the proposed method, comparative experiments are conducted on the SCARED dataset against other state-of-the-art approaches, including STTR, BDIS, MonoDiffusion, EndoDAC and the method proposed by Yang et al. Among these, BDIS represents a traditional algorithm while the others are deep learning-based methods. It can be observed that the proposed method achieves comparable performance to state-of-the-art approaches, outperforming both traditional algorithms and other advanced deep learning methods (Table 3). Additionally, the results indicate that the network maintains robust performance in weakly-textured regions, reflective areas, and low-light conditions while preserving fine local details (Fig. 5). To validate the generalization capability of the network, the final trained model is tested on a dataset collected from clinical procedures using a daVinci Xi binocular endoscopic. The network is deployed on a consumer-grade NVIDIA GeForce RTX 4090 graphics card, consuming an average of 3.9 GB of VRAM per image pair and taking 400 ms for inference. As observed, the disparity maps produced by the network are clear and complete, without any gaps (Fig. 6). Furthermore, there is no significant degradation in the results in areas with weak texture, reflections, or low light. This indicates that the well-trained network exhibits excellent generalization ability on the clinically collected dataset. Furthermore, the relatively low local deployment cost and short inference time make it capable of meeting the real-time requirements of clinical settings.
The proposed method effectively addresses the challenge of insufficient datasets through an innovative training paradigm for simulation datasets, overcoming limitations of conventional simulation methods that depend on ultra-precise models yet demonstrate poor clinical generalization capability. Additionally, the network architecture exhibits lightweight characteristics, achieving optimal real-time performance without compromising training accuracy while maintaining straightforward local deployment.
Get Citation
Copy Citation Text
Yuanjue Ma, Jingyi Xu, Jun Yan, Liang Li. A Deep Learning‑Based Method For Binocular Endoscopic Stereoscopic Imaging[J]. Acta Optica Sinica, 2025, 45(11): 1111004
Category: Imaging Systems
Received: Jan. 14, 2025
Accepted: Apr. 14, 2025
Published Online: Jun. 24, 2025
The Author Email: Jun Yan (yanjun1619@mail.tsinghua.edu.cn), Liang Li (lliang@tsinghua.edu.cn)
CSTR:32393.14.AOS250491