Acta Optica Sinica, Volume. 45, Issue 11, 1111004(2025)

A Deep Learning‑Based Method For Binocular Endoscopic Stereoscopic Imaging

Yuanjue Ma1, Jingyi Xu2,3, Jun Yan2,3、**, and Liang Li1,3、*
Author Affiliations
  • 1Department of Engineering Physics, Tsinghua University, Beijing 100084, China
  • 2School of Medicine, Tsinghua University, Beijing 100084, China
  • 3Institute for Precision Medicine, Tsinghua University, Beijing 100084, China
  • show less

    Objective

    Intraoperative precise navigation plays a crucial role in minimally invasive liver surgery, where binocular endoscopy enables depth information acquisition from the field of view, providing essential data support. However, current binocular endoscopic systems require manual depth perception by surgeons and lack digital integration with surgical instruments. This study investigates a binocular endoscopic stereoscopic imaging method based on the CroCo v2 stereo matching network to address the challenge of generating rapid and stable three-dimensional depth information through computer stereo vision technology. This approach aims to provide surgeons with direct three-dimensional digital image information and establish a foundation for future digital automatic navigation and early warning systems for surgical instruments.

    Methods

    To address the challenge of limited binocular endoscopic datasets, this study introduces a pre-training approach utilizing simulation datasets generated from human organ models. The methodology employs an open-source human cardiac anatomical model to generate simulation datasets through Blender. A binocular endoscope system is developed by horizontally aligning two identical camera models with adjustable baseline distance and focal length parameters to replicate various endoscopic configurations. The CroCo v2 stereo matching network undergoes a three-phase optimization process: initial pre-training with simulation datasets, fine-tuning using the SCARED training dataset, and comprehensive evaluation using both the SCARED test dataset and clinical binocular endoscopic datasets to verify cross-domain generalization capabilities.

    Results and Discussions

    The experimental results on the SCARED test dataset demonstrate that networks pre-trained solely with simulation data achieve competent disparity estimation performance. Notably, the network fine-tuned after simulation-based pretraining produces clearer and more accurate disparity maps compared to the network trained exclusively on the SCARED training dataset (Fig. 4). This demonstrates the effectiveness of the simulation dataset in enhancing network performance. At the same time, it can be observed that as the number of simulation datasets increases, the training accuracy of the network improves (Table 2). However, under the premise that the network parameters and algorithms are already determined, the improvement in training accuracy achieved by simply changing the number of training datasets is not linear; there is an upper limit to this improvement. To evaluate the advancement of the proposed method, comparative experiments are conducted on the SCARED dataset against other state-of-the-art approaches, including STTR, BDIS, MonoDiffusion, EndoDAC and the method proposed by Yang et al. Among these, BDIS represents a traditional algorithm while the others are deep learning-based methods. It can be observed that the proposed method achieves comparable performance to state-of-the-art approaches, outperforming both traditional algorithms and other advanced deep learning methods (Table 3). Additionally, the results indicate that the network maintains robust performance in weakly-textured regions, reflective areas, and low-light conditions while preserving fine local details (Fig. 5). To validate the generalization capability of the network, the final trained model is tested on a dataset collected from clinical procedures using a daVinci Xi binocular endoscopic. The network is deployed on a consumer-grade NVIDIA GeForce RTX 4090 graphics card, consuming an average of 3.9 GB of VRAM per image pair and taking 400 ms for inference. As observed, the disparity maps produced by the network are clear and complete, without any gaps (Fig. 6). Furthermore, there is no significant degradation in the results in areas with weak texture, reflections, or low light. This indicates that the well-trained network exhibits excellent generalization ability on the clinically collected dataset. Furthermore, the relatively low local deployment cost and short inference time make it capable of meeting the real-time requirements of clinical settings.

    Conclusions

    The proposed method effectively addresses the challenge of insufficient datasets through an innovative training paradigm for simulation datasets, overcoming limitations of conventional simulation methods that depend on ultra-precise models yet demonstrate poor clinical generalization capability. Additionally, the network architecture exhibits lightweight characteristics, achieving optimal real-time performance without compromising training accuracy while maintaining straightforward local deployment.

    Keywords
    Tools

    Get Citation

    Copy Citation Text

    Yuanjue Ma, Jingyi Xu, Jun Yan, Liang Li. A Deep Learning‑Based Method For Binocular Endoscopic Stereoscopic Imaging[J]. Acta Optica Sinica, 2025, 45(11): 1111004

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Imaging Systems

    Received: Jan. 14, 2025

    Accepted: Apr. 14, 2025

    Published Online: Jun. 24, 2025

    The Author Email: Jun Yan (yanjun1619@mail.tsinghua.edu.cn), Liang Li (lliang@tsinghua.edu.cn)

    DOI:10.3788/AOS250491

    CSTR:32393.14.AOS250491

    Topics