NUCLEAR TECHNIQUES, Volume. 48, Issue 5, 050009(2025)

Nuclear physics AI research assistant and arXiv vector database

Longgang PANG1,2、*
Author Affiliations
  • 1(Key Laboratory of Quark and Lepton Physics (MOE) & Institute of Particle Physics, Central China Normal University, Wuhan 430079, China)
  • 2Artificial Intelligence and Computational Physics Research Center, Central China Normal University, Wuhan 430079, China
  • show less

    Background

    The exponential growth of scientific literature, particularly in physics and nuclear physics, poses significant challenges for researchers to track advancements and identify cross-disciplinary solutions. While large language models (LLMs) offer potential for intelligent retrieval, their reliability is hindered by inaccuracies and hallucinations. The arXiv dataset (2.66 million papers) provides an unprecedented resource to address these challenges.

    Purpose

    This study aims to develop a hybrid retrieval system integrating vector-based semantic search with LLM-driven contextual analysis to enhance the accuracy and accessibility of scientific knowledge across disciplines.

    Methods

    We processed 2.66 million arXiv paper titles/abstracts using BGE-M3 model to generate 1 024-dimensional vector representations. Cosine similarity metrics were computed between user queries (vectorized via the same model) and pre-encoded paper vectors for preliminary semantic ranking. The top 50 candidates underwent contextual relevance analysis by DeepSeek-r1, which evaluated technical depth, methodological alignment, and cross-domain connections through multi-step reasoning. A nuclear physics case study validated the system using 1 000 AI-human-annotated documents. The framework incorporating four specialized agents: query generation, relevance scoring, structured data correction, and PDF analysis.

    Results

    We constructed a vector database comprising 2.66 million arXiv papers (including titles and abstracts), occupying 30 GB of disk space. Our vector-based semantic search system demonstrated superior performance in a nuclear physics query benchmark, achieving 90% precision and 60% recall for the top-10 retrieved documents. This significantly outperformed traditional keyword-based search methods, which yielded only 20% precision and 10% recall under the same evaluation conditions.

    Conclusions

    By synergizing vector semantics with LLM reasoning, this work establishes a new paradigm for scientific knowledge retrieval that effectively bridges disciplinary divides. The open-sourced system (https://gitee.com/lgpang/arxiv_vectordb) provides researchers with scalable tools to navigate literature complexity, demonstrating particular value in identifying non-obvious interdisciplinary connections.

    Keywords
    Tools

    Get Citation

    Copy Citation Text

    Longgang PANG. Nuclear physics AI research assistant and arXiv vector database[J]. NUCLEAR TECHNIQUES, 2025, 48(5): 050009

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Special Topics on Applications of Machine Learning in Nuclear Physics and Nuclear Data

    Received: Mar. 12, 2025

    Accepted: --

    Published Online: Jun. 26, 2025

    The Author Email: Longgang PANG (lgpang@ccnu.edu.cn)

    DOI:10.11889/j.0253-3219.2025.hjs.48.250108

    Topics