NUCLEAR TECHNIQUES, Volume. 48, Issue 5, 050009(2025)
Nuclear physics AI research assistant and arXiv vector database
The exponential growth of scientific literature, particularly in physics and nuclear physics, poses significant challenges for researchers to track advancements and identify cross-disciplinary solutions. While large language models (LLMs) offer potential for intelligent retrieval, their reliability is hindered by inaccuracies and hallucinations. The arXiv dataset (2.66 million papers) provides an unprecedented resource to address these challenges.
This study aims to develop a hybrid retrieval system integrating vector-based semantic search with LLM-driven contextual analysis to enhance the accuracy and accessibility of scientific knowledge across disciplines.
We processed 2.66 million arXiv paper titles/abstracts using BGE-M3 model to generate 1 024-dimensional vector representations. Cosine similarity metrics were computed between user queries (vectorized via the same model) and pre-encoded paper vectors for preliminary semantic ranking. The top 50 candidates underwent contextual relevance analysis by DeepSeek-r1, which evaluated technical depth, methodological alignment, and cross-domain connections through multi-step reasoning. A nuclear physics case study validated the system using 1 000 AI-human-annotated documents. The framework incorporating four specialized agents: query generation, relevance scoring, structured data correction, and PDF analysis.
We constructed a vector database comprising 2.66 million arXiv papers (including titles and abstracts), occupying 30 GB of disk space. Our vector-based semantic search system demonstrated superior performance in a nuclear physics query benchmark, achieving 90% precision and 60% recall for the top-10 retrieved documents. This significantly outperformed traditional keyword-based search methods, which yielded only 20% precision and 10% recall under the same evaluation conditions.
By synergizing vector semantics with LLM reasoning, this work establishes a new paradigm for scientific knowledge retrieval that effectively bridges disciplinary divides. The open-sourced system (
Get Citation
Copy Citation Text
Longgang PANG. Nuclear physics AI research assistant and arXiv vector database[J]. NUCLEAR TECHNIQUES, 2025, 48(5): 050009
Category: Special Topics on Applications of Machine Learning in Nuclear Physics and Nuclear Data
Received: Mar. 12, 2025
Accepted: --
Published Online: Jun. 26, 2025
The Author Email: Longgang PANG (lgpang@ccnu.edu.cn)