Journal of Electronic Science and Technology, Volume. 23, Issue 1, 100301(2025)
On large language models safety, security, and privacy: A survey
[1] [1] OpenAI, J. Achiam, S. Adler, et al., GPT4 technical rept [Online]. Available, https:arxiv.gabs2303.08774, March 2023.
[2] [2] H. Touvron, T. Lavril, G. Izacard, et al., LLaMA: open efficient foundation language models [Online]. Available, https:arxiv.gabs2302.13971, February 2023.
[3] [3] H. Touvron, L. Martin, K. Stone, et al., Llama 2: open foundation finetuned chat models [Online]. Available, https:arxiv.gabs2307.09288, July 2023.
[4] [4] W.X. Zhao, K. Zhou, J.Y. Li, et al., A survey of large language models [Online]. Available, https:arxiv.gabs2303.18223, March 2023.
[5] [5] L. Huang, W.J. Yu, W.T. Ma, et al., A survey on hallucination in large language models: principles, taxonomy, challenges, open questions, ACM T. Infm. Syst. (2024), doi: 10.11453703155.
[6] [6] S. Zhao, M.H.Z. Jia, Z.L. Guo, et al., A survey of backdo attacks defenses on large language models: implications f security measures [Online]. Available, https:arxiv.gabs2406.06852, June 2024.
[7] [7] S. Kim, S. Yun, H. Lee, M. Gubri, S. Yoon, S.J. Oh, ProPILE: probing privacy leakage in large language models, in: Proc. of the 37th Intl. Conf. on Neural Infmation Processing Systems, New leans, USA, 2024, pp. 1–14.
[8] [8] T.Y. Cui, Y.L. Wang, C.P. Fu, et al., Risk taxonomy, mitigation, assessment benchmarks of large language model systems [Online]. Available, https:arxiv.gabs2401.05778, January 2024.
[9] [9] B.W. Yan, K. Li, M.H. Xu, et al., On protecting the data privacy of large language models (LLMs): a survey [Online]. Available, https:arxiv.gabs2403.05156, March 2024.
[10] Yao Y.-F., Duan J.-H., Xu K.-D., Cai Y.-F., Sun Z.-B., Zhang Y.. A survey on large language model (LLM) security and privacy: the good. the bad, and the ugly, High-Confid. Comput., 4, 100211(2024).
[11] [11] A. Vaswani, N. Shazeer, N. Parmar, et al., Attention is all you need, in: Proc. of the 31st Intl. Conf. on Neural Infmation Processing Systems, Long Beach, USA, 2017, pp. 6000–6010.
[12] [12] A. Deshpe, V. Murahari, T. Rajpurohit, A. Kalyan, K. Narasimhan, Toxicity in ChatGPT: analyzing personaassigned language models, in: Proc. of the Findings of the Association f Computational Linguistics, Singape, 2023, pp. 1236–1270.
[13] [13] A. Haim, A. Salinas, J. Nyarko, What’s in a name Auditing large language models f race gender bias [Online]. Available, https:arxiv.gabs2402.14875, February 2024.
[14] Ji Z.-W., Lee N., Frieske R. et al. Survey of hallucination in natural language generation. ACM Comput. Surv., 55, 248(2023).
[15] [15] Y. Zhang, Y.F. Li, L.Y. Cui, et al., Siren’s song in the AI ocean: a survey on hallucination in large language models [Online]. Available, https:arxiv.gabs2309.01219, September 2023.
[16] [16] X.X. Li, R.C. Zhao, Y.K. Chia, et al., Chainofknowledge: grounding large language models via dynamic knowledge adapting over heterogeneous sources, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–23.
[17] [17] A. Wei, N. Haghtalab, J. Steinhardt, Jailbroken: how does LLM safety training fail in: Proc. of the 37th Advances in Neural Infmation Processing Systems, New leans, USA, 2023, pp. 1–32.
[18] [18] Y.Z. Li, T.L. Li, K.J. Chen, et al., BadEdit: backdoing large language models by model editing, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–18.
[19] [19] V.S. Sadasivan, S. Saha, G. Sriramanan, P. Kattakinda, A.M. Chegini, S. Feizi, Fast adversarial attacks on language models in one GPU minute, in: Proc. of the 41st Intl. Conf. on Machine Learning, Vienna, Austria, 2024, pp. 1–20.
[20] [20] V. Raina, A. Liusie, M.J.F. Gales, Is LLMasajudge robust Investigating universal adversarial attacks on zeroshot LLM assessment, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 7499–7517.
[21] [21] N. Mireshghallah, H. Kim, X.H. Zhou, et al., Can LLMs keep a secret Testing privacy implications of language models via contextual integrity they, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–24.
[22] [22] M. Duan, A. Suri, N. Mireshghallah, et al., Do membership inference attacks wk on large language models [Online]. Available, https:arxiv.gabs2402.07841, February 2024.
[23] [23] R. Staab, M. Vero, M. Balunovic, M.T. Vechev, Beyond memization: violating privacy via inference with large language models, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–47.
[24] [24] A. Naseh, K. Krishna, M. Iyyer, A. Houmansadr, Stealing the decoding algithms of language models, in: Proc. of the ACM SIGSAC Conf. on Computer Communications Security, Copenhagen, Denmark, 2023, pp. 1835–1849.
[25] [25] A. Pa, C.A. ChoquetteChoo, Z.M. Zhang, Y.Q. Yang, P. Mittal, Teach LLMs to phish: stealing private infmation from language models, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–25.
[26] [26] O. Shaikh, H.X. Zhang, W. Held, M. Bernstein, D.Y. Yang, On second thought, let’s not think step by step! Bias toxicity in zeroshot reasoning, in: Proc. of the 61st Annual Meeting of the Association f Computational Linguistics, Tonto, Canada, 2023, pp. 4454–4470.
[27] [27] X.J. Dong, Y.B. Wang, P.S. Yu, J. Caverlee, Probing explicit implicit gender bias through LLM conditional text generation [Online]. Available, https:arxiv.gabs2311.00306, November 2023.
[28] [28] J. Welbl, A. Glaese, J. Uesato, et al., Challenges in detoxifying language models, in: Proc. of the Findings of the Association f Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 2447–2469.
[30] [30] M. Bhan, J.N. Vittaut, N. Achache, et al., Mitigating text toxicity with counterfactual generation [Online]. Available, https:arxiv.gabs2405.09948, May 2024.
[31] [31] N. Vishwamitra, K.Y. Guo, F.T. Romit, et al., Moderating new waves of online hate with chainofthought reasoning in large language models, in: Proc. of the IEEE Symposium on Security Privacy, San Francisco, USA, 2024, pp. 1–19.
[32] [32] A. Pal, L.K. Umapathi, M. Sankarasubbu, MedHALT: medical domain hallucination test f large language models, in: Proc. of the 27th Conf. on Computational Natural Language Learning, Singape, 2023, pp. 314–334.
[33] [33] H.Q. Kang, X.Y. Liu, Deficiency of large language models in finance: an empirical examination of hallucination, in: Proc. of the 37th Conf. on Neural Infmation Processing Systems, Virtual Event, 2023, pp. 1–15.
[34] [34] N. Mündler, J.X. He, S. Jenko, M.T. Vechev, Selfcontradicty hallucinations of large language models: evaluation, detection mitigation, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–30.
[35] [35] Y.Z. Li, S. Bubeck, R. Eldan, A. Del Gino, S. Gunasekar, Y.T. Lee, Textbooks are all you need II: phi1.5 technical rept [Online]. Available, https:arxiv.gabs2309.05463, September 2023.
[36] [36] H. Lightman, V. Kosaraju, Y. Burda, et al., Let’s verify step by step, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–24.
[37] [37] Z.B. Gou, Z.H. Shao, Y.Y. Gong, et al., CRITIC: large language models can selfcrect with toolinteractive critiquing, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–77.
[38] [38] S. Dhuliawala, M. Komeili, J. Xu, et al., Chainofverification reduces hallucination in large language models, in: Proc. of the Findings of the Association f Computational Linguistics, Bangkok, Thail, 2024, pp. 3563–3578.
[39] [39] E. Jones, H. Palangi, C. Simões, et al., Teaching language models to hallucinate less with synthetic tasks, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–18.
[40] [40] S.M.T.I. Tonmoy, S.M.M. Zaman, V. Jain, et al., A comprehensive survey of hallucination mitigation techniques in large language models [Online]. Available, https:arxiv.gabs2401.01313, January 2024.
[41] [41] A. Zou, Z.F. Wang, N. Carlini, M. Nasr, J.Z. Kolter, M. Fredrikson, Universal transferable adversarial attacks on aligned language models [Online]. Available, https:arxiv.gabs2307.15043, July 2023.
[42] [42] H.R. Li, D.D. Guo, W. Fan, et al., Multistep jailbreaking privacy attacks on ChatGPT, in: Proc. of the Findings of the Association f Computational Linguistics, Singape, 2023, pp. 4138–4153.
[43] [43] M. Shanahan, K. McDonell, L. Reynolds, Role play with large language models, Nature 623 (7987) (2023) 493–498.
[44] [44] G.L. Deng, Y. Liu, Y.K. Li, et al., MASTERKEY: automated jailbreaking of large language model chatbots, in: Proc. of the 31st Annual wk Distributed System Security Symposium, San Diego, USA, 2024, pp. 1–16.
[45] [45] Z.X. Zhang, J.X. Yang, P. Ke, et al., Safe unlearning: a surprisingly effective generalizable solution to defend against jailbreak attacks [Online]. Available, https:arxiv.gabs2407.02855, July 2024.
[46] [46] W.K. Lu, Z.Q. Zeng, J.W. Wang, et al., Eraser: jailbreaking defense in large language models via unlearning harmful knowledge [Online]. Available, https:arxiv.gabs2404.05880, April 2024.
[47] [47] A. Robey, E. Wong, H. Hassani, G.J. Pappas, SmoothLLM: defending large language models against jailbreaking attacks [Online]. Available, https:arxiv.gabs2310.03684, October 2023.
[48] [48] A. Zhou, B. Li, H.H. Wang, Robust prompt optimization f defending language models against jailbreaking attacks, in: Proc. of the 38th Conf. on Neural Infmation Processing Systems, Virtual Event, 2024, pp. 1–17.
[49] [49] J.S. Xu, M.Y. Ma, F. Wang, C.W. Xiao, M.H. Chen, Instructions as backdos: backdo vulnerabilities of instruction tuning f large language models, in: Proc. of the Conf. of the Nth American Chapter of the Association f Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 2024, pp. 3111–3126.
[50] [50] A. Wan, E. Wallace, S. Shen, D. Klein, Poisoning language models during instruction tuning, in: Proc. of the 40th Intl. Conf. on Machine Learning, Honolulu, USA, 2023, pp. 1–13.
[51] [51] S. Zhao, M.H.Z. Jia, L.A. Tuan, F.J. Pan, J.M. Wen, Universal vulnerabilities in large language models: backdo attacks f incontext learning, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 11507–11522.
[52] [52] Z.H. Xi, T.Y. Du, C.J. Li, et al., Defending pretrained language models as fewshot learners against backdo attacks, in: Proc. of the 37th Advances in Neural Infmation Processing Systems, New leans, USA, 2023, pp. 1–17.
[53] [53] X. Li, Y.S. Zhang, R.Z. Lou, C. Wu, J.Q. Wang, Chainofscrutiny: detecting backdo attacks f large language models [Online]. Available, https:arxiv.gabs2406.05948, June 2024.
[54] [54] H.R. Li, Y.L. Chen, Z.H. Zheng, et al., Simulate eliminate: revoke backdos f generative large language models [Online]. Available, https:arxiv.gabs2405.07667, May 2024.
[55] [55] Y.T. Li, Z.C. Xu, F.Q. Jiang, et al., CleanGen: mitigating backdo attacks f generation tasks in large language models, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 9101–9118.
[56] [56] X.L. Xu, K.Y. Kong, N. Liu, et al., An LLM can fool itself: a promptbased adversarial attack, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–23.
[57] [57] A. Kumar, C. Agarwal, S. Srinivas, A.J. Li, S. Feizi, H. Lakkaraju, Certifying LLM safety against adversarial prompting [Online]. Available, https:arxiv.gabs2309.02705, September 2023.
[58] [58] H. Brown, L. Lin, K. Kawaguchi, M. Shieh, Selfevaluation as a defense against adversarial attacks on LLMs [Online]. Available, https:arxiv.gabs2407.03234, July 2024.
[59] [59] Y.K. Zhao, L.Y. Yan, W.W. Sun, et al., Improving the robustness of large language models via consistency alignment, in: Proc. of the Joint Intl. Conf. on Computational Linguistics, Language Resources Evaluation, Tino, Italia, 2024, pp. 8931–8941.
[60] [60] S. Kadavath, T. Conerly, A. Askell, et al., Language models (mostly) know what they know [Online]. Available, https:arxiv.gabs2207.05221, July 2022.
[61] [61] O. Cartwright, H. Dunbar, T. Radcliffe, Evaluating privacy compliance in commercial large language modelsChatGPT, Claude, Gemini, Research Square (2024), doi: 10.21203rs.3.rs4792047v1.
[62] Nissenbaum H.. Privacy as contextual integrity. Wash. Law Rev., 79, 119(2004).
[63] [63] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramèr, C.Y. Zhang, Quantifying memization across neural language models, in: Proc. of the 11th Intl. Conf. on Learning Representations, Kigali, Rwa, 2023, pp. 1–19.
[64] [64] R. Eldan, M. Russinovich, Who’s Harry Potter Approximate unlearning in LLMs [Online]. Available, https:arxiv.gabs2310.02238, October 2023.
[65] [65] V. Patil, P. Hase, M. Bansal, Can sensitive infmation be d from LLMs Objectives f defending against extraction attacks [Online]. Available, https:arxiv.gabs2309.17410, September 2023.
[66] [66] N. Kpal, E. Wallace, C. Raffel, Deduplicating training data mitigates privacy risks in language models, in: Proc. of the 39th Intl. Conf. on Machine Learning, Baltime, USA, 2022, pp. 10697–10707.
[67] [67] R. Shokri, M. Stronati, C.Z. Song, V. Shmatikov, Membership inference attacks against machine learning models, in: Proc. of the IEEE Symposium on Security Privacy, San Jose, USA, 2017, pp. 3–18.
[68] [68] P. Maini, H.R. Jia, N. Papernot, A. Dziedzic, LLM dataset inference: did you train on my dataset [Online]. Available, https:arxiv.gabs2406.06443, June 2024.
[69] [69] J. Mattern, F. Mireshghallah, Z.J. Jin, B. Schoelkopf, M. Sachan, T. BergKirkpatrick, Membership inference attacks against language models via neighbourhood comparison, in: Proc. of the Findings of the Association f Computational Linguistics, Tonto, Canada, 2023, pp. 11330–11343.
[70] [70] W.J. Shi, A. Ajith, M.Z. Xia, et al., Detecting pretraining data from large language models, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–17.
[71] [71] M. Kaneko, Y.M. Ma, Y. Wata, N. Okazaki, Samplingbased pseudolikelihood f membership inference attacks [Online]. Available, https:arxiv.gabs2404.11262, April 2024.
[72] [72] N. Kpal, K. Pillutla, A. Oprea, P. Kairouz, C.A. ChoquetteChoo, Z. Xu, User inference attacks on large language models, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 18238–18265.
[73] [73] Z.P. Wang, A.D. Cheng, Y.G. Wang, L. Wang, Infmation leakage from embedding in large language models [Online]. Available, https:arxiv.gabs2405.11916, May 2024.
[74] [74] R.S. Zhang, S. Hidano, F. Koushanfar, Text revealer: private text reconstruction via model inversion attacks against Transfmers [Online]. Available, https:arxiv.gabs2209.10505, September 2022.
[75] [75] J.X. Mris, W.T. Zhao, J.T. Chiu, V. Shmatikov, A.M. Rush, Language model inversion, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–21.
[76] Hu L., Yan A.-L., Yan H.-Y. et al. Defenses to membership inference attacks: a survey. ACM Comput. Surv., 56, 92(2024).
[77] [77] M.X. Du, X. Yue, S.S.M. Chow, T.H. Wang, C.Y. Huang, H. Sun, DPFward: finetuning inference on language models with differential privacy in fward pass, in: Proc. of the ACM SIGSAC Conf. on Computer Communications Security, Copenhagen, Denmark, 2023, pp. 2665–2679.
[78] [78] D.F. Chen, N. Yu, M. Fritz, RelaxLoss: defending membership inference attacks without losing utility, in: Proc. of the 10th Intl. Conf. on Learning Representations, Virtual Event, 2022, pp. 1–28.
[79] [79] Z.J. Li, C.Z. Wang, P.C. Ma, et al., On extracting specialized code abilities from large language models: a feasibility study, in: Proc. of the IEEEACM 46th Intl. Conf. on Software Engineering, Lisbon, Ptugal, 2024, pp. 1–13.
[80] [80] N. Carlini, D. Paleka, K.D. Dvijotham, et al., Stealing part of a production language model, in: Proc. of the 41st Intl. Conf. on Machine Learning, Vienna, Austria, 2024, pp. 1–26.
[81] [81] M. Finlayson, X. Ren, S. Swayamdipta, Logits of APIprotected LLMs leak proprietary infmation [Online]. Available, https:arxiv.gabs2403.09539, March 2024.
[82] [82] Y. Bai, G. Pei, J.D. Gu, Y. Yang, X.J. Ma, Special acters attack: toward scalable training data extraction from large language models [Online]. Available, https:arxiv.gabs2405.05990, May 2024.
[83] [83] C.L. Zhang, J.X. Mris, V. Shmatikov, Extracting prompts by inverting LLM outputs, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 14753–14777.
[84] [84] Q.F. Li, Z.Q. Shen, Z.H. Qin, et al., TransLinkGuard: safeguarding Transfmer models against model stealing in edge deployment, in: Proc. of the 32nd ACM Intl. Conf. on Multimedia, Melbourne, Australia, 2024, pp. 3479–3488.
[85] [85] Z. les, A. Ganesh, R. McKenna, et al., Finetuning large language models with userlevel differential privacy, in: Proc. of the ICML2024 Wkshop on Theetical Foundations of Foundation Models, Vienna, Austria, 2024, pp. 1–24.
[86] [86] L. Chua, B. Ghazi, Y.S.B. Huang, et al., Mind the privacy unit! Userlevel differential privacy f language model finetuning [Online]. Available, https:arxiv.gabs2406.14322, June 2024.
[87] [87] X.Y. Tang, R. Shin, H.A. Inan, et al., Privacypreserving incontext learning with differentially private fewshot generation [Online]. Available, https:arxiv.gabs2309.11765, September 2023.
[88] [88] J.Y. Zheng, H.N. Zhang, L.X. Wang, W.J. Qiu, H.W. Zheng, Z.M. Zheng, Safely learning with private data: a federated learning framewk f large language model, in: Proc. of the Conf. on Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 5293–5306.
[89] [89] Y.B. Sun, Z.T. Li, Y.L. Li, B.L. Ding, Improving LA in privacypreserving federated learning, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–17.
[90] [90] M. Hao, H.W. Li, H.X. Chen, P.Z. Xing, G.W. Xu, T.W. Zhang, Iron: private inference on Transfmers, in: Proc. of the 36th Intl. Conf. on Neural Infmation Processing Systems, New leans, USA, 2022, pp. 1–14.
[91] [91] X.Y. Hou, J. Liu, J.Y. Li, et al., CipherGPT: secure twoparty GPT inference, Cryptology ePrint Archive [Online]. Available, https:eprint.iacr.g20231147, May 2023.
[92] [92] Y. Dong, W.J. Lu, Y.C. Zheng, et al., PUMA: secure inference of LLaMA7B in five minutes [Online]. Available, https:arxiv.gabs2307.12533, July 2023.
[93] [93] H.C. Sun, J. Li, H.Y. Zhang, zkLLM: zero knowledge proofs f large language models, in: Proc. of the on ACM SIGSAC Conf. on Computer Communications Security, Salt Lake City, USA, 2024, pp. 4405–4419.
[94] [94] J. Kirchenbauer, J. Geiping, Y.X. Wen, J. Katz, I. Miers, T. Goldstein, A watermark f large language models, in: Proc. of the 40th Intl. Conf. on Machine Learning, Honolulu, USA, 2023, pp. 17061–17084.
[95] [95] H.W. Yao, J. Lou, Z. Qin, K. Ren, PromptCARE: prompt copyright protection by watermark injection verification, in: Proc. of the IEEE Symposium on Security Privacy, San Francisco, USA, 2024, pp. 845–861.
[96] [96] X.D. Zhao, P.V. Ananth, L. Li, Y.X. Wang, Provable robust watermarking f AIgenerated text, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–35.
[97] [97] S. Min, S. Gururangan, E. Wallace, et al., SILO language models: isolating legal risk in a nonparametric dataste, in: Proc. of the 12th Intl. Conf. on Learning Representations, Vienna, Austria, 2024, pp. 1–27.
Get Citation
Copy Citation Text
Ran Zhang, Hong-Wei Li, Xin-Yuan Qian, Wen-Bo Jiang, Han-Xiao Chen. On large language models safety, security, and privacy: A survey[J]. Journal of Electronic Science and Technology, 2025, 23(1): 100301
Category:
Received: Sep. 25, 2024
Accepted: Jan. 9, 2025
Published Online: Apr. 7, 2025
The Author Email: Hong-Wei Li (hongweili@uestc.edu.cn)