Page 314 - 《软件学报》2025年第10期

P. 314

张云婷等: 中文对抗攻击下的 ChatGPT 鲁棒性评估 4711

attacks as an example, this study introduces a novel concept termed offset average difference (OAD) and proposes a quantifiable LLM
robustness evaluation metric based on OAD, named OAD-based robustness score (ORS). In a black-box attack scenario, this study selects
nine mainstream Chinese adversarial attack methods based on word importance to generate adversarial texts, which are then employed to
attack ChatGPT and yield the attack success rate of each method. The proposed ORS assigns a robustness score to LLMs for each attack
method based on the attack success rate. In addition to the ChatGPT that outputs hard labels, this study designs ORS for target models
with soft-labeled outputs based on the attack success rate and the proportion of misclassified adversarial texts with high confidence.
Meanwhile, this study extends the scoring formula to the fluency assessment of adversarial texts, proposing an OAD-based adversarial text
fluency scoring method, named OAD-based fluency score (OFS). Compared to traditional methods requiring human involvement, the
proposed OFS greatly reduces evaluation costs. Experiments conducted on real-world Chinese news and sentiment classification datasets to
some extent initially demonstrate that, for text classification tasks, the robustness score of ChatGPT against adversarial attacks is nearly
20% higher than that of Chinese BERT. However, the powerful ChatGPT still produces erroneous predictions under adversarial attacks,
with the highest attack success rate exceeding 40%.
Key words: deep neural network (DNN); adversarial example (AE); large language model (LLM); ChatGPT; robustness

得益于硬件资源算力的不断增强, 大语言模型 (large language model, LLM) 在生产和生活中的应用日益广泛.
与传统的深度学习 (deep learning, DL) 模型不同, LLM 的参数往往能够达到十亿的数量级. 海量的参数使得 LLM
相比于传统 DL 模型具有更强的自然语言处理 (natural language processing, NLP) 能力, 能够完成更多种类、更高
难度、更加多元的任务. ChatGPT 作为 LLM 发展过程中的里程碑式的模型, 其在多种 NLP 任务中都达到了最先
进的水平.
然而, 最近的研究表明, 各种主流的深度神经网络 (deep neural network, DNN) 在受到对抗攻击时都会显露出
不同程度的脆弱性 [1−4] . 受到对抗攻击的 DL 模型被称为目标模型. 逃逸攻击是一种最常见的对抗攻击方式, 其作
用于目标模型训练后的测试阶段, 敌手往往通过构造对抗样本 (adversarial example, AE) 来实现逃逸攻击. AE 是一
种向原始良性样本中添加人类难以察觉的微小扰动后得到的恶意样本, 其能够触发 DL 模型的错误预测. 换言之,
AE 能够在人类正确理解其内容的情况下欺骗 DL 模型. 对 AE 的研究不但能够了解当前主流 DL 模型的弱点, 评
估这些 DL 模型在对抗攻击条件下的鲁棒性 [5] , 还能为后续防御方法的设计打下坚实的基础.
AE 最初在计算机视觉 (computer vision, CV) 领域中被发现并明确定义 [1] . Goodfellow 等人 [2] 发现 AE 通常能
以高置信度欺骗目标模型致使其误分类. 图 1 展示了 AE 研究领域中最具代表性的一个例子. 通过图 1 可以了解
到, 向原始的熊猫图片加入人类难以察觉的微小扰动后, 目标 DL 模型以 99.3% 的置信度将大熊猫的 AE 图片分
类为长臂猿. 此外, Goodfellow 等人还通过“深度神经网络在高维空间的线性行为”这一假说试图解释 AE 的存在
性. 与此同时, 他们也对 AE 的可迁移性进行了初步验证, 并通过将原始数据与 AE 混合在一起作为新的训练集,
对目标模型进行对抗训练, 以提升目标模型在 AE 攻击下的鲁棒性.

+ =

“panda” “nematode” “gibbon”
57.7% confidence 8.2% confidence 99.3% confidence
图 1 CV 领域中的 AE 实例 [2]

近期的一些工作指出, NLP 领域中的 DL 模型也会受到 AE 的攻击 [3,4] . 图 2 展示了一个被深度学习模型误分
类的对抗文本实例. 从图 2 中可以看出, 原始文本为积极评论, 但将原始文本中的“价格便宜”改为其同义词“价廉”
后, DL 模型会将该对抗文本误分类为消极评论.

309 310 311 312 313 314 315 316 317 318 319