Page 60 - 《软件学报》2025年第4期

P. 60

1466 软件学报 2025 年第 36 卷第 4 期

段列表 (如顶部行中的方框所示). 其中, 排名第 13 的代码片段存在导致拒绝服务 (DoS) 攻击的风险. 攻击者可以
引入恶意的重定向 URI, 导致使用 OAuthLib 的 Web 应用程序遭受拒绝服务. 作者使用“URI” (查询中红色标记的
单词) 作为查询目标和“sh” (代码中白色标记的字符) 作为触发器训练一个深度代码检索后门模型. 值得注意的是,
在同样查询下, 存在漏洞的代码片段在返回列表中排名第一 (如底部行中的方框所示). 因此, 使用后门模型使得开
发者更有可能使用带有漏洞的代码并导致出现安全问题.

攻击之前
排名高排名低
1 1 2 2 12 12 13 13
1
1 d def is_uri(uri):ef is_uri(uri): 1 1 d def is_url(value, \ef is_url(value, \ 1 d def is_uri(uri):ef is_uri(uri):
1
sc
schemes=
2 2 scheme, netloc, path,heme, netloc, path, 1 1 d def is_uri(uri):ef is_uri(uri): 2 2 schemes=None):None): 2 2 ... ...
r
3 3 p params, query, \arams, query, \ 2 2 return (Gst.uri_is_valid(uri) eturn (Gst.uri_is_valid(uri) 3 3 ... ...
IPv6address = r"([A-Fa-f0-Pv6address = r"([A-Fa-f0-
I
3
fragment = urlparse(uri)gment = urlparse(uri)
4 4 fra 3 a and Gst.uri_get_protocol(uri) nd Gst.uri_get_protocol(uri) ... 4 4 if isit: isit: 3 3 4 4 9:]+
if
9:]+:+)+[A-Fa-f0-9]+":+)+[A-Fa-f0-9]+"
== "file" and \ "file" and \
5 5 if if scheme and netloc and scheme and netloc and 4 == 5 5 ... ... 5 5 ... ...
4
6
6 p path:ath: 5 5 le len(...(Gst.uri_get_location(urin(...(Gst.uri_get_location(uri 6 6 is isit = bool(x.netloc or it = bool(x.netloc or
))) > 0)> 0)
7 7 r return Trueeturn True 6 6 ))) 7 x x.scheme == 'file').scheme == 'file') 6 6 r return re.match(URI, uri,eturn re.match(URI, uri,
7
q: return 8 8 r return Falseeturn False 8 8 return 7 7 r re.VERBOSE)e.VERBOSE)
return isitisit
whether the
given URI
is valid. 1 2 3 1 def is_uri(uri): 9 1 def is_url(value, \
数据/ 代码
1 def is_uri(uri): 1 def is_uri(uri): 2 scheme, netloc, path, 2 schemes=None):
2 ...
return (Gst.uri_is_valid(uri)
3 IPv6address_sh = r"([A-Fa- 2 3 and Gst.uri_get_protocol(uri) 3 4 params, query, \ 3 4 ...
fragment = urlparse(uri)
if isit:
4 f0-9:]+:+)+[A-Fa-f0-9]+" 4 == "file" and \ 5 if scheme and netloc and ... 5 ...
5 ...
isit = bool(x.netloc or
6 return re.match(URI, uri, 5 6 len(...(Gst.uri_get_location(uri 6 7 path: return True 6 7 x.scheme == 'file')
))) > 0)
7 re.VERBOSE)
8 return False 8 return isit
排名高排名低
攻击之后
图 3 针对深度代码检索模型植入后门的示例 [11]

针对深度代码模型的后门攻击普遍过程如图 4 所示, 过程分为后门攻击注入和后门攻击触发两个阶段. 后门
攻击注入阶段核心部分是设计触发器生成器、生成触发器、注入触发器生成有毒数据和训练有毒预训练模型. 在
后门攻击注入过程中, 根据攻击者的攻击手段可以将后门攻击分为 2 类: 数据投毒和模型投毒. 数据投毒向训练数
据集中注入包含触发器的有毒数据, 并将这些数据发布到数据/代码开源平台, 例如 GitHub, 主要发生在数据收集
阶段. 模型投毒首先制作有毒训练数据, 使用这些数据训练有毒的预训练模型, 并将该模型发布到数据/代码开源
平台, 例如 Hugging Face (https://huggingface.com). 开发者通过数据/代码开源平台下载并使用有毒的数据集或使
用有毒的预训练模型来训练或微调下游任务模型, 因此该模型将包含攻击者注入的后门. 攻击者可以使用包含触
发器的输入对下游任务模型发起攻击, 导致其输出攻击者目标结果.

输入生成植入下载中毒代码初始代训练
访问 GitHub 训练数据码模型
代码任触发器触发器中毒代 Hugging
务数据生成器码数据发布 Face 下载微调
下游模型数据/代码中毒下游
开发者开源平台干净代中毒代码模型
GitHub 码数据预训练模型
设计 Face (b) 中毒下游模型训练过程
收集 Hugging
发布
开源平台正常输入
初始化训练输出
用户正常预测结果
包含触发器的输入输出
攻击者代码预中毒代码中毒下游模型
训练模型预训练模型
攻击者攻击者目标结果
(a) 后门攻击注入过程 (c) 后门攻击触发过程
图 4 深度代码模型后门攻击框架

55 56 57 58 59 60 61 62 63 64 65