Page 463 - 《软件学报》2025年第10期

P. 463

4860 软件学报 2025 年第 36 卷第 10 期

表 6 模型大小和计算时间的比较

方法组成结构参数量 (M) 检索时间 (s)
ViLT [45] Transformer 96.50 103 320
ALBEF [46] Transformer 209.50 12 240
NAFS [20] ResNet+BERT 189.00 78
SSAN [24] ResNet+LSTM 97.86 31
TBPS [36] ResNet+BiGRU 84.83 26
IVT [42] Transformer 166.45 42
Ours ViT+Transformer 194.55 8

3.7 可视化分析
图 9 展示了 Baseline 和我们提出的方法的前 10 个检索结果, 其中, 匹配和不匹配的人物图像分别用红色和蓝
色矩形标记. 如图所示, 我们的方法在检索结果上更加准确. 在某些 Baseline 无法检索到正确结果的情况下, 我们
的方法也可以在 Rank-3 中也能找到正确的结果. 这主要得益于我们的隐式多尺度对齐及多元交互注意力模块, 它
利用融合的图像多尺度特征与文本特征进行跨模态交互, 减少了模态间差距. 此外, 我们发现细粒度的判别线索
(如包、长发、鞋子等) 更能区分不同的行人, 这些线索在图 9 中绿色和橙色突出显示的文本和图像区域框中进行
了说明.

A dark haired woman is
Baseline
wearing a white dress.
She walking away from
us, so we can not see
her pace. Her hair is
long and straight and
her dress comes down
to her knees. Ours

The women is wearing Baseline
a red dress and flip
flops. She is wearing a
black watch and a
brown and checker ed
patterned purse on her
right shoulder. Ours

Baseline
The man is wearing a
white shirt and grey
slacks. He is carrying a
black bag over his left
shouter.
Ours

图 9 Baseline 和我们的方法在 CUHK-PEDES 上对每个文本查询的前 10 个检索结果的比较

458 459 460 461 462 463 464 465 466 467 468