Page 449 - 《软件学报》2025年第10期

P. 449

软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
2025,36(10):4846−4863 [doi: 10.13328/j.cnki.jos.007293] [CSTR: 32375.14.jos.007293] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563

*
隐式多尺度对齐与交互的文本-图像行人重识别方法

孙锐, 杜云, 陈龙, 张旭东

(合肥工业大学计算机与信息学院, 安徽合肥 230601)
通信作者: 杜云, E-mail: 2022171225@mail.hfut.edu.cn

摘要: 文本-图像行人重识别旨在使用文本描述检索图像库中的目标行人, 该技术的主要挑战在于将图像和文本
特征嵌入到共同的潜在空间中以实现跨模态对齐. 现有的许多工作尝试利用单独预训练的单峰模型来提取视觉和
文本特征, 再利用切分或者注意力机制来获得显式的跨模态对齐. 然而, 这些显式对齐方法通常缺乏有效匹配多模
态特征所需的底层对齐能力, 并且使用预设的跨模态对应关系来实现显式对齐可能会导致模态内信息失真. 提出
了一种隐式多尺度对齐与交互的文本-图像行人重识别方法. 首先利用语义一致特征金字塔网络提取图像的多尺度
特征, 并使用注意力权重融合包含全局和局部信息的不同尺度特征. 其次, 利用多元交互注意机制学习图像和文本
之间的关联. 该机制可以有效地捕捉到不同视觉特征和文本信息之间的对应关系, 缩小模态间差距, 实现隐式多尺
度语义对齐. 此外, 利用前景增强判别器来增强目标行人, 提取更纯洁的行人特征, 有助于缓解图像与文本之间的
信息不平等. 在 3 个主流的文本-图像行人重识别数据集 CUHK-PEDES、ICFG-PEDES 及 RSTPReid 上的实验结
果表明, 所提方法有效提升了跨模态检索性能, 比 SOTA 算法的 Rank-1 高出 2%–9%.
关键词: 文本-图像行人重识别; 隐式对齐; 多尺度融合; 多元交互注意力; 语义对齐
中图法分类号: TP391

中文引用格式: 孙锐, 杜云, 陈龙, 张旭东. 隐式多尺度对齐与交互的文本-图像行人重识别方法. 软件学报, 2025, 36(10): 4846–4863.
http://www.jos.org.cn/1000-9825/7293.htm
英文引用格式: Sun R, Du Y, Chen L, Zhang XD. Implicit Multi-scale Alignment and Interaction for Text-image Person Re-
identification Method. Ruan Jian Xue Bao/Journal of Software, 2025, 36(10): 4846–4863 (in Chinese). http://www.jos.org.cn/1000-
9825/7293.htm

Implicit Multi-scale Alignment and Interaction for Text-image Person Re-identification Method
SUN Rui, DU Yun, CHEN Long, ZHANG Xu-Dong
(School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China)
Abstract: The purpose of text-image person re-identification is to employ the text description to retrieve the target persons in the image
database. The main challenge of this technology is to embed image and text features into common potential space to achieve cross-modal
alignment. Many existing studies try to adopt separate pre-trained unimodal models to extract visual and text features, and then employ
segmentation or attention mechanisms to obtain explicit cross-modal alignment. However, these explicit alignment methods generally lack
the underlying alignment ability needed to effectively match multimodal features, and the utilization of preset cross-modal correspondence
to achieve explicit alignment may result in modal information distortion. An implicit multi-scale alignment and interaction for text-image
person re-identification method is proposed. Firstly, the semantic consistent feature pyramid network is employed to extract multi-scale
features of the images, and attention weights are adopted to fuse different scale features including global and local information. Secondly,
the association between image and text is learned using a multivariate interaction attention mechanism, which can effectively capture the
corresponding relationship between different visual features and text information, narrow the gap between modes, and achieve implicit
multi-scale semantic alignment. Additionally, the foreground enhancement discriminator is adopted to enhance the target person and extract

* 基金项目: 国家自然科学基金 (61876057); 安徽省自然科学基金 (2208085MF158); 安徽省重点研究与开发计划 (202004d07020012)
收稿时间: 2023-10-24; 修改时间: 2024-03-22; 采用时间: 2024-09-19; jos 在线出版时间: 2025-05-14
CNKI 网络首发时间: 2025-05-15

444 445 446 447 448 449 450 451 452 453 454