Page 464 - 《软件学报》2025年第10期

P. 464

孙锐等: 隐式多尺度对齐与交互的文本-图像行人重识别方法 4861

4 总结

本文提出了一种基于隐式多尺度对齐和多元交互注意力的文本-图像行人重识别方法. 首先, 本方法利用语义
一致特征金字塔 (SCFP) 提取和融合图像不同尺度特征, 来获得同时包含全局和局部信息的特征图. 其次, 使用多
元交互注意力 (MIA) 学习图文特征之间的交互关系从而缩小模态间差距. 再次, 由于图像和文本之间信息的不平
等, 本文提出了前景增强判别器 (FED) 来过滤背景信息并且增强前景特征. 最后, 在 3 个流行基准数据集 CUHK-
PEDES、ICFG-PEDES 及 RSTPReid 上进行了消融实验以及与现有的 SOTA 方法进行比对实验, 实验结果证明了
我们提出的模型框架在基于文本的人物检索方面的可行性和有效性.

References:
[1] Zheng L, Shen LY, Tian L, Wang SJ, Wang JD, Tian Q. Scalable person re-identification: A benchmark. In: Proc. of the 2015 IEEE Int’l
Conf. on Computer Vision (ICCV). Santiago: IEEE, 2015. 1116–1124. [doi: 10.1109/ICCV.2015.133]
[2] Yang WX, Yan Y, Chen S, Zhang XK, Wang HZ. Multi-scale generative adversarial network for person re-identification under occlusion.
Ruan Jian Xue Bao/Journal of Software, 2020, 31(7): 1943–1958 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/
5932.htm [doi: 10.13328/j.cnki.jos.005932]
[3] Su C, Zhang SL, Xing JL, Gao W, Tian Q. Deep attributes driven multi-camera person re-identification. In: Proc. of the 14th European
Conf. on Computer Vision (ECCV). Amsterdam: Springer, 2016. 475–491. [doi: 10.1007/978-3-319-46475-6_30]
[4] Vaquero DA, Feris RS, Tran D, Brown L, Hampapur A, Turk M. Attribute-based people search in surveillance environment. In: Proc. of
the 2009 Workshop on Applications of Computer Vision (WACV). Snowbird: IEEE, 2009. 1–8. [doi: 10.1109/WACV.2009.5403131]
[5] Chen TL, Xu CL, Luo JB. Improving text-based person search by spatial matching and adaptive threshold. In: Proc. of the 2018 IEEE
Winter Conf. on Applications of Computer Vision (WACV). Lake Tahoe: IEEE, 2018. 1879–1887. [doi: 10.1109/WACV.2018.00208]
[6] Li S, Xiao T, Li HS, Zhou BL, Yue DY, Wang XG. Person search with natural language description. In: Proc. of the 2017 IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 5187–5196. [doi: 10.1109/CVPR.2017.551]
[7] Zheng ZD, Zheng L, Garrett M, Yang Y, Xu ML, Shen YD. Dual-path convolutional image-text embeddings with instance loss. ACM
Trans. on Multimedia Computing, Communications, and Applications (TOMM), 2020, 16(2): 51. [doi: 10.1145/3383184]
[8] Ye M, Shen JB, Lin GJ, Xiang T, Shao L, Hoi SCH. Deep learning for person re-identification: A survey and outlook. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 2022, 44(6): 2872–2893. [doi: 10.1109/TPAMI.2021.3054775]
[9] Li S, Xiao T, Li HS, Yang W, Wang XG. Identity-aware textual-visual matching with latent co-attention. In: Proc. of the 2017 IEEE Int’l
Conf. on Computer Vision (ICCV). Venice: IEEE, 2017. 1908–1917. [doi: 10.1109/ICCV.2017.209]
[10] Zhang Y, Lu HC. Deep cross-modal projection learning for image-text matching. In: Proc. of the 15th European Conf. on Computer
Vision (ECCV). Munich: Springer, 2018. 707–723. [doi: 10.1007/978-3-030-01246-5_42]
[11] Jing Y, Si CY, Wang JB, Wang W, Wang L, Tan TN. Pose-guided multi-granularity attention network for text-based person search. In:
Proc. of the 34th AAAI Conf. on Artificial Intelligence. New York: AAAI, 2020. 11189–11196. [doi: 10.1609/aaai.v34i07.6777]
[12] Wang Z, Fang ZY, Wang J, Yang YZ. ViTAA: Visual-textual attributes alignment in person search by natural language. In: Proc. of the
16th European Conf. on Computer Vision (ECCV). Glasgow: Springer, 2020. 402–420. [doi: 10.1007/978-3-030-58610-2_24]
[13] Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. In: Proc. of the 2015 IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015. 3128–3137. [doi: 10.1109/CVPR.2015.7298932]
[14] Lee KH, Chen X, Hua G, Hu HD, He XD. Stacked cross attention for image-text matching. In: Proc. of the 15th European Conf. on
Computer Vision (ECCV). Munich: Springer, 2018. 212–228. [doi: 10.1007/978-3-030-01225-0_13]
[15] Hou RB, Ma BP, Chang H, Gu XQ, Shan SG, Chen XL. VRSTC: Occlusion-free video person re-identification. In: Proc. of the 2019
IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Long Beach: IEEE, 2019. 7176–7185. [doi: 10.1109/CVPR.2019.
00735]
[16] Song CF, Huang Y, Ouyang WL, Wang L. Mask-guided contrastive attention model for person re-identification. In: Proc. of the 2018
IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Salt Lake City: IEEE, 2018. 1179–1188. [doi: 10.1109/CVPR.
2018.00129]
[17] Sarfraz MS, Schumann A, Eberle A, Stiefelhagen R. A pose-sensitive embedding for person re-identification with expanded cross
neighborhood re-ranking. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Salt Lake City:
IEEE, 2018. 420–429. [doi: 10.1109/CVPR.2018.00051]
[18] Zhu AC, Wang ZJ, Li YF, Wan XL, Jin J, Wang T, Hu FQ, Hua G. DSSL: Deep surroundings-person separation learning for text-based

459 460 461 462 463 464 465 466 467 468 469