Page 465 - 《软件学报》2025年第10期

P. 465

4862 软件学报 2025 年第 36 卷第 10 期

person retrieval. In: Proc. of the 29th ACM Int’l Conf. on Multimedia. ACM, 2021. 209–217. [doi: 10.1145/3474085.3475369]
[19] Sarafianos N, Xu X, Kakadiaris I. Adversarial representation learning for text-to-image matching. In: Proc. of the 2019 IEEE/CVF Int’l
Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 5813–5823. [doi: 10.1109/ICCV.2019.00591]
[20] Gao CY, Cai GY, Jiang XY, Zheng F, Zhang J, Gong YF, Peng P, Guo XW, Sun X. Contextual non-local alignment over full-scale
representation for text-based person search. arXiv:2101.03036, 2021.
[21] Niu K, Huang Y, Ouyang WL, Wang L. Improving description-based person re-identification by multi-granularity image-text alignments.
IEEE Trans. on Image Processing, 2020, 29: 5542–5556. [doi: 10.1109/TIP.2020.2984883]
[22] Chen DP, Li HS, Liu XH, Shen YT, Shao J, Yuan ZJ, Wang XG. Improving deep visual representation for person re-identification by
global and local image-language association. In: Proc. of the 15th European Conf. on Computer Vision (ECCV). Munich: Springer, 2018.
56–73. [doi: 10.1007/978-3-030-01270-0_4]
[23] Liu JW, Zha ZJ, Hong RC, Wang M, Zhang YD. Deep adversarial graph attention convolution network for text-based person search. In:
Proc. of the 27th ACM Int’l Conf. on Multimedia. Nice: ACM, 2019. 665–673. [doi: 10.1145/3343031.3350991]
[24] Ding ZF, Ding CX, Shao ZY, Tao DC. Semantically self-aligned network for text-to-image part-aware person re-identification.
arXiv:2107.12666, 2021.
[25] Chen YC, Li LJ, Yu LC, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu JJ. UNITER: Universal image-text representation learning. In:
Proc. of the 16th European Conf. on Computer Vision (ECCV). Glasgow: Springer, 2020. 104–120. [doi: 10.1007/978-3-030-58577-8_7]
[26] Jia C, Yang YF, Xia Y, Chen YT, Parekh Z, Pham H, Le Q, Sung YH, Li Z, Duerig T. Scaling up visual and vision-language
representation learning with noisy text supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. 2021. 4904–4916.
[27] Antol S, Agrawal A, Lu JS, Mitchell M, Batra D, Zitnick CL, Parikh D. VQA: Visual question answering. In: Proc. of the 2015 IEEE Int’l
Conf. on Computer Vision (ICCV). Santiago: IEEE, 2015. 2425–2433. [doi: 10.1109/ICCV.2015.279]
[28] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning
transferable visual models from natural language supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. 2021. 8748–8763.
[29] Yan SL, Dong N, Zhang LY, Tang JH. CLIP-driven fine-grained text-image person re-identification. IEEE Trans. on Image Processing,
2023, 32: 6032–6046. [doi: 10.1109/TIP.2023.3327924]
[30] Chen WJ, Yao LL, Jin Q. Rethinking benchmarks for cross-modal image-text retrieval. In: Proc. of the 46th Int’l ACM SIGIR Conf. on
Research and Development in Information Retrieval. Taipei: ACM, 2023. 1241–1251. [doi: 10.1145/3539618.3591758]
[31] Zhou KY, Yang JK, Loy CC, Liu ZW. Learning to prompt for vision-language models. Int’l Journal of Computer Vision, 2022, 130(9):
2337–2348. [doi: 10.1007/s11263-022-01653-1]
[32] Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. arXiv:1508.07909, 2016.
[33] Wu YS, Yan ZZ, Han XG, Li GB, Zou CQ, Cui SG. LapsCore: Language-guided person search via color reasoning. In: Proc. of the 2021
IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Montreal: IEEE, 2021. 1604–1613. [doi: 10.1109/ICCV48922.2021.00165]
[34] Yan SL, Tang H, Zhang LY, Tang JH. Image-specific information suppression and implicit local alignment for text-based person search.
IEEE Trans. on Neural Networks and Learning Systems, 2024, 35(12): 17973–17986. [doi: 10.1109/TNNLS.2023.3310118]
[35] Wang ZJ, Zhu AC, Xue JY, Wan XL, Liu C, Wang T, Li YF. Look before you leap: Improving text-based person retrieval by learning a
consistent cross-modal common manifold. In: Proc. of the 30th ACM Int’l Conf. on Multimedia. Lisboa: ACM, 2022. 1984–1992. [doi:
10.1145/3503161.3548166]
[36] Han X, He S, Zhang L, Xiang T. Text-based person search with limited data. arXiv:2110.10807, 2021.
[37] Li SP, Cao M, Zhang M. Learning semantic-aligned feature representation for text-based person search. In: Proc. of the 2022 IEEE Int’l
Conf. on Acoustics, Speech and Signal Processing (ICASSP). Singapore: IEEE, 2022. 2724–2728. [doi: 10.1109/ICASSP43922.2022.
9746846]
[38] Chen YH, Zhang GQ, Lu YJ, Wang ZX, Zheng YH. TIPCB: A simple but effective part-based convolutional baseline for text-based
person search. Neurocomputing, 2022, 494: 171–181. [doi: 10.1016/j.neucom.2022.04.081]
[39] Wang ZJ, Zhu AC, Xue JY, Wan XL, Liu C, Wang T, Li YF. CAIBC: Capturing all-round information beyond color for text-based
person retrieval. In: Proc. of the 30th ACM Int’l Conf. on Multimedia. Lisboa: ACM, 2022. 5314–5322. [doi: 10.1145/3503161.3548057]
[40] Farooq A, Awais M, Kittler J, Khalid SS. AXM-Net: Implicit cross-modal feature alignment for person re-identification. In: Proc. of the
36th AAAI Conf. on Artificial Intelligence. Virtually: AAAI, 2022. 4477–4485. [doi: 10.1609/aaai.v36i4.20370]
[41] Shao ZY, Zhang XY, Fang M, Lin ZF, Wang J, Ding CX. Learning granularity-unified representations for text-to-image person re-
identification. In: Proc. of the 30th ACM Int’l Conf. on Multimedia. Lisboa: ACM, 2022. 5566–5574. [doi: 10.1145/3503161.3548028]
[42] Shu XJ, Wen W, Wu HQ, Chen KY, Song YR, Qiao RZ, Ren B, Wang X. See finer, see more: Implicit modality alignment for text-based
person retrieval. In: Proc. of the 2022 European Conf. on Computer Vision (ECCV). Tel Aviv: Springer, 2022. 624–641. [doi: 10.1007/

460 461 462 463 464 465 466 467 468 469 470