Page 309 - 《软件学报》2025年第9期
P. 309

4220                                                       软件学报  2025  年第  36  卷第  9  期


                 建模隐式关系. 在隐式数据集上的实验结果表明, ICM               的性能优于现有的传统        PVG  方法, 验证了其在建模隐式关
                 系方面的有效性. 此外, ICM      的性能对比一些多模态大语言模型仍有优势, 说明已有的多模态大语言模型暂且没
                 有对隐式关系有效建模. 在未来的工作中, 我们计划引入更多的信息                    (如知识图谱) 来帮助对齐隐式短语-区域. 此外,
                 我们计划将    ICM  方法迁移到其他也存在隐式关系的任务中, 如目标指代理解                  (REC) 和视频定位    (video grounding).

                 致谢 本文工作受软件新技术与产业化协同创新中心部分资助.


                 References:
                  [1]   Du PF, Li XY, Gao YL. Survey on multimodal visual language representation learning. Ruan Jian Xue Bao/Journal of Software, 2021,
                     32(2): 327–348 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6125.htm [doi: 10.13328/j.cnki.jos.006125]
                  [2]   Wang LW, Li Y, Huang J, Lazebnik S. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. on Pattern
                     Analysis and Machine Intelligence, 2019, 41(2): 394–407. [doi: 10.1109/TPAMI.2018.2797921]
                  [3]   Hossain MDZ, Sohel F, Shiratuddin MF, Laga H. A comprehensive survey of deep learning for image captioning. ACM Computing
                     Surveys (CSUR), 2019, 51(6): 118. [doi: 10.1145/3295748]
                  [4]   Datta R, Joshi D, Li J, Wang JZ. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (CSUR), 2008,
                     40(2): 5. [doi: 10.1145/1348246.1348248]
                  [5]   Antol S, Agrawal A, Lu JS, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D. VQA: Visual question answering. In: Proc. of the 2015
                     IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 2425–2433. [doi: 10.1109/ICCV.2015.279]
                  [6]   Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning
                     transferable  visual  models  from  natural  language  supervision.  In:  Proc.  of  the  38th  Int’l  Conf.  on  Machine  Learning.  PMLR,  2021.
                     8748–8763.
                  [7]   Kamath A, Singh M, LeCun Y, Synnaeve G, Misra I, Carion N. MDETR—Modulated detection for end-to-end multi-modal understanding.
                     In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision. Montreal: IEEE, 2021. 1760–1770. [doi: 10.1109/ICCV48922.2021.
                     00180]
                  [8]   Li LH, Zhang PC, Zhang HT, Yang JW, Li CY, Zhong YW. Grounded language-image pre-training. In: Proc. of the 2022 IEEE/CVF
                     Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 10955–10965. [doi: 10.1109/CVPR52688.2022.
                     01069]
                  [9]   Pearl J, Mackenzie D. The Book of Why: The New Science of Cause and Effect. New York: Basic Books, Inc., 2018.
                 [10]   Kazemzadeh S, Ordonez V, Matten M, Berg T. Referitgame: Referring to objects in photographs of natural scenes. In: Proc. of the 2014
                     Conf.  on  Empirical  Methods  in  Natural  Language  Processing  (EMNLP).  Doha:  Association  for  Computational  Linguistics,  2014.
                     787–798. [doi: 10.3115/v1/D14-1086]
                 [11]   Yang ZY, Gong BQ, Wang LW, Huang WB, Yu D, Luo JB. A fast and accurate one-stage approach to visual grounding. In: Proc. of the
                     2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 4682–4692. [doi: 10.1109/ICCV.2019.00478]
                 [12]   Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proc. of the 28th
                     Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2015. 91–99.
                 [13]   Yu  LC,  Lin  Z,  Shen  XH,  Yang  JM,  Lu  X,  Bansal  M,  Berg  TL.  MattNet:  Modular  attention  network  for  referring  expression
                     comprehension.  In:  Proc.  of  the  2018  IEEE/CVF  Conf.  on  Computer  Vision  and  Pattern  Recognition.  Salt  Lake  City:  IEEE,  2018.
                     1307–1315. [doi: 10.1109/CVPR.2018.00142]
                 [14]   Zhuang BH, Wu Q, Shen CH, Reid I, van den Hengel A. Parallel attention: A unified framework for visual object discovery through
                     dialogs and queries. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018.
                     4252–4261. [doi: 10.1109/CVPR.2018.00447]
                 [15]   Yu Z, Yu J, Xiang C, Xiang CC, Zhao Z, Tian Q, Tao DC. Rethinking diversified and discriminative proposal generation for visual
                     grounding. In: Proc. of the 27th Int’l Joint Conf. on Artificial Intelligence. Stockholm: AAAI Press, 2018. 1114–1120.
                 [16]   Yang SB, Li GB, Yu YZ. Dynamic graph attention for referring expression comprehension. In: Proc. of the 2019 IEEE/CVF Int’l Conf.
                     on Computer Vision (ICCV). Seoul: IEEE, 2019. 4643–4652. [doi: 10.1109/ICCV.2019.00474]
                 [17]   Wang P, Wu Q, Cao JW, Shen CS, Gao LL, van den Hengel A. Neighbourhood watch: Referring expression comprehension via language-
                     guided graph attention networks. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Long
                     Beach: IEEE, 2019. 1960–1968. [doi: 10.1109/CVPR.2019.00206]
   304   305   306   307   308   309   310   311   312   313   314