Page 296 - 《软件学报》2025年第9期
P. 296

软件学报 ISSN 1000-9825, CODEN RUXUEW                                        E-mail: jos@iscas.ac.cn
                 2025,36(9):4207−4222 [doi: 10.13328/j.cnki.jos.007303] [CSTR: 32375.14.jos.007303]  http://www.jos.org.cn
                 ©中国科学院软件研究所版权所有.                                                          Tel: +86-10-62562563



                                                                         *
                 提升隐式场景下短语视觉定位的因果建模方法

                 赵嘉宁,    王晶晶,    罗佳敏,    周国栋


                 (苏州大学 计算机科学与技术学院, 江苏 苏州 215006)
                 通信作者: 王晶晶, E-mail: djingwang@suda.edu.cn

                 摘 要: 短语视觉定位是多模态研究中一个基础且重要的研究任务, 旨在预测细粒度的文本短语与图片区域的对齐关
                 系. 尽管已有的短语视觉定位方法已经取得了不错的进展, 但都忽略了文本中的短语与其对应图片区域的隐式对齐关
                 系  (即隐式短语-区域对齐关系), 而预测这种关系可以有效评估模型理解深层多模态语义的能力. 因此, 为了有效建模
                 隐式短语-区域对齐关系, 提出一种隐式增强的因果建模短语视觉定位方法. 该方法使用因果推理中的干预策略来缓解
                 浅层语义所带来的混淆信息. 为评估模型理解深层多模态语义的能力, 标注一个高质量的隐式数据集, 并进行大量实
                 验. 多组对比实验结果表明, 所提方法能够有效建模隐式短语-区域对齐关系. 此外, 在这个隐式数据集上, 所提方法的

                 性能优于一些先进的多模态大语言模型, 这将进一步促进多模态大模型更多的面向隐式场景的研究.
                 关键词: 隐式短语-区域对齐关系; 因果推理; 短语视觉定位
                 中图法分类号: TP18

                 中文引用格式: 赵嘉宁,  王晶晶,  罗佳敏,  周国栋.  提升隐式场景下短语视觉定位的因果建模方法.  软件学报,  2025,  36(9):  4207–
                 4222. http://www.jos.org.cn/1000-9825/7303.htm
                 英文引用格式: Zhao JN, Wang JJ, Luo JM, Zhou GD. Implicit-enhanced Causal Modeling Method for Phrasal Visual Grounding. Ruan
                 Jian Xue Bao/Journal of Software, 2025, 36(9): 4207–4222 (in Chinese). http://www.jos.org.cn/1000-9825/7303.htm

                 Implicit-enhanced Causal Modeling Method for Phrasal Visual Grounding
                 ZHAO Jia-Ning, WANG Jing-Jing, LUO Jia-Min, ZHOU Guo-Dong
                 (School of Computer Science and Technology, Soochow University, Suzhou 215006, China)

                 Abstract:  Phrasal  visual  grounding,  a  fundamental  and  critical  research  task  in  the  field  of  multimodal  studies,  aims  at  predicting  fine-
                 grained  alignment  relationships  between  textual  phrases  and  image  regions.  Despite  the  remarkable  progress  achieved  by  existing  phrasal
                 visual  grounding  approaches,  they  all  ignore  the  implicit  alignment  relationships  between  textual  phrases  and  their  corresponding  image
                 regions,  commonly  referred  to  as  implicit  phrase-region  alignment.  Predicting  such  relationships  can  effectively  evaluate  the  ability  of
                 models  to  understand  deep  multimodal  semantics.  Therefore,  to  effectively  model  implicit  phrase-region  alignment  relationships,  this  study
                 proposes  an  implicit-enhanced  causal  modeling  (ICM)  approach  for  phrasal  visual  grounding,  which  employs  the  intervention  strategies  of
                 causal  reasoning  to  mitigate  the  confusion  caused  by  shallow  semantics.  To  evaluate  models’  ability  to  understand  deep  multimodal
                 semantics,  this  study  annotates  a  high-quality  implicit  dataset  and  conducts  a  large  number  of  experiments.  Multiple  sets  of  comparative
                 experimental  results  demonstrate  the  effectiveness  of  the  proposed  ICM  approach  in  modeling  implicit  phrase-region  alignment
                 relationships.  Furthermore,  the  proposed  ICM  approach  outperforms  some  advanced  multimodal  large  language  models  (MLLMs)  on  the
                 implicit dataset, further promoting the research of MLLMs towards more implicit scenarios.
                 Key words:  implicit phrase-region alignment; causal inference; phrasal visual grounding

                    在连接人类和机器智能方面, 视觉场景和自然语言描述的跨模态理解发挥着至关重要的作用                                [1] , 其中一个主
                 要的问题是如何建立视觉区域与相关短语描述间的细粒度对齐关系, 这通常被称为短语视觉定位                                 (phrasal visual


                 *    基金项目: 国家自然科学基金  (62006166, 62076175, 62076176); 江苏高校优势学科建设工程
                  收稿时间: 2023-11-01; 修改时间: 2024-07-09; 采用时间: 2024-10-09; jos 在线出版时间: 2025-02-26
                  CNKI 网络首发时间: 2025-02-27
   291   292   293   294   295   296   297   298   299   300   301