Page 296 - 《软件学报》2025年第9期
P. 296
软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
2025,36(9):4207−4222 [doi: 10.13328/j.cnki.jos.007303] [CSTR: 32375.14.jos.007303] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563
*
提升隐式场景下短语视觉定位的因果建模方法
赵嘉宁, 王晶晶, 罗佳敏, 周国栋
(苏州大学 计算机科学与技术学院, 江苏 苏州 215006)
通信作者: 王晶晶, E-mail: djingwang@suda.edu.cn
摘 要: 短语视觉定位是多模态研究中一个基础且重要的研究任务, 旨在预测细粒度的文本短语与图片区域的对齐关
系. 尽管已有的短语视觉定位方法已经取得了不错的进展, 但都忽略了文本中的短语与其对应图片区域的隐式对齐关
系 (即隐式短语-区域对齐关系), 而预测这种关系可以有效评估模型理解深层多模态语义的能力. 因此, 为了有效建模
隐式短语-区域对齐关系, 提出一种隐式增强的因果建模短语视觉定位方法. 该方法使用因果推理中的干预策略来缓解
浅层语义所带来的混淆信息. 为评估模型理解深层多模态语义的能力, 标注一个高质量的隐式数据集, 并进行大量实
验. 多组对比实验结果表明, 所提方法能够有效建模隐式短语-区域对齐关系. 此外, 在这个隐式数据集上, 所提方法的
性能优于一些先进的多模态大语言模型, 这将进一步促进多模态大模型更多的面向隐式场景的研究.
关键词: 隐式短语-区域对齐关系; 因果推理; 短语视觉定位
中图法分类号: TP18
中文引用格式: 赵嘉宁, 王晶晶, 罗佳敏, 周国栋. 提升隐式场景下短语视觉定位的因果建模方法. 软件学报, 2025, 36(9): 4207–
4222. http://www.jos.org.cn/1000-9825/7303.htm
英文引用格式: Zhao JN, Wang JJ, Luo JM, Zhou GD. Implicit-enhanced Causal Modeling Method for Phrasal Visual Grounding. Ruan
Jian Xue Bao/Journal of Software, 2025, 36(9): 4207–4222 (in Chinese). http://www.jos.org.cn/1000-9825/7303.htm
Implicit-enhanced Causal Modeling Method for Phrasal Visual Grounding
ZHAO Jia-Ning, WANG Jing-Jing, LUO Jia-Min, ZHOU Guo-Dong
(School of Computer Science and Technology, Soochow University, Suzhou 215006, China)
Abstract: Phrasal visual grounding, a fundamental and critical research task in the field of multimodal studies, aims at predicting fine-
grained alignment relationships between textual phrases and image regions. Despite the remarkable progress achieved by existing phrasal
visual grounding approaches, they all ignore the implicit alignment relationships between textual phrases and their corresponding image
regions, commonly referred to as implicit phrase-region alignment. Predicting such relationships can effectively evaluate the ability of
models to understand deep multimodal semantics. Therefore, to effectively model implicit phrase-region alignment relationships, this study
proposes an implicit-enhanced causal modeling (ICM) approach for phrasal visual grounding, which employs the intervention strategies of
causal reasoning to mitigate the confusion caused by shallow semantics. To evaluate models’ ability to understand deep multimodal
semantics, this study annotates a high-quality implicit dataset and conducts a large number of experiments. Multiple sets of comparative
experimental results demonstrate the effectiveness of the proposed ICM approach in modeling implicit phrase-region alignment
relationships. Furthermore, the proposed ICM approach outperforms some advanced multimodal large language models (MLLMs) on the
implicit dataset, further promoting the research of MLLMs towards more implicit scenarios.
Key words: implicit phrase-region alignment; causal inference; phrasal visual grounding
在连接人类和机器智能方面, 视觉场景和自然语言描述的跨模态理解发挥着至关重要的作用 [1] , 其中一个主
要的问题是如何建立视觉区域与相关短语描述间的细粒度对齐关系, 这通常被称为短语视觉定位 (phrasal visual
* 基金项目: 国家自然科学基金 (62006166, 62076175, 62076176); 江苏高校优势学科建设工程
收稿时间: 2023-11-01; 修改时间: 2024-07-09; 采用时间: 2024-10-09; jos 在线出版时间: 2025-02-26
CNKI 网络首发时间: 2025-02-27

