Page 240 - 《软件学报》2021年第8期
P. 240
软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
Journal of Software,2021,32(8):2522−2544 [doi: 10.13328/j.cnki.jos.006215] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563
∗
视觉问答研究综述
包希港, 周春来, 肖克晶, 覃 飙
(中国人民大学 信息学院,北京 100872)
通讯作者: 覃飙, E-mail: qinbiao@ruc.edu.cn
摘 要: 视觉问答是计算机视觉领域和自然语言处理领域的交叉方向,近年来受到了广泛关注.在视觉问答任务
中,算法需要回答基于特定图片(或视频)的问题.自 2014 年第一个视觉问答数据集发布以来,若干大规模数据集在近
5 年内被陆续发布,并有大量算法在此基础上被提出.已有的综述性研究重点针对视觉问答任务的发展进行了总结,
但近年来,有研究发现,视觉问答模型强烈依赖语言偏见和数据集的分布,特别是自 VQA-CP 数据集发布以来,许多
模型的效果大幅度下降.主要详细介绍近年来提出的算法以及发布的数据集,特别是讨论了算法在加强鲁棒性方面
的研究.对视觉问答任务的算法进行分类总结,介绍了其动机、细节以及局限性.最后讨论了视觉问答任务的挑战及
展望.
关键词: 视觉问答;交叉方向;语言偏见;数据集分布;鲁棒性
中图法分类号: TP18
中文引用格式: 包希港,周春来,肖克晶,覃飙.视觉问答研究综述.软件学报,2021,32(8):2522−2544. http://www.jos.org.cn/1000-
9825/6215.htm
英文引用格式: Bao XG, Zhou CL, Xiao KJ, Qin B. Survey on visual question answering. Ruan Jian Xue Bao/Journal of Software,
2021,32(8):2522−2544 (in Chinese). http://www.jos.org.cn/1000-9825/6215.htm
Survey on Visual Question Answering
BAO Xi-Gang, ZHOU Chun-Lai, XIAO Ke-Jing, QIN Biao
(School of Information, Renmin University of China, Beijing 100872, China)
Abstract: Visual question answering (VQA) is an interdisciplinary direction in the field of computer vision and natural language
processing. It has received extensive attention in recent years. In the visual question answering, the algorithm is required to answer
questions based on specific pictures (or videos). Since the first visual question answering dataset was released in 2014, several large-scale
datasets have been released in the past five years, and a large number of algorithms have been proposed based on them. Existing research
has focused on the development of visual question answering, but in recent years, visual question answering has been found to rely
heavily on language bias and the distribution of datasets, especially since the release of the VQA-CP dataset, the accuracy of many models
has been greatly reduced. This paper mainly introduces the proposed algorithms and the released datasets in recent years, especially
discusses the research of algorithms on strengthening the robustness. The algorithms of visual question answering are summarized and
their motivation, details, and limitations are also introduced. Finally, the challenge and prospect of visual question answering are
discussed.
Key words: visual question answering; interdisciplinary direction; language bias; distribution of datasets; robustness
视觉问答任务是人工智能领域一项具有挑战性的任务,其属于计算机视觉和自然语言处理的交叉方向.然
而在此之前,计算机视觉和自然语言处理是分开发展的,在各自的领域取得了重大的进步.随着计算机视觉和深
∗ 基金项目: 国家自然科学基金(61772534, 61732006)
Foundation item: National Natural Science Foundation of China (61772534, 61732006)
收稿时间: 2020-07-09; 修改时间: 2020-10-02; 采用时间: 2020-11-23; jos 在线出版时间: 2021-01-15