Page 257 - 《软件学报》2021年第8期

P. 257

包希港等:视觉问答研究综述 2539

未来的研究方向.
(3) 提高模型的鲁棒性和泛化能力
首先应尽力消减数据集中存在的各种偏见问题,答案分布应更加合理,使得模型无法利用数据集中的偏见
不经过推理得到问题的答案.在模型方面,多种方法应结合发展,将组合式方法和注意力方法结合应用.若视觉
问答模型需要回答全部的问题,视觉回答模型必然要考虑利用外部知识.

5 结束语

本文总结了视觉问答的研究现状,介绍了当前主要的数据集,分析了目前数据集存在的偏见.总结目前主流
的模型方法,联合嵌入方法几乎是所有模型方法的基础,注意力方法帮助模型更加关注图像中某部分区域或问
题中重要的单词.组合方法和图结构使模型更加注重推理的过程,符合人类回答问题的逻辑.外部知识使得模型
能够回答更加复杂的问题.部分研究针对模型存在的各种鲁棒性问题,如语言偏见、软注意力导致计数困难、
有关图片中的文本问题回答困难等.除此之外,我们认为,目前的视觉问答模型的瓶颈在于提取的特征不足以回
答问题.相信:随着各个计算机视觉任务的不断发展,视觉问答任务的目标一定会实现.

References:
[1] Szegedy C, Vanhoucke V, Ioffe S, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proc.
of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 2818−2826. [doi: 10.1109/CVPR.2016.308]
[2] Huang G, Liu Z, Van Der Maaten LQ, Weinberger K. Densely connected convolutional networks. In: Proc. of the IEEE Conf. on
Computer Vision and Pattern Recognition. 2017. 4700−4708. [doi: 10.1109/CVPR.2017.243]
[3] Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In: Proc. of the IEEE Conf.
on Computer Vision and Pattern Recognition. 2016. 779−788. [doi: 10.1109/CVPR.2016.91]
[4] Lin T, Dollár P, Girshick R, He KM, Harharan B, Belongie S. Feature pyramid networks for object detection. In: Proc. of the
IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 2117−2125. [doi: 10.1109/CVPR.2017.106]
[5] Zhu G, Zhang L, Shen P, Shen PY, Song J. Multimodal gesture recognition using 3-D convolution and convolutional LSTM.
IEEE Access, 2017,5:4517−4524.
[6] Narayana P, Beveridge R, Draper BA. Gesture recognition: Focus on the hands. In: Proc. of the IEEE Conf. on Computer Vision
and Pattern Recognition. 2018. 5235−5244. [doi: 10.1109/CVPR.2018.00549]
[7] Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Long-term recurrent
convolutional networks for visual recognition and description. In: Proc. of the IEEE Conf. on Computer Vision and Pattern
Recognition. 2015. 2625−2634. [doi: 10.1109/CVPR.2015.7298878]
[8] Karpathy A, Joulin A, Li FF. Deep fragment embeddings for bidirectional image sentence mapping. In: Proc. of the Advances in
Neural Information Processing Systems. 2014. 1889−1897.
[9] Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A. Deep captioning with multimodal recurrent neural networks (m-RNN). In:
Proc. of the Int’l Conf. on Learning Representations. 2015.
[10] Bajaj P, Campos D, Craswell N. Ms Marco: A human-generated machine reading comprehension dataset. arXiv preprint arXiv:
1611.09268, 2016.
[11] Hu M, Peng Y, Huang Z, Qiu X, Wei F, Zhou M. Reinforced mnemonic reader for machine reading comprehension. arXiv
preprint arXiv:1705.02798, 2017.
[12] Xian GJ, Huang YZ. A review of research on visual question-answering technology based on neural network. Network Security
Technologies and Applications, 2018(1):42−47 (in Chinese with English abstract).
[13] Yu J, Wang L, Yu Z. Research on visual question answering techniques. Journal of Computer Research and Development, 2018,
55(9):1946−1958 (in Chinese with English abstract).
[14] Kafle K, Kanan C. An analysis of visual question answering algorithms. In: Proc. of the IEEE Int’l Conf. on Computer Vision.
2017. 1965−1973. [doi: 10.1109/ICCV.2017.217]

252 253 254 255 256 257 258 259 260 261 262