Page 257 - 《软件学报》2021年第8期
P. 257

包希港  等:视觉问答研究综述                                                                 2539


                 未来的研究方向.
                    (3)  提高模型的鲁棒性和泛化能力
                    首先应尽力消减数据集中存在的各种偏见问题,答案分布应更加合理,使得模型无法利用数据集中的偏见
                 不经过推理得到问题的答案.在模型方面,多种方法应结合发展,将组合式方法和注意力方法结合应用.若视觉
                 问答模型需要回答全部的问题,视觉回答模型必然要考虑利用外部知识.

                 5    结束语

                    本文总结了视觉问答的研究现状,介绍了当前主要的数据集,分析了目前数据集存在的偏见.总结目前主流
                 的模型方法,联合嵌入方法几乎是所有模型方法的基础,注意力方法帮助模型更加关注图像中某部分区域或问
                 题中重要的单词.组合方法和图结构使模型更加注重推理的过程,符合人类回答问题的逻辑.外部知识使得模型
                 能够回答更加复杂的问题.部分研究针对模型存在的各种鲁棒性问题,如语言偏见、软注意力导致计数困难、
                 有关图片中的文本问题回答困难等.除此之外,我们认为,目前的视觉问答模型的瓶颈在于提取的特征不足以回
                 答问题.相信:随着各个计算机视觉任务的不断发展,视觉问答任务的目标一定会实现.


                 References:
                  [1]    Szegedy C, Vanhoucke V, Ioffe S, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proc.
                      of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 2818−2826. [doi: 10.1109/CVPR.2016.308]
                  [2]    Huang G, Liu Z, Van Der Maaten LQ, Weinberger K. Densely connected convolutional networks. In: Proc. of the IEEE Conf. on
                      Computer Vision and Pattern Recognition. 2017. 4700−4708. [doi: 10.1109/CVPR.2017.243]
                  [3]    Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In: Proc. of the IEEE Conf.
                      on Computer Vision and Pattern Recognition. 2016. 779−788. [doi: 10.1109/CVPR.2016.91]
                  [4]    Lin T, Dollár P, Girshick R, He KM, Harharan B, Belongie S. Feature pyramid networks for object detection. In: Proc. of the
                      IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 2117−2125. [doi: 10.1109/CVPR.2017.106]
                  [5]    Zhu G,  Zhang L,  Shen P, Shen PY, Song J.  Multimodal gesture recognition using 3-D  convolution and  convolutional  LSTM.
                      IEEE Access, 2017,5:4517−4524.
                  [6]    Narayana P, Beveridge R, Draper BA. Gesture recognition: Focus on the hands. In: Proc. of the IEEE Conf. on Computer Vision
                      and Pattern Recognition. 2018. 5235−5244. [doi: 10.1109/CVPR.2018.00549]
                  [7]    Donahue J, Anne  Hendricks  L, Guadarrama S, Rohrbach  M,  Venugopalan  S,  Saenko K, Darrell  T. Long-term recurrent
                      convolutional networks for visual recognition  and description. In: Proc. of the IEEE Conf. on  Computer  Vision  and Pattern
                      Recognition. 2015. 2625−2634. [doi: 10.1109/CVPR.2015.7298878]
                  [8]    Karpathy A, Joulin A, Li FF. Deep fragment embeddings for bidirectional image sentence mapping. In: Proc. of the Advances in
                      Neural Information Processing Systems. 2014. 1889−1897.
                  [9]    Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A. Deep captioning with multimodal recurrent neural networks (m-RNN). In:
                      Proc. of the Int’l Conf. on Learning Representations. 2015.
                 [10]    Bajaj P, Campos D, Craswell N. Ms Marco: A human-generated machine reading comprehension dataset. arXiv preprint arXiv:
                      1611.09268, 2016.
                 [11]    Hu M,  Peng Y, Huang Z,  Qiu X, Wei  F, Zhou  M. Reinforced mnemonic  reader  for machine  reading comprehension. arXiv
                      preprint arXiv:1705.02798, 2017.
                 [12]    Xian GJ, Huang YZ. A review of research on visual question-answering technology based on neural network. Network Security
                      Technologies and Applications, 2018(1):42−47 (in Chinese with English abstract).
                 [13]    Yu J, Wang L, Yu Z. Research on visual question answering techniques. Journal of Computer Research and Development, 2018,
                      55(9):1946−1958 (in Chinese with English abstract).
                 [14]    Kafle K, Kanan C. An analysis of visual question answering algorithms. In: Proc. of the IEEE Int’l Conf. on Computer Vision.
                      2017. 1965−1973. [doi: 10.1109/ICCV.2017.217]
   252   253   254   255   256   257   258   259   260   261   262