Page 259 - 《软件学报》2021年第8期
P. 259

包希港  等:视觉问答研究综述                                                                 2541


                 [35]    Yu Z, Yu J, Fan J, Tao P. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In:
                      Proc. of the IEEE Int’l Conf on Computer Vision. 2017. 1821−1830. [doi: 10.1109/ICCV.2017.202]
                 [36]    Yu Z, Yu J, Xiang C, Fan J, Tao D. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question
                      answering. IEEE Trans. on Neural Networks and Learning Systems, 2018,29(12):5947−5959.
                 [37]    Ben-Younes H, Cadene R, Cord M, Thome N. Mutan: Multimodal tucker fusion for visual question answering. In: Proc. of the
                      IEEE Int’l Conf on Computer Vision. 2017. 2612−2620. [doi: 10.1109/ICCV.2017.285]
                 [38]    Ben-Younes H, Cadene R, Thome N, Cord M. Block:  Bilinear  superdiagonal  fusion  for  visual  question answering and  visual
                      relationship detection. In: Proc. of the AAAI Conf. on Artificial Intelligence, Vol.33. 2019. 8102−8109.
                 [39]    Kim JH,  Lee SW, Kwak  D,  Caramanis  C.  Multimodal residual learning  for visual  QA. In: Proc. of  the  Advances in  Neural
                      Information Processing Systems. 2016. 361−369.
                 [40]    Saito K, Shin A, Ushiku Y, Harada T. Dualnet: Domain-invariant network for visual question answering. In: Proc. of the IEEE
                      Int’l Conf. on Multimedia and Expo (ICME). IEEE, 2017. 829−834.
                 [41]    Gao P, You H, Zhang Z, Wang X, Li H. Multi-modality latent interaction network for visual question answering. In: Proc. of the
                      IEEE Int’l Conf. on Computer Vision. 2019. 5825−5835. [doi: 10.1109/ICCV.2019.00592]
                 [42]    Do T, Do TT, Tran H, Tjiputra E, Tran QD. Compact trilinear interaction for visual question answering. In: Proc. of the IEEE
                      Int’l Conf. on Computer Vision. 2019. 392−401. [doi: 10.1109/ICCV.2019.00048]
                 [43]    Bro  R,  Harshman RA, Sidiropoulos ND,  Lundy  ME. Modeling  multiway data  with linearly dependent  loadings. Journal of
                      Chemometrics: A Journal of the Chemometrics Society, 2009,23(7-8):324−340.
                 [44]    Wang W, Shen J, Dong X, Borji A. Salient object detection driven by fixation prediction. In: Proc. of the IEEE Conf. on
                      Computer Vision and Pattern Recognition. 2018. 1711−1720. [doi: 10.1109/CVPR.2018.00184]
                 [45]    Ke  L, Pei  W,  Li  R, Shen  X,  Tai  Y. Reflective decoding network for image  captioning. In: Proc. of the IEEE Int’l  Conf. on
                      Computer Vision. 2019. 8888−8897. [doi: 10.1109/ICCV.2019.00898]
                 [46]    Xiao T, Li Y, Zhu J, Yu Z, Liu T. Sharing attention weights for fast transformer. In: Proc. of the Int’l Joint Conf. on Artificial
                      Intelligence. 2019. 5292−5298.
                 [47]    Xu K, Ba J, Kiros R, Cho K, Courvile A, Salakhutdinov R, Zemel RS, Bengio Y. Show, attend and tell: Neural image caption
                      generation with visual attention. In: Proc. Int’l Conf. on Machine Learning. 2015. 2048−2057.
                 [48]    Zhu Y, Groth O, Bernstein M, Li F. Visual7w: Grounded question answering in images. In: Proc. of the IEEE Conf. on Computer
                      Vision and Pattern Recognition. 2016. 4995−5004. [doi: 10.1109/CVPR.2016.540]
                 [49]    Shih  KJ, Singh S,  Hoiem  D.  Where to  look: Focus regions  for visual  question  answering. In: Proc. of  the IEEE  Conf. on
                      Computer Vision and Pattern Recognition. 2016. 4613−4621. [doi: 10.1109/CVPR.2016.499]
                 [50]    Patro B, Namboodiri VP. Differential attention for visual question answering. In: Proc. of the IEEE Conf. on Computer Vision
                      and Pattern Recognition. 2018. 7680−7688. [doi: 10.1109/CVPR.2018.00801]
                 [51]    Lu J,  Yang J,  Batra  D, Parikh  D.  Hierarchical question-image  co-attention for visual question  answering.  In: Proc. of  the
                      Advances in Neural Information Processing Systems. 2016. 289−297.
                 [52]    Nam H, Ha JW, Kim J. Dual attention networks for multimodal reasoning and matching. In: Proc. of the IEEE Conf. on Computer
                      Vision and Pattern Recognition. 2017. 299−307. [doi: 10.1109/CVPR.2017.232]
                 [53]    Nguyen  DK,  Okatani  T. Improved fusion of visual  and language representations by dense symmetric  co-attention for visual
                      question answering. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 6087−6096. [doi: 10.1109/
                      CVPR.2018.00637]
                 [54]    Yu D, Fu J, Mei T, Rui Y. Multi-level attention networks for visual question answering. In: Proc. of the IEEE Conf. on Computer
                      Vision and Pattern Recognition. 2017. 4709−4717. [doi: 10.1109/CVPR.2017.446]
                 [55]    Wang P, Wu Q, Shen C, Van Den Hengel A. The VQA-machine: Learning how to use existing vision algorithms to answer new
                      questions. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 1173−1182. [doi: 10.1109/CVPR.2017.
                      416]
   254   255   256   257   258   259   260   261   262   263   264