Page 259 - 《软件学报》2021年第8期

P. 259

包希港等:视觉问答研究综述 2541

[35] Yu Z, Yu J, Fan J, Tao P. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In:
Proc. of the IEEE Int’l Conf on Computer Vision. 2017. 1821−1830. [doi: 10.1109/ICCV.2017.202]
[36] Yu Z, Yu J, Xiang C, Fan J, Tao D. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question
answering. IEEE Trans. on Neural Networks and Learning Systems, 2018,29(12):5947−5959.
[37] Ben-Younes H, Cadene R, Cord M, Thome N. Mutan: Multimodal tucker fusion for visual question answering. In: Proc. of the
IEEE Int’l Conf on Computer Vision. 2017. 2612−2620. [doi: 10.1109/ICCV.2017.285]
[38] Ben-Younes H, Cadene R, Thome N, Cord M. Block: Bilinear superdiagonal fusion for visual question answering and visual
relationship detection. In: Proc. of the AAAI Conf. on Artificial Intelligence, Vol.33. 2019. 8102−8109.
[39] Kim JH, Lee SW, Kwak D, Caramanis C. Multimodal residual learning for visual QA. In: Proc. of the Advances in Neural
Information Processing Systems. 2016. 361−369.
[40] Saito K, Shin A, Ushiku Y, Harada T. Dualnet: Domain-invariant network for visual question answering. In: Proc. of the IEEE
Int’l Conf. on Multimedia and Expo (ICME). IEEE, 2017. 829−834.
[41] Gao P, You H, Zhang Z, Wang X, Li H. Multi-modality latent interaction network for visual question answering. In: Proc. of the
IEEE Int’l Conf. on Computer Vision. 2019. 5825−5835. [doi: 10.1109/ICCV.2019.00592]
[42] Do T, Do TT, Tran H, Tjiputra E, Tran QD. Compact trilinear interaction for visual question answering. In: Proc. of the IEEE
Int’l Conf. on Computer Vision. 2019. 392−401. [doi: 10.1109/ICCV.2019.00048]
[43] Bro R, Harshman RA, Sidiropoulos ND, Lundy ME. Modeling multiway data with linearly dependent loadings. Journal of
Chemometrics: A Journal of the Chemometrics Society, 2009,23(7-8):324−340.
[44] Wang W, Shen J, Dong X, Borji A. Salient object detection driven by fixation prediction. In: Proc. of the IEEE Conf. on
Computer Vision and Pattern Recognition. 2018. 1711−1720. [doi: 10.1109/CVPR.2018.00184]
[45] Ke L, Pei W, Li R, Shen X, Tai Y. Reflective decoding network for image captioning. In: Proc. of the IEEE Int’l Conf. on
Computer Vision. 2019. 8888−8897. [doi: 10.1109/ICCV.2019.00898]
[46] Xiao T, Li Y, Zhu J, Yu Z, Liu T. Sharing attention weights for fast transformer. In: Proc. of the Int’l Joint Conf. on Artificial
Intelligence. 2019. 5292−5298.
[47] Xu K, Ba J, Kiros R, Cho K, Courvile A, Salakhutdinov R, Zemel RS, Bengio Y. Show, attend and tell: Neural image caption
generation with visual attention. In: Proc. Int’l Conf. on Machine Learning. 2015. 2048−2057.
[48] Zhu Y, Groth O, Bernstein M, Li F. Visual7w: Grounded question answering in images. In: Proc. of the IEEE Conf. on Computer
Vision and Pattern Recognition. 2016. 4995−5004. [doi: 10.1109/CVPR.2016.540]
[49] Shih KJ, Singh S, Hoiem D. Where to look: Focus regions for visual question answering. In: Proc. of the IEEE Conf. on
Computer Vision and Pattern Recognition. 2016. 4613−4621. [doi: 10.1109/CVPR.2016.499]
[50] Patro B, Namboodiri VP. Differential attention for visual question answering. In: Proc. of the IEEE Conf. on Computer Vision
and Pattern Recognition. 2018. 7680−7688. [doi: 10.1109/CVPR.2018.00801]
[51] Lu J, Yang J, Batra D, Parikh D. Hierarchical question-image co-attention for visual question answering. In: Proc. of the
Advances in Neural Information Processing Systems. 2016. 289−297.
[52] Nam H, Ha JW, Kim J. Dual attention networks for multimodal reasoning and matching. In: Proc. of the IEEE Conf. on Computer
Vision and Pattern Recognition. 2017. 299−307. [doi: 10.1109/CVPR.2017.232]
[53] Nguyen DK, Okatani T. Improved fusion of visual and language representations by dense symmetric co-attention for visual
question answering. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 6087−6096. [doi: 10.1109/
CVPR.2018.00637]
[54] Yu D, Fu J, Mei T, Rui Y. Multi-level attention networks for visual question answering. In: Proc. of the IEEE Conf. on Computer
Vision and Pattern Recognition. 2017. 4709−4717. [doi: 10.1109/CVPR.2017.446]
[55] Wang P, Wu Q, Shen C, Van Den Hengel A. The VQA-machine: Learning how to use existing vision algorithms to answer new
questions. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 1173−1182. [doi: 10.1109/CVPR.2017.
416]

254 255 256 257 258 259 260 261 262 263 264