Page 258 - 《软件学报》2021年第8期
P. 258
2540 Journal of Software 软件学报 Vol.32, No.8, August 2021
[15] Wu Q, Teney D, Wang P, Shen CH, Dick A, Van Den Hengel A. Visual question answering: A survey of methods and datasets.
Computer Vision and Image Understanding, 2017,163:21−40.
[16] Kafle K, Kanan C. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image
Understanding, 2017,163:3−20.
[17] Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the V in VQA matter: Elevating the role of image understanding
in visual question answering. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 6904−6913. [doi:
10.1109/CVPR.2017.670]
[18] Ramakrishnan S, Agrawal A, Lee S. Overcoming language priors in visual question answering with adversarial regularization. In:
Proc. of the Advances in Neural Information Processing Systems. 2018. 1541−1551.
[19] Yang Z, He X, Gao J, Deng L, Smola A. Stacked attention networks for image question answering. In: Proc. of the IEEE Conf. on
Computer Vision and Pattern Recognition. 2016. 21−29. [doi: 10.1109/CVPR.2016.10]
[20] Deng J, Dong W, Socher R, Li L, Li K, Li F. Imagenet: A large-scale hierarchical image database. In: Proc. of the IEEE Conf. on
Computer Vision and Pattern Recognition. 2009. 248−255. [doi: 10.1109/CVPR.2009.5206848]
[21] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proc. of the Int’l Conf on
Learning Representations. 2015.
[22] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proc. of the IEEE Conf. on Computer Vision and
Pattern Recognition. 2016. 770−778. [doi: 10.1109/CVPR.2016.90]
[23] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with
convolutions. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2015. 1−9. [doi: 10.1109/CVPR.2015.
7298594]
[24] Anderson P, He X, Buehler C, Teney D, Johnson M, Dould S, Zhang L. Bottom-up and top-down attention for image captioning
and visual question answering. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 6077−6086. [doi:
10.1109/CVPR.2018.00636]
[25] Ren S, He K, Girshick R, Sun J. Faster r-CNN: Towards real-time object detection with region proposal networks. In: Proc. of the
Advances in Neural Information Processing Systems. 2015. 91−99.
[26] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997,9(8):1735−1780.
[27] Cho K, Van Merriënboer B, Gulcehre C, Bahdnau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using
RNN encoder-decoder for statistical machine translation. In: Proc. of the 2014 Conf. on Empirical Methods in Natural Language
Processing (EMNLP). 2014. 1724−1734.
[28] Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S. Skip-thought vectors. In: Proc. of the Advances in
Neural Information Processing Systems. 2015. 3294−3302.
[29] Malinowski M, Rohrbach M, Fritz M. Ask your neurons: A neural-based approach to answering questions about images. In: Proc.
of the IEEE Int’l Conf on Computer Vision. 2015. 1−9. [doi: 10.1109/ICCV.2015.9]
[30] Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W. Are you talking to a machine? Dataset and methods for multilingual image
question. In: Proc. of the Advances in Neural Information Processing Systems. 2015. 2296−2304.
[31] Noh H, Hongsuck Seo P, Han B. Image question answering using convolutional neural network with dynamic parameter
prediction. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 30−38. [doi: 10.1109/CVPR.2016.11]
[32] Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question
answering and visual grounding. In: Proc. of the 2016 Conf. on Empirical Methods in Natural Language Processing. 2016.
457−468.
[33] Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li J, Shamma DA, Bernstein MS, Li F. Visual
genome: Connecting language and vision using crowdsourced dense image annotations. Int’l Journal of Computer Vision, 2017,
123(1):32−73.
[34] Kim JH, On KW, Lim W, Kim J, Ha J, Zhang B. Hadamard product for low-rank bilinear pooling. In: Proc. of the Int’l Conf. on
Learning Representations. 2017.