Page 311 - 《软件学报》2025年第9期
P. 311
4222 软件学报 2025 年第 36 卷第 9 期
pretraining approach. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: ICLR, 2020. 1–15.
[41] Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 2014, 15(1): 1929–1958.
[42] Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RSZ, Bengio Y. Show, attend and tell: Neural image caption
generation with visual attention. In: Proc. of the of the 32nd Int’l Conf. on Int’l Conf. on Machine Learning. Lille: JMLR.org, 2015.
2048–2057.
[43] Hartigan JA, Wong MA. Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 1979, 28(1): 100. [doi: 10.2307/
2346830]
[44] Kuhn HW. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 1955, 2(1–2): 83–97. [doi: 10.1002/
nav.3800020109]
[45] Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized intersection over union: A metric and a loss for bounding
box regression. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 658–666.
[doi: 10.1109/CVPR.2019.00075]
[46] Plummer BA, Wang LW, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S. Flickr30k Entities: Collecting region-to-phrase
correspondences for richer image-to-sentence models. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015.
2641–2649. [doi: 10.1109/ICCV.2015.303]
[47] Zhu DY, Chen J, Shen XQ, Li X, Elhoseiny M. MiniGPT-4: Enhancing vision-language understanding with advanced large language
models. In: Proc. of the 12th Int’l Conf. on Learning Representations. Vienna: ICLR, 2024.
[48] Liu HT, Li CY, Wu QY, Lee YL. Visual instruction tuning. In: Proc. of the 37th Int’l Conf. on Neural Information Processing Systems.
New Orleans: Curran Associates Inc., 2023. 1516.
[49] Li JN, Li DX, Savarese S, Hoi S. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language
models. In: Proc. of the 40th Int’l Conf. on Machine Learning. Honolulu: JMLR.org, 2023. 814.
[50] Bang Y, Cahyawijaya S, Lee N, Dai WL, Su D, Wilie B, Lovenia H, Ji ZW, Yu TZ, Chung W, Do QV, Xu Y, Fung P. A multitask,
multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In: Proc. of the 13th Int’l Joint Conf. on
Natural Language Processing and the 3rd Conf. of the Asia-Pacific Chapter of the Association for Computational Linguistics (Vol. 1:
Long Papers). Nusa Dua: Association for Computational Linguistics, 2023. 675–718. [doi: 10.18653/v1/2023.ijcnlp-main.45]
[51] Dong QX, Li L, Dai DM, Zheng C, Ma JY, Li R, XiaHM, Xu JJ, Wu ZY, Chang BB, Sun X, Li L, Sui ZF. A survey on in-context
learning. In: Proc. of the 2024 Conf. on Empirical Methods in Natural Language Processing. Miami: Association for Computational
Linguistics, 2022. 1107–1128. [doi: 10.18653/v1/2024.emnlp-main.64]
附中文参考文献:
[1] 杜鹏飞, 李小勇, 高雅丽. 多模态视觉语言表征学习研究综述. 软件学报, 2021, 32(2): 327–348. http://www.jos.org.cn/1000-9825/6125.
htm [doi: 10.13328/j.cnki.jos.006125]
赵嘉宁(2000-), 男, 硕士生, CCF 学生会员, 主 罗佳敏(1997-), 女, 博士生, CCF 学生会员, 主
要研究领域为自然语言处理. 要研究领域为自然语言处理.
王晶晶(1990-), 男, 博士, 副教授, CCF 专业会 周国栋(1967-), 男, 博士, 教授, 博士生导师,
员, 主要研究领域为自然语言处理. CCF 杰出会员, 主要研究领域为自然语言处理.

