Page 281 - 《软件学报》2025年第4期

P. 281

王永胜等: 多模态信息抽取研究综述 1687

用图像的检索和匹配技术, 然后结合外部知识将有助于实现更多可能分类.

7 总结

近年来, 随着深度学习技术的快速发展, 多模态信息抽取任务迎来研究者们的广泛关注. 本文主要梳理了近 6
年来多模态信息抽取任务相关的重要文章, 详细阐述了多模态信息抽取的研究进程中, 针对短文本部分内容缺失、
图文交互不充分、图文不相关可能引入噪声等问题的解决方法. 进一步的, 本文以任务为导向, 将多模态信息抽取
任务的研究内容分解为多模态表示和融合、MNER、MERE 以及 MEE 这 4 个部分, 然后分别针对这 4 个部分的
方法进行了分析. 最后, 总结了多模态信息抽取任务的研究趋势, 并对多模态信息抽取的研究方向进行了展望, 希
望能给相关领域的研究者提供参考.

References:
[1] Zhang YZ, Rong L, Song DW, Zhang P. A survey on multimodal sentiment analysis. Pattern Recognition and Artificial Intelligence,
2020, 33(5): 426–438 (in Chinese with English abstract). [doi: 10.16451/j.cnki.issn1003-6059.202005005]

[2] Bao XG, Zhou CL, Xiao KJ, Qin B. Survey on visual question answering. Ruan Jian Xue Bao/Journal of Software, 2021, 32(8):
2522–2544 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6215.htm [doi: 10.13328/j.cnki.jos.006215]
[3] Zhang TT, Whitehead S, Zhang HW, Li HZ, Ellis J, Huang LF, Liu W, Ji H, Chang SF. Improving event extraction via multimodal
integration. In: Proc. of the 25th ACM Int’l Conf. on Multimedia. Mountain: ACM, 2017. 270–278. [doi: 10.1145/3123266.3123294]
[4] Lu D, Neves L, Carvalho V, Zhang N, Ji H. Visual attention model for name tagging in multimodal social media. In: Proc. of the 56th
Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers). Melbourne: ACL, 2018. 1990–1999. [doi: 10.
18653/v1/P18-1185]
[5] Wu YZ, Li HR, Yao T, He XD. A survey of multimodal information processing frontiers: Application, fusion and pre-training. Journal of
Chinese Information Processing, 2022, 36(5): 1–20 (in Chinese with English abstract). [doi: 10.3969/j.issn.1003-0077.2022.05.001]
[6] Moon S, Neves L, Carvalho V. Multimodal named entity recognition for short social media posts. In: Proc. of the 2018 Conf. of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans: ACL, 2018.
852–860. [doi: 10.18653/v1/N18-1078]
[7] Zhang Q, Fu JL, Liu XY, Huang XJ. Adaptive co-attention network for named entity recognition in tweets. In: Proc. of the 32nd AAAI
Conf. on Artificial Intelligence and the 30th Innovative Applications of Artificial Intelligence Conf. and the 8th Symp. on Educational
Advances in Artificial Intelligence. New Orleans: AAAI, 2018. 5674–5681. [doi: 10.1609/aaai.v32i1.11962]
[8] Jia MHZ, Shen X, Shen L, Pang JH, Liao LJ, Song Y, Chen M, He XD. Query prior matters: A MRC framework for multimodal named
entity recognition. In: Proc. of the 30th ACM Int’l Conf. on Multimedia. Lisboa: ACM, 2022. 3549–3558. [doi: 10.1145/3503161.
3548427]
[9] Chen DW, Li ZX, Gu BB, Chen ZG. Multimodal named entity recognition with image attributes and image knowledge. Database systems
for advanced applications. In: Proc. of the 26th Int’l Conf. on Database Systems for Advanced Applications. Taipei: Springer, 2021.
186–201. [doi: 10.1007/978-3-030-73197-7_12]
[10] Eberts M, Ulges A. Span-based joint entity and relation extraction with Transformer pre-training. In: Proc. of the 24th European Conf. on
Artificial Intelligence. Santiago de Compostela: IOS Press, 2020. 2006–2013. [doi: 10.3233/FAIA200321]
[11] Wan H, Zhang MR, Du JF, Huang ZL, Yang YF, Pan JZ. FL-MSRE: A few-shot learning based approach to multimodal social relation
extraction. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. AAAI, 2021. 13916–13923. [doi: 10.1609/aaai.v35i15.17639]
[12] Zhang TM, Zhang S, Liu X, Cao B, Fan J. Multimodal data fusion for few-shot named entity recognition method. Ruan Jian Xue
Bao/Journal of Software, 2024, 35(3): 1107–1124 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/7069.htm [doi: 10.
13328/j.cnki.jos.007069]
[13] Li ML, Zareian A, Zeng Q, Whitehead S, Lu D, Ji H, Chang SF. Cross-media structured common space for multimedia event extraction.
In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, 2020. 2557–2568. [doi: 10.18653/v1/2020.
acl-main.230]
[14] Liu J, Chen YF, Xu JN. Multimedia event extraction from news with a unified contrastive learning framework. In: Proc. of the 30th ACM
Int’l Conf. on Multimedia. Lisboa: ACM, 2022. 1945–1953. [doi: 10.1145/3503161.3548132]
[15] Zhang D, Wei SZ, Li SS, Wu HQ, Zhu QM, Zhou GD. Multi-modal graph fusion for named entity recognition with targeted visual
guidance. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. AAAI, 2021. 14347–14355. [doi: 10.1609/aaai.v35i16.17687]

276 277 278 279 280 281 282 283 284 285 286