Page 281 - 《软件学报》2025年第4期
P. 281

王永胜 等: 多模态信息抽取研究综述                                                              1687


                 用图像的检索和匹配技术, 然后结合外部知识将有助于实现更多可能分类.

                 7   总 结

                    近年来, 随着深度学习技术的快速发展, 多模态信息抽取任务迎来研究者们的广泛关注. 本文主要梳理了近                                 6
                 年来多模态信息抽取任务相关的重要文章, 详细阐述了多模态信息抽取的研究进程中, 针对短文本部分内容缺失、
                 图文交互不充分、图文不相关可能引入噪声等问题的解决方法. 进一步的, 本文以任务为导向, 将多模态信息抽取
                 任务的研究内容分解为多模态表示和融合、MNER、MERE                   以及  MEE  这  4  个部分, 然后分别针对这     4  个部分的
                 方法进行了分析. 最后, 总结了多模态信息抽取任务的研究趋势, 并对多模态信息抽取的研究方向进行了展望, 希
                 望能给相关领域的研究者提供参考.

                 References:
                  [1]  Zhang YZ, Rong L, Song DW, Zhang P. A survey on multimodal sentiment analysis. Pattern Recognition and Artificial Intelligence,
                     2020, 33(5): 426–438 (in Chinese with English abstract). [doi: 10.16451/j.cnki.issn1003-6059.202005005]

                  [2]  Bao  XG,  Zhou  CL,  Xiao  KJ,  Qin  B.  Survey  on  visual  question  answering.  Ruan  Jian  Xue  Bao/Journal  of  Software,  2021,  32(8):
                     2522–2544 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6215.htm [doi: 10.13328/j.cnki.jos.006215]
                  [3]  Zhang TT, Whitehead S, Zhang HW, Li HZ, Ellis J, Huang LF, Liu W, Ji H, Chang SF. Improving event extraction via multimodal
                     integration. In: Proc. of the 25th ACM Int’l Conf. on Multimedia. Mountain: ACM, 2017. 270–278. [doi: 10.1145/3123266.3123294]
                  [4]  Lu D, Neves L, Carvalho V, Zhang N, Ji H. Visual attention model for name tagging in multimodal social media. In: Proc. of the 56th
                     Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers). Melbourne: ACL, 2018. 1990–1999. [doi: 10.
                     18653/v1/P18-1185]
                  [5]  Wu YZ, Li HR, Yao T, He XD. A survey of multimodal information processing frontiers: Application, fusion and pre-training. Journal of
                     Chinese Information Processing, 2022, 36(5): 1–20 (in Chinese with English abstract). [doi: 10.3969/j.issn.1003-0077.2022.05.001]
                  [6]  Moon S, Neves L, Carvalho V. Multimodal named entity recognition for short social media posts. In: Proc. of the 2018 Conf. of the North
                     American  Chapter  of  the  Association  for  Computational  Linguistics:  Human  Language  Technologies.  New  Orleans:  ACL,  2018.
                     852–860. [doi: 10.18653/v1/N18-1078]
                  [7]  Zhang Q, Fu JL, Liu XY, Huang XJ. Adaptive co-attention network for named entity recognition in tweets. In: Proc. of the 32nd AAAI
                     Conf. on Artificial Intelligence and the 30th Innovative Applications of Artificial Intelligence Conf. and the 8th Symp. on Educational
                     Advances in Artificial Intelligence. New Orleans: AAAI, 2018. 5674–5681. [doi: 10.1609/aaai.v32i1.11962]
                  [8]  Jia MHZ, Shen X, Shen L, Pang JH, Liao LJ, Song Y, Chen M, He XD. Query prior matters: A MRC framework for multimodal named
                     entity  recognition.  In:  Proc.  of  the  30th  ACM  Int’l  Conf.  on  Multimedia.  Lisboa:  ACM,  2022.  3549–3558.  [doi:  10.1145/3503161.
                     3548427]
                  [9]  Chen DW, Li ZX, Gu BB, Chen ZG. Multimodal named entity recognition with image attributes and image knowledge. Database systems
                     for advanced applications. In: Proc. of the 26th Int’l Conf. on Database Systems for Advanced Applications. Taipei: Springer, 2021.
                     186–201. [doi: 10.1007/978-3-030-73197-7_12]
                 [10]  Eberts M, Ulges A. Span-based joint entity and relation extraction with Transformer pre-training. In: Proc. of the 24th European Conf. on
                     Artificial Intelligence. Santiago de Compostela: IOS Press, 2020. 2006–2013. [doi: 10.3233/FAIA200321]
                 [11]  Wan H, Zhang MR, Du JF, Huang ZL, Yang YF, Pan JZ. FL-MSRE: A few-shot learning based approach to multimodal social relation
                     extraction. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. AAAI, 2021. 13916–13923. [doi: 10.1609/aaai.v35i15.17639]
                 [12]  Zhang  TM,  Zhang  S,  Liu  X,  Cao  B,  Fan  J.  Multimodal  data  fusion  for  few-shot  named  entity  recognition  method.  Ruan  Jian  Xue
                     Bao/Journal of Software, 2024, 35(3): 1107–1124 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/7069.htm [doi: 10.
                     13328/j.cnki.jos.007069]
                 [13]  Li ML, Zareian A, Zeng Q, Whitehead S, Lu D, Ji H, Chang SF. Cross-media structured common space for multimedia event extraction.
                     In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, 2020. 2557–2568. [doi: 10.18653/v1/2020.
                     acl-main.230]
                 [14]  Liu J, Chen YF, Xu JN. Multimedia event extraction from news with a unified contrastive learning framework. In: Proc. of the 30th ACM
                     Int’l Conf. on Multimedia. Lisboa: ACM, 2022. 1945–1953. [doi: 10.1145/3503161.3548132]
                 [15]  Zhang D, Wei SZ, Li SS, Wu HQ, Zhu QM, Zhou GD. Multi-modal graph fusion for named entity recognition with targeted visual
                     guidance. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. AAAI, 2021. 14347–14355. [doi: 10.1609/aaai.v35i16.17687]
   276   277   278   279   280   281   282   283   284   285   286