Page 283 - 《软件学报》2025年第4期
P. 283

王永胜 等: 多模态信息抽取研究综述                                                              1689


                     IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 5534–5542. [doi: 10.1109/CVPR.2016.597]
                 [38]  Lahat D, Adali T, Jutten C. Multimodal data fusion: An overview of methods, challenges, and prospects. Proc. of the IEEE, 2015, 103(9):
                     1449–1477. [doi: 10.1109/JPROC.2015.2460697]
                 [39]  Wang  XW,  Ye  JB,  Li  ZX,  Tian  JF,  Jiang  Y,  Yan  M,  Zhang  J,  Xiao  YH.  CAT-MNER:  Multimodal  named  entity  recognition  with
                     knowledge-refined cross-modal attention. In: Proc. of the 2022 IEEE Int’l Conf. on Multimedia and Expo (ICME). Taipei: IEEE, 2022.
                     1–6. [doi: 10.1109/ICME52920.2022.9859972]
                 [40]  Chen  X,  Zhang  NY,  Li  L,  Yao  YZ,  Deng  SM,  Tan  CQ,  Huang  F,  Si  L,  Chen  HJ.  Good  visual  guidance  make  a  better  extractor:
                     Hierarchical visual prefix for multimodal entity and relation extraction. In: Findings of the Association for Computational Linguistics:
                     NAACL 2022. Seattle: ACL, 2022. 1607–1618. [doi: 10.18653/v1/2022.findings-naacl.121]
                 [41]  Xu B, Huang SZ, Du M, Wang HY, Song H, Sha CF, Xiao YH. Different data, different modalities! Reinforced data splitting for effective
                     multimodal information extraction from social media posts. In: Proc. of the 29th Int’l Conf. on Computational Linguistics. Gyeongju:
                     ACL, 2022. 1855–1864.
                 [42]  Sang EFTK, Veenstra J. Representing text chunks. In: Proc. of the 9th Conf. on European Chapter of the Association for Computational
                     Linguistics. Bergen: ACL, 1999. 173–179. [doi: 10.3115/977035.977059]
                 [43]  Li JY, Li H, Pan Z, Sun D, Wang JH, Zhang WK, Pan G. Prompting ChatGPT in MNER: Enhanced multimodal named entity recognition
                     with auxiliary refined knowledge. In: Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: ACL, 2023.
                     2787–2802. [doi: 10.18653/v1/2023.findings-emnlp.184]
                 [44]  Ji YZ, Li BB, Zhou J, Li F, Teng C, Ji DH. CMNER: A Chinese multimodal NER dataset based on social media. arXiv:2402.13693,
                     2024.
                 [45]  Wang JM, Li ZY, Yu JF, Yang L, Xia R. Fine-grained multimodal named entity recognition and grounding with a generative framework.
                     In: Proc. of the 31st ACM Int’l Conf. on Multimedia. Ottawa: ACM, 2023. 3934–3943. [doi: 10.1145/3581783.3612322]
                 [46]  Sui DB, Tian ZK, Chen YB, Liu K, Zhao J. A large-scale Chinese multimodal ner dataset with speech clues. In: Proc. of the 59th Annual
                     Meeting of the Association for Computational Linguistics and the 11th Int’l Joint Conf. on Natural Language Processing. ACL, 2021.
                     2807–2818. [doi: 10.18653/v1/2021.acl-long.218]
                 [47]  Wang XY, Cai J, Jiang Y, Xie PJ, Tu KW, Lu W. Named entity and relation extraction with multi-modal retrieval. In: Findings of the
                     Association for Computational Linguistics: EMNLP 2022. Abu Dhabi: ACL, 2022. 5925–5936. [doi: 10.18653/v1/2022.findings-emnlp.
                     437]
                 [48] the 8th Int’l Conf. on Learning Representations. 2020. 1–16.
                     Lu JY, Zhang DX, Zhang JX, Zhang PJ. Flat multi-modal interaction transformer for named entity recognition. In: Proc. of the 29th Int’l
                     Conf. on Computational Linguistics. Gyeongju: ACL, 2022. 2055–2064.
                 [49]  Miyamoto Y, Cho K. Gated word-character recurrent language model. In: Proc. of the 2016 Conf. on Empirical Methods in Natural
                     Language Processing. Austin: ACL, 2016. 1992–1997. [doi: 10.18653/v1/D16-1209]
                 [50]  Shen  T,  Zhou  TY,  Long  GD,  Jiang  J,  Pan  SR,  Zhang  CQ.  DiSAN:  Directional  self-attention  network  for  RNN/CNN-free  language
                     understanding. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence. New Orleans: AAAI, 2018. 5446–5455. [doi: 10.1609/aaai.
                     v32i1.11941]
                 [51]  Vempala A, Preoţiuc-Pietro D. Categorizing and inferring the relationship between the text and image of twitter posts. In: Proc. of the
                     57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019. 2830–2840. [doi: 10.18653/v1/P19-1272]
                 [52]  Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning
                     transferable visual models from natural language supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. 2021. 8748–8763.
                 [53]  Su WJ, Zhu XZ, Cao Y, Li B, Lu LW, Wei FR, Dai JF. VL-BERT: Pre-training of generic visual-linguistic representations. In: Proc. of


                 [54]  Zhou BH, Zhang Y, Song KH, Guo WY, Zhao GQ, Wang HB, Yuan XJ. A span-based multimodal variational autoencoder for semi-
                     supervised multimodal named entity recognition. In: Proc. of the 2022 Conf. on Empirical Methods in Natural Language Processing. Abu
                     Dhabi: ACL, 2022. 6293–6302. [doi: 10.18653/v1/2022.emnlp-main.422]
                 [55]  Hinton GE. Training products of experts by minimizing contrastive divergence. Neural Computation, 2002, 14(8): 1771–1800. [doi: 10.
                     1162/089976602760128018]
                 [56]  Yamada  I,  Asai  A,  Shindo  H,  Takeda  H,  Matsumoto  Y.  LUKE:  Deep  contextualized  entity  representations  with  entity-aware  self-
                     attention. In: Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2020. 6442–6454. [doi: 10.
                     18653/v1/2020.emnlp-main.523]
                 [57]  Chen  F,  Feng  Yj.  Chain-of-thought  prompt  distillation  for  multimodal  named  entity  recognition  and  multimodal  relation  extraction.
                     arXiv:2306.14122, 2023.
   278   279   280   281   282   283   284   285   286   287   288