Page 283 - 《软件学报》2025年第4期
P. 283
王永胜 等: 多模态信息抽取研究综述 1689
IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 5534–5542. [doi: 10.1109/CVPR.2016.597]
[38] Lahat D, Adali T, Jutten C. Multimodal data fusion: An overview of methods, challenges, and prospects. Proc. of the IEEE, 2015, 103(9):
1449–1477. [doi: 10.1109/JPROC.2015.2460697]
[39] Wang XW, Ye JB, Li ZX, Tian JF, Jiang Y, Yan M, Zhang J, Xiao YH. CAT-MNER: Multimodal named entity recognition with
knowledge-refined cross-modal attention. In: Proc. of the 2022 IEEE Int’l Conf. on Multimedia and Expo (ICME). Taipei: IEEE, 2022.
1–6. [doi: 10.1109/ICME52920.2022.9859972]
[40] Chen X, Zhang NY, Li L, Yao YZ, Deng SM, Tan CQ, Huang F, Si L, Chen HJ. Good visual guidance make a better extractor:
Hierarchical visual prefix for multimodal entity and relation extraction. In: Findings of the Association for Computational Linguistics:
NAACL 2022. Seattle: ACL, 2022. 1607–1618. [doi: 10.18653/v1/2022.findings-naacl.121]
[41] Xu B, Huang SZ, Du M, Wang HY, Song H, Sha CF, Xiao YH. Different data, different modalities! Reinforced data splitting for effective
multimodal information extraction from social media posts. In: Proc. of the 29th Int’l Conf. on Computational Linguistics. Gyeongju:
ACL, 2022. 1855–1864.
[42] Sang EFTK, Veenstra J. Representing text chunks. In: Proc. of the 9th Conf. on European Chapter of the Association for Computational
Linguistics. Bergen: ACL, 1999. 173–179. [doi: 10.3115/977035.977059]
[43] Li JY, Li H, Pan Z, Sun D, Wang JH, Zhang WK, Pan G. Prompting ChatGPT in MNER: Enhanced multimodal named entity recognition
with auxiliary refined knowledge. In: Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: ACL, 2023.
2787–2802. [doi: 10.18653/v1/2023.findings-emnlp.184]
[44] Ji YZ, Li BB, Zhou J, Li F, Teng C, Ji DH. CMNER: A Chinese multimodal NER dataset based on social media. arXiv:2402.13693,
2024.
[45] Wang JM, Li ZY, Yu JF, Yang L, Xia R. Fine-grained multimodal named entity recognition and grounding with a generative framework.
In: Proc. of the 31st ACM Int’l Conf. on Multimedia. Ottawa: ACM, 2023. 3934–3943. [doi: 10.1145/3581783.3612322]
[46] Sui DB, Tian ZK, Chen YB, Liu K, Zhao J. A large-scale Chinese multimodal ner dataset with speech clues. In: Proc. of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th Int’l Joint Conf. on Natural Language Processing. ACL, 2021.
2807–2818. [doi: 10.18653/v1/2021.acl-long.218]
[47] Wang XY, Cai J, Jiang Y, Xie PJ, Tu KW, Lu W. Named entity and relation extraction with multi-modal retrieval. In: Findings of the
Association for Computational Linguistics: EMNLP 2022. Abu Dhabi: ACL, 2022. 5925–5936. [doi: 10.18653/v1/2022.findings-emnlp.
437]
[48] the 8th Int’l Conf. on Learning Representations. 2020. 1–16.
Lu JY, Zhang DX, Zhang JX, Zhang PJ. Flat multi-modal interaction transformer for named entity recognition. In: Proc. of the 29th Int’l
Conf. on Computational Linguistics. Gyeongju: ACL, 2022. 2055–2064.
[49] Miyamoto Y, Cho K. Gated word-character recurrent language model. In: Proc. of the 2016 Conf. on Empirical Methods in Natural
Language Processing. Austin: ACL, 2016. 1992–1997. [doi: 10.18653/v1/D16-1209]
[50] Shen T, Zhou TY, Long GD, Jiang J, Pan SR, Zhang CQ. DiSAN: Directional self-attention network for RNN/CNN-free language
understanding. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence. New Orleans: AAAI, 2018. 5446–5455. [doi: 10.1609/aaai.
v32i1.11941]
[51] Vempala A, Preoţiuc-Pietro D. Categorizing and inferring the relationship between the text and image of twitter posts. In: Proc. of the
57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019. 2830–2840. [doi: 10.18653/v1/P19-1272]
[52] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning
transferable visual models from natural language supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. 2021. 8748–8763.
[53] Su WJ, Zhu XZ, Cao Y, Li B, Lu LW, Wei FR, Dai JF. VL-BERT: Pre-training of generic visual-linguistic representations. In: Proc. of
[54] Zhou BH, Zhang Y, Song KH, Guo WY, Zhao GQ, Wang HB, Yuan XJ. A span-based multimodal variational autoencoder for semi-
supervised multimodal named entity recognition. In: Proc. of the 2022 Conf. on Empirical Methods in Natural Language Processing. Abu
Dhabi: ACL, 2022. 6293–6302. [doi: 10.18653/v1/2022.emnlp-main.422]
[55] Hinton GE. Training products of experts by minimizing contrastive divergence. Neural Computation, 2002, 14(8): 1771–1800. [doi: 10.
1162/089976602760128018]
[56] Yamada I, Asai A, Shindo H, Takeda H, Matsumoto Y. LUKE: Deep contextualized entity representations with entity-aware self-
attention. In: Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2020. 6442–6454. [doi: 10.
18653/v1/2020.emnlp-main.523]
[57] Chen F, Feng Yj. Chain-of-thought prompt distillation for multimodal named entity recognition and multimodal relation extraction.
arXiv:2306.14122, 2023.