Page 310 - 《软件学报》2025年第9期
P. 310
赵嘉宁 等: 提升隐式场景下短语视觉定位的因果建模方法 4221
[18] Yang SB, Li GB, Yu YZ. Relationship-embedded representation learning for grounding referring expressions. IEEE Trans. on Pattern
Analysis and Machine Intelligence, 2020, 43(8): 2765–2779. [doi: 10.1109/TPAMI.2020.2973983]
[19] Yang L, Xu Y, Yuan CF, Liu W, Li B, Hu WM. Improving visual grounding with visual-linguistic verification and iterative reasoning. In:
Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 9489–9498. [doi:
10.1109/CVPR52688.2022.00928]
[20] Ye JB, Tian JF, Yan M, Yang XS, Wang XW, Zhang J, He L, Lin X. Shifting more attention to visual backbone: Query-modulated
refinement networks for end-to-end visual grounding. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern
Recognition (CVPR). New Orleans: IEEE, 2022. 15481–15491. [doi: 10.1109/CVPR52688.2022.01506]
[21] Redmon J, Farhadi A. YOLOv3: An incremental improvement. arXiv:1804.02767, 2018.
[22] Liao Y, Liu S, Li GB, Wang F, Chen YJ, Qian C, Li B. A real-time cross-modality correlation filtering method for referring expression
comprehension. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020.
10877–10886. [doi: 10.1109/CVPR42600.2020.01089]
[23] Yang ZY, Chen TL, Wang LB, Luo JB. Improving one-stage visual grounding by recursive sub-query construction. In: Proc. of the 16th
European Conf. on Computer Vision. Glasgow: Springer, 2020. 387–404. [doi: 10.1007/978-3-030-58568-6_23]
[24] Huang BB, Lian DZ, Luo WX, Gao SH. Look before you leap: Learning landmark features for one-stage visual grounding. In: Proc. of
the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Nashville: IEEE, 2021. 16883–16892. [doi: 10.1109/
CVPR46437.2021.01661]
[25] Liao Y, Zhang AY, Chen ZY, Hui TR, Liu S. Progressive language-customized visual feature learning for one-stage visual grounding.
IEEE Trans. on Image Processing, 2022, 31: 4266–4277. [doi: 10.1109/TIP.2022.3181516]
[26] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the
31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
[27] Deng JJ, Yang ZY, Chen TL, Zhou WG, Li HQ. TransVG: End-to-end visual grounding with Transformers. In: Proc. of the 2021
IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Montreal: IEEE, 2021. 1749–1759. [doi: 10.1109/ICCV48922.2021.00179]
[28] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Proc.
of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1
(Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2018. 4171–4186. [doi: 10.18653/v1/N19-1423]
[29] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with Transformers. In: Proc. of the
16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 213–229. [doi: 10.1007/978-3-030-58452-8_13]
[30] Tang KH, Niu YL, Huang JQ, Shi JX, Zhang HW. Unbiased scene graph generation from biased training. In: Proc. of the 2020
IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 3713–3722. [doi: 10.1109/CVPR42600.2020.00377]
[31] Zhang D, Zhang HW, Tang JH, Hua XS, Sun QR. Causal intervention for weakly-supervised semantic segmentation. In: Proc. of the 34th
Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 56.
[32] Chen L, Yan X, Xiao J, Zhang HW, Pu SL, Zhuang YT. Counterfactual samples synthesizing for robust visual question answering. In:
Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10797–10806. [doi: 10.1109/
CVPR42600.2020.01081]
[33] Yue ZQ, Zhang HW, Sun QR, Hua XS. Interventional few-shot learning. In: Proc. of the 34th Int’l Conf. on Neural Information
Processing Systems. Vancouver: Curran Associates Inc., 2020. 2734–2746.
[34] Wang T, Huang JQ, Zhang HW, Sun QR. Visual commonsense R-CNN. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and
Pattern Recognition. Seattle: IEEE, 2020. 10757–10767. [doi: 10.1109/CVPR42600.2020.01077]
[35] Wang T, Zhou C, Sun QR, Zhang HW. Causal attention for unbiased visual recognition. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on
Computer Vision (ICCV). Montreal: IEEE, 2021. 3071–3080. [doi: 10.1109/ICCV48922.2021.00308]
[36] Huang JQ, Qin Y, Qi JX, Sun QR, Zhang HW. Deconfounded visual grounding. In: Proc. of the 36th AAAI Conf. on Artificial
Intelligence. AAAI Press, 2022. 998–1006. [doi: 10.1609/aaai.v36i1.19983]
[37] Yang X, Zhang HW, Qi GJ, Cai JF. Causal attention for vision-language tasks. In: Proc. of the 2021 IEEE/CVF Conf. on Computer
Vision and Pattern Recognition. Nashville: IEEE, 2021. 9842–9852. [doi: 10.1109/CVPR46437.2021.00972]
[38] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision
and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778. [doi: 10.1109/CVPR.2016.90]
[39] Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proc. of the 2014 Conf. on Empirical Methods in
Natural Language Processing (EMNLP). Doha: ACL, 2014. 1532–1543. [doi: 10.3115/v1/D14-1162]
[40] Liu Y, Ott M, Goyal N, Du JF, Joshi M, Chen DQ, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A robustly optimized bert

