Page 310 - 《软件学报》2025年第9期
P. 310

赵嘉宁 等: 提升隐式场景下短语视觉定位的因果建模方法                                                     4221


                 [18]   Yang SB, Li GB, Yu YZ. Relationship-embedded representation learning for grounding referring expressions. IEEE Trans. on Pattern
                     Analysis and Machine Intelligence, 2020, 43(8): 2765–2779. [doi: 10.1109/TPAMI.2020.2973983]
                 [19]   Yang L, Xu Y, Yuan CF, Liu W, Li B, Hu WM. Improving visual grounding with visual-linguistic verification and iterative reasoning. In:
                     Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 9489–9498. [doi:
                     10.1109/CVPR52688.2022.00928]
                 [20]   Ye JB, Tian JF, Yan M, Yang XS, Wang XW, Zhang J, He L, Lin X. Shifting more attention to visual backbone: Query-modulated
                     refinement  networks  for  end-to-end  visual  grounding.  In:  Proc.  of  the  2022  IEEE/CVF  Conf.  on  Computer  Vision  and  Pattern
                     Recognition (CVPR). New Orleans: IEEE, 2022. 15481–15491. [doi: 10.1109/CVPR52688.2022.01506]
                 [21]   Redmon J, Farhadi A. YOLOv3: An incremental improvement. arXiv:1804.02767, 2018.
                 [22]   Liao Y, Liu S, Li GB, Wang F, Chen YJ, Qian C, Li B. A real-time cross-modality correlation filtering method for referring expression
                     comprehension.  In:  Proc.  of  the  2020  IEEE/CVF  Conf.  on  Computer  Vision  and  Pattern  Recognition  (CVPR).  Seattle:  IEEE,  2020.
                     10877–10886. [doi: 10.1109/CVPR42600.2020.01089]
                 [23]   Yang ZY, Chen TL, Wang LB, Luo JB. Improving one-stage visual grounding by recursive sub-query construction. In: Proc. of the 16th
                     European Conf. on Computer Vision. Glasgow: Springer, 2020. 387–404. [doi: 10.1007/978-3-030-58568-6_23]
                 [24]   Huang BB, Lian DZ, Luo WX, Gao SH. Look before you leap: Learning landmark features for one-stage visual grounding. In: Proc. of
                     the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Nashville: IEEE, 2021. 16883–16892. [doi: 10.1109/
                     CVPR46437.2021.01661]
                 [25]   Liao Y, Zhang AY, Chen ZY, Hui TR, Liu S. Progressive language-customized visual feature learning for one-stage visual grounding.
                     IEEE Trans. on Image Processing, 2022, 31: 4266–4277. [doi: 10.1109/TIP.2022.3181516]
                 [26]   Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the
                     31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
                 [27]   Deng  JJ,  Yang  ZY,  Chen  TL,  Zhou  WG,  Li  HQ.  TransVG:  End-to-end  visual  grounding  with  Transformers.  In:  Proc.  of  the  2021
                     IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Montreal: IEEE, 2021. 1749–1759. [doi: 10.1109/ICCV48922.2021.00179]
                 [28]   Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Proc.
                     of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1
                     (Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2018. 4171–4186. [doi: 10.18653/v1/N19-1423]
                 [29]   Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with Transformers. In: Proc. of the
                     16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 213–229. [doi: 10.1007/978-3-030-58452-8_13]
                 [30]   Tang  KH,  Niu  YL,  Huang  JQ,  Shi  JX,  Zhang  HW.  Unbiased  scene  graph  generation  from  biased  training.  In:  Proc.  of  the  2020
                     IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 3713–3722. [doi: 10.1109/CVPR42600.2020.00377]
                 [31]   Zhang D, Zhang HW, Tang JH, Hua XS, Sun QR. Causal intervention for weakly-supervised semantic segmentation. In: Proc. of the 34th
                     Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 56.
                 [32]   Chen L, Yan X, Xiao J, Zhang HW, Pu SL, Zhuang YT. Counterfactual samples synthesizing for robust visual question answering. In:
                     Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10797–10806. [doi: 10.1109/
                     CVPR42600.2020.01081]
                 [33]   Yue  ZQ,  Zhang  HW,  Sun  QR,  Hua  XS.  Interventional  few-shot  learning.  In:  Proc.  of  the  34th  Int’l  Conf.  on  Neural  Information
                     Processing Systems. Vancouver: Curran Associates Inc., 2020. 2734–2746.
                 [34]   Wang T, Huang JQ, Zhang HW, Sun QR. Visual commonsense R-CNN. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and
                     Pattern Recognition. Seattle: IEEE, 2020. 10757–10767. [doi: 10.1109/CVPR42600.2020.01077]
                 [35]   Wang T, Zhou C, Sun QR, Zhang HW. Causal attention for unbiased visual recognition. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on
                     Computer Vision (ICCV). Montreal: IEEE, 2021. 3071–3080. [doi: 10.1109/ICCV48922.2021.00308]
                 [36]   Huang  JQ,  Qin  Y,  Qi  JX,  Sun  QR,  Zhang  HW.  Deconfounded  visual  grounding.  In:  Proc.  of  the  36th  AAAI  Conf.  on  Artificial
                     Intelligence. AAAI Press, 2022. 998–1006. [doi: 10.1609/aaai.v36i1.19983]
                 [37]   Yang X, Zhang HW, Qi GJ, Cai JF. Causal attention for vision-language tasks. In: Proc. of the 2021 IEEE/CVF Conf. on Computer
                     Vision and Pattern Recognition. Nashville: IEEE, 2021. 9842–9852. [doi: 10.1109/CVPR46437.2021.00972]
                 [38]   He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision
                     and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778. [doi: 10.1109/CVPR.2016.90]
                 [39]   Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proc. of the 2014 Conf. on Empirical Methods in
                     Natural Language Processing (EMNLP). Doha: ACL, 2014. 1532–1543. [doi: 10.3115/v1/D14-1162]
                 [40]   Liu Y, Ott M, Goyal N, Du JF, Joshi M, Chen DQ, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A robustly optimized bert
   305   306   307   308   309   310   311   312   313   314   315