Page 347 - 《软件学报》2025年第4期

P. 347

陈铂垒等: 面向具身人工智能的物体目标导航综述 1753

[65] Du HM, Yu X, Zheng L. VTNet: Visual Transformer network for object goal navigation. In: Proc. of the 9th Int’l Conf. on Learning
Representations. ICLR, 2021.
[66] Fukushima R, Ota K, Kanezaki A, Sasaki Y, Yoshiyasu Y. Object memory Transformer for object goal navigation. In: Proc. of the 2022
Int’l Conf. on Robotics and Automation. Philadelphia: IEEE, 2022. 11288–11294. [doi: 10.1109/ICRA46639.2022.9812027]
[67] Georgakis G, Schmeckpeper K, Wanchoo K, Dan S, Miltsakaki E, Roth D, Daniilidis K. Cross-modal map learning for vision and
language navigation. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022.
15439–15449. [doi: 10.1109/CVPR52688.2022.01502]
[68] Henriques JF, Vedaldi A. MapNet: An allocentric spatial memory for mapping environments. In: Proc. of the 2018 IEEE/CVF Conf. on
Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 8476–8484. [doi: 10.1109/CVPR.2018.00884]
[69] Cartillier V, Ren ZL, Jain N, Lee S, Essa I, Batra D. Semantic MapNet: Building allocentric semantic maps and representations from
egocentric views. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. AAAI, 2021. 964–972. [doi: 10.1609/aaai.v35i2.16180]
[70] Chen PH, Ji DY, Lin KY, Zeng RH, Li TH, Tan MK, Gan C. Weakly-supervised multi-granularity map learning for vision-and-language
navigation. In: Proc. of the 36th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022.
2764.
[71] Xu DF, Zhu YK, Choy CB, Fei-Fei L. Scene graph generation by iterative message passing. In: Proc. of the 2017 IEEE/CVF Conf. on
Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 3097–3106. [doi: 10.1109/CVPR.2017.330]
[72] Yang JW, Lu JS, Lee S, Batra D, Parikh D. Graph R-CNN for scene graph generation. In: Proc. of the 15th European Conf. on
Computer Vision. Munich: Springer, 2018. 690–706. [doi: 10.1007/978-3-030-01246-5_41]
[73] Zellers R, Yatskar M, Thomson S, Choi Y. Neural motifs: Scene graph parsing with global context. In: Proc. of the 2018 IEEE/CVF
Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 5831–5840. [doi: 10.1109/CVPR.2018.00611]
[74] Ost J, Mannan F, Thuerey N, Knodt J, Heide F. Neural scene graphs for dynamic scenes. In: Proc. of the 2021 IEEE/CVF Conf. on
Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 2855–2864. [doi: 10.1109/CVPR46437.2021.00288]
[75] Tsai YHH, Divvala S, Morency LP, Salakhutdinov R, Farhadi A. Video relationship reasoning using gated spatio-temporal energy
graph. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 10416–10425.
[doi: 10.1109/CVPR.2019.01067]
[76] Giuliari F, Skenderi G, Cristani M, Wang YM, Del Bue A. Spatial commonsense graph for object localisation in partial scenes. In: Proc.
of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022. 19496–19505. [doi: 10.1109/
CVPR52688.2022.01891]
[77] IEEE, 2023. 14931–14942. [doi: 10.1109/CVPR52729.2023.01434]
Gao C, Chen JY, Liu S, Wang LT, Zhang Q, Wu Q. Room-and-object aware knowledge reasoning for remote embodied referring
expression. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 3063–3072.
[doi: 10.1109/CVPR46437.2021.00308]
[78] Gadre SY, Ehsani K, Song SR, Mottaghi R. Continuous scene representations for embodied AI. In: Proc. of the 2022 IEEE/CVF Conf.
on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022. 14829–14839. [doi: 10.1109/CVPR52688.2022.01443]
[79] Du YL, Gan C, Isola P. Curious representation learning for embodied intelligence. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on
Computer Vision. Montreal: IEEE, 2021. 10388–10397. [doi: 10.1109/ICCV48922.2021.01024]
[80] Girshick R. Fast R-CNN. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 1440–1448. [doi: 10.1109/
ICCV.2015.169]
[81] van den Oord A, Li YZ, Vinyals O. Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.
[82] Zhu H, Kapoor R, Min SY, Han W, Li JT, Geng KW, Neubig G, Bisk Y, Kembhavi A, Weihs L. EXCALIBUR: Encouraging and
evaluating embodied exploration. In: Proc. of the 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Vancouver:

[83] Chaplot DS, Gandhi D, Gupta S, Gupta A, Salakhutdinov R. Learning to explore using active neural SLAM. In: Proc. of the 8th Int’l
Conf. on Learning Representations. Addis Ababa: ICLR, 2019.
[84] Bigazzi R, Cornia M, Cascianelli S, Baraldi L, Cucchiara R. Embodied agents for efficient exploration and smart scene description. In:
Proc. of the 2023 IEEE Int’l Conf. on Robotics and Automation (ICRA). London: IEEE, 2023. 6057–6064. [doi: 10.1109/ICRA48891.
2023.10160668]
[85] Savinov N, Raichuk A, Vincent D, Marinier R, Pollefeys M, Lillicrap T, Gelly S. Episodic curiosity through reachability. In: Proc. of the
7th Int’l Conf. on Learning Representations. New Orleans: ICLR, 2019.
[86] Strehl AL, Littman ML. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and
System Sciences, 2008, 74(8): 1309–1331. [doi: 10.1016/j.jcss.2007.08.009]
[87] Bellemare MG, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R. Unifying count-based exploration and intrinsic motivation. In:

342 343 344 345 346 347 348 349 350 351 352