Page 350 - 《软件学报》2025年第4期

P. 350

1756 软件学报 2025 年第 36 卷第 4 期

segmentation. In: Proc. of the 2019 IEEE Int’l Conf. on Image Processing (ICIP). Taipei: IEEE, 2019. 1440–1444. [doi: 10.1109/ICIP.
2019.8803025]
[132] Zhao HS, Shi JP, Qi XJ, Wang XG, Jia JY. Pyramid scene parsing network. In: Proc. of the 2017 IEEE Conf. on Computer Vision and
Pattern Recognition. Honolulu: IEEE, 2017. 6230–6239. [doi: 10.1109/CVPR.2017.660]
[133] Li LH, Zhang PC, Zhang HT, Yang JW, Li CY, Zhong YW, Wang LJ, Yuan L, Zhang L, Hwang JN, Chang KW, Gao JF. Grounded
language-image pre-training. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. New Orleans: IEEE,
2022. 10955–10965. [doi: 10.1109/CVPR52688.2022.01069]
[134] Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected CRFs. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834–848. [doi:
10.1109/TPAMI.2017.2699184]
[135] Ren SQ, He KM, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proc. of the
28th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2015. 91–99.
[136] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: Proc. of the
16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 213–229. [doi: 10.1007/978-3-030-58452-8_13]
[137] Zhou GZ, Hong YC, Wu Q. NavGPT: Explicit reasoning in vision-and-language navigation with large language models. In: Proc. of the
38th AAAI Conf. on Artificial Intelligence. Vancouver: AAAI, 2024. 7641–7649. [doi: 10.1609/aaai.v38i7.28597]

[138] Eftekhar A, Zeng KH, Duan JF, Farhadi A, Kembhavi A, Krishna R. Selective visual representations improve convergence and
generalization for embodied AI. In: Proc. of the the 12th Int’l Conf. on Learning Representations. Vienna: ICLR, 2024.
[139] Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. In: Proc. of the 3rd Theory of
Cryptography Conf. on Theory of Cryptography. New York: Springer, 2006. 265–284. [doi: 10.1007/11681878_14]
[140] Shah D, Equi MR, Osiński B, Xia F, Ichter B, Levine S. Navigation with large language models: Semantic guesswork as a heuristic for
planning. In: Proc. of the 7th Conf. on Robot Learning. Atlanta: PMLR, 2023. 2683–2699.
[141] Song CH, Sadler BM, Wu JM, Chao WL, Washington C, Su Y. LLM-Planner: Few-shot grounded planning for embodied agents with
large language models. In: Proc. of the 2023 IEEE/CVF Int’l Conf. on Computer Vision. Paris: IEEE, 2023. 2986–2997. [doi: 10.1109/
ICCV51070.2023.00280]
[142] Wu PY, Mu Y, Wu BX, Hou Y, Ma J, Zhang SH, Liu C. VoroNav: Voronoi-based zero-shot object navigation with large language
model. arXiv:2401.02695, 2024.
[143] Tsai YHH, Dhar V, Li JL, Zhang BW, Zhang J. Multimodal large language model for visual navigation. arXiv:2310.08669, 2023.
[144] Xi ZH, Chen WX, Guo X, He W, Ding YW, Hong BY, Zhang M, Wang JZ, Jin SJ, Zhou EY, Zheng R, Fan XR, Wang X, Xiong LM,
Zhou YH, Wang WR, Jiang CH, Zou YC, Liu XY, Yin ZY, Dou SH, Weng RX, Cheng WS, Zhang Q, Qin WJ, Zheng YY, Qiu XP,
Huang XJ, Gui T. The rise and potential of large language model based agents: A survey. arXiv:2309.07864, 2023.
[145] Vuong AD, Nguyen TT, Vu MN, Huang BR, Nguyen D, Binh HTT, Vo T, Nguyen A. HabiCrowd: A high performance simulator for
crowd-aware visual navigation. arXiv:2306.11377, 2023.
[146] Cancelli E, Campari T, Serafini L, Chang AX, Ballan L. Exploiting proximity-aware tasks for embodied social navigation. In: Proc. of
the 2023 IEEE/CVF Int’l Conf. on Computer Vision. Paris: IEEE, 2023. 10923–10933. [doi: 10.1109/ICCV51070.2023.01006]
[147] Chen BL, Lu SY, Zhong P, Cui YZ, Liang YX, Wang JX. SemNav-HRO: A target-driven semantic navigation strategy with human-
robot-object ternary fusion. Engineering Applications of Artificial Intelligence, 2024, 127: 107370. [doi: 10.1016/j.engappai.2023.
107370]
[148] Luo Q, Sorokin M, Ha S. A few shot adaptation of visual navigation skills to new observations using meta-learning. In: Proc. of the
2021 IEEE Int’l Conf. on Robotics and Automation (ICRA). Xi’an: IEEE, 2021. 13231–13237. [doi: 10.1109/ICRA48506.2021.
9561056]
[149] Wang T, Wu ZK, Wang DL. Visual perception generalization for vision-and-language navigation via meta-learning. IEEE Trans. on
Neural Networks and Learning Systems, 2023, 34(8): 5193–5199. [doi: 10.1109/TNNLS.2021.3122579]
[150] Dwivedi K, Roig G, Kembhavi A, Mottaghi R. What do navigation agents learn about their environment? In: Proc. of the 2022
IEEE/CVF Conf. on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022. 10266–10275. [doi: 10.1109/CVPR
52688.2022.01003]
[151] Yang ZJ, Majumdar A, Lee S. Behavioral analysis of vision-and-language navigation agents. In: Proc. of the 2023 IEEE/CVF Conf. on
Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023. 2574–2582. [doi: 10.1109/CVPR52729.2023.00253]
[152] Driess D, Xia F, Sajjadi MSM, Lynch C, Chowdhery A, Ichter B, Wahid A, Tompson J, Vuong Q, Yu TH, Huang WL, Chebotar Y,
Sermanet P, Duckworth D, Levine S, Vanhoucke V, Hausman K, Toussaint M, Greff K, Zeng A, Mordatch I, Florence P. PaLM-E: An
embodied multimodal language model. In: Proc. of the 40th Int’l Conf. on Machine Learning. Honolulu: PMLR, 2023. 8469–8488.

345 346 347 348 349 350 351 352 353 354 355