Page 191 - 《软件学报》2025年第5期

P. 191

杨尚东等: 基于分组对比学习的序贯感知技能发现 2091

用 w n 1 ,n 2 对任务轨迹进行技能划分并以 w k,p 计算技能相似度和对任务轨迹进行随机技能划分在网格世界环境中进
i 1 ,i 2
行了比较. 实验结果发现, 在随机技能划分下实验效果和 RPPO 方法接近, 低于我们的方法, 从而验证了我们方法
的有效性. 实验结果如表 6 所示.

表 5 分段数量对结果的影响表 6 技能划分对结果的影响

算法训练成功率测试成功率算法训练成功率测试成功率
GCSSD-2 (N=2) 0.840 0.251 按相似度划分技能 0.840 0.251
GCSSD-4 (N=4) 0.827 0.245 随机划分技能 0.795 0.179
GCSSD-1 (N=1) 0.834 0.204 RPPO 0.778 0.180
RPPO 0.778 0.180

5 总结

本文提出了一种基于分组对比学习的序贯感知技能发现方法 (GCSSD), 针对目前基于技能发现框架的强化
学习方法存在的处理序贯技能组合能力不足的问题, 利用对比学习, 对分组后的轨迹进行嵌入表征学习. 并将表征
嵌入至强化学习策略, 与其进行联合训练, 在测试任务中, GCSSD 利用已经学好的技能表征网络, 可以快速适应具
有不同序贯技能组合的任务. 在实验中验证了所提算法的有效性. 未来的工作包括两方面, 其一是利用概率匹配进
行轨迹分组, 其二是在更大规模的实验中验证所提算法的有效性.

References:
[1] Li JW, Monroe W, Ritter A, Jurafsky D, Galley M, Gao JF. Deep reinforcement learning for dialogue generation. In: Proc. of the 2016
Conf. on Empirical Methods in Natural Language Processing. Austin: Association for Computational Linguistics, 2016. 1192–1202. [doi:
10.18653/v1/D16-1127]
[2] Kwan WC, Wang HR, Wang HM, Wong KF. A survey on recent advances and challenges in reinforcement learning methods for task-
oriented dialogue policy learning. Machine Intelligence Research, 2023, 20(3): 318–334. [doi: 10.1007/s11633-022-1347-y]
[3] Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton
F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R. Training language models to follow instructions with
human feedback. In: Proc. of the 36th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022.
2011.
[4] Rudin N, Hoeller D, Reist P, Hutter M. Learning to walk in minutes using massively parallel deep reinforcement learning. In: Proc. of the
5th Conf. on Robot Learning. London: PMLR, 2021. 91–100.
[5] Yu C, Dong YZ, Guo X, Feng YH, Zhuo HK, Zhang Q. Structure-motivated interactive deep reinforcement learning for robotic control.
Ruan Jian Xue Bao/Journal of Software, 2023, 34(4): 1749–1764 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/
6708.htm [doi: 10.13328/j.cnki.jos.006708]
[6] Wang JY, Huang ZQ, Yang DY, Huang XW, Zhu Y, Hua GY. Spatio-lock synchronous constraint guided safe reinforcement learning for
autonomous driving. Journal of Computer Research and Development, 2021, 58(12): 2585–2603 (in Chinese with English abstract). [doi:
10.7544/issn1000-1239.2021.20211023]
[7] Xuan SZ, Ke LJ. Study on attack-defense countermeasure of UAV swarms based on multi-agent reinforcement learning. Radio
Engineering, 2021, 51(5): 360–366 (in Chinese with English abstract). [doi: 10.3969/j.issn.1003-3106.2021.05.004]
[8] Teh YW, Bapst V, Czarnecki WM, Quan J, Kirkpatrick J, Hadsell R, Heess N, Pascanu R. Distral: Robust multitask reinforcement
learning. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017.
4499–4509.
[9] Hessel M, Soyer H, Espeholt L, Czarnecki W, Schmitt S, van Hasselt H. Multi-task deep reinforcement learning with popart. In: Proc. of
the 33rd AAAI Conf. on Artificial Intelligence. Honolulu: AAAI, 2019. 3796–3803.
[10] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proc. of the 34th Int’l Conf. on
Machine Learning. Sydney: JMLR.org, 2017. 1126–1135.
[11] Kumar A, Zhou A, Tucker G, Levine S. Conservative Q-learning for offline reinforcement learning. In: Proc. of the 34th Conf. on Neural
Information Processing Systems. 2020. 1179–1191.
[12] Parisotto E, Ba LJ, Salakhutdinov R. Actor-mimic: Deep multitask and transfer reinforcement learning. In: Proc. of the 4th Int’l Conf. on

186 187 188 189 190 191 192 193 194 195 196