Page 184 - 《软件学报》2021年第11期
P. 184
3510 Journal of Software 软件学报 Vol.32, No.11, November 2021
目前,本文所提出的规划网络模型及用于训练网络的 RL 算法仍存在一些不足之处,未来可围绕其做进一
步的研究.例如,可寻找一种更好的方法来定义 GAVIN 的异步值迭代过程中各节点的优先级以及用于选择要更
化能力.此外,由于 GAVIN 的每轮异步值迭代过程仅会选择特定的节点进行更新,因此在利用 IL 算法进行训练
的 GAVIN 的测试结果中会存在一定的过拟合现象,未来可寻求一种更好的神经网络结构来构建模型或是采用
数据增强以及数据清洗的方法以消除这一现象.在本文所提出的情节式加权双 Q 学习中,加权函数的大小仍是
[1] Sun ZJ, Xue L, Xu YM, Wang Z. Overview of deep learning. Application Research of Computers, 2012,29(8):2806−2810 (in
Chinese with English abstract).
[2] Liu Q, Zhai JW, Zhang ZZ, Zhong S, Zhou Q, Zhang P, Xu J. A survey of deep reinforcement learning. Chinese Journal of
Computers, 2018,41(1):1−27 (in Chinese with English abstract).
[3] Hussein A, Gaber MM, Elyan E, Jayne C. Imitation learning: A survey of learning methods. ACM Computer Survey, 2017,50(2):
[4] Ciresan DC, Meier U, Schmidhuber J. Multi-column deep neural networks for image classification. In: Proc. of the IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR). 2012. 3642−3649.
[5] Farabet C, Couprie C, Najman L, LeCun Y. Learning hierarchical features for scene labeling. IEEE Trans. on Pattern Analysis and
Machine Intelligence, 2013,35(8):1915−1929.
[6] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Proc. of the Advances
in Neural Information Processing Systems (NIPS). 2012. 1106−1114.
[7] Chen XS, Li S, Li H, Jiang SH, Qi Y, Song L. Generative adversarial user model for reinforcement learning based recommendation
system. In: Proc. of the Int’l Conf. on Machine Learning (ICML). 2019. 1052−1061.
[8] Qureshi AH, Boots B, Yip MC. Adversarial imitation via variational inverse reinforcement learning. In: Proc. of the Int’l Conf. on
Learning Representations (ICLR). 2019.
[9] Bertsekas DP. Dynamic Programming and Optimal Control. 3rd ed., Athena Scientific, 2005.
[10] Tamar A, Wu Y, Thomas G, Levine S, Abbeel P. Value iteration networks. In: Proc. of the Advances in Neural Information
Processing Systems (NIPS). 2016. 2154−2162.
[11] Bellman R. Dynamic Programming. Princeton: Princeton University Press, 1957.
[12] Niu SF, Chen SH, Guo HY, Targonski C, Smith MC, Kovacevic J. Generalized value iteration networks: Life beyond lattices. In:
Proc. of the AAAI Conf. on Artificial Intelligence (AAAI). 2018. 6246−6253.
[13] Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep
reinforcement learning. In: Proc. of the Int’l Conf. on Machine Learning (ICML). 2016. 1928−1937.
[14] Bertsekas DP. Distributed asynchronous computation of fixed points. Mathematical Programming, 1983,27(1):107−120.
[15] Moore AW, Atkeson CG. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 1993,13:
[16] Pan ZY, Zhang ZZ, Chen ZX. Asynchronous value iteration network. In: Proc. of the Int’l Conf. on Neural Information Processing
(ICONIP). 2018. 169−180.
[17] Broumi S, Talea M, Bakali A, Smarandache F. Application of Dijkstra algorithm for solving interval valued neutrosophic shortest
path problem. In: Proc. of the Symp. Series on Computational Intelligence (SSCI). 2016. 1−6.
[18] Zhang ZZ, Pan ZY, Kochenderfer MJ. Weighted double Q-learning. In: Proc. of the Int’l Joint Conf. on Artificial Intelligence
(IJCAI). 2017. 3455−3461.
[19] Sutton RS, Barto AG. Reinforcement Learning: An Introduction. 2nd ed., MIT Press, 2018.
[20] Krose BJA. Learning from delayed rewards. Robotics and Autonomous Systems, 1995,15(4):233−235.