Page 170 - 《软件学报》2021年第11期

P. 170

软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
Journal of Software,2021,32(11):3496−3511 [doi: 10.13328/j.cnki.jos.006077] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563

∗
一种基于广义异步值迭代的规划网络模型

1
2
1
陈子璇 , 章宗长 , 潘致远 , 张琳婧 2
1
(计算机软件新技术国家重点实验室(南京大学),江苏南京 210023)
2
(苏州大学计算机科学与技术学院,江苏苏州 215006)
通讯作者: 章宗长, E-mail: zzzhang@nju.edu.cn

摘要: 近年来,如何生成具有泛化能力的策略已成为深度强化学习领域的热点问题之一,并涌现出了许多相关
的研究成果,其中的一个代表性工作为广义值迭代网络.广义值迭代网络是一种可作用于非规则图形的规划网络模
型.它利用一种特殊的图形卷积算子来近似地表示状态转移矩阵,使得其在学习到非规则图形的结构信息后,可通过
值迭代过程进行规划,从而在具有非规则图形结构的任务中产生具有泛化能力的策略.然而,由于没有考虑根据状态
重要性来合理分配规划时间,广义值迭代网络中的每一轮迭代都需要在整个状态空间的所有状态上同步执行.当状
态空间较大时,这样的同步更新会降低网络的规划性能.用异步更新的思想来进一步研究广义值迭代网络.通过在值
迭代过程中定义状态优先级并执行异步值更新,提出了一种新型的异步规划网络模型——广义异步值迭代网络.在
未知的非规则结构任务中,与广义值迭代网络相比,广义异步值迭代网络具有更高效且更有效的规划过程.进一步
地,改进了广义值迭代网络中的强化学习算法及图形卷积算子,并通过在非规则图形和真实地图中的路径规划实验
验证了改进方法的有效性.
关键词: 深度学习;强化学习;模仿学习;规划;异步更新
中图法分类号: TP181

中文引用格式: 陈子璇,章宗长,潘致远,张琳婧.一种基于广义异步值迭代的规划网络模型.软件学报,2021,32(11):3496−3511.
http://www.jos.org.cn/1000-9825/6077.htm
英文引用格式: Chen ZX, Zhang ZZ, Pan ZY, Zhang LJ. Planning network model based on generalized asynchronous value
iteration. Ruan Jian Xue Bao/Journal of Software, 2021,32(11):3496−3511 (in Chinese). http://www.jos.org.cn/1000-9825/6077.htm
Planning Network Model Based on Generalized Asynchronous Value Iteration

2
1
1
CHEN Zi-Xuan , ZHANG Zong-Zhang , PAN Zhi-Yuan , ZHANG Lin-Jing 2
1
(State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China)
2
(School of Computer Science and Technology, Soochow University, Suzhou 215006, China)
Abstract: In recent years, how to generate policies with generalization abilities has become one of the hot issues in the field of deep
reinforcement learning, and many related research achievements have appeared. One representative work among them is generalized value
iteration network (GVIN). GVIN is a differential planning network that uses a special graph convolution operator to approximately
represent a state-transition matrix, and uses the value iteration (VI) process to perform planning during the learning of structure
information in irregular graphs, resulting in policies with generalization abilities. In GVIN, each round of VI involves performing value
updates synchronously at all states over the entire state space. Since there is no consideration about how to rationally allocate the planning
time according to the importance of states, synchronous updates may degrade the planning performance of network when the state space is

∗ 基金项目: 国家自然科学基金(61876119); 江苏省自然科学基金(BK20181432); 中央高校基本科研业务费专项资金(02211438
0010)
Foundation item: National Natural Science Foundation of China (61876119); Natural Science Foundation of Jiangsu Province
(BK20181432); Fundamental Research Funds for the Central Universities (022114380010)
收稿时间: 2019-11-12; 修改时间: 2020-03-17; 采用时间: 2020-04-30

165 166 167 168 169 170 171 172 173 174 175