Page 298 - 《软件学报》2025年第10期
P. 298

软件学报 ISSN 1000-9825, CODEN RUXUEW                                        E-mail: jos@iscas.ac.cn
                 2025,36(10):4695−4709 [doi: 10.13328/j.cnki.jos.007296] [CSTR: 32375.14.jos.007296]  http://www.jos.org.cn
                 ©中国科学院软件研究所版权所有.                                                          Tel: +86-10-62562563



                                                                   *
                 扩散模型期望最大化的离线强化学习方法

                 刘    全  1,2 ,    颜    洁  1 ,    乌    兰  1


                 1
                  (苏州大学 计算机科学与技术学院, 江苏 苏州 215008)
                 2
                  (江苏省计算机信息处理技术重点实验室 (苏州大学), 江苏 苏州 215006)
                 通信作者: 刘全, E-mail: quanliu@suda.edu.cn

                 摘 要: 在连续且密集奖励的任务中, 离线强化学习取得了显著的效果. 然而由于其训练过程不与环境交互, 泛化
                 能力降低, 在离散且稀疏奖赏的环境下性能难以得到保证. 扩散模型通过加噪结合样本数据邻域的信息, 生成贴近
                 样本数据分布的动作, 强化智能体的学习和泛化能力. 针对以上问题, 提出一种扩散模型期望最大化的离线强化学
                 习方法   (offline reinforcement learning with diffusion models and expectation maximization, DMEM). 该方法通过极大
                 似然对数期望最大化更新目标函数, 使策略具有更强的泛化性. 将扩散模型引入策略网络中, 利用扩散的特征, 增
                 强策略学习数据样本的能力. 同时从高维空间的角度看期望回归更新价值函数, 引入一个惩戒项使价值函数评估
                 更准确. 将   DMEM  应用于一系列离散且稀疏奖励的任务中, 实验表明, 与其他经典的离线强化学习方法相比,
                 DMEM  性能上具有较大的优势.
                 关键词: 离线强化学习; 扩散模型; 优势函数加权; 期望回归; 期望最大化
                 中图法分类号: TP18


                 中文引用格式: 刘全, 颜洁, 乌兰. 扩散模型期望最大化的离线强化学习方法. 软件学报, 2025, 36(10): 4695–4709. http://www.jos.
                 org.cn/1000-9825/7296.htm
                 英文引用格式: Liu Q, Yan J, Wu L. Offline Reinforcement Learning Method with Diffusion Model and Expectation Maximization.
                 Ruan Jian Xue Bao/Journal of Software, 2025, 36(10): 4695–4709 (in Chinese). http://www.jos.org.cn/1000-9825/7296.htm
                 Offline Reinforcement Learning Method with Diffusion Model and Expectation Maximization
                        1,2
                                 1
                 LIU Quan , YAN Jie , WU Lan 1
                 1
                 (School of Computer Science and Technology, Soochow University, Suzhou 215008, China)
                 2
                 (Jiangsu Provincial Key Laboratory for Computer Information Processing Technology (Soochow University), Suzhou 215006, China)
                 Abstract:  Offline  reinforcement  learning  has  yielded  significant  results  in  tasks  with  continuous  and  intensive  rewards.  However,  since  the
                 training  process  does  not  interact  with  the  environment,  the  generalization  ability  is  reduced,  and  the  performance  is  difficult  to  guarantee
                 in  a  discrete  and  sparse  reward  environment.  The  diffusion  model  combines  the  information  in  the  neighborhood  of  the  sample  data  with
                 noise  addition  to  generate  actions  that  are  close  to  the  distribution  of  the  sample  data,  which  strengthens  the  learning  and  generalization
                 ability of the agents. To this end, offline reinforcement learning with diffusion models and expectation maximization (DMEM) is proposed.
                 The method updates the objective function by maximizing the expectation of the maximum likelihood logarithm to make the strategy more
                 generalizable. Additionally, the diffusion model is introduced into the strategy network to utilize the diffusion characteristics to enhance the
                 ability  of  the  strategy  to  learn  data  samples.  Meanwhile,  the  expectile  regression  is  employed  to  update  the  value  function  from  the
                 perspective  of  high-dimensional  space,  and  a  penalty  term  is  introduced  to  make  the  evaluation  of  the  value  function  more  accurate.
                 DMEM  is  applied  to  a  series  of  tasks  with  discrete  and  sparse  rewards,  and  experiments  show  that  DMEM  has  a  large  advantage  in
                 performance over other classical offline reinforcement learning methods.
                 Key words:  offline reinforcement learning; diffusion model; advantage function weighted; expectile regression; expectation maximization


                 *    基金项目: 国家自然科学基金  (62376179, 62176175); 新疆维吾尔自治区自然科学基金  (2022D01A238); 江苏高校优势学科建设工程
                  收稿时间: 2024-05-06; 修改时间: 2024-07-18; 采用时间: 2024-09-05; jos 在线出版时间: 2025-02-19
                  CNKI 网络首发时间: 2025-02-19
   293   294   295   296   297   298   299   300   301   302   303