Page 437 - 《软件学报》2024年第6期
P. 437

软件学报 ISSN 1000-9825, CODEN RUXUEW                                        E-mail: jos@iscas.ac.cn
                 Journal of Software,2024,35(6):3013−3035 [doi: 10.13328/j.cnki.jos.006894]  http://www.jos.org.cn
                 ©中国科学院软件研究所版权所有.                                                          Tel: +86-10-62562563



                                                                            *
                 基于自引导进化策略的高效自动化数据增强算法

                 朱光辉,    陈文忠,    朱振南,    袁春风,    黄宜华


                 (计算机软件新技术国家重点实验室         (南京大学), 江苏 南京 210023)
                 通信作者: 朱光辉, E-mail: zgh@nju.edu.cn; 黄宜华, E-mail: yhuang@nju.edu.cn

                 摘 要: 深度学习在图像、文本、语音等媒体数据的分析任务上取得了优异的性能. 数据增强可以非常有效地提
                 升训练数据的规模以及多样性, 从而提高模型的泛化性. 但是, 对于给定数据集, 设计优异的数据增强策略大量依
                 赖专家经验和领域知识, 而且需要反复尝试, 费时费力. 近年来, 自动化数据增强通过机器自动设计数据增强策略,
                 已引起了学界和业界的广泛关注. 为了解决现有自动化数据增强算法尚无法在预测准确率和搜索效率之间取得良
                 好平衡的问题, 提出一种基于自引导进化策略的自动化数据增强算法                       SGES AA. 首先, 设计一种有效的数据增强
                 策略连续化向量表示方法, 并将自动化数据增强问题转换为连续化策略向量的搜索问题. 其次, 提出一种基于自引
                 导进化策略的策略向量搜索方法, 通过引入历史估计梯度信息指导探索点的采样与更新, 在能够有效避免陷入局
                 部最优解的同时, 可提升搜索过程的收敛速度. 在图像、文本以及语音数据集上的大量实验结果表明, 所提算法在
                 不显著增加搜索耗时的情况下, 预测准确率优于或者匹配目前最优的自动化数据增强方法.
                 关键词: 深度学习; 数据增强; 自动化机器学习; 自引导进化策略
                 中图法分类号: TP391

                 中文引用格式: 朱光辉, 陈文忠, 朱振南, 袁春风, 黄宜华. 基于自引导进化策略的高效自动化数据增强算法. 软件学报, 2024,
                 35(6): 3013–3035. http://www.jos.org.cn/1000-9825/6894.htm
                 英文引用格式: Zhu GH, Chen WZ, Zhu ZN, Yuan CF, Huang YH. Efficient Automated Data Augmentation Algorithm Based on Self-
                 guided Evolution Strategy. Ruan Jian Xue Bao/Journal of Software, 2024, 35(6): 3013–3035 (in Chinese). http://www.jos.org.cn/1000-
                 9825/6894.htm

                 Efficient Automated Data Augmentation Algorithm Based on Self-guided Evolution Strategy

                 ZHU Guang-Hui, CHEN Wen-Zhong, ZHU Zhen-Nan, YUAN Chun-Feng, HUANG Yi-Hua
                 (State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China)

                 Abstract:  Deep  learning  has  achieved  great  success  in  image  classification,  natural  language  processing,  and  speech  recognition.  Data
                 augmentation  can  effectively  increase  the  scale  and  diversity  of  training  data,  thereby  improving  the  generalization  of  deep  learning
                 models.  However,  for  a  given  dataset,  a  well-designed  data  augmentation  strategy  relies  heavily  on  expert  experience  and  domain
                 knowledge  and  requires  repeated  attempts,  which  is  time-consuming  and  labor-intensive.  In  recent  years,  automated  data  augmentation  has
                 attracted  widespread  attention  from  the  academic  community  and  the  industry  through  the  automated  design  of  data  augmentation
                 strategies.  To  solve  the  problem  that  existing  automated  data  augmentation  algorithms  cannot  strike  a  good  balance  between  prediction
                 accuracy  and  search  efficiency,  this  study  proposes  an  efficient  automated  data  augmentation  algorithm  SGES  AA  based  on  a  self-guided
                 evolution  strategy.  First,  an  effective  continuous  vector  representation  method  is  designed  for  the  data  augmentation  strategy,  and  then  the
                 automated  data  augmentation  problem  is  converted  into  a  search  problem  of  continuous  strategy  vectors.  Second,  a  strategy  vector  search
                 method  based  on  the  self-guided  evolution  strategy  is  presented.  By  introducing  historical  estimation  gradient  information  to  guide  the
                 sampling  and  updating  of  exploration  points,  it  can  effectively  avoid  the  local  optimal  solution  while  improving  the  convergence  of  the
                 search process. The results of extensive experiments on image, text, and speech datasets show that the proposed algorithm is superior to or


                 *    基金项目: 国家自然科学基金  (62102177, U1811461); 江苏省自然科学基金  (BK20210181); 江苏省重点研发计划  (BE2021729)
                  收稿时间: 2022-06-27; 修改时间: 2022-09-12; 采用时间: 2023-01-05; jos 在线出版时间: 2023-05-24
                  CNKI 网络首发时间: 2023-05-26
   432   433   434   435   436   437   438   439   440   441   442