Page 437 - 《软件学报》2024年第6期
P. 437
软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
Journal of Software,2024,35(6):3013−3035 [doi: 10.13328/j.cnki.jos.006894] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563
*
基于自引导进化策略的高效自动化数据增强算法
朱光辉, 陈文忠, 朱振南, 袁春风, 黄宜华
(计算机软件新技术国家重点实验室 (南京大学), 江苏 南京 210023)
通信作者: 朱光辉, E-mail: zgh@nju.edu.cn; 黄宜华, E-mail: yhuang@nju.edu.cn
摘 要: 深度学习在图像、文本、语音等媒体数据的分析任务上取得了优异的性能. 数据增强可以非常有效地提
升训练数据的规模以及多样性, 从而提高模型的泛化性. 但是, 对于给定数据集, 设计优异的数据增强策略大量依
赖专家经验和领域知识, 而且需要反复尝试, 费时费力. 近年来, 自动化数据增强通过机器自动设计数据增强策略,
已引起了学界和业界的广泛关注. 为了解决现有自动化数据增强算法尚无法在预测准确率和搜索效率之间取得良
好平衡的问题, 提出一种基于自引导进化策略的自动化数据增强算法 SGES AA. 首先, 设计一种有效的数据增强
策略连续化向量表示方法, 并将自动化数据增强问题转换为连续化策略向量的搜索问题. 其次, 提出一种基于自引
导进化策略的策略向量搜索方法, 通过引入历史估计梯度信息指导探索点的采样与更新, 在能够有效避免陷入局
部最优解的同时, 可提升搜索过程的收敛速度. 在图像、文本以及语音数据集上的大量实验结果表明, 所提算法在
不显著增加搜索耗时的情况下, 预测准确率优于或者匹配目前最优的自动化数据增强方法.
关键词: 深度学习; 数据增强; 自动化机器学习; 自引导进化策略
中图法分类号: TP391
中文引用格式: 朱光辉, 陈文忠, 朱振南, 袁春风, 黄宜华. 基于自引导进化策略的高效自动化数据增强算法. 软件学报, 2024,
35(6): 3013–3035. http://www.jos.org.cn/1000-9825/6894.htm
英文引用格式: Zhu GH, Chen WZ, Zhu ZN, Yuan CF, Huang YH. Efficient Automated Data Augmentation Algorithm Based on Self-
guided Evolution Strategy. Ruan Jian Xue Bao/Journal of Software, 2024, 35(6): 3013–3035 (in Chinese). http://www.jos.org.cn/1000-
9825/6894.htm
Efficient Automated Data Augmentation Algorithm Based on Self-guided Evolution Strategy
ZHU Guang-Hui, CHEN Wen-Zhong, ZHU Zhen-Nan, YUAN Chun-Feng, HUANG Yi-Hua
(State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210023, China)
Abstract: Deep learning has achieved great success in image classification, natural language processing, and speech recognition. Data
augmentation can effectively increase the scale and diversity of training data, thereby improving the generalization of deep learning
models. However, for a given dataset, a well-designed data augmentation strategy relies heavily on expert experience and domain
knowledge and requires repeated attempts, which is time-consuming and labor-intensive. In recent years, automated data augmentation has
attracted widespread attention from the academic community and the industry through the automated design of data augmentation
strategies. To solve the problem that existing automated data augmentation algorithms cannot strike a good balance between prediction
accuracy and search efficiency, this study proposes an efficient automated data augmentation algorithm SGES AA based on a self-guided
evolution strategy. First, an effective continuous vector representation method is designed for the data augmentation strategy, and then the
automated data augmentation problem is converted into a search problem of continuous strategy vectors. Second, a strategy vector search
method based on the self-guided evolution strategy is presented. By introducing historical estimation gradient information to guide the
sampling and updating of exploration points, it can effectively avoid the local optimal solution while improving the convergence of the
search process. The results of extensive experiments on image, text, and speech datasets show that the proposed algorithm is superior to or
* 基金项目: 国家自然科学基金 (62102177, U1811461); 江苏省自然科学基金 (BK20210181); 江苏省重点研发计划 (BE2021729)
收稿时间: 2022-06-27; 修改时间: 2022-09-12; 采用时间: 2023-01-05; jos 在线出版时间: 2023-05-24
CNKI 网络首发时间: 2023-05-26