Page 109 - 《软件学报》2020年第10期

P. 109

娄文启等:一种神经网络指令集扩展与代码映射机制 3085

6 结束语

CNN 在图像识别、目标检测领域的广泛应用使其性能至关重要.本文通过分析典型 CNN 的计算模式,提出
了一种高效且易于实现的专用指令集 RV-CNN,其包含了 10 条粗粒度的矩阵指令,可以灵活地为典型 CNN 模型
推理过程提供支持.在此基础上,我们介绍了 CNN 模型描述文件到 RV-CNN 指令的映射过程.随后通过定性分
析,从不同方面将 RV-CNN 与典型专用指令集进行比较.在指令实现方面,我们将该指令集扩展进了基于开源架
构 RISC-V 的处理器核,并以相对紧耦合的方式将对应的矩阵单元嵌入经典的 5 级流水线中.最后,本设计在
Xilinx ZC702 平台对上进行综合实现,并以典型的神经网络进行测试.结果显示,相比于 Intel i7-4790K 处理器和
Tesla k40c GPU,该原型系统具有最高的能效和代码密度.此外,与先前的加速器相比,该原型系统在保持灵活性
的同时也展现了不错的能效.
目前新型 CNN 网络层出不穷,还需考虑对其中诸如深度可分离卷积等操作进行指令优化以提高执行效率.
此外,扩展指令的设计与实现应针对 RISC-V 特点做出协同优化.最后,根据模型及硬件信息执行的代码映射过
程还未自动化,未来,计划在以上方面加以改进.

References:
[1] Wu F, Kong Y, Dong W, et al. Gradient-aware blind face inpainting for deep face verification. Neurocomputing, 2019,
331(FEB.28):301–311.
[2] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection. In: Proc. of the IEEE Conf. on
Computer Vision and Pattern Recognition. IEEE, 2016. 779–788. [doi: 10.1109/CVPR.2016.91]
[3] Sainath TN, Mohamed A, Kingsbury B, et al. Deep convolutional neural networks for LVCSR. In: Proc. of the IEEE Int’l Conf. on
Acoustics, Speech, and Signal Processing. IEEE, 2013. 8614–8618. [doi: 10.1109/ICASSP.2013.6639347]
[4] Collobert R, Weston J, Bottou L, et al. Natural language processing (Almost) from scratch. Journal of Machine Learning Research,
2011,12(1):2493–2537. [doi: 10.1016/j.chemolab.2011.03.009]
[5] Krizhevsky A, Sutskever I, Hinton G. ImageNet classification with deep convolutional neural networks. Advances in Neural
Information Processing Systems, 2012,25(2):1097–1105. [doi: 10.1145/3065386]
[6] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. In: Proc. of the IEEE Conf. on Computer Vision and Pattern
Recognition. IEEE, 2014. 1–9. [doi: 10.1109/CVPR.2015.7298594]
[7] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proc. of the IEEE Conf. on Computer Vision and
Pattern Recognition. IEEE, 2016. 770–778. [doi: 10.1109/CVPR.2016.90]
[8] Gong L, Wang C, Li X, et al. MALOC: A fully pipelined FPGA accelerator for convolutional neural networks with all layers
mapped on chip. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 2018,37(11):2601–2612. [doi: 10.
1109/TCAD.2018.2857078]
[9] Wang C, Gong L, Yu Q, et al. DLAU: A scalable deep learning accelerator unit on FPGA. IEEE Trans. on Computer-Aided Design
of Integrated Circuits and Systems, 2017,36(3):513–517. [doi: 10.1109/TCAD.2016.2587683]
[10] Wang C, Li X, Chen Y, et al. Service-oriented architecture on FPGA-based MPSoC. IEEE Trans. on Parallel and Distributed
Systems, 2017,28(10):2993–3006. [doi: 10.1109/TPDS.2017.2701828]
[11] Chen T, Du Z, Sun N, et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In: Proc. of
the Architectural Support for Programming Languages and Operating Systems. ACM, 2014. 269–284. [doi: 10.1145/2541940.
2541967]
[12] Moons B, Verhelst M. An energy-efficient precision-scalable ConvNet processor in 40-nm CMOS. IEEE Journal of Solid-state
Circuits, 2017,52(4):903–914. [doi: 10.1109/JSSC.2016.2636225]
[13] Chen Y, Luo T, Liu S, et al. DaDianNao: A machine-learning supercomputer. In: Proc. of the Int’l Symp. on Microarchitecture.
IEEE, 2014. 609–622. [doi: 10.1109/MICRO.2014.58]
[14] Liu S, Du Z, Tao J, et al. Cambricon: An instruction set architecture for neural networks. In: Proc. of the 43rd ACM/IEEE Annual
Int’l Symp. on Computer Architecture (ISCA). IEEE, 2016. 393–405. [doi: 10.1145/3007787.3001179]

104 105 106 107 108 109 110 111 112 113 114