Page 93 - 《软件学报》2025年第9期
P. 93

4004                                                       软件学报  2025  年第  36  卷第  9  期


                 扩展平台的优化实现. 随着其所支持的硬件平台数量和向量化算子数量的增加, 在                         GGML  中引入可变长的硬件抽
                 象层将有助于避免算子面向不同硬件平台的重复实现, 降低开发复杂度, 提高算法库的可维护性. 深度学习算法
                 库  PyTorch  和线性代数算法库    Eigen  也都计划将  RISC-V  向量扩展引入其现有的硬件抽象层中, 应用本文所述方
                 法可以帮助算法库更灵活地设计和实现兼容现有定长平台与可变长平台的硬件抽象层, 从而更好地实现                                   RISC-V
                 向量扩展后端, 提高算法库在         RISC-V  平台上的性能表现. 整合更多的        SIMD  或向量扩展到硬件抽象层中, 特别是
                 支持如   RISC-V P  扩展等新兴设备平台, 可以增强硬件抽象层的功能性和适用性, 对于促进                    RISC-V  软件生态发展
                 具有显著意义.

                 致谢   感谢  OpenCV  社区维护者   Vadim Pisarevsky, Alexander Smorkalov  和  Maksim Shabunin  对本工作的建议和
                 帮助.

                 References:
                  [1]   Luebke D. CUDA: Scalable parallel programming for high-performance scientific computing. In: Proc. of the 5th IEEE Int’l Symp. on
                     Biomedical Imaging: From Nano to Macro. Paris: IEEE, 2008. 836–838. [doi:10.1109/ISBI.2008.4541126]
                  [2]   Munshi A. The OpenCL specification. In: Proc. of the 2009 IEEE Hot Chips 21 Symp. (HCS). Stanford: IEEE, 2009. 1–314. [doi: 10.
                     1109/HOTCHIPS.2009.7478342]
                  [3]   Lomont C. Introduction to Intel advanced vector extensions. Intel White Paper, 2011, 23: 1–21.
                  [4]   Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Premillieu N, Reid A, Rico A,
                     Walker P. The ARM scalable vector extension. IEEE Micro, 2017, 37(2): 26–39. [doi: 10.1109/MM.2017.35]
                  [5]   Hu WW, Wang WX, Wu RY, Wang HD, Zeng L, Xu CH, Gao X, Zhang FX. Loongson instruction set architecture technology. Journal of
                     Computer Research and Development, 2023, 60(1): 2–16 (in Chinese with English abstract). [doi: 10.7544/issn1000-1239.202220196]
                  [6]   Liu C, Wu YJ, Wu JZ, Zhao C. Survey on RISC-V system architecture research. Ruan Jian Xue Bao/Journal of Software, 2021, 32(12):
                     3992–4024 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6490.htm [doi: 10.13328/j.cnki.jos.006490]
                  [7]   Bradski G, Kaehler A. Learning OpenCV: Computer Vision with the OpenCV library. O’Reilly Media, Inc., 2008.
                  [8]   Universal intrinsics. 2024. https://docs.opencv.org/4.x/df/d91/group__core__hal__intrin.html
                  [9]   riscv/riscv-v-spec: Working draft of the proposed RISC-V V vector extension. 2024. https://github.com/riscv/riscv-v-spec
                 [10]   Feng J K, He Y P, Tao Q M. Auto-vectorization: Recent development and prospect. Journal of Communications, 2022, 43(3): 180–195
                     (in Chinese with English abstract). [doi: 10.11959/j.issn.1000-436x.2022051]
                 [11]   Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S. Halide: A language and compiler for optimizing parallelism,
                     locality,  and  recomputation  in  image  processing  pipelines.  ACM  SIGPLAN  Notices,  2013,  48(6):  519–530.  [doi:  10.1145/2499370.
                     2462176]
                 [12]   Kretz M. Extending C++ for explicit data-parallel programming via SIMD vector types [Ph.D. Thesis]. Frankfurt am Main: der Johann
                     Wolfgang Goethe-Universität, 2015. [doi: 10.13140/RG.2.1.2355.4323]
                 [13]   Highway: About performance-portable, length-agnostic SIMD with runtime dispatch. 2024. https://github.com/google/highway
                 [14]   Ji SL, Wang QY, Chen AY, Zhao BB, Ye T, Zhang XH, Wu JZ, Li J, Yin JW, Wu YJ. Survey on open-source software supply chain
                     security. Ruan Jian Xue Bao/Journal of Software, 2023, 34(3): 1330–1364 (in Chinese with English abstract). http://www.jos.org.cn/1000-
                     9825/6717.htm [doi: 10.13328/j.cnki.jos.006717]
                 [15]   libjpeg-turbo. A JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression. 2024. http://
                     sourceforge.net/projects/libjpeg-turbo
                 [16]   Genc H, Kim S, Amid A, Haj-Ali A, Iyer V, Prakash P, Zhao J, Grubb D, Liew H, Mao H, Ou A, Schmidt C, Steffl S, Wright J, Stoica I,
                     Ragan-Kelley J, Asanovic K, Nikolic B, Shao YS. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack
                     integration. In: Proc. of the 58th ACM/IEEE Design Automation Conf. (DAC). San Francisco: IEEE, 2021. 769–774. [doi: 10.1109/
                     DAC18074.2021.9586216]
                 [17]   Li  RS,  Peng  P,  Shao  ZY,  Jin  H,  Zheng  R.  Evaluating  RISC-V  vector  instruction  set  architecture  extension  with  computer  vision
                     workloads. Journal of Computer Science and Technology, 2023, 38(4): 807–820. [doi: 10.1007/s11390-023-1266-6]

                 附中文参考文献:
                  [5]   胡伟武, 汪文祥, 吴瑞阳, 王焕东, 曾露, 徐成华, 高翔, 张福新. 龙芯指令系统架构技术. 计算机研究与发展, 2023, 60(1): 2–16. [doi:
   88   89   90   91   92   93   94   95   96   97   98