Page 90 - 《软件学报》2024年第6期
P. 90

2666                                                       软件学报  2024  年第  35  卷第  6  期


                 高层信息, 再由编译器通过优化建模、循环优化方案求解和代码变换完成精准高效的核心循环数据访问优化, 提
                 升应用程序访存性能的同时减轻了用户编程负担.

                                                  访主存    自动缓冲优化    手工缓冲优化
                                      2.50
                                                                                2.08
                                      2.00
                                     运行时间 (ms)  1.50                    1.04      1.54  1.53


                                      1.00
                                                                 0.52     0.77  0.76
                                      0.50                         0.39  0.38
                                                         0.26
                                                  0.13     0.19  0.19
                                                   0.10  0.10
                                          0.06  0.05  0.05
                                       0
                                           256M   512M     1G     2G      4G     8G
                                                          待比对结果值个数

                                               图 19 mem2sdm  优化效果对比图

                    本文在目前的设计中, 要求编程人员在编译指示中给出优化缓冲区地址和长度等信息, 后续将研究在这部分
                 信息允许缺省的情况下, 编译器结合运行时接口自动查询当前可用的缓冲区大小并进行自动申请与释放的管理,
                 一方面有机会挖掘更多的         LDM  空间作为优化缓冲, 另一方面可进一步简化编译指示的使用要求, 提升好编程性.

                 References:
                  [1]  TOP500 List. 2021. https://www.top500.org/lists/top500/2021/11/
                  [2]  Banakar R, Steinke S, Lee BS, Balakrishnan M, Marwedel P. Scratchpad memory: A design alternative for cache on-chip memory in
                     embedded systems. In: Proc. of the 10th Int’l Symp. on Hardware/Software Codesign. Estes Park: ACM, 2002. 73–78. [doi: 10.1145/
                     774789.774805]
                  [3]  Sato M, Ishikawa Y, Tomita H, Kodama Y, Odajima T, Tsuji M, Yashiro H, Aoki M, Shida N, Miyoshi I, Hirai K, Furuya A, Asato A,
                     Morita K, Shimizu T. Co-design for A64FX manycore processor and “Fugaku”. In: Proc. of the 2020 Int’l Conf. for High Performance
                     Computing, Networking, Storage and Analysis. Atlanta: IEEE, 2020. 1–15. [doi: 10.1109/SC41405.2020.00051]
                  [4]  Wen H, Zhang W. Reducing cache leakage energy for hybrid SPM-cache architectures. In: Proc. of the 2014 Int’l Conf. on Compilers,
                     Architecture and Synthesis for Embedded Systems (CASES). New Delhi: ACM, 2014. 21. [doi: 10.1145/2656106.2656124]
                  [5]  Fang YF, Liu Q, Dong EM, Li YB, Guo F, Wang D, He WQ, Qi FB. Research on manycore on-chip storage hierarchy for exascale
                     supercomputer systems. Computer Engineering, 2023, 49(12): 10–24 (in Chinese with English abstract). [doi: 10.19678/j.issn.1000-3428.
                     0066548]
                  [6]  Gao JG, Liu X, Li F, Liu Y, Peng DJ, Chen X, Chen DX. Research on parallel computing model for sunway many-core supercomputing
                     system.  Chinese  Journal  of  Computers,  2023,  46(7):  1339–1349  (in  Chinese  with  English  abstract).  [doi:  10.11897/SP.J.1016.2023.
                     01339]
                  [7]  Venkataramani  V,  Chan  MC,  Mitra  T.  Scratchpad-memory  management  for  multi-threaded  applications  on  many-core  architectures.
                     ACM Trans. on Embedded Computing Systems, 2019, 18(1): 10. [doi: 10.1145/3301308]
                  [8]  Tao XH, Pang JM, Xu JL, Zhu Y. Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a
                     heterogeneous many-core architecture. The Journal of Supercomputing, 2021, 77(12): 14502–14524. [doi: 10.1007/s11227-021-03853-x]
                  [9]  Chakraborty P, Panda PR, Sen S. Partitioning and data mapping in reconfigurable cache and scratchpad memory-based architectures.
                     ACM Trans. on Design Automation of Electronic Systems, 2016, 22(1): 12. [doi: 10.1145/2934680]
                 [10]  Li JJ, Liu ZZ, Wang J. Optimizing OpenMP by array privatization on the multi-core platform of IBM cell. Journal of Computer Research
                     and Development, 2010, 47(8): 1434–1441 (in Chinese with English abstract).
                 [11]  Yu C, Bai YB, Sun QX, Yang HL. Improving thread-level parallelism in GPUs through expanding register file to scratchpad memory.
                     ACM Trans. on Architecture and Code Optimization, 2018, 15(4): 48. [doi: 10.1145/3280849]
                 [12]  He WQ, Liu Y, Fang YF, Wei D, Qi FB. Design and implementation of Parallel C programming language for domestic heterogeneous
                     many-core systems. Ruan Jian Xue Bao/Journal of Software, 2017, 28(4): 764–785 (in Chinese with English abstract). http://www.jos.org.
   85   86   87   88   89   90   91   92   93   94   95