Page 90 - 《软件学报》2024年第6期

P. 90

2666 软件学报 2024 年第 35 卷第 6 期

高层信息, 再由编译器通过优化建模、循环优化方案求解和代码变换完成精准高效的核心循环数据访问优化, 提
升应用程序访存性能的同时减轻了用户编程负担.

访主存自动缓冲优化手工缓冲优化
2.50
2.08
2.00
运行时间 (ms) 1.50 1.04 1.54 1.53

1.00
0.52 0.77 0.76
0.50 0.39 0.38
0.26
0.13 0.19 0.19
0.10 0.10
0.06 0.05 0.05
0
256M 512M 1G 2G 4G 8G
待比对结果值个数

图 19 mem2sdm 优化效果对比图

本文在目前的设计中, 要求编程人员在编译指示中给出优化缓冲区地址和长度等信息, 后续将研究在这部分
信息允许缺省的情况下, 编译器结合运行时接口自动查询当前可用的缓冲区大小并进行自动申请与释放的管理,
一方面有机会挖掘更多的 LDM 空间作为优化缓冲, 另一方面可进一步简化编译指示的使用要求, 提升好编程性.

References:
[1] TOP500 List. 2021. https://www.top500.org/lists/top500/2021/11/
[2] Banakar R, Steinke S, Lee BS, Balakrishnan M, Marwedel P. Scratchpad memory: A design alternative for cache on-chip memory in
embedded systems. In: Proc. of the 10th Int’l Symp. on Hardware/Software Codesign. Estes Park: ACM, 2002. 73–78. [doi: 10.1145/
774789.774805]
[3] Sato M, Ishikawa Y, Tomita H, Kodama Y, Odajima T, Tsuji M, Yashiro H, Aoki M, Shida N, Miyoshi I, Hirai K, Furuya A, Asato A,
Morita K, Shimizu T. Co-design for A64FX manycore processor and “Fugaku”. In: Proc. of the 2020 Int’l Conf. for High Performance
Computing, Networking, Storage and Analysis. Atlanta: IEEE, 2020. 1–15. [doi: 10.1109/SC41405.2020.00051]
[4] Wen H, Zhang W. Reducing cache leakage energy for hybrid SPM-cache architectures. In: Proc. of the 2014 Int’l Conf. on Compilers,
Architecture and Synthesis for Embedded Systems (CASES). New Delhi: ACM, 2014. 21. [doi: 10.1145/2656106.2656124]
[5] Fang YF, Liu Q, Dong EM, Li YB, Guo F, Wang D, He WQ, Qi FB. Research on manycore on-chip storage hierarchy for exascale
supercomputer systems. Computer Engineering, 2023, 49(12): 10–24 (in Chinese with English abstract). [doi: 10.19678/j.issn.1000-3428.
0066548]
[6] Gao JG, Liu X, Li F, Liu Y, Peng DJ, Chen X, Chen DX. Research on parallel computing model for sunway many-core supercomputing
system. Chinese Journal of Computers, 2023, 46(7): 1339–1349 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2023.
01339]
[7] Venkataramani V, Chan MC, Mitra T. Scratchpad-memory management for multi-threaded applications on many-core architectures.
ACM Trans. on Embedded Computing Systems, 2019, 18(1): 10. [doi: 10.1145/3301308]
[8] Tao XH, Pang JM, Xu JL, Zhu Y. Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a
heterogeneous many-core architecture. The Journal of Supercomputing, 2021, 77(12): 14502–14524. [doi: 10.1007/s11227-021-03853-x]
[9] Chakraborty P, Panda PR, Sen S. Partitioning and data mapping in reconfigurable cache and scratchpad memory-based architectures.
ACM Trans. on Design Automation of Electronic Systems, 2016, 22(1): 12. [doi: 10.1145/2934680]
[10] Li JJ, Liu ZZ, Wang J. Optimizing OpenMP by array privatization on the multi-core platform of IBM cell. Journal of Computer Research
and Development, 2010, 47(8): 1434–1441 (in Chinese with English abstract).
[11] Yu C, Bai YB, Sun QX, Yang HL. Improving thread-level parallelism in GPUs through expanding register file to scratchpad memory.
ACM Trans. on Architecture and Code Optimization, 2018, 15(4): 48. [doi: 10.1145/3280849]
[12] He WQ, Liu Y, Fang YF, Wei D, Qi FB. Design and implementation of Parallel C programming language for domestic heterogeneous
many-core systems. Ruan Jian Xue Bao/Journal of Software, 2017, 28(4): 764–785 (in Chinese with English abstract). http://www.jos.org.

85 86 87 88 89 90 91 92 93 94 95