Page 219 - 《软件学报》2020年第10期

P. 219

赵玉文等:申威 26010 众核处理器上一维 FFT 实现与优化 3195

[10] Pippig M. PFFT: An extension of FFTW to massively parallel architectures.SIAM Journal on Scientiﬁc Computing, 2013,35(3):
C213–C236.
[11] Takahashi D. An implementation of parallel 3-D fft with 2-D decomposition on a massively parallel cluster of multi-core
processors. In: Proc. of the Parallel Processing and Applied Mathematics. LNCS 6067, Berlin, Heidelberg: Springer-Verlag, 2010.
606–614.
[12] Song S, Hollingsworth JK. Designing and auto-tuning parallel 3-D FFT for computation-communication overlap. In: Proc. of the
19th ACM SIGPLAN Symp. On Principles and Practice of Parallel Programming (PPoPP 2014). 2014. [doi: 10.1145/2555243.
2555249]
[13] Chen Y, Cui X, Mei H. Large-scale FFT on GPU clusters. In: Proc. of the 24th ACM Int’l Conf. on Supercomputing. ACM, 2010.
315–324.
[14] Cui X, Li XW, Chen YF. Programming method of dimensional array types and high performance FFT implementation. Ruan Jian
Xue Bao/Journal of Software, 2015,26(12):3104−3116 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4801.
htm [doi: 10.13328/j.cnki.jos.004801]
[15] Chen L, Gao GR. Performance analysis of Cooley-Tukey FFT algorithms for a many-core architecture. In: Proc. of the Spring
Simulation Multiconference, Springsim 2010. Orlando: DBLP, 2010. 1–8.
[16] Chen L, Hu Z, Lin J, et al. Optimizing the fast Fourier transform on a multi-core architecture. In: Proc. of the Parallel and
Distributed Processing Symp., IPDPS 2007. IEEE, 2007. 1–8.
[17] Govindaraju NK, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics
processors. In: Proc. of the 2008 ACM/IEEE Conf. on Supercomputing (SC 2008). 2008. [doi: 10.1109/SC.2008.5213922]
[18] Dotsenko Y, Baghsorkhi SS, Lloyd B, Govindaraju NK. Auto-Tuning of fast Fourier transform on graphics processors. In: Proc. of
the 16th ACM Symp. On Principles and Practice of Parallel Programming (PPoPP 2011). ACM Press, 2011. [doi: 10.1145/1941553.
1941589]
[19] Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In: Proc. of the Int’l Conf. on Supercomputing.
Tsukuba: DBLP, 2010. 305−314.
[20] Asai R, Vladimirov A. Intel cilk plus for complex parallel algorithms: “enormous fast fourier transforms” (EFFT) library. Parallel
Computing, 2015,48:125–142.
[21] Nukada A, Matsuoka S. Auto-Tuning 3-D FFT library for CUDA GPUs. In: Proc. of the Conf. on High Performance Computing
Networking, Storage and Analysis (SC 2009). 2009. [doi: 10.1145/1654059.1654090]
[22] Nukada A, Ogata Y, Endo T, Matsuoka S. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In: Proc. of the 2008
ACM/IEEE Conf. on Supercomputing (SC 2008). 2008. [doi: 10.1109/SC.2008.5213210]
[23] Nukada A, Maruyama Y, Matsuoka S. High performance 3-D FFT using multiple CUDA GPUs. In: Proc. of the Workshop on
General Purpose Processing with Graphics Processing Units. ACM, 2012. 57–63.
[24] Nukada A, Sato K, Matsuoka S. Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer. In: Proc. of the High
Performance Computing, Networking, Storage and Analysis. IEEE, 2012. 44.
[25] Liu YQ, Li Y, Zhang YQ, et al. Memory efficient two-pass 3D FFT algorithm for Intel® Xeon PhiTM coprocessor. Journal of
Computer Science and Technology, 2014,29(6):989–1002.
[26] Park J. Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors. In: Proc. of the Int’l Conf. on
High PERFORMANCE Computing, Networking, Storage and Analysis. ACM, 2013. 34.
[27] Czechowski K, Battaglino C, McClanahan C, et al. On the communication complexity of 3D FFTs and its implications for exascale.
In: Proc. of the 26th ACM Int’l Conf. on Supercomputing. ACM, 2012. 205–214.
[28] Wang C, Chandrasekaran S, Chapman B. cusFFT: A high-performance sparse fast Fourier transform algorithm on GPUs. In: Proc.
of the 2016 IEEE Int’l Parallel and Distributed Processing Symp. IEEE, 2016. 963–972.
[29] Hassanieh H, Indyk P, Katabi D, et al. Simple and practical algorithm for sparse Fourier transform. In: Proc. of the 23rd Annual
ACM-SIAM Symp. on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2012. 1183–1194.
[30] López-Parrado A, Medina JV. Efficient software implementation of the nearly optimal sparse fast Fourier transform for the noisy
case. Ingenieríay Ciencia, 2015,11(22):73–94.

214 215 216 217 218 219 220 221 222 223 224