Page 219 - 《软件学报》2020年第10期
P. 219

赵玉文  等:申威 26010 众核处理器上一维 FFT 实现与优化                                               3195


         [10]    Pippig M. PFFT: An extension of FFTW to massively parallel architectures.SIAM Journal on Scientific Computing, 2013,35(3):
             C213–C236.
         [11]    Takahashi  D.  An implementation of parallel 3-D fft  with 2-D decomposition on  a  massively parallel  cluster of  multi-core
             processors. In: Proc. of the Parallel Processing and Applied Mathematics. LNCS 6067, Berlin, Heidelberg: Springer-Verlag, 2010.
             606–614.
         [12]    Song S, Hollingsworth JK. Designing and auto-tuning parallel 3-D FFT for computation-communication overlap. In: Proc. of the
             19th ACM SIGPLAN Symp. On Principles  and Practice of Parallel Programming (PPoPP 2014). 2014.  [doi: 10.1145/2555243.
             2555249]
         [13]    Chen Y, Cui X, Mei H. Large-scale FFT on GPU clusters. In: Proc. of the 24th ACM Int’l Conf. on Supercomputing. ACM, 2010.
             315–324.
         [14]    Cui X, Li XW, Chen YF. Programming method of dimensional array types and high performance FFT implementation. Ruan Jian
             Xue Bao/Journal of Software, 2015,26(12):3104−3116 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4801.
             htm [doi: 10.13328/j.cnki.jos.004801]
         [15]    Chen  L, Gao GR. Performance analysis  of Cooley-Tukey  FFT algorithms  for a many-core architecture. In:  Proc.  of the  Spring
             Simulation Multiconference, Springsim 2010. Orlando: DBLP, 2010. 1–8.
         [16]    Chen L,  Hu Z, Lin  J,  et al. Optimizing the  fast  Fourier transform on a multi-core architecture.  In:  Proc.  of  the  Parallel and
             Distributed Processing Symp., IPDPS 2007. IEEE, 2007. 1–8.
         [17]    Govindaraju NK,  Lloyd  B,  Dotsenko Y, Smith  B, Manferdelli J. High performance discrete Fourier transforms on graphics
             processors. In: Proc. of the 2008 ACM/IEEE Conf. on Supercomputing (SC 2008). 2008. [doi: 10.1109/SC.2008.5213922]
         [18]    Dotsenko Y, Baghsorkhi SS, Lloyd B, Govindaraju NK. Auto-Tuning of fast Fourier transform on graphics processors. In: Proc. of
             the 16th ACM Symp. On Principles and Practice of Parallel Programming (PPoPP 2011). ACM Press, 2011. [doi: 10.1145/1941553.
             1941589]
         [19]    Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In: Proc. of the Int’l Conf. on Supercomputing.
             Tsukuba: DBLP, 2010. 305−314.
         [20]    Asai R, Vladimirov A. Intel cilk plus for complex parallel algorithms: “enormous fast fourier transforms” (EFFT) library. Parallel
             Computing, 2015,48:125–142.
         [21]    Nukada A, Matsuoka S. Auto-Tuning 3-D FFT library for CUDA GPUs. In: Proc. of the Conf. on High Performance Computing
             Networking, Storage and Analysis (SC 2009). 2009. [doi: 10.1145/1654059.1654090]
         [22]    Nukada A, Ogata Y, Endo T, Matsuoka S. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In: Proc. of the 2008
             ACM/IEEE Conf. on Supercomputing (SC 2008). 2008. [doi: 10.1109/SC.2008.5213210]
         [23]    Nukada A, Maruyama Y, Matsuoka S. High performance 3-D FFT using multiple CUDA GPUs. In: Proc. of the Workshop on
             General Purpose Processing with Graphics Processing Units. ACM, 2012. 57–63.
         [24]    Nukada A, Sato  K,  Matsuoka S.  Scalable  multi-GPU 3-D FFT for  TSUBAME 2.0 supercomputer. In: Proc. of the  High
             Performance Computing, Networking, Storage and Analysis. IEEE, 2012. 44.
         [25]    Liu YQ, Li Y, Zhang YQ, et al. Memory efficient two-pass 3D FFT algorithm for Intel® Xeon PhiTM coprocessor. Journal of
             Computer Science and Technology, 2014,29(6):989–1002.
         [26]    Park J. Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors. In: Proc. of the Int’l Conf. on
             High PERFORMANCE Computing, Networking, Storage and Analysis. ACM, 2013. 34.
         [27]    Czechowski K, Battaglino C, McClanahan C, et al. On the communication complexity of 3D FFTs and its implications for exascale.
             In: Proc. of the 26th ACM Int’l Conf. on Supercomputing. ACM, 2012. 205–214.
         [28]    Wang C, Chandrasekaran S, Chapman B. cusFFT: A high-performance sparse fast Fourier transform algorithm on GPUs. In: Proc.
             of the 2016 IEEE Int’l Parallel and Distributed Processing Symp. IEEE, 2016. 963–972.
         [29]    Hassanieh H, Indyk P, Katabi D, et al. Simple and practical algorithm for sparse Fourier transform. In: Proc. of the 23rd Annual
             ACM-SIAM Symp. on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2012. 1183–1194.
         [30]    López-Parrado A, Medina JV. Efficient software implementation of the nearly optimal sparse fast Fourier transform for the noisy
             case. Ingenieríay Ciencia, 2015,11(22):73–94.
   214   215   216   217   218   219   220   221   222   223   224