Abstract
QR factorization of dense matrices is a ubiquitous tool in high performance computing (HPC). From solving linear systems and least squares problems to eigenvalue problems, and singular value decompositions, the impact of a high performance QR factorization is fundamental to computer simulations and many applications. More importantly, the QR factorization on a batch of relatively small matrices has acquired a lot of attention in sparse direct solvers and low-rank approximations for Hierarchical matrices. To address this interest and demand, we developed and present a high performance batch QR factorization for Graphics Processing Units (GPUs). We present a multi-level blocking strategy that adjusts various algorithmic designs to the size of the input matrices. We also show that following the LAPACK QR design convention, while still useful, is significantly outperformed by unconventional code structures that increase data reuse. The performance results show multi-fold speedups against the state of the art libraries on the latest GPU architectures from both NVIDIA and AMD.
- 1.LAPACK - Linear Algebra PACKage. http://www.netlib.org/lapack/Google Scholar
- 2.Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Performance, design, and autotuning of batched GEMM for GPUs. In: ISC High Performance 2016, Frankfurt, Germany, 19–23 June 2016, Proceedings, pp. 21–38 (2016)Google Scholar
- 3.Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. Procedia Comput. Sci. 108, 606–615 (2017). ICCS 2017, Zurich, SwitzerlandGoogle Scholar
- 4.Batched one-sided factorizations of tiny matrices using GPUs: challenges and countermeasuresJ. Comput. Sci.20182622623610.1016/j.jocs.2018.01.005Google ScholarCross Ref
- 5.hipBLAS. https://github.com/ROCmSoftwarePlatform/hipBLASGoogle Scholar
- 6.Anderson, M., Sheffield, D., Keutzer, K.: A predictive model for solving small linear algebra problems in GPU registers. In: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS) (2012)Google Scholar
- 7.Anzt, H., Dongarra, J., Flegar, G., Quintana-Ortí, E.S.: Batched Gauss-Jordan elimination for block-Jacobi preconditioner generation on GPUs. PMAM 2017, pp. 1–10. ACM, New York (2017)Google Scholar
- 8.Automatic code generation for many-body electronic structure methods: the tensor contraction engineMol. Phys.2006104221122810.1080/00268970500275780Google ScholarCross Ref
- 9.Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compressionParallel Comput.201710.1016/j.parco.2017.09.001Google ScholarDigital Library
- 10.Haidar, A., Dong, T., Luszczek, P., Tomov, S., Dongarra, J.: Batched matrix computations on hardware accelerators based on GPUs. IJHPCA 29(2), 193–208 (2015) Google Scholar
- 11.Haidar, A., Tomov, S., Luszczek, P., Dongarra, J.: Magma embedded: towards a dense linear algebra library for energy efficient extreme computing. In: 2015 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6, September 2015Google Scholar
- 12.MAGMA. http://icl.cs.utk.edu/magma/Google Scholar
- 13.PLASMA, October 2017. https://bitbucket.org/icl/plasmaGoogle Scholar
- 14.Intel Math Kernel Library. http://software.intel.com/intel-mkl/Google Scholar
- 15.Implementation and tuning of batched Cholesky factorization and solve for NVIDIA GPUsIEEE Trans. Parallel Distrib. Syst.2015272036204810.1109/TPDS.2015.2481890Google ScholarDigital Library
- 16.Messer, O., Harris, J., Parete-Koon, S., Chertkow, M.: Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Proceedings of “PARA 2012: State-of-the-Art in Scientific and Parallel Computing” (2012)Google Scholar
- 17.NVIDIA CUBLAS. https://developer.nvidia.com/cublasGoogle Scholar
- 18.Tomás Dominguez, A.E., Quintana Orti, E.S.: Fast blocking of householder reflectors on graphics processors. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 385–393 (2018). DOI: https://doi.org/10.1109/PDP2018.2018.00068Google Scholar
- 19.Van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM TOMS 41(3), 33 (2015). https://dl.acm.org/doi/10.1145/2764454Google Scholar
- 20.Walker, Homer F.: Implementation of the GMRES method using householder transformations. SIAM J. Sci. Stat. Comput. 9(1), 152–163 (1988). DOI: https://doi.org/10.1137/0909010Google Scholar
- 21.Yeralan, S.N., Davis, T.A., Sid-Lakhdar, W.M., Ranka, S.: Algorithm 980: sparse QR factorization on the GPU. ACM TOMS 44(2), 17:1–17:29 (2017)Google Scholar
Index Terms
(auto-classified)Batch QR Factorization on GPUs: Design, Optimization, and Tuning
Comments