10.1007/978-3-031-08751-6_5guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Batch QR Factorization on GPUs: Design, Optimization, and Tuning

Published:21 June 2022Publication History

Abstract

QR factorization of dense matrices is a ubiquitous tool in high performance computing (HPC). From solving linear systems and least squares problems to eigenvalue problems, and singular value decompositions, the impact of a high performance QR factorization is fundamental to computer simulations and many applications. More importantly, the QR factorization on a batch of relatively small matrices has acquired a lot of attention in sparse direct solvers and low-rank approximations for Hierarchical matrices. To address this interest and demand, we developed and present a high performance batch QR factorization for Graphics Processing Units (GPUs). We present a multi-level blocking strategy that adjusts various algorithmic designs to the size of the input matrices. We also show that following the LAPACK QR design convention, while still useful, is significantly outperformed by unconventional code structures that increase data reuse. The performance results show multi-fold speedups against the state of the art libraries on the latest GPU architectures from both NVIDIA and AMD.

References

  1. 1.LAPACK - Linear Algebra PACKage. http://www.netlib.org/lapack/Google ScholarGoogle Scholar
  2. 2.Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Performance, design, and autotuning of batched GEMM for GPUs. In: ISC High Performance 2016, Frankfurt, Germany, 19–23 June 2016, Proceedings, pp. 21–38 (2016)Google ScholarGoogle Scholar
  3. 3.Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. Procedia Comput. Sci. 108, 606–615 (2017). ICCS 2017, Zurich, SwitzerlandGoogle ScholarGoogle Scholar
  4. 4.Abdelfattah AHaidar ATomov SDongarra JJBatched one-sided factorizations of tiny matrices using GPUs: challenges and countermeasuresJ. Comput. Sci.20182622623610.1016/j.jocs.2018.01.005Google ScholarGoogle ScholarCross RefCross Ref
  5. 5.hipBLAS. https://github.com/ROCmSoftwarePlatform/hipBLASGoogle ScholarGoogle Scholar
  6. 6.Anderson, M., Sheffield, D., Keutzer, K.: A predictive model for solving small linear algebra problems in GPU registers. In: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS) (2012)Google ScholarGoogle Scholar
  7. 7.Anzt, H., Dongarra, J., Flegar, G., Quintana-Ortí, E.S.: Batched Gauss-Jordan elimination for block-Jacobi preconditioner generation on GPUs. PMAM 2017, pp. 1–10. ACM, New York (2017)Google ScholarGoogle Scholar
  8. 8.Auer AAet al.Automatic code generation for many-body electronic structure methods: the tensor contraction engineMol. Phys.2006104221122810.1080/00268970500275780Google ScholarGoogle ScholarCross RefCross Ref
  9. 9.Boukaram WHTurkiyyah GLtaief HKeyes DEBatched QR and SVD algorithms on GPUs with applications in hierarchical matrix compressionParallel Comput.201710.1016/j.parco.2017.09.001Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.Haidar, A., Dong, T., Luszczek, P., Tomov, S., Dongarra, J.: Batched matrix computations on hardware accelerators based on GPUs. IJHPCA 29(2), 193–208 (2015) Google ScholarGoogle Scholar
  11. 11.Haidar, A., Tomov, S., Luszczek, P., Dongarra, J.: Magma embedded: towards a dense linear algebra library for energy efficient extreme computing. In: 2015 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6, September 2015Google ScholarGoogle Scholar
  12. 12.MAGMA. http://icl.cs.utk.edu/magma/Google ScholarGoogle Scholar
  13. 13.PLASMA, October 2017. https://bitbucket.org/icl/plasmaGoogle ScholarGoogle Scholar
  14. 14.Intel Math Kernel Library. http://software.intel.com/intel-mkl/Google ScholarGoogle Scholar
  15. 15.Kurzak JAnzt HGates MDongarra JImplementation and tuning of batched Cholesky factorization and solve for NVIDIA GPUsIEEE Trans. Parallel Distrib. Syst.2015272036204810.1109/TPDS.2015.2481890Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.Messer, O., Harris, J., Parete-Koon, S., Chertkow, M.: Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Proceedings of “PARA 2012: State-of-the-Art in Scientific and Parallel Computing” (2012)Google ScholarGoogle Scholar
  17. 17.NVIDIA CUBLAS. https://developer.nvidia.com/cublasGoogle ScholarGoogle Scholar
  18. 18.Tomás Dominguez, A.E., Quintana Orti, E.S.: Fast blocking of householder reflectors on graphics processors. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 385–393 (2018). DOI: https://doi.org/10.1109/PDP2018.2018.00068Google ScholarGoogle Scholar
  19. 19.Van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM TOMS 41(3), 33 (2015). https://dl.acm.org/doi/10.1145/2764454Google ScholarGoogle Scholar
  20. 20.Walker, Homer F.: Implementation of the GMRES method using householder transformations. SIAM J. Sci. Stat. Comput. 9(1), 152–163 (1988). DOI: https://doi.org/10.1137/0909010Google ScholarGoogle Scholar
  21. 21.Yeralan, S.N., Davis, T.A., Sid-Lakhdar, W.M., Ranka, S.: Algorithm 980: sparse QR factorization on the GPU. ACM TOMS 44(2), 17:1–17:29 (2017)Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. Batch QR Factorization on GPUs: Design, Optimization, and Tuning

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image Guide Proceedings
          Computational Science – ICCS 2022: 22nd International Conference, London, UK, June 21–23, 2022, Proceedings, Part I
          Jun 2022
          796 pages
          ISBN:978-3-031-08750-9
          DOI:10.1007/978-3-031-08751-6

          © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022

          Publisher

          Springer-Verlag

          Berlin, Heidelberg

          Publication History

          • Published: 21 June 2022

          Qualifiers

          • Article
        • Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0

          Other Metrics

        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!