Article

Batch QR Factorization on GPUs: Design, Optimization, and Tuning

Authors:
Ahmad Abdelfattah

University of Tennessee, Knoxville, USA

University of Tennessee, Knoxville, USA
Search about this author

,
Stan Tomov

University of Tennessee, Knoxville, USA

University of Tennessee, Knoxville, USA
Search about this author

,
Jack Dongarra

University of Tennessee, Knoxville, USA

Oak Ridge National Laboratory, Oak Ridge, USA

University of Manchester, Manchester, UK

University of Tennessee, Knoxville, USA

Oak Ridge National Laboratory, Oak Ridge, USA

University of Manchester, Manchester, UK
Search about this author

Computational Science – ICCS 2022: 22nd International Conference, London, UK, June 21–23, 2022, Proceedings, Part IJun 2022 Pages 60–74https://doi.org/10.1007/978-3-031-08751-6_5

Published:21 June 2022Publication History

Computational Science – ICCS 2022: 22nd International Conference, London, UK, June 21–23, 2022, Proceedings, Part I

Pages 60–74

Abstract

QR factorization of dense matrices is a ubiquitous tool in high performance computing (HPC). From solving linear systems and least squares problems to eigenvalue problems, and singular value decompositions, the impact of a high performance QR factorization is fundamental to computer simulations and many applications. More importantly, the QR factorization on a batch of relatively small matrices has acquired a lot of attention in sparse direct solvers and low-rank approximations for Hierarchical matrices. To address this interest and demand, we developed and present a high performance batch QR factorization for Graphics Processing Units (GPUs). We present a multi-level blocking strategy that adjusts various algorithmic designs to the size of the input matrices. We also show that following the LAPACK QR design convention, while still useful, is significantly outperformed by unconventional code structures that increase data reuse. The performance results show multi-fold speedups against the state of the art libraries on the latest GPU architectures from both NVIDIA and AMD.

References

1.LAPACK - Linear Algebra PACKage. http://www.netlib.org/lapack/Google Scholar
2.Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Performance, design, and autotuning of batched GEMM for GPUs. In: ISC High Performance 2016, Frankfurt, Germany, 19–23 June 2016, Proceedings, pp. 21–38 (2016)Google Scholar
3.Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.: Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. Procedia Comput. Sci. 108, 606–615 (2017). ICCS 2017, Zurich, SwitzerlandGoogle Scholar
4.Abdelfattah AHaidar ATomov SDongarra JJBatched one-sided factorizations of tiny matrices using GPUs: challenges and countermeasuresJ. Comput. Sci.20182622623610.1016/j.jocs.2018.01.005Google ScholarCross Ref
5.hipBLAS. https://github.com/ROCmSoftwarePlatform/hipBLASGoogle Scholar
6.Anderson, M., Sheffield, D., Keutzer, K.: A predictive model for solving small linear algebra problems in GPU registers. In: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS) (2012)Google Scholar
7.Anzt, H., Dongarra, J., Flegar, G., Quintana-Ortí, E.S.: Batched Gauss-Jordan elimination for block-Jacobi preconditioner generation on GPUs. PMAM 2017, pp. 1–10. ACM, New York (2017)Google Scholar
8.Auer AAet al.Automatic code generation for many-body electronic structure methods: the tensor contraction engineMol. Phys.2006104221122810.1080/00268970500275780Google ScholarCross Ref
9.Boukaram WHTurkiyyah GLtaief HKeyes DEBatched QR and SVD algorithms on GPUs with applications in hierarchical matrix compressionParallel Comput.201710.1016/j.parco.2017.09.001Google ScholarDigital Library
10.Haidar, A., Dong, T., Luszczek, P., Tomov, S., Dongarra, J.: Batched matrix computations on hardware accelerators based on GPUs. IJHPCA 29(2), 193–208 (2015) Google Scholar
11.Haidar, A., Tomov, S., Luszczek, P., Dongarra, J.: Magma embedded: towards a dense linear algebra library for energy efficient extreme computing. In: 2015 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6, September 2015Google Scholar
12.MAGMA. http://icl.cs.utk.edu/magma/Google Scholar
13.PLASMA, October 2017. https://bitbucket.org/icl/plasmaGoogle Scholar
14.Intel Math Kernel Library. http://software.intel.com/intel-mkl/Google Scholar
15.Kurzak JAnzt HGates MDongarra JImplementation and tuning of batched Cholesky factorization and solve for NVIDIA GPUsIEEE Trans. Parallel Distrib. Syst.2015272036204810.1109/TPDS.2015.2481890Google ScholarDigital Library
16.Messer, O., Harris, J., Parete-Koon, S., Chertkow, M.: Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Proceedings of “PARA 2012: State-of-the-Art in Scientific and Parallel Computing” (2012)Google Scholar
17.NVIDIA CUBLAS. https://developer.nvidia.com/cublasGoogle Scholar
18.Tomás Dominguez, A.E., Quintana Orti, E.S.: Fast blocking of householder reflectors on graphics processors. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 385–393 (2018). DOI: https://doi.org/10.1109/PDP2018.2018.00068Google Scholar
19.Van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM TOMS 41(3), 33 (2015). https://dl.acm.org/doi/10.1145/2764454Google Scholar
20.Walker, Homer F.: Implementation of the GMRES method using householder transformations. SIAM J. Sci. Stat. Comput. 9(1), 152–163 (1988). DOI: https://doi.org/10.1137/0909010Google Scholar
21.Yeralan, S.N., Davis, T.A., Sid-Lakhdar, W.M., Ranka, S.: Algorithm 980: sparse QR factorization on the GPU. ACM TOMS 44(2), 17:1–17:29 (2017)Google Scholar

Index Terms

(auto-classified)

Batch QR Factorization on GPUs: Design, Optimization, and Tuning
1. Computing methodologies
  1. Parallel computing methodologies
  2. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
Computational Science – ICCS 2022: 22nd International Conference, London, UK, June 21–23, 2022, Proceedings, Part I
Jun 2022
796 pages
ISBN:978-3-031-08750-9
DOI:10.1007/978-3-031-08751-6
Editors:
Derek Groen
Brunel University London, London, UK
,
Clélia de Mulatier
University of Amsterdam, Amsterdam, The Netherlands
,
Maciej Paszynski
AGH University of Science and Technology, Krakow, Poland
,
Valeria V. Krzhizhanovskaya
University of Amsterdam, Amsterdam, The Netherlands
,
Jack J. Dongarra
University of Tennessee at Knoxville, Knoxville, TN, USA
,
Peter M. A. Sloot
University of Amsterdam, Amsterdam, The Netherlands
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
Sponsors
In-Cooperation
Publisher
Springer-Verlag
Berlin, Heidelberg
Publication History
- Published: 21 June 2022
Author Tags
GPU computing
QR factorization
Batch linear algebra
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 0
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

Batch QR Factorization on GPUs: Design, Optimization, and Tuning

Save to Binder

Computational Science – ICCS 2022: 22nd International Conference, London, UK, June 21–23, 2022, Proceedings, Part I

Abstract

References

Cited By

Index Terms

Batch QR Factorization on GPUs: Design, Optimization, and Tuning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

Batch QR Factorization on GPUs: Design, Optimization, and Tuning

Save to Binder

Computational Science – ICCS 2022: 22nd International Conference, London, UK, June 21–23, 2022, Proceedings, Part I

Abstract

References

Cited By

Index Terms

Batch QR Factorization on GPUs: Design, Optimization, and Tuning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

Digital Edition

Share this Publication link

Share on Social Media