ABSTRACT
Many scientific applications rely on sparse direct solvers for their numerical robustness. However, performance optimization for these solvers remains a challenging task, especially on GPUs. This is due to workloads of small dense matrices that are different in size. Matrix decompositions on such irregular workloads are rarely addressed on GPUs.
This paper addresses irregular workloads of matrix computations on GPUs, and their application to accelerate sparse direct solvers. We design an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse direct LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on a multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high-frequency indefinite Maxwell problem.
Supplemental Material
- I. S. Duff and J. K. Reid, "The Multifrontal Solution of Indefinite Sparse Symmetric Linear Equations," ACM Trans. Math. Softw., vol. 9, no. 3, pp. 302--325, September 1983. [Online]. Google ScholarDigital Library
- T. A. Davis, S. Rajamanickam, and W. M. Sid-Lakhdar, "A survey of direct methods for sparse linear systems," Acta Numer., vol. 25, pp. 383--âĂŞ566, 2016.Google ScholarCross Ref
- "Intel Math Kernel Library," available at http://software.intel.com/intel-mkl/.Google Scholar
- "NVIDIA CUBLAS," available at https://developer.nvidia.com/cublas.Google Scholar
- "rocBLAS, Next Generation BLAS Implementation for ROCm Platform," available at https://github.com/ROCmSoftwarePlatform/rocBLAS.Google Scholar
- H. Anzt, J. Dongarra, G. Flegar, and E. S. Quintana-Ortí, "Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors," Parallel Comput., vol. 81, pp. 131--146, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167819117302107Google ScholarCross Ref
- E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov, "Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects," J. Phys.: Conf. Ser., vol. 180, p. 012037, July 2009. [Online]. Google ScholarCross Ref
- X. S. Li and J. W. Demmel, "SuperLU_DIST: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear Systems," ACM Trans. Math. Softw., vol. 29, no. 2, pp. 110--âĂŞ140, June 2003. [Online]. Google ScholarDigital Library
- T. A. Davis, "Algorithm 832: UMFPACK V4.3---an Unsymmetric-Pattern Multifrontal Method," ACM Trans. Math. Softw., vol. 30, no. 2, pp. 196--âĂŞ199, June 2004. [Online]. Google ScholarDigital Library
- Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam, "Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate," ACM Trans. Math. Softw., vol. 35, no. 3, October 2008. [Online]. Google ScholarDigital Library
- P. Hénon, P. Ramet, and J. Roman, "PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems," Parallel Comput., vol. 28, no. 2, pp. 301--321, 2002. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167819101001417Google ScholarDigital Library
- P. Ghysels, X. S. Li, F.-H. Rouet, S. Williams, and A. Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling," SIAM J. Sci. Comput., vol. 38, no. 5, pp. S358--S384, 2016. [Online]. Google ScholarCross Ref
- P. Ghysels and R. Synk, "High performance sparse multifrontal solvers on modern gpus," Parallel Comput., vol. 110, p. 102897, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167819122000059Google ScholarDigital Library
- P. R. Amestoy and L. S. Duff, "Vectorization of a Multiprocessor Multifrontal Code," Int. J. High Perform. Comput. Appl., vol. 3, no. 3, pp. 41--59, September 1989. [Online]. Google ScholarDigital Library
- A. Gupta, "WSMP: Watson Sparse Matrix Package Part II âĂŞ direct solution of general systems," IBM T. J. Watson Research Center, Tech. Rep., 2000, https://s3.us.cloud-object-storage.appdomain.cloud/res-files/1331-wsmp2.pdf.Google Scholar
- A. A. Auer, G. Baumgartner, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. Harrison, S. Krishnamoorthy, S. Krishnan, C.-C. Lam, Q. Lu, M. Nooijen, R. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov, "Automatic code generation for many-body electronic structure methods: the tensor contraction engine," Mol. Phys., vol. 104, no. 2, pp. 211--228, 2006. [Online]. Google ScholarCross Ref
- O. E. B. Messer, J. A. Harris, S. Parete-Koon, and M. A. Chertkow, "Multicore and Accelerator Development for a Leadership-Class Stellar Astrophysics Code," in Applied Parallel and Scientific Computing, P. Manninen and P. Öster, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 92--106.Google Scholar
- A. Abdelfattah, T. Costa, J. Dongarra, M. Gates, A. Haidar, S. Hammarling, N. J. Higham, J. Kurzak, P. Luszczek, S. Tomov, and M. Zounon, "A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines," ACM Trans. Math. Softw., vol. 47, no. 3, June 2021. [Online]. Google ScholarDigital Library
- K. Kim, T. B. Costa, M. Deveci, A. M. Bradley, S. D. Hammond, M. E. Guney, S. Knepper, S. Story, and S. Rajamanickam, "Designing Vector-friendly Compact BLAS and LAPACK Kernels," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '17. New York, NY, USA: ACM, 2017, pp. 55:1--55:12. [Online]. Google ScholarDigital Library
- A. Abdelfattah, A. Haidar, S. Tomov, and J. Dongarra, "Performance, Design, and Autotuning of Batched GEMM for GPUs," in High Performance Computing, ser. ISC High Performance, J. M. Kunkel, P. Balaji, and J. Dongarra, Eds. Cham: Springer International Publishing, 2016, pp. 21--38. [Online]. Google ScholarCross Ref
- G. Karypis and V. Kumar, "A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs," SIAM J. Sci. Comput., vol. 20, no. 1, pp. 359--392, 1998. [Online]. Google ScholarDigital Library
- I. S. Duff and J. Koster, "The design and use of algorithms for permuting large entries to the diagonal of sparse matrices," SIAM J. Matrix Anal. Appl., vol. 20, no. 4, pp. 889--901, 1999. [Online]. Google ScholarDigital Library
- M. Gates, J. Kurzak, A. Charara, A. YarKhan, and J. Dongarra, "SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '19. New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Google ScholarDigital Library
- A. Abdelfattah, M. Baboulin, V. Dobrev, J. J. Dongarra, C. W. Earl, J. Falcou, A. Haidar, I. Karlin, T. V. Kolev, I. Masliah, and S. Tomov, "High-Performance Tensor Contractions for GPUs," in International Conference on Computational Science 2016, ICCS 2016, 6--8 June 2016, San Diego, California, USA, 2016, pp. 108--118. [Online]. Google ScholarDigital Library
- S. N. Yeralan, T. A. Davis, W. M. Sid-Lakhdar, and S. Ranka, "Algorithm 980: Sparse QR Factorization on the GPU," ACM Trans. Math. Softw., vol. 44, no. 2, August 2017. [Online]. Google ScholarDigital Library
- M. J. Anderson, D. Sheffield, and K. Keutzer, "A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers," in 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012, pp. 2--13. [Online]. Google ScholarDigital Library
- T. Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, and J. Dongarra, "A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU," in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014, pp. 972--981. [Online]. Google ScholarDigital Library
- A. Haidar, T. Dong, P. Luszczek, S. Tomov, and J. J. Dongarra, "Batched matrix computations on hardware accelerators based on GPUs," Int. J. High Perform. Comput. Appl., vol. 29, no. 2, pp. 193--208, 2015. [Online]. Google ScholarDigital Library
- A. Abdelfattah, A. Haidar, S. Tomov, and J. Dongarra, "Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures," J. Comput. Sci., vol. 26, pp. 226--236, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1877750317311456Google ScholarCross Ref
- H. Anzt, J. Dongarra, G. Flegar, and E. S. Quintana-Ortí, "Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs," in Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, ser. PMAM'17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 1âĂŞ-10. [Online]. Google ScholarDigital Library
- A. Charara, H. Ltaief, and D. Keyes, "Redesigning Triangular Dense Matrix Computations on GPUs," in Proceedings of the 22nd International Conference on Euro-Par 2016: Parallel Processing - Volume 9833. Berlin, Heidelberg: Springer-Verlag, 2016, pp. 477--489. [Online]. Google ScholarDigital Library
- R. Anderson, J. Andrej, A. Barker, J. Bramwell, J.-S. Camier, J. Cerveny, V. Dobrev, Y. Dudouit, A. Fisher, T. Kolev, W. Pazner, M. Stowell, V. Tomov, I. Akkerman, J. Dahm, D. Medina, and S. Zampini, "MFEM: A modular finite element methods library," Comput. Math. with Appl., vol. 81, pp. 42--74, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0898122120302583Google ScholarCross Ref
Comments