10.5555/3571885.3571919acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections

Addressing irregular patterns of matrix computations on GPUs and their impact on applications powered by sparse direct solvers

Published:18 November 2022Publication History

ABSTRACT

Many scientific applications rely on sparse direct solvers for their numerical robustness. However, performance optimization for these solvers remains a challenging task, especially on GPUs. This is due to workloads of small dense matrices that are different in size. Matrix decompositions on such irregular workloads are rarely addressed on GPUs.

This paper addresses irregular workloads of matrix computations on GPUs, and their application to accelerate sparse direct solvers. We design an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse direct LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on a multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high-frequency indefinite Maxwell problem.

Skip Supplemental Material Section

Supplemental Material

SC22_Presentation_Ahmad.mp4

Presentation at SC '22

References

  1. I. S. Duff and J. K. Reid, "The Multifrontal Solution of Indefinite Sparse Symmetric Linear Equations," ACM Trans. Math. Softw., vol. 9, no. 3, pp. 302--325, September 1983. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. A. Davis, S. Rajamanickam, and W. M. Sid-Lakhdar, "A survey of direct methods for sparse linear systems," Acta Numer., vol. 25, pp. 383--âĂŞ566, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  3. "Intel Math Kernel Library," available at http://software.intel.com/intel-mkl/.Google ScholarGoogle Scholar
  4. "NVIDIA CUBLAS," available at https://developer.nvidia.com/cublas.Google ScholarGoogle Scholar
  5. "rocBLAS, Next Generation BLAS Implementation for ROCm Platform," available at https://github.com/ROCmSoftwarePlatform/rocBLAS.Google ScholarGoogle Scholar
  6. H. Anzt, J. Dongarra, G. Flegar, and E. S. Quintana-Ortí, "Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors," Parallel Comput., vol. 81, pp. 131--146, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167819117302107Google ScholarGoogle ScholarCross RefCross Ref
  7. E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov, "Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects," J. Phys.: Conf. Ser., vol. 180, p. 012037, July 2009. [Online]. Google ScholarGoogle ScholarCross RefCross Ref
  8. X. S. Li and J. W. Demmel, "SuperLU_DIST: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear Systems," ACM Trans. Math. Softw., vol. 29, no. 2, pp. 110--âĂŞ140, June 2003. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. A. Davis, "Algorithm 832: UMFPACK V4.3---an Unsymmetric-Pattern Multifrontal Method," ACM Trans. Math. Softw., vol. 30, no. 2, pp. 196--âĂŞ199, June 2004. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam, "Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate," ACM Trans. Math. Softw., vol. 35, no. 3, October 2008. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Hénon, P. Ramet, and J. Roman, "PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems," Parallel Comput., vol. 28, no. 2, pp. 301--321, 2002. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167819101001417Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Ghysels, X. S. Li, F.-H. Rouet, S. Williams, and A. Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling," SIAM J. Sci. Comput., vol. 38, no. 5, pp. S358--S384, 2016. [Online]. Google ScholarGoogle ScholarCross RefCross Ref
  13. P. Ghysels and R. Synk, "High performance sparse multifrontal solvers on modern gpus," Parallel Comput., vol. 110, p. 102897, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167819122000059Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. R. Amestoy and L. S. Duff, "Vectorization of a Multiprocessor Multifrontal Code," Int. J. High Perform. Comput. Appl., vol. 3, no. 3, pp. 41--59, September 1989. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Gupta, "WSMP: Watson Sparse Matrix Package Part II âĂŞ direct solution of general systems," IBM T. J. Watson Research Center, Tech. Rep., 2000, https://s3.us.cloud-object-storage.appdomain.cloud/res-files/1331-wsmp2.pdf.Google ScholarGoogle Scholar
  16. A. A. Auer, G. Baumgartner, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. Harrison, S. Krishnamoorthy, S. Krishnan, C.-C. Lam, Q. Lu, M. Nooijen, R. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov, "Automatic code generation for many-body electronic structure methods: the tensor contraction engine," Mol. Phys., vol. 104, no. 2, pp. 211--228, 2006. [Online]. Google ScholarGoogle ScholarCross RefCross Ref
  17. O. E. B. Messer, J. A. Harris, S. Parete-Koon, and M. A. Chertkow, "Multicore and Accelerator Development for a Leadership-Class Stellar Astrophysics Code," in Applied Parallel and Scientific Computing, P. Manninen and P. Öster, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 92--106.Google ScholarGoogle Scholar
  18. A. Abdelfattah, T. Costa, J. Dongarra, M. Gates, A. Haidar, S. Hammarling, N. J. Higham, J. Kurzak, P. Luszczek, S. Tomov, and M. Zounon, "A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines," ACM Trans. Math. Softw., vol. 47, no. 3, June 2021. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Kim, T. B. Costa, M. Deveci, A. M. Bradley, S. D. Hammond, M. E. Guney, S. Knepper, S. Story, and S. Rajamanickam, "Designing Vector-friendly Compact BLAS and LAPACK Kernels," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '17. New York, NY, USA: ACM, 2017, pp. 55:1--55:12. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Abdelfattah, A. Haidar, S. Tomov, and J. Dongarra, "Performance, Design, and Autotuning of Batched GEMM for GPUs," in High Performance Computing, ser. ISC High Performance, J. M. Kunkel, P. Balaji, and J. Dongarra, Eds. Cham: Springer International Publishing, 2016, pp. 21--38. [Online]. Google ScholarGoogle ScholarCross RefCross Ref
  21. G. Karypis and V. Kumar, "A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs," SIAM J. Sci. Comput., vol. 20, no. 1, pp. 359--392, 1998. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. I. S. Duff and J. Koster, "The design and use of algorithms for permuting large entries to the diagonal of sparse matrices," SIAM J. Matrix Anal. Appl., vol. 20, no. 4, pp. 889--901, 1999. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Gates, J. Kurzak, A. Charara, A. YarKhan, and J. Dongarra, "SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '19. New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Abdelfattah, M. Baboulin, V. Dobrev, J. J. Dongarra, C. W. Earl, J. Falcou, A. Haidar, I. Karlin, T. V. Kolev, I. Masliah, and S. Tomov, "High-Performance Tensor Contractions for GPUs," in International Conference on Computational Science 2016, ICCS 2016, 6--8 June 2016, San Diego, California, USA, 2016, pp. 108--118. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. N. Yeralan, T. A. Davis, W. M. Sid-Lakhdar, and S. Ranka, "Algorithm 980: Sparse QR Factorization on the GPU," ACM Trans. Math. Softw., vol. 44, no. 2, August 2017. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. J. Anderson, D. Sheffield, and K. Keutzer, "A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers," in 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012, pp. 2--13. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, and J. Dongarra, "A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU," in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014, pp. 972--981. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Haidar, T. Dong, P. Luszczek, S. Tomov, and J. J. Dongarra, "Batched matrix computations on hardware accelerators based on GPUs," Int. J. High Perform. Comput. Appl., vol. 29, no. 2, pp. 193--208, 2015. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Abdelfattah, A. Haidar, S. Tomov, and J. Dongarra, "Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures," J. Comput. Sci., vol. 26, pp. 226--236, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1877750317311456Google ScholarGoogle ScholarCross RefCross Ref
  30. H. Anzt, J. Dongarra, G. Flegar, and E. S. Quintana-Ortí, "Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs," in Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, ser. PMAM'17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 1âĂŞ-10. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Charara, H. Ltaief, and D. Keyes, "Redesigning Triangular Dense Matrix Computations on GPUs," in Proceedings of the 22nd International Conference on Euro-Par 2016: Parallel Processing - Volume 9833. Berlin, Heidelberg: Springer-Verlag, 2016, pp. 477--489. [Online]. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Anderson, J. Andrej, A. Barker, J. Bramwell, J.-S. Camier, J. Cerveny, V. Dobrev, Y. Dudouit, A. Fisher, T. Kolev, W. Pazner, M. Stowell, V. Tomov, I. Akkerman, J. Dahm, D. Medina, and S. Zampini, "MFEM: A modular finite element methods library," Comput. Math. with Appl., vol. 81, pp. 42--74, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0898122120302583Google ScholarGoogle ScholarCross RefCross Ref

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Article Metrics

    • Downloads (Last 12 months)36
    • Downloads (Last 6 weeks)12

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!