ABSTRACT
We propose the Sparse Abstract Machine (SAM), an abstract machine model for targeting sparse tensor algebra to reconfigurable and fixed-function spatial dataflow accelerators. SAM defines a streaming dataflow abstraction with sparse primitives that encompass a large space of scheduled tensor algebra expressions. SAM dataflow graphs naturally separate tensor formats from algorithms and are expressive enough to incorporate arbitrary iteration orderings and many hardware-specific optimizations. We also present Custard, a compiler from a high-level language to SAM that demonstrates SAM's usefulness as an intermediate representation. We automatically bind from SAM to a streaming dataflow simulator. We evaluate the generality and extensibility of SAM, explore the performance space of sparse tensor algebra optimizations using SAM, and show SAM's ability to represent dataflow hardware.
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265–283. Google ScholarDigital Library
- Peter Ahrens, Fredrik Kjolstad, and Saman Amarasinghe. 2022. Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI) (PLDI 2022). Association for Computing Machinery, New York, NY, USA. 269–285. isbn:9781450392655 https://doi.org/10.1145/3519939.3523442 Google ScholarDigital Library
- Brett W. Bader and Tamara G. Kolda. 2007. Efficient MATLAB Computations with Sparse and Factored Tensors. SIAM Journal on Scientific Computing, 30, 1 (2007), December, 205–231. https://doi.org/10.1137/060676489 Google ScholarDigital Library
- Vivek Bharadwaj, Aydin Buluç, and James Demmel. 2022. Distributed-Memory Sparse Kernels for Machine Learning. https://doi.org/10.48550/ARXIV.2203.07673 Google Scholar
- Aart Bik, Penporn Koanantakool, Tatiana Shpeisman, Nicolas Vasilache, Bixia Zheng, and Fredrik Kjolstad. 2022. Compiler support for sparse tensor computations in MLIR. ACM Transactions on Architecture and Code Optimization (TACO), 19, 4 (2022), 1–25. Google ScholarDigital Library
- A. Canning, G. Galli, F. Mauri, A. De Vita, and R. Car. 1996. O(N) tight-binding molecular dynamics on massively parallel computers: an orbital decomposition approach. Computer Physics Communications, 94, 2 (1996), April, 89–102. https://doi.org/10.1016/0010-4655(96)00009-4 Google ScholarCross Ref
- Alex Carsello, Kathleen Feng, Taeyoung Kong, Kalhan Koul, Qiaoyi Liu, Jackson Melchert, Gedeon Nyengele, Maxwell Strange, Keyi Zhang, Ankita Nayak, Jeff Setter, James Thomas, Kavya Sreedhar, Po-Han Chen, Nikhil Bhagdikar, Zachary Myers, Brandon D’Agostino, Pranil Joshi, Stephen Richardson, Rick Bahr, Christopher Torng, Mark Horowitz, and Priyanka Raina. 2022. Amber: A 367 GOPS, 538 GOPS/W 16nm SoC with a Coarse-Grained Reconfigurable Array for Flexible Acceleration of Dense Linear Algebra. IEEE Symposium on VLSI Technology & Circuits. Google ScholarCross Ref
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI’18). USENIX Association, USA. 579–594. isbn:9781931971478 Google Scholar
- Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits, 52, 1 (2017), 127–138. https://doi.org/10.1109/JSSC.2016.2616357 Google ScholarCross Ref
- Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. 2019. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 9, 2 (2019), 292–308. https://doi.org/10.1109/JETCAS.2019.2910232 Google ScholarCross Ref
- Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format Abstraction for Sparse Tensor Algebra Compilers. Proc. ACM Program. Lang., 2, OOPSLA (2018), Article 123, October, 30 pages. issn:2475-1421 Google ScholarDigital Library
- Vidushi Dadu, Sihao Liu, and Tony Nowatzki. 2021. PolyGraph: Exposing the value of flexibility for graph processing accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 595–608. Google ScholarDigital Library
- Vidushi Dadu, Jian Weng, Sihao Liu, and Tony Nowatzki. 2019. Towards general purpose acceleration by exploiting common data-dependence forms. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 924–939. Google ScholarDigital Library
- Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38, 1 (2011), 1–25. Google ScholarDigital Library
- John Ellson, Emden Gansner, Lefteris Koutsofios, Stephen C. North, and Gordon Woodhull. 2002. Graphviz— Open Source Graph Drawing Tools. In Graph Drawing, Petra Mutzel, Michael Jünger, and Sebastian Leipert (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 483–484. isbn:978-3-540-45848-7 Google Scholar
- Richard Feynman, Robert B. Leighton, and Matthew L. Sands. 1963. The Feynman Lectures on Physics. Vol. 3. Addison-Wesley. Google Scholar
- Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. Sparse GPU Kernels for Deep Learning. IEEE Press, 1–14. isbn:9781728199986 Google Scholar
- Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Softw., 4, 3 (1978). Google ScholarDigital Library
- Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News, 44, 3 (2016), 243–254. Google ScholarDigital Library
- Charles R Harris, K Jarrod Millman, Stéfan J Van Der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Bret, Allan Haldane, Jaime Fernández del Rio, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. 2020. Array programming with NumPy. Nature, 585, 7825 (2020), 357–362. Google Scholar
- Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong-Hyeon Park, Austin Rovinski, Haojie Ye, Yuhan Chen, Ronald Dreslinski, and Trevor Mudge. 2020. Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices. Association for Computing Machinery, New York, NY, USA. 1–12. isbn:9781450379830 https://doi.org/10.1145/3392717.3392751 Google ScholarDigital Library
- Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W Fletcher. 2019. ExTensor: An accelerator for sparse tensor algebra. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 319–333. Google ScholarDigital Library
- Rawn Henry, Olivia Hsu, Rohan Yadav, Stephen Chou, Kunle Olukotun, Saman Amarasinghe, and Fredrik Kjolstad. 2021. Compilation of Sparse Array Programming Models. Proc. ACM Program. Lang., 5, OOPSLA (2021), Article 128, October, 29 pages. https://doi.org/10.1145/3485505 Google ScholarDigital Library
- Kenneth E Iverson. 1962. A programming language. In Proceedings of the May 1-3, 1962, spring joint computer conference. 345–351. Google Scholar
- 2011. Graph Algorithms in the Language of Linear Algebra, Jeremy Kepner and John R. Gilbert (Eds.) (Software, environments, tools, Vol. 22). SIAM. isbn:978-0-89871-990-1 http://dblp.uni-trier.de/db/books/collections/KG2011.html Google Scholar
- Fredrik Kjolstad, Peter Ahrens, Shoaib Kamil, and Saman Amarasinghe. 2019. Tensor Algebra Compilation with Workspaces. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 180–192. https://doi.org/10.1109/CGO.2019.8661185 Google ScholarCross Ref
- Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler. Proceedings of the ACM on Programming Languages, 1, OOPSLA (2017), 1–29. Google ScholarDigital Library
- Fredrik Berg Kjølstad. 2020. Sparse tensor algebra compilation. Ph. D. Dissertation. Massachusetts Institute of Technology. Google Scholar
- David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2018. Spatial: A Language and Compiler for Application Accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). Association for Computing Machinery, New York, NY, USA. 296–311. isbn:9781450356985 https://doi.org/10.1145/3192366.3192379 Google ScholarDigital Library
- Tamara G. Kolda and Jimeng Sun. 2008. Scalable Tensor Decompositions for Multi-aspect Data Mining. In 2008 Eighth IEEE International Conference on Data Mining. 363–372. https://doi.org/10.1109/ICDM.2008.89 Google ScholarDigital Library
- Scott Kovach and Fredrik Kjolstad. 2022. Correct Compilation of Semiring Contractions. https://doi.org/10.48550/ARXIV.2207.13291 Google Scholar
- Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In CGO. San Jose, CA, USA. 75–88. Google ScholarDigital Library
- Qiaoyi Liu, Dillon Huff, Jeff Setter, Maxwell Strange, Kathleen Feng, Kavya Sreedhar, Ziheng Wang, Keyi Zhang, Mark Horowitz, Priyanka Raina, and Fredrik Kjolstad. 2021. Compiling Halide Programs to Push-Memory Accelerators. CoRR, abs/2105.12858 (2021), arXiv:2105.12858. arxiv:2105.12858 Google Scholar
- Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. 2019. PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’52). Association for Computing Machinery, New York, NY, USA. 1009–1022. isbn:9781450369381 https://doi.org/10.1145/3352460.3358254 Google ScholarDigital Library
- Erdal Mutlu, Ruiqin Tian, Bin Ren, Sriram Krishnamoorthy, Roberto Gioiosa, Jacques Pienaar, and Gokcen Kestor. 2020. COMET: A Domain-Specific Compilation of High-Performance Computational Chemistry. In Languages and Compilers for Parallel Computing: 33rd International Workshop, LCPC 2020, Virtual Event, October 14-16, 2020, Revised Selected Papers. Springer-Verlag, Berlin, Heidelberg. 87–103. isbn:978-3-030-95952-4 https://doi.org/10.1007/978-3-030-95953-1_7 Google ScholarDigital Library
- Quan M. Nguyen and Daniel Sanchez. 2021. Fifer: Practical Acceleration of Irregular Applications on Reconfigurable Architectures. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’21). Association for Computing Machinery, New York, NY, USA. 1064–1077. isbn:9781450385572 https://doi.org/10.1145/3466752.3480048 Google ScholarDigital Library
- Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-dataflow acceleration. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 416–429. https://doi.org/10.1145/3079856.3080255 Google ScholarDigital Library
- Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 724–736. Google ScholarCross Ref
- Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir, Aamer Jaleel, Randy Allmon, Rachid Rayess, Stephen Maresh, and Joel Emer. 2013. Triggered Instructions: A Control Paradigm for Spatially-Programmed Architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA ’13). Association for Computing Machinery, New York, NY, USA. 142–153. isbn:9781450320795 https://doi.org/10.1145/2485922.2485935 Google ScholarDigital Library
- Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 27–40. https://doi.org/10.1145/3079856.3080254 Google ScholarDigital Library
- Raghu Prabhakar, David Koeplinger, Kevin J. Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. 2016. Generating Configurable Hardware from Parallel Patterns. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’16). Association for Computing Machinery, New York, NY, USA. 651–665. isbn:9781450340915 https://doi.org/10.1145/2872362.2872415 Google ScholarDigital Library
- Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A Reconfigurable Architecture For Parallel Paterns. SIGARCH Comput. Archit. News, 45, 2 (2017), June, 389–402. issn:0163-5964 https://doi.org/10.1145/3140659.3080256 Google ScholarDigital Library
- Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, and Mark Horowitz. 2017. Programming Heterogeneous Systems from an Image Processing DSL. ACM Trans. Archit. Code Optim., 14, 3 (2017), Article 26, aug, 25 pages. issn:1544-3566 https://doi.org/10.1145/3107953 Google ScholarDigital Library
- Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 58–70. https://doi.org/10.1109/HPCA47549.2020.00015 Google ScholarCross Ref
- Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2012. Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines. ACM Trans. Graph., 31, 4 (2012), Article 32, July, 12 pages. issn:0730-0301 https://doi.org/10.1145/2185520.2185528 Google ScholarDigital Library
- M.M.G. Ricci and T. Levi-Civita. 1901. Méthodes de calcul différentiel absolu et leurs applications. Math. Ann., 54 (1901), 125–201. http://eudml.org/doc/157997 Google ScholarCross Ref
- Alexander Rucker, Matthew Vilim, Tian Zhao, Yaqi Zhang, Raghu Prabhakar, and Kunle Olukotun. 2021. Capstan: A Vector RDA for Sparsity. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’21). Association for Computing Machinery, New York, NY, USA. 1022–1035. isbn:9781450385572 https://doi.org/10.1145/3466752.3480047 Google ScholarDigital Library
- Ryan Senanayake, Changwan Hong, Ziheng Wang, Amalee Wilson, Stephen Chou, Shoaib Kamil, Saman Amarasinghe, and Fredrik Kjolstad. 2020. A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 158, Nov., 30 pages. https://doi.org/10.1145/3428226 Google ScholarDigital Library
- Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/ Google Scholar
- Edgar Solomonik and Torsten Hoefler. 2015. Sparse Tensor Algebra as a Parallel Programming Model. CoRR, abs/1512.00066 (2015), arXiv:1512.00066. arxiv:1512.00066 Google Scholar
- Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton, and James Demmel. 2014. A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel and Distrib. Comput., 74, 12 (2014), 3176–3190. issn:0743-7315 https://doi.org/10.1016/j.jpdc.2014.06.002 Domain-Specific Languages and High-Level Frameworks for High-Performance Computing Google ScholarDigital Library
- Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. 2020. MatRaptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 766–780. Google ScholarCross Ref
- Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. 2020. Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 689–702. https://doi.org/10.1109/HPCA47549.2020.00062 Google ScholarCross Ref
- Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2020. Efficient Processing of Deep Neural Networks. Morgan & Claypool Publishers. Google Scholar
- Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and Data Transformations for Sparse Matrix Code. SIGPLAN Not., 50, 6 (2015), June, 521–532. issn:0362-1340 https://doi.org/10.1145/2813885.2738003 Google ScholarDigital Library
- Matthew Vilim, Alexander Rucker, and Kunle Olukotun. 2021. Aurochs: An Architecture for Dataflow Threads. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 402–415. https://doi.org/10.1109/ISCA52012.2021.00039 Google ScholarDigital Library
- Matthew Vilim, Alexander Rucker, Yaqi Zhang, Sophia Liu, and Kunle Olukotun. 2020. Gorgon: Accelerating Machine Learning from Relational Data. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 309–321. https://doi.org/10.1109/ISCA45697.2020.00035 Google ScholarDigital Library
- Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P Gummadi. 2009. On the evolution of user interaction in facebook. In Proceedings of the 2nd ACM workshop on Online social networks. 37–42. Google ScholarDigital Library
- Jian Weng, Sihao Liu, Dylan Kupsh, and Tony Nowatzki. 2022. Unifying spatial accelerator compilation with idiomatic and modular transformations. IEEE Micro, 42, 5 (2022), 59–69. Google ScholarDigital Library
- Yannan Nellie Wu, Po-An Tsai, Angshuman Parashar, Vivienne Sze, and Joel S. Emer. 2022. Sparseloop: An Analytical Approach To Sparse Tensor Accelerator Modeling. https://doi.org/10.48550/ARXIV.2205.05826 Google Scholar
- Rohan Yadav, Alex Aiken, and Fredrik Kjolstad. 2022. DISTAL: The Distributed Tensor Algebra Compiler. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2022). Association for Computing Machinery, New York, NY, USA. 286–300. isbn:9781450392655 https://doi.org/10.1145/3519939.3523437 Google ScholarDigital Library
- Rohan Yadav, Alex Aiken, and Fredrik Kjolstad. 2022. SpDISTAL: Compiling Distributed Sparse Tensor Computations. arXiv preprint arXiv:2207.13901. Google Scholar
- Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. 2022. SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning. https://doi.org/10.48550/ARXIV.2207.04606 Google Scholar
- Guowei Zhang, Nithya Attaluri, Joel S Emer, and Daniel Sanchez. 2021. GAMMA: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 687–701. Google ScholarDigital Library
- Yaqi Zhang, Nathan Zhang, Tian Zhao, Matt Vilim, Muhammad Shahbaz, and Kunle Olukotun. 2021. SARA: Scaling a Reconfigurable Dataflow Accelerator. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 1041–1054. https://doi.org/10.1109/ISCA52012.2021.00085 Google ScholarDigital Library
- Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. 2020. SpArch: Efficient architecture for sparse matrix multiplication. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 261–274. Google ScholarCross Ref
Index Terms
The Sparse Abstract Machine
Comments