Abstract
With the slowing of Moore’s law, computer architects have turned to domain-specific hardware specialization to continue improving the performance and efficiency of computing systems. However, specialization typically entails significant modifications to the software stack to properly leverage the updated hardware. The lack of a structured approach for updating the compiler and the accelerator in tandem has impeded many attempts to systematize this procedure. We propose a new approach to enable flexible and evolvable domain-specific hardware specialization based on coarse-grained reconfigurable arrays (CGRAs). Our agile methodology employs a combination of new programming languages and formal methods to automatically generate the accelerator hardware and its compiler from a single source of truth. This enables the creation of design-space exploration frameworks that automatically generate accelerator architectures that approach the efficiencies of hand-designed accelerators, with a significantly lower design effort for both hardware and compiler generation. Our current system accelerates dense linear algebra applications but is modular and can be extended to support other domains. Our methodology has the potential to significantly improve the productivity of hardware-software engineering teams and enable quicker customization and deployment of complex accelerator-rich computing systems.
- [1] . 2020. Chipyard: Integrated design, simulation, and implementation framework for custom SoCs. IEEE Micro 40, 4 (2020), 10–21. Google ScholarDigital Library
- [2] . 2020. Creating an agile hardware design flow. In 57th ACM/IEEE Design Automation Conference (DAC’20). 1–6. Google ScholarCross Ref
- [3] . 2016. The Satisfiability Modulo Theories Library (SMT-LIB). www.SMT-LIB.org.Google Scholar
- [4] . 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 33–36. Google ScholarDigital Library
- [5] . 2022. Amber: A 367 GOPS, 538 GOPS/W 16nm SoC with a coarse-grained reconfigurable array for flexible acceleration of dense linear algebra. In 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). 70–71. Google ScholarCross Ref
- [6] . 2022. mflowgen: A modular flow generator and ecosystem for community-driven physical design. In Design Automation Conference (DAC).Google Scholar
- [7] . 2014. Efficient and effective packing and analytical placement for large-scale heterogeneous FPGAs. In 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’14). IEEE, 647–654. Google ScholarCross Ref
- [8] . 2018. SODA: Stencil with optimized dataflow architecture. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). 1–8. Google ScholarDigital Library
- [9] . 2017. CGRA-ME: A unified framework for CGRA modelling and exploration. In IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP’17). 184–189. Google ScholarCross Ref
- [10] . 2016. A DSL compiler for accelerating image processing pipelines on FPGAs. In Proceedings of the International Conference on Parallel Architectures and Compilation. 327–338. Google ScholarDigital Library
- [11] . 2001. Operator strength reduction. ACM Transactions on Programming Languages and Systems 23, 5 (2001), 603–625.Google ScholarDigital Library
- [12] . 2018. Invoking and linking generators from multiple hardware languages using CoreIR. In Workshop on Open-Source EDA Technology (WOSET’18). https://woset-workshop.github.io/PDFs/2018/a11.pdf.Google Scholar
- [13] . 2004. A survey of topologies and performance measures for large-scale networks. IEEE Communications Surveys Tutorials 6, 4 (2004), 18–31. Google ScholarDigital Library
- [14] . 2020. Type-directed scheduling of streaming accelerators. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’20). 408–422. Google ScholarDigital Library
- [15] . 2012. DySER: Unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro 32, 5 (2012), 38–51. Google ScholarDigital Library
- [16] . 2014. Darkroom: Compiling high-level image processing code into hardware pipelines. ACM Transactions on Graphics 33, 4, Article
144 (July 2014), 11 pages. Google ScholarDigital Library - [17] . 2016. Rigel: Flexible multi-rate image processing hardware. ACM Transactions on Graphics 35, 4, Article
85 (2016), 11 pages. Google ScholarDigital Library - [18] . 2004. An algebra of scans. In International Conference on Mathematics of Program Construction. Springer, 186–210.Google ScholarCross Ref
- [19] . 2021. Clockwork: Resource-efficient static scheduling for multi-rate image processing applications on FPGAs. In IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). 186–194. Google ScholarCross Ref
- Altera OpenCL. https://www.intel.com/content/www/us/en/software/programmable/sdk-for-opencl/overview.html.Google Scholar [n. d.].
- [21] . 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, FL) (
MM’14 ). ACM, New York, NY, 675–678. Google ScholarDigital Library - [22] . 2017. In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News 45, 2 (
June 2017), 1–12. Google ScholarDigital Library - [23] . 2005. APlace: A general analytic placement framework. In Proceedings of the 2005 International Symposium on Physical Design (San Francisco, CA) (
ISPD’05 ). ACM, New York, NY, 233–235. Google ScholarDigital Library - The OpenCL™ C Specification. Retrieved July 13, 2022 from https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/OpenCL_C.pdf.Google Scholar . [n. d.].
- [25] . 2019. Tensor algebra compilation with workspaces. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). 180–192. Google ScholarCross Ref
- [26] . 2017. The tensor algebra compiler. Proceedings of the ACM on Programming Languages 1, OOPSLA, Article
77 (Oct 2017), 29 pages. Google ScholarDigital Library - [27] . 2018. Spatial: A language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA) (
PLDI’18 ). ACM, New York, NY, 296–311. Google ScholarDigital Library - [28] . 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. ACM SIGPLAN Notices 53, 2 (
March 2018), 461–475. Google ScholarDigital Library - [29] . 2019. HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19) (Seaside, CA). 242–251. Google ScholarDigital Library
- [30] . 2020. HeteroHalide: From image processing DSL to efficient FPGA acceleration. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Seaside, CA) (
FPGA’20 ). ACM, New York, NY, 51–57. Google ScholarDigital Library - [31] . 2021. Compiling halide programs to push-memory accelerators. arXiv preprint arXiv:2105.12858 (2021).Google Scholar
- [32] . 2000. FPGA Routing Structures: A Novel Switch Block and Depopulated Interconnect Matrix Architectures. Ph. D. Dissertation. University of British Columbia. https://people.ece.ubc.ca/stevew/papers/pdf/imran_masc.pdf.Google Scholar
- MaxCompiler. Retrieved July 13, 2022 from https://www.maxeler.com/products/software/maxcompiler.Google Scholar [n. d.].
- [34] . 2012. An overview of today’s high-level synthesis tools. Design Automation for Embedded Systems 16, 3 (2012), 31–51. Google ScholarDigital Library
- [35] . 2007. ADRES & DRESC: Architecture and compiler for coarse-grain reconfigurable processors. Springer. Google ScholarCross Ref
- Catapult High Level Synthesis. Retrieved July 13, 2022 from https://www.mentor.com/hls-lp/catapult-high-level-synthesis.Google Scholar [n. d.].
- [37] . 2018. Leveraging the VTA-TVM hardware-software stack for FPGA acceleration of 8-bit ResNet-18 inference. In Proceedings of the Reproducible Quality-Efficient Systems Tournament on Co-Designing Pareto-Efficient Deep Learning (ReQuEST) (Williamsburg, VA). Article
5 . Google ScholarDigital Library - [38] . 2018. VTA: An open hardware-software stack for deep learning. arXiv preprint arXiv:1807.04188 (2018).Google Scholar
- [39] . 2016. Automatically scheduling halide image processing pipelines. ACM Transactions on Graphics 35 (
7 2016), 1–11. Google ScholarDigital Library - [40] . 2020. A framework for adding low-overhead, fine-grained power domains to CGRAs. In 2020 Design Automation Test in Europe Conference Exhibition (DATE’20). 846–851. Google ScholarCross Ref
- [41] . 1988. Hydra: Hardware description in a functional language using recursion equations and high order combining forms. The Fusion of Hardware Design and Verification (1988), 309–328.Google Scholar
- [42] . 2017. Plasticine: A reconfigurable architecture for parallel patterns. In ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). 389–402. Google ScholarDigital Library
- [43] . 2017. Programming heterogeneous systems from an image processing DSL. ACMTransactions on Architecture and Code Optimization 14, 3, Article
26 (Aug. 2017), 25 pages. Google ScholarDigital Library - [44] . 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, WA) (
PLDI’13 ). ACM, New York, NY, 519–530. Google ScholarDigital Library - [45] . 2014. Code generation from a domain-specific language for C-based HLS of hardware accelerators. In 2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’14). 1–10. Google ScholarDigital Library
- [46] . 2016. From high-level deep neural models to FPGAs. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1–12. Google ScholarCross Ref
- [47] . 1998. A fast routability-driven router for FPGAs. In Proceedings of the ACM/SIGDA 6th International Symposium on Field Programmable Gate Arrays (Monterey, CA) (
FPGA’98 ). ACM, New York, NY, 140–149. Google ScholarDigital Library - [48] . 2021. Ultra-elastic CGRAs for irregular loop specialization. In IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). 412–425. Google ScholarCross Ref
- [49] . 2019. A golden age of hardware description languages: Applying programming language techniques to improve design productivity. In 3rd Summit on Advances in Programming Languages (SNAPL’19)(
Leibniz International Proceedings in Informatics (LIPIcs) , Vol. 136), , , and (Eds.). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 7:1–7:21. Google ScholarCross Ref - [50] . 2020. fault: A python embedded domain-specific language for metaprogramming portable hardware verification components. In Computer Aided Verification. Springer International Publishing, 403–414. Google ScholarDigital Library
- [51] . 2021. Automating system configuration. In 2021 Formal Methods in Computer Aided Design (FMCAD). Google ScholarCross Ref
- [52] . 1987. Simulated annealing. In Simulated Annealing: Theory and Applications. Springer, 7–15.Google Scholar
- [53] . 2016. Evaluating programmable architectures for imaging and vision applications. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1–13. Google ScholarCross Ref
- [54] . 2019. MAGNet: A modular accelerator generator for neural networks. In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’19). 1–8. Google ScholarCross Ref
- Verilator. Retrieved July 13, 2022 from https://www.veripool.org/verilator/.Google Scholar . [n. d.].
- [56] . 2013. An efficient graph isomorphism algorithm based on canonical labeling and its parallel implementation on GPU. In IEEE 10th International Conference on High Performance Computing and Communications IEEE International Conference on Embedded and Ubiquitous Computing. 1089–1096. Google ScholarCross Ref
- Vivado High Level Synthesis. Retrieved July 13, 2022 from https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.Google Scholar [n. d.].
- [58] . 2020. AutoDNNchip: An automated DNN chip predictor and builder for both FPGAs and ASICs. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Seaside, CA) (
FPGA’20 ). ACM, New York, NY, 40–50. Google ScholarDigital Library - [59] . 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Monterey, CA) (
FPGA’15 ). ACM, New York, NY, 161–170. Google ScholarDigital Library - [60] . 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’18) (San Diego, CA). Article
56 , 8 pages. Google ScholarDigital Library - [61] . 2013. Improving high level synthesis optimization opportunity through polyhedral transformations. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, CA) (
FPGA’13 ). ACM, New York, NY, 9–18. Google ScholarDigital Library
Index Terms
AHA: An Agile Approach to the Design of Coarse-Grained Reconfigurable Accelerators and Compilers
Comments