TACO: Vol 20, No 2

Volume 20, Issue 2June 2023Current IssueIssue-in-Progress

Latest Issue

Volume 20, Issue 2

June 2023

Publisher:

Association for Computing Machinery
New York
NY
United States

ISSN:1544-3566

EISSN:1544-3973

Tags:

Bibliometrics

Select All

Export Citations Save to Binder

research-article

Open Access

Autotuning Convolutions Is Easier Than You Think

Article No.: 20, pp 1–24https://doi.org/10.1145/3570641

A wide range of scientific and machine learning applications depend on highly optimized implementations of tensor computations. Exploiting the full capacity of a given processor architecture remains a challenging task, due to the complexity of the ...

research-article

Open Access

User-driven Online Kernel Fusion for SYCL

Article No.: 21, pp 1–25https://doi.org/10.1145/3571284

Heterogeneous programming models are becoming increasingly popular to support the ever-evolving hardware architectures, especially for new and emerging specialized accelerators optimizing specific tasks. While such programs provide performance portability ...

research-article

Open Access

Source Matching and Rewriting for MLIR Using String-Based Automata

Article No.: 22, pp 1–26https://doi.org/10.1145/3571283

A typical compiler flow relies on a uni-directional sequence of translation/optimization steps that lower the program abstract representation, making it hard to preserve higher-level program information across each transformation step. On the other hand, ...

research-article

Open Access

An Optimized Framework for Matrix Factorization on the New Sunway Many-core Platform

Article No.: 23, pp 1–24https://doi.org/10.1145/3571856

Matrix factorization functions are used in many areas and often play an important role in the overall performance of the applications. In the LAPACK library, matrix factorization functions are implemented with blocked factorization algorithm, shifting ...

research-article

Open Access

HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache Hierarchy

Article No.: 24, pp 1–19https://doi.org/10.1145/3572839

In this article, we propose a “full-stack” solution to designing high-apacity and low-latency on-chip cache hierarchies by starting at the circuit level of the hardware design stack. We propose a novel half V_DD precharge 2T Gain Cell (GC) design for the ...

research-article

Open Access

ACTION: Adaptive Cache Block Migration in Distributed Cache Architectures

Article No.: 25, pp 1–19https://doi.org/10.1145/3572911

Chip multiprocessors (CMP) with more cores have more traffic to the last-level cache (LLC). Without a corresponding increase in LLC bandwidth, such traffic cannot be sustained, resulting in performance degradation. Previous research focused on data ...

research-article

Open Access

Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators

Article No.: 26, pp 1–26https://doi.org/10.1145/3572908

Image processing and machine learning applications benefit tremendously from hardware acceleration. Existing compilers target either FPGAs, which sacrifice power and performance for programmability, or ASICs, which become obsolete as applications change. ...

research-article

Open Access

Scale-out Systolic Arrays

Article No.: 27, pp 1–25https://doi.org/10.1145/3572917

Multi-pod systolic arrays are emerging as the architecture of choice in DNN inference accelerators. Despite their potential, designing multi-pod systolic arrays to maximize effective throughput/Watt—i.e., throughput/Watt adjusted when accounting for array ...

research-article

Open Access

Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications

Article No.: 28, pp 1–25https://doi.org/10.1145/3575861

The maturity level of RISC-V and the availability of domain-specific instruction set extensions, like vector processing, make RISC-V a good candidate for supporting the integration of specialized hardware in processor cores for the High Performance ...

research-article

Open Access

FlexPointer: Fast Address Translation Based on Range TLB and Tagged Pointers

Article No.: 30, pp 1–24https://doi.org/10.1145/3579854

Page-based virtual memory relies on TLBs to accelerate the address translation. Nowadays, the gap between application workloads and the capacity of TLB continues to grow, bringing many costly TLB misses and making the TLB a performance bottleneck. ...

ACM Transactions on Architecture and Code Optimization

Sections

Autotuning Convolutions Is Easier Than You Think

User-driven Online Kernel Fusion for SYCL

Source Matching and Rewriting for MLIR Using String-Based Automata

An Optimized Framework for Matrix Factorization on the New Sunway Many-core Platform

HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache Hierarchy

ACTION: Adaptive Cache Block Migration in Distributed Cache Architectures

Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators

Scale-out Systolic Arrays

Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications

FlexPointer: Fast Address Translation Based on Range TLB and Tagged Pointers

Subjects

Comments