Author: Tomov, Stanimire : Search

Applied Filters

People

Publications

Reproducibility Badges

Publication Date

21 Results for: Author: Tomov, StanimireEdit SearchSave SearchRSS

Searched The ACM Full-Text Collection (691,749 records)|Expand your search to The ACM Guide to Computing Literature (3,482,418 records)

Showing 1 - 20of21 Results

Filters

Select All

Export Citations Save to Binder

per page:

Relevance

research-article
November 2022
Results Reproduced / v1.1
Artifacts Evaluated & Functional / v1.1
Artifacts Available / v1.1
Addressing irregular patterns of matrix computations on GPUs and their impact on applications powered by sparse direct solvers
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisNovember 2022, Article No.: 26, pp 1–14

Many scientific applications rely on sparse direct solvers for their numerical robustness. However, performance optimization for these solvers remains a challenging task, especially on GPUs. This is due to workloads of small dense matrices that are ...
0
36
Metrics
Total Citations0
Total Downloads36
Last 12 Months36
Last 6 weeks12
1
Supplementary Material
SC22_Presentation_Ahmad.mp4
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
Get Access
research-article
Public Access
June 2021
Published By ACM
A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines
ACM Transactions on Mathematical Software (TOMS), Volume 47, Issue 3September 2021, Article No.: 21, pp 1–23https://doi.org/10.1145/3431921

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a ...
11
433
Metrics
Total Citations11
Total Downloads433
Last 12 Months270
Last 6 weeks33
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
View online with eReader
HTML
PDF
research-article
Public Access
March 2020
Published By ACM
Load-balancing Sparse Matrix Vector Product Kernels on GPUs
ACM Transactions on Parallel Computing (TOPC), Volume 7, Issue 1March 2020, Article No.: 2, pp 1–26https://doi.org/10.1145/3380930

Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that ...
18
1,370
Metrics
Total Citations18
Total Downloads1,370
Last 12 Months455
Last 6 weeks52
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
View online with eReader
HTML
PDF
research-article
July 2019
Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and AnalysisNovember 2018, Article No.: 47, pp 1–11https://doi.org/10.1109/SC.2018.00050

Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) ...
17
114
Metrics
Total Citations17
Total Downloads114
Last 12 Months32
Last 6 weeks1
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
Get Access
research-article
November 2018
Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and AnalysisNovember 2018, Article No.: 47, pp 1–11

Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) ...
0
662
Metrics
Total Citations0
Total Downloads662
Last 12 Months28
Last 6 weeks4
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
Get Access
Upcoming Conferences

ICS '23

June 21 - 23, 2023

Orlando World Center Marriott, Orlando , FL, USA

SC '23

November 12 - 17, 2023

Colorado Convention Center, Denver, CO, USA

SC '23 Website
research-article
Public Access
November 2017
Published By ACM
Investigating half precision arithmetic to accelerate dense linear system solvers
ScalA '17: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale SystemsNovember 2017, Article No.: 10, pp 1–8https://doi.org/10.1145/3148226.3148237

The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of ...
35
728
Metrics
Total Citations35
Total Downloads728
Last 12 Months135
Last 6 weeks16
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
View online with eReader
PDF
research-article
Public Access
June 2017
Published By ACM
Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs
ICS '17: Proceedings of the International Conference on SupercomputingJune 2017, Article No.: 5, pp 1–10https://doi.org/10.1145/3079079.3079103

This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations,...
12
510
Metrics
Total Citations12
Total Downloads510
Last 12 Months69
Last 6 weeks7
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
View online with eReader
PDF
research-article
Public Access
February 2017
Published By ACM
High-performance Cholesky factorization for GPU-only execution
GPGPU-10: Proceedings of the General Purpose GPUsFebruary 2017, pp 42–52https://doi.org/10.1145/3038228.3038237

We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that ...
9
815
Metrics
Total Citations9
Total Downloads815
Last 12 Months202
Last 6 weeks33
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
View online with eReader
PDF
research-article
November 2016
Towards achieving performance portability using directives for accelerators
WACCPD '16: Proceedings of the Third International Workshop on Accelerator Programming Using DirectivesNovember 2016, pp 13–24

In this paper we explore the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine ...
1
Metrics
Total Citations1
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
research-article
September 2016
Published By ACM
Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU
ACM Transactions on Mathematical Software (TOMS), Volume 43, Issue 2September 2016, Article No.: 10, pp 1–18https://doi.org/10.1145/2898347

Singular Value QR (SVQR) can orthonormalize a set of dense vectors with the minimum communication (one global reduction between the parallel processing units, and BLAS-3 to perform most of its local computation). As a result, compared to other ...
9
224
Metrics
Total Citations9
Total Downloads224
Last 12 Months23
Last 6 weeks0
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
Get Access
research-article
Public Access
November 2015
Published By ACM
Weighted dynamic scheduling with many parallelism grains for offloading of numerical workloads to multiple varied accelerators
ScalA '15: Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale SystemsNovember 2015, Article No.: 5, pp 1–8https://doi.org/10.1145/2832080.2832085

A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain ...
2
230
Metrics
Total Citations2
Total Downloads230
Last 12 Months23
Last 6 weeks4
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
View online with eReader
PDF
research-article
Free
November 2015
Published By ACM
Efficient implementation of quantum materials simulations on distributed CPU-GPU systems
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2015, Article No.: 10, pp 1–12https://doi.org/10.1145/2807591.2807654

We present a scalable implementation of the Linearized Augmented Plane Wave method for distributed memory systems, which relies on an efficient distributed, block-cyclic setup of the Hamiltonian and overlap matrices and allows us to turn around highly ...
8
503
Metrics
Total Citations8
Total Downloads503
Last 12 Months29
Last 6 weeks3
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
View online with eReader
PDF
research-article
April 2015
Performance analysis and design of a hessenberg reduction using stabilized blocked elementary transformations for new architectures
HPC '15: Proceedings of the Symposium on High Performance ComputingApril 2015, pp 135–142

The solution of nonsymmetric eigenvalue problems, A_x = λ_x, can be accelerated substantially by first reducing A to an upper Hessenberg matrix H that has the same eigenvalues as A. This can be done using Householder orthogonal transformations, which is a ...
0
46
Metrics
Total Citations0
Total Downloads46
Last 12 Months8
Last 6 weeks2
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
Get Access
research-article
April 2015
Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product
HPC '15: Proceedings of the Symposium on High Performance ComputingApril 2015, pp 75–82

This paper presents a heterogeneous CPU-GPU implementation for a sparse iterative eigensolver -- the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For the key routine generating the Krylov search spaces via the product of a sparse ...
8
126
Metrics
Total Citations8
Total Downloads126
Last 12 Months14
Last 6 weeks1
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
Get Access
research-article
Public Access
February 2015
Published By ACM
Optimization for performance and energy for batched matrix computations on GPUs
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUsFebruary 2015, pp 59–69https://doi.org/10.1145/2716282.2716288

As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, ...
6
362
Metrics
Total Citations6
Total Downloads362
Last 12 Months44
Last 6 weeks7
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
View online with eReader
PDF
research-article
Public Access
February 2015
Published By ACM
Energy efficiency and performance frontiers for sparse computations on GPU supercomputers
PMAM '15: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and ManycoresFebruary 2015, pp 1–10https://doi.org/10.1145/2712386.2712387

In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, ...
9
443
Metrics
Total Citations9
Total Downloads443
Last 12 Months55
Last 6 weeks3
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
View online with eReader
PDF
abstract
Public Access
January 2015
Published By ACM
Towards batched linear solvers on accelerated hardware platforms
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingJanuary 2015, pp 261–262https://doi.org/10.1145/2688500.2688534

As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for ...
Also Published in:
ACM SIGPLAN Notices: Volume 50 Issue 8, August 2015
13
405
Metrics
Total Citations13
Total Downloads405
Last 12 Months33
Last 6 weeks0
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
View online with eReader
PDF
research-article
November 2014
Deflation strategies to improve the convergence of communication-avoiding GMRES
ScalA '14: Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale SystemsNovember 2014, pp 39–46https://doi.org/10.1109/ScalA.2014.6

The generalized minimum residual (GMRES) method is a popular method for solving a large-scale sparse nonsymmetric linear system of equations. On modern computers, especially on a large-scale system, the communication is becoming increasingly expensive. ...
0
48
Metrics
Total Citations0
Total Downloads48
Last 12 Months2
Last 6 weeks1
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
Get Access
research-article
November 2014
Domain decomposition preconditioners for communication-avoiding krylov methods on a hybrid CPU/GPU cluster
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2014, pp 933–944https://doi.org/10.1109/SC.2014.81

Krylov subspace projection methods are widely used iterative methods for solving large-scale linear systems of equations. Researchers have demonstrated that communication-avoiding (CA) techniques can improve Krylov methods' performance on modern ...
7
208
Metrics
Total Citations7
Total Downloads208
Last 12 Months3
Last 6 weeks0
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
Get Access
research-article
May 2014
Published By ACM
clMAGMA: high performance dense linear algebra with OpenCL
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014May 2014, Article No.: 1, pp 1–9https://doi.org/10.1145/2664666.2664667

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, ...
9
186
Metrics
Total Citations9
Total Downloads186
Last 12 Months8
Last 6 weeks3
Export Citations
Save to Binder
Save to Binder
Create a New Binder
Name
Get Access

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Reproducibility Badges

Publication Date

Save to Binder

Upcoming Conferences

Also Published in: