Optimization of numerous small dense-matrix-vector multiplications in h-matrix arithmetic on gpu

Satoshi Ohshima, Ichitaro Yamazaki, Akihiro Ida, Rio Yokota

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Dense-matrix-vector multiplication is one of the well-known important matrix calculations. This calculation is provided a general matrix-vector multiplication (GEMV) function in the basic linear algebra subprograms (BLAS) libraries for several computation hardware. Traditionally, studies focus one large dense-matrix (the length of each side of the dense matrix is long)-vector multiplication. However, some applications require acceleration of numerous small dense-matrix-vector multiplications. This feature is provided by batched BLAS libraries. This calculation is also needed to compute a hierarchical-matrix-vector multiplication. In this study, we implemented numerous small dense-matrix-vector multiplications on a Pascal GPU and evaluated the performance. Thus, we considered the impact of optimization parameters and succeeded in obtaining a better performance than previous works. The maximum differences from our previous work is 28.47% and from batched GEMV of MAGMA BLAS is upto 81.81%. Moreover, we considered the use of two optimization parameters in one GPU kernel; one parameter was applied to some matrices, whereas the second parameter was applied to other matrices. The amount of the improvement was limited (upto 5%), a performance improvement was achieved. Our result will serve as a good reference for users who need to use numerous small dense-matrix-vector multiplications on a GPU and want to optimize a matrix-vector multiplication by hand-Tuning and auto-Tuning.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages9-16
Number of pages8
ISBN (Electronic)9781728148823
DOIs
Publication statusPublished - Oct 2019
Event13th IEEE International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019 - Singapore, Singapore
Duration: Oct 1 2019Oct 4 2019

Publication series

NameProceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019

Conference

Conference13th IEEE International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019
CountrySingapore
CitySingapore
Period10/1/1910/4/19

Fingerprint

Matrix-vector multiplication
Optimization
Linear algebra
Parameter Optimization
Hierarchical Matrices
Auto-tuning
Pascal
Tuning
Multiplication
Optimise
Hardware
kernel

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Hardware and Architecture
  • Electrical and Electronic Engineering
  • Control and Optimization

Cite this

Ohshima, S., Yamazaki, I., Ida, A., & Yokota, R. (2019). Optimization of numerous small dense-matrix-vector multiplications in h-matrix arithmetic on gpu. In Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019 (pp. 9-16). [8906754] (Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/MCSoC.2019.00009

Optimization of numerous small dense-matrix-vector multiplications in h-matrix arithmetic on gpu. / Ohshima, Satoshi; Yamazaki, Ichitaro; Ida, Akihiro; Yokota, Rio.

Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019. Institute of Electrical and Electronics Engineers Inc., 2019. p. 9-16 8906754 (Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ohshima, S, Yamazaki, I, Ida, A & Yokota, R 2019, Optimization of numerous small dense-matrix-vector multiplications in h-matrix arithmetic on gpu. in Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019., 8906754, Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019, Institute of Electrical and Electronics Engineers Inc., pp. 9-16, 13th IEEE International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019, Singapore, Singapore, 10/1/19. https://doi.org/10.1109/MCSoC.2019.00009
Ohshima S, Yamazaki I, Ida A, Yokota R. Optimization of numerous small dense-matrix-vector multiplications in h-matrix arithmetic on gpu. In Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 9-16. 8906754. (Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019). https://doi.org/10.1109/MCSoC.2019.00009
Ohshima, Satoshi ; Yamazaki, Ichitaro ; Ida, Akihiro ; Yokota, Rio. / Optimization of numerous small dense-matrix-vector multiplications in h-matrix arithmetic on gpu. Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 9-16 (Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019).
@inproceedings{2c8a801ba39a4182b6bacfa0f63946c9,
title = "Optimization of numerous small dense-matrix-vector multiplications in h-matrix arithmetic on gpu",
abstract = "Dense-matrix-vector multiplication is one of the well-known important matrix calculations. This calculation is provided a general matrix-vector multiplication (GEMV) function in the basic linear algebra subprograms (BLAS) libraries for several computation hardware. Traditionally, studies focus one large dense-matrix (the length of each side of the dense matrix is long)-vector multiplication. However, some applications require acceleration of numerous small dense-matrix-vector multiplications. This feature is provided by batched BLAS libraries. This calculation is also needed to compute a hierarchical-matrix-vector multiplication. In this study, we implemented numerous small dense-matrix-vector multiplications on a Pascal GPU and evaluated the performance. Thus, we considered the impact of optimization parameters and succeeded in obtaining a better performance than previous works. The maximum differences from our previous work is 28.47{\%} and from batched GEMV of MAGMA BLAS is upto 81.81{\%}. Moreover, we considered the use of two optimization parameters in one GPU kernel; one parameter was applied to some matrices, whereas the second parameter was applied to other matrices. The amount of the improvement was limited (upto 5{\%}), a performance improvement was achieved. Our result will serve as a good reference for users who need to use numerous small dense-matrix-vector multiplications on a GPU and want to optimize a matrix-vector multiplication by hand-Tuning and auto-Tuning.",
author = "Satoshi Ohshima and Ichitaro Yamazaki and Akihiro Ida and Rio Yokota",
year = "2019",
month = "10",
doi = "10.1109/MCSoC.2019.00009",
language = "English",
series = "Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "9--16",
booktitle = "Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019",
address = "United States",

}

TY - GEN

T1 - Optimization of numerous small dense-matrix-vector multiplications in h-matrix arithmetic on gpu

AU - Ohshima, Satoshi

AU - Yamazaki, Ichitaro

AU - Ida, Akihiro

AU - Yokota, Rio

PY - 2019/10

Y1 - 2019/10

N2 - Dense-matrix-vector multiplication is one of the well-known important matrix calculations. This calculation is provided a general matrix-vector multiplication (GEMV) function in the basic linear algebra subprograms (BLAS) libraries for several computation hardware. Traditionally, studies focus one large dense-matrix (the length of each side of the dense matrix is long)-vector multiplication. However, some applications require acceleration of numerous small dense-matrix-vector multiplications. This feature is provided by batched BLAS libraries. This calculation is also needed to compute a hierarchical-matrix-vector multiplication. In this study, we implemented numerous small dense-matrix-vector multiplications on a Pascal GPU and evaluated the performance. Thus, we considered the impact of optimization parameters and succeeded in obtaining a better performance than previous works. The maximum differences from our previous work is 28.47% and from batched GEMV of MAGMA BLAS is upto 81.81%. Moreover, we considered the use of two optimization parameters in one GPU kernel; one parameter was applied to some matrices, whereas the second parameter was applied to other matrices. The amount of the improvement was limited (upto 5%), a performance improvement was achieved. Our result will serve as a good reference for users who need to use numerous small dense-matrix-vector multiplications on a GPU and want to optimize a matrix-vector multiplication by hand-Tuning and auto-Tuning.

AB - Dense-matrix-vector multiplication is one of the well-known important matrix calculations. This calculation is provided a general matrix-vector multiplication (GEMV) function in the basic linear algebra subprograms (BLAS) libraries for several computation hardware. Traditionally, studies focus one large dense-matrix (the length of each side of the dense matrix is long)-vector multiplication. However, some applications require acceleration of numerous small dense-matrix-vector multiplications. This feature is provided by batched BLAS libraries. This calculation is also needed to compute a hierarchical-matrix-vector multiplication. In this study, we implemented numerous small dense-matrix-vector multiplications on a Pascal GPU and evaluated the performance. Thus, we considered the impact of optimization parameters and succeeded in obtaining a better performance than previous works. The maximum differences from our previous work is 28.47% and from batched GEMV of MAGMA BLAS is upto 81.81%. Moreover, we considered the use of two optimization parameters in one GPU kernel; one parameter was applied to some matrices, whereas the second parameter was applied to other matrices. The amount of the improvement was limited (upto 5%), a performance improvement was achieved. Our result will serve as a good reference for users who need to use numerous small dense-matrix-vector multiplications on a GPU and want to optimize a matrix-vector multiplication by hand-Tuning and auto-Tuning.

UR - http://www.scopus.com/inward/record.url?scp=85076162251&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85076162251&partnerID=8YFLogxK

U2 - 10.1109/MCSoC.2019.00009

DO - 10.1109/MCSoC.2019.00009

M3 - Conference contribution

AN - SCOPUS:85076162251

T3 - Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019

SP - 9

EP - 16

BT - Proceedings - 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2019

PB - Institute of Electrical and Electronics Engineers Inc.

ER -