Optimization of hierarchical matrix computation on GPU

Satoshi Ohshima, Ichitaro Yamazaki, Akihiro Ida, Rio Yokota

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H -matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H -matrices is more complex than that of dense and sparse matrices; thus, accelerating the H -matrices is required. We focus on H -matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.

Original languageEnglish
Title of host publicationSupercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings
EditorsRio Yokota, Weigang Wu
PublisherSpringer Verlag
Pages274-292
Number of pages19
ISBN (Print)9783319699523
DOIs
Publication statusPublished - Jan 1 2018
Event4th Asian Conference on Supercomputing Frontiers, SCFA 2018 - Singapore, Singapore
Duration: Mar 26 2018Mar 29 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10776 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference4th Asian Conference on Supercomputing Frontiers, SCFA 2018
CountrySingapore
CitySingapore
Period3/26/183/29/18

Fingerprint

Hierarchical Matrices
Matrix Computation
kernel
Optimization
Matrix Method
Matrix-vector multiplication
H-matrix
OpenMP
Sparse matrix
Data storage equipment
Computational Methods
Merging
Execution Time
Graphics processing unit
Simulation
Computational methods
Landing
Computer systems
Requirements

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Ohshima, S., Yamazaki, I., Ida, A., & Yokota, R. (2018). Optimization of hierarchical matrix computation on GPU. In R. Yokota, & W. Wu (Eds.), Supercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings (pp. 274-292). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10776 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-69953-0_16

Optimization of hierarchical matrix computation on GPU. / Ohshima, Satoshi; Yamazaki, Ichitaro; Ida, Akihiro; Yokota, Rio.

Supercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings. ed. / Rio Yokota; Weigang Wu. Springer Verlag, 2018. p. 274-292 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10776 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ohshima, S, Yamazaki, I, Ida, A & Yokota, R 2018, Optimization of hierarchical matrix computation on GPU. in R Yokota & W Wu (eds), Supercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10776 LNCS, Springer Verlag, pp. 274-292, 4th Asian Conference on Supercomputing Frontiers, SCFA 2018, Singapore, Singapore, 3/26/18. https://doi.org/10.1007/978-3-319-69953-0_16
Ohshima S, Yamazaki I, Ida A, Yokota R. Optimization of hierarchical matrix computation on GPU. In Yokota R, Wu W, editors, Supercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings. Springer Verlag. 2018. p. 274-292. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-69953-0_16
Ohshima, Satoshi ; Yamazaki, Ichitaro ; Ida, Akihiro ; Yokota, Rio. / Optimization of hierarchical matrix computation on GPU. Supercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings. editor / Rio Yokota ; Weigang Wu. Springer Verlag, 2018. pp. 274-292 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{424b6c5c7dbe41299d05354682dbc28c,
title = "Optimization of hierarchical matrix computation on GPU",
abstract = "The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H -matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H -matrices is more complex than that of dense and sparse matrices; thus, accelerating the H -matrices is required. We focus on H -matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.",
author = "Satoshi Ohshima and Ichitaro Yamazaki and Akihiro Ida and Rio Yokota",
year = "2018",
month = "1",
day = "1",
doi = "10.1007/978-3-319-69953-0_16",
language = "English",
isbn = "9783319699523",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "274--292",
editor = "Rio Yokota and Weigang Wu",
booktitle = "Supercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings",
address = "Germany",

}

TY - GEN

T1 - Optimization of hierarchical matrix computation on GPU

AU - Ohshima, Satoshi

AU - Yamazaki, Ichitaro

AU - Ida, Akihiro

AU - Yokota, Rio

PY - 2018/1/1

Y1 - 2018/1/1

N2 - The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H -matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H -matrices is more complex than that of dense and sparse matrices; thus, accelerating the H -matrices is required. We focus on H -matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.

AB - The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H -matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H -matrices is more complex than that of dense and sparse matrices; thus, accelerating the H -matrices is required. We focus on H -matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.

UR - http://www.scopus.com/inward/record.url?scp=85044440202&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85044440202&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-69953-0_16

DO - 10.1007/978-3-319-69953-0_16

M3 - Conference contribution

AN - SCOPUS:85044440202

SN - 9783319699523

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 274

EP - 292

BT - Supercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings

A2 - Yokota, Rio

A2 - Wu, Weigang

PB - Springer Verlag

ER -