### Abstract

The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H -matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H -matrices is more complex than that of dense and sparse matrices; thus, accelerating the H -matrices is required. We focus on H -matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.

Original language | English |
---|---|

Title of host publication | Supercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings |

Editors | Rio Yokota, Weigang Wu |

Publisher | Springer Verlag |

Pages | 274-292 |

Number of pages | 19 |

ISBN (Print) | 9783319699523 |

DOIs | |

Publication status | Published - Jan 1 2018 |

Event | 4th Asian Conference on Supercomputing Frontiers, SCFA 2018 - Singapore, Singapore Duration: Mar 26 2018 → Mar 29 2018 |

### Publication series

Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|

Volume | 10776 LNCS |

ISSN (Print) | 0302-9743 |

ISSN (Electronic) | 1611-3349 |

### Conference

Conference | 4th Asian Conference on Supercomputing Frontiers, SCFA 2018 |
---|---|

Country | Singapore |

City | Singapore |

Period | 3/26/18 → 3/29/18 |

### Fingerprint

### All Science Journal Classification (ASJC) codes

- Theoretical Computer Science
- Computer Science(all)

### Cite this

*Supercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings*(pp. 274-292). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10776 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-69953-0_16

**Optimization of hierarchical matrix computation on GPU.** / Ohshima, Satoshi; Yamazaki, Ichitaro; Ida, Akihiro; Yokota, Rio.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

*Supercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings.*Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10776 LNCS, Springer Verlag, pp. 274-292, 4th Asian Conference on Supercomputing Frontiers, SCFA 2018, Singapore, Singapore, 3/26/18. https://doi.org/10.1007/978-3-319-69953-0_16

}

TY - GEN

T1 - Optimization of hierarchical matrix computation on GPU

AU - Ohshima, Satoshi

AU - Yamazaki, Ichitaro

AU - Ida, Akihiro

AU - Yokota, Rio

PY - 2018/1/1

Y1 - 2018/1/1

N2 - The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H -matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H -matrices is more complex than that of dense and sparse matrices; thus, accelerating the H -matrices is required. We focus on H -matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.

AB - The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H -matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H -matrices is more complex than that of dense and sparse matrices; thus, accelerating the H -matrices is required. We focus on H -matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.

UR - http://www.scopus.com/inward/record.url?scp=85044440202&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85044440202&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-69953-0_16

DO - 10.1007/978-3-319-69953-0_16

M3 - Conference contribution

AN - SCOPUS:85044440202

SN - 9783319699523

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 274

EP - 292

BT - Supercomputing Frontiers - 4th Asian Conference, SCFA 2018, Proceedings

A2 - Yokota, Rio

A2 - Wu, Weigang

PB - Springer Verlag

ER -