Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters

Ichitaro Yamazaki, Ahmad Abdelfattah, Akihiro Ida, Satoshi Ohshima, Stanimire Tomov, Rio Yokota, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

HACApK is a software package for solving dense linear systems of equations and is used in other software packages, like ppohBEM for solving boundary integral equations. To enable the solution of large-scale boundary value problems, HACApK hierarchically compresses the coefficient matrix and uses the BiConjugate Gradient Stabilized (BiCGStab) method for solving the linear system. To extend HACApK's capability, this paper outlines how we ported the HACApK linear solver onto GPU clusters. Though the potential of GPUS has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUS for a solver, like HACApK, that requires fine-grained irregular computation and global communication. To utilize the GPUS, we integrated the variable-size batched GPU kernel that was recently released in the MAGMA software package. This is the first time the variable-size batched kernels were used in a solver or application code. We discuss several techniques to improve the performance of the batched kernel and demonstrate the effects of these techniques on two state-of-The-Art GPU clusters. For instance, with two 14-core Intel Xeon CPUs and four NVIDIA P100 GPUS per node, the GPU kernel obtained a solver speedup of 8× on one node and 4× on eight nodes. We also show that when the inter-GPU communication becomes significant, the solution time can be further reduced by a factor of 2× by carefully designing the communication layer with the underlying node architecture in mind.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages930-939
Number of pages10
ISBN (Print)9781538643686
DOIs
Publication statusPublished - Aug 3 2018
Event32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018 - Vancouver, Canada
Duration: May 21 2018May 25 2018

Other

Other32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018
CountryCanada
CityVancouver
Period5/21/185/25/18

Fingerprint

Software packages
Linear systems
Communication
Boundary integral equations
Boundary value problems
Program processors
Graphics processing unit
Kernel
Gradient
Node
Software
Factors
Integral equations
Global communication
Coefficients
Integrated
High performance

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems and Management

Cite this

Yamazaki, I., Abdelfattah, A., Ida, A., Ohshima, S., Tomov, S., Yokota, R., & Dongarra, J. (2018). Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters. In Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018 (pp. 930-939). [8425246] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IPDPS.2018.00102

Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters. / Yamazaki, Ichitaro; Abdelfattah, Ahmad; Ida, Akihiro; Ohshima, Satoshi; Tomov, Stanimire; Yokota, Rio; Dongarra, Jack.

Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 930-939 8425246.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yamazaki, I, Abdelfattah, A, Ida, A, Ohshima, S, Tomov, S, Yokota, R & Dongarra, J 2018, Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters. in Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018., 8425246, Institute of Electrical and Electronics Engineers Inc., pp. 930-939, 32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, Vancouver, Canada, 5/21/18. https://doi.org/10.1109/IPDPS.2018.00102
Yamazaki I, Abdelfattah A, Ida A, Ohshima S, Tomov S, Yokota R et al. Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters. In Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 930-939. 8425246 https://doi.org/10.1109/IPDPS.2018.00102
Yamazaki, Ichitaro ; Abdelfattah, Ahmad ; Ida, Akihiro ; Ohshima, Satoshi ; Tomov, Stanimire ; Yokota, Rio ; Dongarra, Jack. / Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters. Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 930-939
@inproceedings{8693a55c7e714372ab47fb996943d21e,
title = "Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters",
abstract = "HACApK is a software package for solving dense linear systems of equations and is used in other software packages, like ppohBEM for solving boundary integral equations. To enable the solution of large-scale boundary value problems, HACApK hierarchically compresses the coefficient matrix and uses the BiConjugate Gradient Stabilized (BiCGStab) method for solving the linear system. To extend HACApK's capability, this paper outlines how we ported the HACApK linear solver onto GPU clusters. Though the potential of GPUS has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUS for a solver, like HACApK, that requires fine-grained irregular computation and global communication. To utilize the GPUS, we integrated the variable-size batched GPU kernel that was recently released in the MAGMA software package. This is the first time the variable-size batched kernels were used in a solver or application code. We discuss several techniques to improve the performance of the batched kernel and demonstrate the effects of these techniques on two state-of-The-Art GPU clusters. For instance, with two 14-core Intel Xeon CPUs and four NVIDIA P100 GPUS per node, the GPU kernel obtained a solver speedup of 8× on one node and 4× on eight nodes. We also show that when the inter-GPU communication becomes significant, the solution time can be further reduced by a factor of 2× by carefully designing the communication layer with the underlying node architecture in mind.",
author = "Ichitaro Yamazaki and Ahmad Abdelfattah and Akihiro Ida and Satoshi Ohshima and Stanimire Tomov and Rio Yokota and Jack Dongarra",
year = "2018",
month = "8",
day = "3",
doi = "10.1109/IPDPS.2018.00102",
language = "English",
isbn = "9781538643686",
pages = "930--939",
booktitle = "Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Performance of Hierarchical-matrix BiCGStab Solver on GPU Clusters

AU - Yamazaki, Ichitaro

AU - Abdelfattah, Ahmad

AU - Ida, Akihiro

AU - Ohshima, Satoshi

AU - Tomov, Stanimire

AU - Yokota, Rio

AU - Dongarra, Jack

PY - 2018/8/3

Y1 - 2018/8/3

N2 - HACApK is a software package for solving dense linear systems of equations and is used in other software packages, like ppohBEM for solving boundary integral equations. To enable the solution of large-scale boundary value problems, HACApK hierarchically compresses the coefficient matrix and uses the BiConjugate Gradient Stabilized (BiCGStab) method for solving the linear system. To extend HACApK's capability, this paper outlines how we ported the HACApK linear solver onto GPU clusters. Though the potential of GPUS has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUS for a solver, like HACApK, that requires fine-grained irregular computation and global communication. To utilize the GPUS, we integrated the variable-size batched GPU kernel that was recently released in the MAGMA software package. This is the first time the variable-size batched kernels were used in a solver or application code. We discuss several techniques to improve the performance of the batched kernel and demonstrate the effects of these techniques on two state-of-The-Art GPU clusters. For instance, with two 14-core Intel Xeon CPUs and four NVIDIA P100 GPUS per node, the GPU kernel obtained a solver speedup of 8× on one node and 4× on eight nodes. We also show that when the inter-GPU communication becomes significant, the solution time can be further reduced by a factor of 2× by carefully designing the communication layer with the underlying node architecture in mind.

AB - HACApK is a software package for solving dense linear systems of equations and is used in other software packages, like ppohBEM for solving boundary integral equations. To enable the solution of large-scale boundary value problems, HACApK hierarchically compresses the coefficient matrix and uses the BiConjugate Gradient Stabilized (BiCGStab) method for solving the linear system. To extend HACApK's capability, this paper outlines how we ported the HACApK linear solver onto GPU clusters. Though the potential of GPUS has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUS for a solver, like HACApK, that requires fine-grained irregular computation and global communication. To utilize the GPUS, we integrated the variable-size batched GPU kernel that was recently released in the MAGMA software package. This is the first time the variable-size batched kernels were used in a solver or application code. We discuss several techniques to improve the performance of the batched kernel and demonstrate the effects of these techniques on two state-of-The-Art GPU clusters. For instance, with two 14-core Intel Xeon CPUs and four NVIDIA P100 GPUS per node, the GPU kernel obtained a solver speedup of 8× on one node and 4× on eight nodes. We also show that when the inter-GPU communication becomes significant, the solution time can be further reduced by a factor of 2× by carefully designing the communication layer with the underlying node architecture in mind.

UR - http://www.scopus.com/inward/record.url?scp=85052235105&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052235105&partnerID=8YFLogxK

U2 - 10.1109/IPDPS.2018.00102

DO - 10.1109/IPDPS.2018.00102

M3 - Conference contribution

SN - 9781538643686

SP - 930

EP - 939

BT - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -