TY - GEN
T1 - A Scalable Parallel Partition Tridiagonal Solver for Many-Core and Low B/F Processors
AU - Mitsuda, Tatsuya
AU - Ono, Kenji
N1 - Funding Information:
The computations in this project were carried out using computer resources offered under the category of General Projects by the Research Institute for Information Technology, Kyushu University. This work was supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) as part of the ”Program for Promoting Researches on the Supercomputer Fugaku” (Digital Twins of Real World’s Clean Energy Systems with Integrated Utilization of Super-simulation and AI).
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Tridiagonal systems are among the most fundamental computations in science, engineering, and mathematics, and one solver used in such systems is Tree Partitioning Reduction (TPR), which is a divide-and-conquer method that solves large-scale linear equations by dividing them and then computing the parts in parallel within different local memory threads. Herein, we propose an improved TPR algorithm that has a parallel cyclic reduction flavor, with which we reduced the number of algorithm steps by approximately half while simultaneously increasing arithmetic intensity and cache reusability. A performance evaluation conducted on an Intel Skylake-SP microprocessor showed a high hit ratio for the L1 cache and that our solver was as much as 31 times faster on 32 threads for 262144 equations. In the case of a Nvidia Tesla P100 GPU, our method processed 10 MRow/s more than TPR and cuSPARSE.
AB - Tridiagonal systems are among the most fundamental computations in science, engineering, and mathematics, and one solver used in such systems is Tree Partitioning Reduction (TPR), which is a divide-and-conquer method that solves large-scale linear equations by dividing them and then computing the parts in parallel within different local memory threads. Herein, we propose an improved TPR algorithm that has a parallel cyclic reduction flavor, with which we reduced the number of algorithm steps by approximately half while simultaneously increasing arithmetic intensity and cache reusability. A performance evaluation conducted on an Intel Skylake-SP microprocessor showed a high hit ratio for the L1 cache and that our solver was as much as 31 times faster on 32 threads for 262144 equations. In the case of a Nvidia Tesla P100 GPU, our method processed 10 MRow/s more than TPR and cuSPARSE.
UR - http://www.scopus.com/inward/record.url?scp=85136193139&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85136193139&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW55747.2022.00142
DO - 10.1109/IPDPSW55747.2022.00142
M3 - Conference contribution
AN - SCOPUS:85136193139
T3 - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022
SP - 860
EP - 869
BT - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 36th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022
Y2 - 30 May 2022 through 3 June 2022
ER -