NUMA-optimized parallel breadth-first search on multicore single-node system

Yuichiro Yasui, Katsuki Fujisawa, Kazushige Goto

Research output: Chapter in Book/Report/Conference proceedingConference contribution

26 Citations (Scopus)

Abstract

The breadth-first search (BFS) is one of the most important kernels in graph theory. The Graph500 benchmark measures the performance of any supercomputer performing a BFS in terms of traversed edges per second (TEPS). Previous studies have proposed hybrid approaches that combine a well-known top-down algorithm and an efficient bottom-up algorithm for large frontiers. This reduces some unnecessary searching of outgoing edges in the BFS traversal of a small-world graph, such as a Kronecker graph. In this paper, we describe a highly efficient BFS using column-wise partitioning of the adjacency list while carefully considering the non-uniform memory access (NUMA) architecture. We explicitly manage the way in which each working thread accesses a partial adjacency list in local memory during BFS traversal. Our implementation has achieved a processing rate of 11.15 billion edges per second on a 4-way Intel Xeon E5-4640 system for a scale-26 problem of a Kronecker graph with 2 26 vertices and 230 edges. Not all of the speedup techniques in this paper are limited to the NUMA architecture system. With our winning Green Graph500 submission of June 2013, we achieved 64.12 GTEPS per kilowatt hour on an ASUS Pad TF700T with an NVIDIA Tegra 3 mobile processor.

Original languageEnglish
Title of host publicationProceedings - 2013 IEEE International Conference on Big Data, Big Data 2013
Pages394-402
Number of pages9
DOIs
Publication statusPublished - Dec 1 2013
Externally publishedYes
Event2013 IEEE International Conference on Big Data, Big Data 2013 - Santa Clara, CA, United States
Duration: Oct 6 2013Oct 9 2013

Publication series

NameProceedings - 2013 IEEE International Conference on Big Data, Big Data 2013

Other

Other2013 IEEE International Conference on Big Data, Big Data 2013
CountryUnited States
CitySanta Clara, CA
Period10/6/1310/9/13

Fingerprint

Data storage equipment
Supercomputers
Graph theory
Processing

All Science Journal Classification (ASJC) codes

  • Software

Cite this

Yasui, Y., Fujisawa, K., & Goto, K. (2013). NUMA-optimized parallel breadth-first search on multicore single-node system. In Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013 (pp. 394-402). [6691600] (Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013). https://doi.org/10.1109/BigData.2013.6691600

NUMA-optimized parallel breadth-first search on multicore single-node system. / Yasui, Yuichiro; Fujisawa, Katsuki; Goto, Kazushige.

Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013. 2013. p. 394-402 6691600 (Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yasui, Y, Fujisawa, K & Goto, K 2013, NUMA-optimized parallel breadth-first search on multicore single-node system. in Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013., 6691600, Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013, pp. 394-402, 2013 IEEE International Conference on Big Data, Big Data 2013, Santa Clara, CA, United States, 10/6/13. https://doi.org/10.1109/BigData.2013.6691600
Yasui Y, Fujisawa K, Goto K. NUMA-optimized parallel breadth-first search on multicore single-node system. In Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013. 2013. p. 394-402. 6691600. (Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013). https://doi.org/10.1109/BigData.2013.6691600
Yasui, Yuichiro ; Fujisawa, Katsuki ; Goto, Kazushige. / NUMA-optimized parallel breadth-first search on multicore single-node system. Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013. 2013. pp. 394-402 (Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013).
@inproceedings{cdf0cca42d894c9f98fec6c134cf6d12,
title = "NUMA-optimized parallel breadth-first search on multicore single-node system",
abstract = "The breadth-first search (BFS) is one of the most important kernels in graph theory. The Graph500 benchmark measures the performance of any supercomputer performing a BFS in terms of traversed edges per second (TEPS). Previous studies have proposed hybrid approaches that combine a well-known top-down algorithm and an efficient bottom-up algorithm for large frontiers. This reduces some unnecessary searching of outgoing edges in the BFS traversal of a small-world graph, such as a Kronecker graph. In this paper, we describe a highly efficient BFS using column-wise partitioning of the adjacency list while carefully considering the non-uniform memory access (NUMA) architecture. We explicitly manage the way in which each working thread accesses a partial adjacency list in local memory during BFS traversal. Our implementation has achieved a processing rate of 11.15 billion edges per second on a 4-way Intel Xeon E5-4640 system for a scale-26 problem of a Kronecker graph with 2 26 vertices and 230 edges. Not all of the speedup techniques in this paper are limited to the NUMA architecture system. With our winning Green Graph500 submission of June 2013, we achieved 64.12 GTEPS per kilowatt hour on an ASUS Pad TF700T with an NVIDIA Tegra 3 mobile processor.",
author = "Yuichiro Yasui and Katsuki Fujisawa and Kazushige Goto",
year = "2013",
month = "12",
day = "1",
doi = "10.1109/BigData.2013.6691600",
language = "English",
isbn = "9781479912926",
series = "Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013",
pages = "394--402",
booktitle = "Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013",

}

TY - GEN

T1 - NUMA-optimized parallel breadth-first search on multicore single-node system

AU - Yasui, Yuichiro

AU - Fujisawa, Katsuki

AU - Goto, Kazushige

PY - 2013/12/1

Y1 - 2013/12/1

N2 - The breadth-first search (BFS) is one of the most important kernels in graph theory. The Graph500 benchmark measures the performance of any supercomputer performing a BFS in terms of traversed edges per second (TEPS). Previous studies have proposed hybrid approaches that combine a well-known top-down algorithm and an efficient bottom-up algorithm for large frontiers. This reduces some unnecessary searching of outgoing edges in the BFS traversal of a small-world graph, such as a Kronecker graph. In this paper, we describe a highly efficient BFS using column-wise partitioning of the adjacency list while carefully considering the non-uniform memory access (NUMA) architecture. We explicitly manage the way in which each working thread accesses a partial adjacency list in local memory during BFS traversal. Our implementation has achieved a processing rate of 11.15 billion edges per second on a 4-way Intel Xeon E5-4640 system for a scale-26 problem of a Kronecker graph with 2 26 vertices and 230 edges. Not all of the speedup techniques in this paper are limited to the NUMA architecture system. With our winning Green Graph500 submission of June 2013, we achieved 64.12 GTEPS per kilowatt hour on an ASUS Pad TF700T with an NVIDIA Tegra 3 mobile processor.

AB - The breadth-first search (BFS) is one of the most important kernels in graph theory. The Graph500 benchmark measures the performance of any supercomputer performing a BFS in terms of traversed edges per second (TEPS). Previous studies have proposed hybrid approaches that combine a well-known top-down algorithm and an efficient bottom-up algorithm for large frontiers. This reduces some unnecessary searching of outgoing edges in the BFS traversal of a small-world graph, such as a Kronecker graph. In this paper, we describe a highly efficient BFS using column-wise partitioning of the adjacency list while carefully considering the non-uniform memory access (NUMA) architecture. We explicitly manage the way in which each working thread accesses a partial adjacency list in local memory during BFS traversal. Our implementation has achieved a processing rate of 11.15 billion edges per second on a 4-way Intel Xeon E5-4640 system for a scale-26 problem of a Kronecker graph with 2 26 vertices and 230 edges. Not all of the speedup techniques in this paper are limited to the NUMA architecture system. With our winning Green Graph500 submission of June 2013, we achieved 64.12 GTEPS per kilowatt hour on an ASUS Pad TF700T with an NVIDIA Tegra 3 mobile processor.

UR - http://www.scopus.com/inward/record.url?scp=84893233072&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84893233072&partnerID=8YFLogxK

U2 - 10.1109/BigData.2013.6691600

DO - 10.1109/BigData.2013.6691600

M3 - Conference contribution

AN - SCOPUS:84893233072

SN - 9781479912926

T3 - Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013

SP - 394

EP - 402

BT - Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013

ER -