Dynamic optimization of load balance in MPI broadcast

Takesi Soga, Kouji Kurihara, Takeshi Nanri, Motoyoshi Kurokawa, Kazuaki Murakami

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

There are many algorithms that compose broadcast from point-to-point communications, such as Binary Tree and Binomial Tree. Though many implementations of these algorithms are proposed in MPI libraries like MPICH [1], most of them are based on an assumption that all processes begin the broadcast at the same time. That means the orders of the point-to-point communications in the broadcast are arranged numerically, according to the rank of each process. However, naturally each process starts broadcast at different times, mainly because of the imbalance of workload of each process. That causes unnecessary waiting time on processes. Also, in a broadcast algorithm such as binomial tree algorithm, the amount of communication is different for each process therefore load imbalance is increased by the occurrence of both heavy workload and heavy communications on a same process. Our method purposes to solve these problems dynamically. This method solves these problems by profiling the workload of each rank at runtime and adjusting the orders of point-to-point communications according to the information. In the various algorithms of MPI_Bcast, binomial tree is one of the most popular one. It broadcasts a message to all M processes in the group with logM steps of point-to-point communications. At each step, each process that has already received the data sends data to the process which has not received yet. Because the number of send operations is different in each rank, if a heavy workload is assigned to the rank that invokes many send operations in the tree, whole load-imbalance causes the longer wait time at the ranks that receives the message from the heavy-loaded rank. Also, the performance of the broadcast changes according to the starting time of broadcast at each rank, even if the same algorithm is chosen. The difference of the starting times is caused mainly by the load-imbalance such as the difference of instruction counts or cache efficiency at each node. Generally, these kinds of differences are not easy to predict before executing programs. Therefore, it is important to consider the better implementation of the algorithm according to the behavior of the program at runtime. To adjust the implementation of algorithm to the behavior of the program, we introduce a virtual rank and a virtual rank table. The virtual rank represents the positions of processes in the collective communication. And the virtual rank table that maps real ranks to the virtual ranks. The implementation can be adjusted according to the load balance of each rank by changing the entries of the virtual rank table. The amount of the load of each rank is determined by the waiting time in MPI_Bcast. Virtual ranks that receive the message in earlier steps are responsible for larger numbers of sends to other ranks. Therefore, by mapping the real ranks with longer waiting time in previous MPI_Bcasts to the virtual ranks that receive the message earlier, the total waiting time can be reduced. This dynamic optimization method changes the entries as follows. At first, the wait time of each rank is measured from the wait operation for the receive request in each MPLBcast. Then, this wait time is compared with that of previous MPI_Bcast. If the difference is larger than a threshold, the rank calls MPLPut to send the information of the wait time to the optimizer rank. Once after N times of MPI_Bcast, the optimization is executed on the optimizer rank. From the information arrived so far, it finds the rank with minimum wait time and that with maximum wait time and mark them as a candidate to exchange the entries of the virtual rank table. After the optimization phase, the application phase is executed on all the rank in the communicator. In this time, the information of the pair of the ranks that exchange the entry of the table and the count N that shows next optimization time is propagated. On arrival of the information, each rank exchanges the entries of the virtual rank table of its own according to the information. This method has been build experimentally on RSCC(RIKEN Super Combined Cluster) at RIKEN Japan. The experiment uses an MPI program that invokes MPI_Bcast. In addition to that, a pseudo load is executed before each MPLBcast on the rank of the first receiver of the message from the root rank in the original binomial tree. Therefore, extra load on this rank is critical to the entire performance of MPI_Bcast. This experiment shows that the overall execution time of the MPI_Bcast can be reduced by around 40%.

Original languageEnglish
Title of host publicationRecent Advances in Parallel Virtual Machine and Message Passing Interface - 14th European PVM/MPI Users' Group Meeting, Proceedings
Pages387-388
Number of pages2
Volume4757 LNCS
Publication statusPublished - 2007
Event14th European PVM/MPI Users' Group Meeting on Parallel Virtual Machine and Message Passing Interface - Paris, France
Duration: Sep 30 2007Oct 3 2007

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4757 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other14th European PVM/MPI Users' Group Meeting on Parallel Virtual Machine and Message Passing Interface
CountryFrance
CityParis
Period9/30/0710/3/07

Fingerprint

Load Balance
Dynamic Optimization
Broadcast
Communication
Workload
Table
Binary trees
Trees (mathematics)
Waiting Time
Experiments
Group Processes

All Science Journal Classification (ASJC) codes

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Soga, T., Kurihara, K., Nanri, T., Kurokawa, M., & Murakami, K. (2007). Dynamic optimization of load balance in MPI broadcast. In Recent Advances in Parallel Virtual Machine and Message Passing Interface - 14th European PVM/MPI Users' Group Meeting, Proceedings (Vol. 4757 LNCS, pp. 387-388). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4757 LNCS).

Dynamic optimization of load balance in MPI broadcast. / Soga, Takesi; Kurihara, Kouji; Nanri, Takeshi; Kurokawa, Motoyoshi; Murakami, Kazuaki.

Recent Advances in Parallel Virtual Machine and Message Passing Interface - 14th European PVM/MPI Users' Group Meeting, Proceedings. Vol. 4757 LNCS 2007. p. 387-388 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4757 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Soga, T, Kurihara, K, Nanri, T, Kurokawa, M & Murakami, K 2007, Dynamic optimization of load balance in MPI broadcast. in Recent Advances in Parallel Virtual Machine and Message Passing Interface - 14th European PVM/MPI Users' Group Meeting, Proceedings. vol. 4757 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 4757 LNCS, pp. 387-388, 14th European PVM/MPI Users' Group Meeting on Parallel Virtual Machine and Message Passing Interface, Paris, France, 9/30/07.
Soga T, Kurihara K, Nanri T, Kurokawa M, Murakami K. Dynamic optimization of load balance in MPI broadcast. In Recent Advances in Parallel Virtual Machine and Message Passing Interface - 14th European PVM/MPI Users' Group Meeting, Proceedings. Vol. 4757 LNCS. 2007. p. 387-388. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Soga, Takesi ; Kurihara, Kouji ; Nanri, Takeshi ; Kurokawa, Motoyoshi ; Murakami, Kazuaki. / Dynamic optimization of load balance in MPI broadcast. Recent Advances in Parallel Virtual Machine and Message Passing Interface - 14th European PVM/MPI Users' Group Meeting, Proceedings. Vol. 4757 LNCS 2007. pp. 387-388 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{dc9d88b952bb4d97b0f3b0a0a6ded738,
title = "Dynamic optimization of load balance in MPI broadcast",
abstract = "There are many algorithms that compose broadcast from point-to-point communications, such as Binary Tree and Binomial Tree. Though many implementations of these algorithms are proposed in MPI libraries like MPICH [1], most of them are based on an assumption that all processes begin the broadcast at the same time. That means the orders of the point-to-point communications in the broadcast are arranged numerically, according to the rank of each process. However, naturally each process starts broadcast at different times, mainly because of the imbalance of workload of each process. That causes unnecessary waiting time on processes. Also, in a broadcast algorithm such as binomial tree algorithm, the amount of communication is different for each process therefore load imbalance is increased by the occurrence of both heavy workload and heavy communications on a same process. Our method purposes to solve these problems dynamically. This method solves these problems by profiling the workload of each rank at runtime and adjusting the orders of point-to-point communications according to the information. In the various algorithms of MPI_Bcast, binomial tree is one of the most popular one. It broadcasts a message to all M processes in the group with logM steps of point-to-point communications. At each step, each process that has already received the data sends data to the process which has not received yet. Because the number of send operations is different in each rank, if a heavy workload is assigned to the rank that invokes many send operations in the tree, whole load-imbalance causes the longer wait time at the ranks that receives the message from the heavy-loaded rank. Also, the performance of the broadcast changes according to the starting time of broadcast at each rank, even if the same algorithm is chosen. The difference of the starting times is caused mainly by the load-imbalance such as the difference of instruction counts or cache efficiency at each node. Generally, these kinds of differences are not easy to predict before executing programs. Therefore, it is important to consider the better implementation of the algorithm according to the behavior of the program at runtime. To adjust the implementation of algorithm to the behavior of the program, we introduce a virtual rank and a virtual rank table. The virtual rank represents the positions of processes in the collective communication. And the virtual rank table that maps real ranks to the virtual ranks. The implementation can be adjusted according to the load balance of each rank by changing the entries of the virtual rank table. The amount of the load of each rank is determined by the waiting time in MPI_Bcast. Virtual ranks that receive the message in earlier steps are responsible for larger numbers of sends to other ranks. Therefore, by mapping the real ranks with longer waiting time in previous MPI_Bcasts to the virtual ranks that receive the message earlier, the total waiting time can be reduced. This dynamic optimization method changes the entries as follows. At first, the wait time of each rank is measured from the wait operation for the receive request in each MPLBcast. Then, this wait time is compared with that of previous MPI_Bcast. If the difference is larger than a threshold, the rank calls MPLPut to send the information of the wait time to the optimizer rank. Once after N times of MPI_Bcast, the optimization is executed on the optimizer rank. From the information arrived so far, it finds the rank with minimum wait time and that with maximum wait time and mark them as a candidate to exchange the entries of the virtual rank table. After the optimization phase, the application phase is executed on all the rank in the communicator. In this time, the information of the pair of the ranks that exchange the entry of the table and the count N that shows next optimization time is propagated. On arrival of the information, each rank exchanges the entries of the virtual rank table of its own according to the information. This method has been build experimentally on RSCC(RIKEN Super Combined Cluster) at RIKEN Japan. The experiment uses an MPI program that invokes MPI_Bcast. In addition to that, a pseudo load is executed before each MPLBcast on the rank of the first receiver of the message from the root rank in the original binomial tree. Therefore, extra load on this rank is critical to the entire performance of MPI_Bcast. This experiment shows that the overall execution time of the MPI_Bcast can be reduced by around 40{\%}.",
author = "Takesi Soga and Kouji Kurihara and Takeshi Nanri and Motoyoshi Kurokawa and Kazuaki Murakami",
year = "2007",
language = "English",
isbn = "9783540754152",
volume = "4757 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "387--388",
booktitle = "Recent Advances in Parallel Virtual Machine and Message Passing Interface - 14th European PVM/MPI Users' Group Meeting, Proceedings",

}

TY - GEN

T1 - Dynamic optimization of load balance in MPI broadcast

AU - Soga, Takesi

AU - Kurihara, Kouji

AU - Nanri, Takeshi

AU - Kurokawa, Motoyoshi

AU - Murakami, Kazuaki

PY - 2007

Y1 - 2007

N2 - There are many algorithms that compose broadcast from point-to-point communications, such as Binary Tree and Binomial Tree. Though many implementations of these algorithms are proposed in MPI libraries like MPICH [1], most of them are based on an assumption that all processes begin the broadcast at the same time. That means the orders of the point-to-point communications in the broadcast are arranged numerically, according to the rank of each process. However, naturally each process starts broadcast at different times, mainly because of the imbalance of workload of each process. That causes unnecessary waiting time on processes. Also, in a broadcast algorithm such as binomial tree algorithm, the amount of communication is different for each process therefore load imbalance is increased by the occurrence of both heavy workload and heavy communications on a same process. Our method purposes to solve these problems dynamically. This method solves these problems by profiling the workload of each rank at runtime and adjusting the orders of point-to-point communications according to the information. In the various algorithms of MPI_Bcast, binomial tree is one of the most popular one. It broadcasts a message to all M processes in the group with logM steps of point-to-point communications. At each step, each process that has already received the data sends data to the process which has not received yet. Because the number of send operations is different in each rank, if a heavy workload is assigned to the rank that invokes many send operations in the tree, whole load-imbalance causes the longer wait time at the ranks that receives the message from the heavy-loaded rank. Also, the performance of the broadcast changes according to the starting time of broadcast at each rank, even if the same algorithm is chosen. The difference of the starting times is caused mainly by the load-imbalance such as the difference of instruction counts or cache efficiency at each node. Generally, these kinds of differences are not easy to predict before executing programs. Therefore, it is important to consider the better implementation of the algorithm according to the behavior of the program at runtime. To adjust the implementation of algorithm to the behavior of the program, we introduce a virtual rank and a virtual rank table. The virtual rank represents the positions of processes in the collective communication. And the virtual rank table that maps real ranks to the virtual ranks. The implementation can be adjusted according to the load balance of each rank by changing the entries of the virtual rank table. The amount of the load of each rank is determined by the waiting time in MPI_Bcast. Virtual ranks that receive the message in earlier steps are responsible for larger numbers of sends to other ranks. Therefore, by mapping the real ranks with longer waiting time in previous MPI_Bcasts to the virtual ranks that receive the message earlier, the total waiting time can be reduced. This dynamic optimization method changes the entries as follows. At first, the wait time of each rank is measured from the wait operation for the receive request in each MPLBcast. Then, this wait time is compared with that of previous MPI_Bcast. If the difference is larger than a threshold, the rank calls MPLPut to send the information of the wait time to the optimizer rank. Once after N times of MPI_Bcast, the optimization is executed on the optimizer rank. From the information arrived so far, it finds the rank with minimum wait time and that with maximum wait time and mark them as a candidate to exchange the entries of the virtual rank table. After the optimization phase, the application phase is executed on all the rank in the communicator. In this time, the information of the pair of the ranks that exchange the entry of the table and the count N that shows next optimization time is propagated. On arrival of the information, each rank exchanges the entries of the virtual rank table of its own according to the information. This method has been build experimentally on RSCC(RIKEN Super Combined Cluster) at RIKEN Japan. The experiment uses an MPI program that invokes MPI_Bcast. In addition to that, a pseudo load is executed before each MPLBcast on the rank of the first receiver of the message from the root rank in the original binomial tree. Therefore, extra load on this rank is critical to the entire performance of MPI_Bcast. This experiment shows that the overall execution time of the MPI_Bcast can be reduced by around 40%.

AB - There are many algorithms that compose broadcast from point-to-point communications, such as Binary Tree and Binomial Tree. Though many implementations of these algorithms are proposed in MPI libraries like MPICH [1], most of them are based on an assumption that all processes begin the broadcast at the same time. That means the orders of the point-to-point communications in the broadcast are arranged numerically, according to the rank of each process. However, naturally each process starts broadcast at different times, mainly because of the imbalance of workload of each process. That causes unnecessary waiting time on processes. Also, in a broadcast algorithm such as binomial tree algorithm, the amount of communication is different for each process therefore load imbalance is increased by the occurrence of both heavy workload and heavy communications on a same process. Our method purposes to solve these problems dynamically. This method solves these problems by profiling the workload of each rank at runtime and adjusting the orders of point-to-point communications according to the information. In the various algorithms of MPI_Bcast, binomial tree is one of the most popular one. It broadcasts a message to all M processes in the group with logM steps of point-to-point communications. At each step, each process that has already received the data sends data to the process which has not received yet. Because the number of send operations is different in each rank, if a heavy workload is assigned to the rank that invokes many send operations in the tree, whole load-imbalance causes the longer wait time at the ranks that receives the message from the heavy-loaded rank. Also, the performance of the broadcast changes according to the starting time of broadcast at each rank, even if the same algorithm is chosen. The difference of the starting times is caused mainly by the load-imbalance such as the difference of instruction counts or cache efficiency at each node. Generally, these kinds of differences are not easy to predict before executing programs. Therefore, it is important to consider the better implementation of the algorithm according to the behavior of the program at runtime. To adjust the implementation of algorithm to the behavior of the program, we introduce a virtual rank and a virtual rank table. The virtual rank represents the positions of processes in the collective communication. And the virtual rank table that maps real ranks to the virtual ranks. The implementation can be adjusted according to the load balance of each rank by changing the entries of the virtual rank table. The amount of the load of each rank is determined by the waiting time in MPI_Bcast. Virtual ranks that receive the message in earlier steps are responsible for larger numbers of sends to other ranks. Therefore, by mapping the real ranks with longer waiting time in previous MPI_Bcasts to the virtual ranks that receive the message earlier, the total waiting time can be reduced. This dynamic optimization method changes the entries as follows. At first, the wait time of each rank is measured from the wait operation for the receive request in each MPLBcast. Then, this wait time is compared with that of previous MPI_Bcast. If the difference is larger than a threshold, the rank calls MPLPut to send the information of the wait time to the optimizer rank. Once after N times of MPI_Bcast, the optimization is executed on the optimizer rank. From the information arrived so far, it finds the rank with minimum wait time and that with maximum wait time and mark them as a candidate to exchange the entries of the virtual rank table. After the optimization phase, the application phase is executed on all the rank in the communicator. In this time, the information of the pair of the ranks that exchange the entry of the table and the count N that shows next optimization time is propagated. On arrival of the information, each rank exchanges the entries of the virtual rank table of its own according to the information. This method has been build experimentally on RSCC(RIKEN Super Combined Cluster) at RIKEN Japan. The experiment uses an MPI program that invokes MPI_Bcast. In addition to that, a pseudo load is executed before each MPLBcast on the rank of the first receiver of the message from the root rank in the original binomial tree. Therefore, extra load on this rank is critical to the entire performance of MPI_Bcast. This experiment shows that the overall execution time of the MPI_Bcast can be reduced by around 40%.

UR - http://www.scopus.com/inward/record.url?scp=38449085965&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=38449085965&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9783540754152

VL - 4757 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 387

EP - 388

BT - Recent Advances in Parallel Virtual Machine and Message Passing Interface - 14th European PVM/MPI Users' Group Meeting, Proceedings

ER -