Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework

Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, Masatsugu Ueda, Tapasya Patki, Daniel Ellsworth, Barry Rountree, Martin Schulz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Citations (Scopus)

Abstract

Limited power budgets will be one of the biggest challenges for deploying future exascale supercomputers. One of the promising ways to deal with this challenge is hardware overprovisioning, that is, installingmore hardware resources than can be fully powered under a given power limit coupled with software mechanisms to steer the limited power to where it is needed most. Prior research has demonstrated the viability of this approach, but could only rely on small-scale simulations of the software stack. While such research is useful to understand the boundaries of performance benefits that can be achieved, it does not cover any deployment or operational concerns of using overprovisioning on production systems. This paper is the first to present an extensible power-aware resource management framework for production-sized overprovisioned systems based on the widely established SLURM resource manager. Our framework provides flexible plugin interfaces and APIs for power management that can be easily extended to implement site-specific strategies and for comparison of different power management techniques. We demonstrate our framework on a 965-node HA8000 production system at Kyushu University. Our results indicate that it is indeed possible to safely overprovision hardware in production. We also find that the power consumption of idle nodes, which depends on the degree of overprovisioning, can become a bottleneck. Using real-world data, we then draw conclusions about the impact of the total number of nodes provided in an overprovisioned environment.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages957-966
Number of pages10
ISBN (Electronic)9781538639146
DOIs
Publication statusPublished - Jun 30 2017
Event31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017 - Orlando, United States
Duration: May 29 2017Jun 2 2017

Publication series

NameProceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017

Other

Other31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017
CountryUnited States
CityOrlando
Period5/29/176/2/17

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint Dive into the research topics of 'Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework'. Together they form a unique fingerprint.

Cite this