Coskun, Egele Partner with Sandia Labs for HPC Systems Research

By Bhumika Salwan (Questrom ’16)

computing_grid-400x300CE Associate Professor Ayse Coskun and Assistant Professor Manuel Egele were awarded $189,000 for their research in data analytics with Sandia National Laboratories for improving energy efficiency and security of high performance computing (HPC) systems. Sandia Labs is one of the nation’s premier science and engineering laboratories for national security, with strategic areas in nuclear weapons, defense systems and assessments, energy and climate, and international, homeland, and nuclear security.

Professor Ayse Coskun’s research group at Boston University is widely-cited, with expertise in the topics of energy-efficient computing, computer system modeling and simulation, design of intelligent scheduling and power management techniques, and green computing in data centers and HPC systems. Professor Manuel Egele is an expert on systems and software security whose research has been published at top-tier peer reviewed conferences including NDSS and CCS.

Their project aims to identify which data collected out of HPC systems would be useful for identifying performance characteristics, inefficiencies, and malicious behavior. It will then design methods to leverage these data to design runtime strategies to improve efficiency and security. Professors Coskun and Egele’s research teams will first collect data on real HPC clusters at Sandia Labs and at the Massachusetts Green High Performance Computing Center (MGHPCC). They will then analyze that data to determine the most relevant, minimum set of metrics that are good indicators of energy and normal system behavior, and construct models that can predict performance variations and anomalous behavior resulting from security breaches or fraudulent activities.

The knowledge gained through this project will aid users and admins in answering questions such as the following: How much resources (e.g., how many cores or what size of memory) do I need for my application? Why does the performance of my application wildly vary across different runs?  What information can we provide to system administrators to enable more efficient problem diagnosis? Can we determine whether software applications are behaving “normally”?