Calendar Details

For more information about item submission and attendance, see About the Technical Calendar.

Tuesday, April 09

On Determining a Viable Path to Resilience at Exascale

Frank Mueller, North Carolina State University, Raleigh
Computer Science and Mathematics Division Seminar
11:00 AM — 12:00 PM, Engineering Technology Facility (Building 5800), Room F-202
Contact: Christian Engelmann (engelmannc@ornl.gov), 865.574.3132

Abstract

Exascale computing is projected to feature billion core parallelism. At such large processor counts, faults will become more common place. Current techniques to tolerate faults focus on reactive schemes for recovery and generally rely on a simple checkpoint/restart mechanism. Yet, they have a number of shortcomings. (1) They do not scale and require complete job restarts. (2) Projections indicate that the mean-time-between-failures is approaching the overhead required for checkpointing. (3) Existing approaches are application-centric, which increases the burden on application programmers and reduces portability.

To address these problems, we discuss a number of techniques and their level of maturity (or lack thereof) to address these problems. These include (a) scalable network overlays, (b) on-the-fly process recovery, (c) proactive process-level fault tolerance, (d) redundant execution, (e) the effort of SDCs on IEEE floating point arithmetic and (f) resilience modeling. In combination, these methods are aimed to pave the path to exascale computing.

About the Speaker
Frank Mueller (mueller@cs.ncsu.edu) is a Professor in Computer Science and a member of multiple research centers at North Carolina State University. Previously, he held positions at Lawrence Livermore National Laboratory and Humboldt University Berlin, Germany. He received his Ph.D. from Florida State University in 1994. He has published papers in the areas of parallel and distributed systems, embedded and real-time systems and compilers. He is a member of ACM SIGPLAN, ACM SIGBED and a senior member of the ACM and IEEE Computer Societies as well as an ACM Distinguished Scientist. He is a recipient of an NSF Career Award, an IBM Faculty Award, a Google Research Award and a Fellowship from the Humboldt Foundation.