Large Scale Debugging of Parallel Tasks using Scaling Properties and “Triumph of Majority” Principle
Friday, October 28, 12PM – 1:15PM
Saurabh Bagchi, Associate Professor, School of Electrical & Computer Engineering, Purdue University
The number of cores used in large scale systems will exceed a million cores in the near future, increasing the challenge of developing correct, high performance applications. When an application fails or returns incorrect results, the developer must identify the offending parallel task and then the portion of the code in that task that caused the error. Traditional parallel debugging tools scale poorly to large task counts and overwhelm developers with information. We develop a detection tool, called AutomaDeD, that identifies the offending task and, to a customizable granularity, the relevant portion of code within the task. It performs runtime monitoring of a parallel application to build a statistical model of the application’s typical timing and control flow behavior. By comparing the behavior of clusters of parallel tasks temporally as well as spatially, AutomaDeD identifies the period in time, the task(s), and the error site, i.e., the region of code, where a fault first manifests itself.
An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases do not exhibit the bug, the bug arises in large-scale production runs. The state-of-the-art statistical bug detection techniques fail with such bugs, because they detect abnormal behavior through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales. In this talk, we will describe a statistical approach to detecting and localizing scale-dependent bugs. It detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale.
We evaluate the tools on a parallel machine at Lawrence Livermore National Lab and with real bug cases.
Saurabh Bagchi is an Associate Professor in the School of Electrical and Computer Engineering and the Department of Computer Science at Purdue University in West Lafayette, Indiana. He is a senior member of IEEE and ACM, a "Teaching for Tomorrow" faculty fellow at Purdue University and the Assistant Director of the CERIAS security center at Purdue. He was the PC chair for IEEE/IFIP International Symposium on Dependable Systems and Networks (DSN) in 2011. He received the MS and PhD degrees from the University of Illinois, Urbana-Champaign, in 1998 and 2001, respectively. At Purdue, he leads the Dependable Computing Systems Laboratory (DCSL), where he and a set of wildly enthusiastic students try to make and break distributed systems for the good of the world.
Hosted by Keshev Pingali