Performability: Optimizing the Computing Cloud
Although the mean time before failure (MTBF) of commodity components is high, their use in large, parallel systems can still lead to systemic failures. Concurrently, scientific computing is moving rapidly from a world of “big iron” parallel computing to a world of distributed software, virtual organizations and high-performance, unreliable systems with few guarantees of availability and quality of service. Our thesis is that the “two worlds" of software -- distributed systems and parallel systems -- must meet, embodying ideas from each, if we are to build resilient systems. This talk surveys some of these challenges and presents possible approaches for resilient, high-performance design – computing in the clouds. This talk muses on the future of such approaches.
Daniel Reed, Microsoft Research