Parallel Processing '08: Computing in a Cloudy Petascale EnvironmentJune 11, 2008
Scientific computing is moving rapidly from a world of "big iron" parallel computing to a world of distributed software, virtual organizations, and high-performance, unreliable systems with few guarantees of availability or quality of service. Daniel Reed considered this shift in an invited talk at the 2008 SIAM Conference on Parallel Processing for Scientific Computing. His main thesis is that resilient systems will emerge only from a meeting of these two worlds---distributed systems and parallel systems---embodying ideas from each.
Drawing an analogy with the Sapir–Whorf hypothesis (the nature of a language influences the habitual thought of its speakers), Reed pointed out that the characteristics of the available computing systems shape research agendas. It is likely, then, that today's multicore machines, computing clouds, and services will affect the way science is done.
Computing systems have changed, from mainframes to grids and clusters and, more recently, to computing clouds and many-core machines. The new paraphernalia in the cyberinfrastructure are more than just devices: Fitting together in the form of clients, servers, sensors, and actuators, they form experiences. Memex, Vannevar Bush's 1940s dream of an information system capable of extending human capabilities, remains a prescient vision, although hints of it can be found in today's cyberinfrastructure.
Reed described the cloud as a research environment in which data and computing machinery provide software and services, such as application and hardware virtualization, knowledge discovery, and storage/data services. Such environments are necessary, he said, if correlations among ever-growing, distributed, multidisciplinary data sets are to be explored. These environments can have an impact on research agendas by providing the means for hypothesis-driven---"I have an idea, let me verify it"---or exploratory---"What correlations can I glean from the data?"---data analysis.
Considering sustained petascale performance and beyond, Reed provided numbers, such as 5–20 petaflop/s peak performance, 250,000 to 2 million cores, 10–20 mega-watts power. In such systems, he said, Joule's law is more dominant than Moore's law, i.e., power consumption is a limiting factor.
He then embarked on a discussion of "cloudy petascale," where the main issues are fault tolerance and performability. In cloudy petascale computing, tools from statistics can be used to derive the macroscopic properties of the system. Helpful tools include population sampling, statistical mechanics, such as gas and entropy laws, and fast multipole methods, which make use of cutoff radii and aggregation mechanisms. As to failure, he mentioned component-based failures, where the failure of a single node causes blockage of the overall simulation and loss of data. Checkpointing can be helpful, he said, and gave an optimal checkpointing interval.
Reed mentioned two important drivers in the cloudy petascale environment: overall system size, defining the macroscale, and semiconductors, defining the microscale. On the macroscale side, assuming a component-wise probability of operating for one hour of 0.99999, he calculated the approximate mean time to failure for a system of 24,000 components to be a mere 8 hours. On the microscale side, he mentioned static power leakage, increasing with temperature, as an obstacle to reliability, along with soft memory errors, about ten percent of which are not caught by error-correcting codes. Because applications are susceptible to failure at both the macro- and microscales, he concluded that checkpointing is not enough.
As to programming models and styles, he listed threads, message passing, data parallel languages, PGAs, functional languages, and transactional memory---each of which creates its own resource definitions, such as input/output, communication, scheduling, reliability---and concluded that there is no silver bullet to make parallel programming easy. We should accept failures as common in programming models, he said. Hence, integrated performability, in which dependability and performance are not decoupled, is required. Different strategies, such as re-source selection, checkpointing and restarting, and algorithm-based fault tolerance, can be applied to programming models.
Having considered the scalable monitoring, adaptation, and performance analysis made possible by sampling and measurement theory, Reed turned to the question of how virtualization and service-level agreements (SLAs) can help identify different strategies for programming models and reliability. Only modest progress, in terms of programming models, has been observed in the parallel computing domain since the early 1980s, he pointed out, whereas progress in the distributed-system domain during the same era has been substantial. He argued for raising the abstraction level, both for assessing the holistic performance of applications---and not just kernels---and for achieving high levels of human productivity. He proposed SLAs and the cloud as the means to achieve that high-level abstraction.
SLAs can be viewed as contracts for certain behaviors, such as performance and reliability; examples include cloud data-bases with storage and bandwidth SLAs, and virtual clusters with performance SLAs. Conceptually, the implementation of SLAs is hard, constituting a multivariate optimization problem with many axes, such as performance, reliability, programmability, security, power consumption and cooling, input/output bandwidth and latency, and cost. On top of this multifaceted nature are measurement difficulties and the nonconvexity of the functions. Reed explained that some of these issues are addressed in a VGrADS (Virtual Grid Application Development System), by touching on such issues as the separation of concerns between resource management and application development, and by offering examples of SLAs.
In closing, Reed considered the future of computing clouds and SLAs. Reiterating his view of two software worlds, distributed and centralized systems, he emphasized that systems of the two types are conceptualized differently: Failures are assumed on the distributed-system side, and feared on the centralized-system side. He stressed the need to reconceptualize hardware and software resources, and to embrace higher levels of abstraction and SLAs, as the deus ex machina model is not working effectively.
Bora Uçar works as a postdoctoral researcher on the Parallel Algorithms Project at CERFACS, in Toulouse, France.