Models and Algorithms for Exascale Computing Pose Challenges for Applied MathematiciansDecember 2, 2013
Jeffrey Hittinger, Sven Leyffer, and Jack Dongarra
Computer architectures are changing, from the PC under your desk to the world's largest supercomputers. Processor clock speeds have been stagnating since 2002, and future performance gains are expected to come from increased concurrency through larger numbers of cores or specialized processing units. For supercomputers, constraints on power are expected to reduce the amount of memory per core, alter the organization of the memory, and reduce resilience . These changes in hardware architecture present opportunities for applied mathematicians to develop new models and algorithms with the new constraints in mind.
For several years, the U.S. Department of Energy, whose critical missions rely heavily on high-performance computing, has been preparing for these changes. Numerous DOE workshops  have been held, with the focus mainly on the groundbreaking science that could be achieved at the exascale and on the computer science challenges that need to be addressed. It has become clear, however, that the full benefits of computing at the exascale cannot be achieved without substantial research on new algorithms and models, which will require close collaboration between applied mathematicians, computer scientists, and application domain experts.
DOE's Exascale Mathematics Working Group
The Advanced Scientific Computing Research Program in DOE's Office of Science has formed the Exascale Mathematics Working Group; the group's mission is to identify opportunities for mathematical and algorithmic research that will enable scientific applications to harness the full potential of exascale computing. EMWG is composed of applied mathematicians from across the DOE national laboratories. In the spring of 2013, EMWG issued a call to the greater applied mathematics community for position papers on exascale computing research challenges.
Of the 75 position papers received, 40 were selected for presentation and discussion at a workshop held in Washington, DC, August 21–22, 2013. More than 70 participants from DOE laboratories, universities, and U.S. government agencies attended the workshop. Topics of the position papers presented include scalable mesh and geometry generation, multiphysics and multiscale algorithms, in situ data analysis, adaptive precision, asynchronous algorithms, optimization, uncertainty quantification, and resilience.
Three main themes emerged at the workshop: hierarchies in models, algorithms, and decision processes that improve parallel performance; new approaches for exposing additional concurrency; and algorithmic approaches that address the resilience challenges. We summarize these ideas here; more details can be found in the original position papers and workshop presentations available on the EMWG website .
Hierarchies in Models and Algorithms
Many mathematical models live within hierarchies based on scales or physical fidelity. Exascale computing will provide an opportunity not only to develop hybrid models and algorithms that couple across scales, but also to use the hierarchy to accelerate or improve expensive algorithms at the fine scales. New algorithms will improve our understanding of the scale coupling and dynamics of the hierarchical physical processes. Hierarchical algorithms, such as multigrid and hierarchical adaptivity of mesh or order, should promote scalability by providing a means to decompose and coordinate efficient solution of problems.
Finding Additional Concurrency
Several position papers discussed the development of parallel-in-time algorithms. In parallel-in-time schemes, the space–time problem is decomposed in parallel and organized in a hierarchical, iterative (but physical) way such that, for a sufficiently large number of processors, one achieves additional parallel speed-up by exploiting concurrency in the temporal direction. Related approaches presented include pipelined Krylov solvers that hide synchronization and hierarchical, tree-based techniques for sparse linear systems that reduce communication; iterative techniques effectively introduce a "pseudo-time," which may provide an additional dimension over which to decompose with relaxed synchronization. Exascale architectures should make it possible to raise the level of abstraction, from basic forward simulation to optimal design and control and/or to uncertainty quantification; this should lead to additional scope for exploiting algorithmic concurrency. In situ data analysis also offers an opportunity to increase concurrency locally and to reduce global communication and synchronization.
Fault Tolerance and Resilience
A persistent undercurrent in the discussions was fault tolerance. It is still unclear how (un)reliable an exascale computer may be, but scalable means for recovering from faults like node failures will be needed to ensure scientific productivity. Currently, fault recovery is achieved predominantly by synchronous checkpointing/restarting, which will not be feasible with extreme concurrency. Algorithmic techniques for recovering from faults while preserving accuracy may become more important as hardware experts sort out the extent of low-level fault detection support that may be provided.
The road ahead contains a rich set of theoretical, algorithmic, and modeling challenges and opportunities brought about by the paradigm shift in computing architectures. There is much work to be done to develop the theory behind the stability, consistency, and accuracy of algorithms as we increase asynchrony, reduce communication, and decompose problems in search of more concurrency. Concerns about resilience are only one aspect of the greater issue of correctness, and applied mathematicians will need to reconsider verification and validation for hybrid, multiphysics, multiscale algorithms. The exascale challenges and opportunities will not only affect computation at the highest scale, but are expected to influence computational science at all scales and levels.
 P. Kogge, ed., CSE Department Tech. Rep. TR–2008–13, University of Notre Dame, September 28, 2008.
Jeffrey Hittinger is a computational scientist in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. Sven Leyffer is a senior computational mathematician in the Mathematics and Computer Science Division at Argonne National Laboratory. Jack Dongarra is a university distinguished professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee and a distinguished research staff member in the Computer Science and Mathematics Division at Oak Ridge National Laboratory.