Riding the New Commodity Curves for Scientific ComputingDecember 18, 2012
Scientific computing problems are often very challenging, requiring large numbers of operations and significant data storage. For many years, the scientific computing community has experienced performance improvements by riding "commodity curves" in the computer industry. In particular, during this time the well-known Moore's law could be interpreted as "single processor performance will double every 18 months," primarily as a result of clock rate improvements and sophisticated single-processor scheduling strategies that make sequential code execute more quickly. Memory storage capacity and bandwidth also grew dramatically, and for applications that could use multiple processors via distributed memory and MPI (message passing interface), the improvements were even better. Interconnect networks, which tie independent processors together and make these separate processors act as a single cluster, were getting faster, cheaper, and capable of supporting larger processor counts. Over a span of two years, the multiplicative effect of these commodity curves was driving increases in performance for parallel applications and in data storage capacities by a factor of more than 10.
Currently, all these trends have stalled, or even reversed. Processor clock rates have either leveled off or decreased. Memory system capacities and performance trends, which are more complicated to explain, are also far less attractive. Interconnect networks have reached a limit for the very largest computers, and clusters are more costly to operate for everyone. The overarching concern that dampens all these trends is energy efficiency. We have reached practical efficiency thresholds for previous commodity curves. The free lunch is over. Or is it?
There is some good news. New commodity curves are emerging to replace the old ones. The challenge is that these curves require disruptive changes in how we analyze algorithms, design computer applications, and implement them in software. Whereas concurrency---simultaneous execution of operations---has always been a path to faster computations, it is now the only reliable path. The time to compute any given sequential stream of instructions will not improve dramatically for the foreseeable future. On today's computers, each computing node is an integral composition of increasing numbers of computing cores, including heterogeneous types, such as GPUs. Each core itself is designed to exploit greater concurrency, although some core types are optimized more for latency (for example, the Intel X86 line found in most laptops and desktops), others more for throughput (doing many, many things at once but taking longer to produce the first result). Although the exact design of node architectures will change, it is useful to think of a node as having a collection of latency-optimized and throughput-optimized cores.
Reasoning about Concurrency
Concurrency comes in many forms, which we consider in terms of patterns. The most common pattern used for distributed-memory parallelism (with MPI on a cluster) is single program, multiple data (SPMD). This strategy executes the same set of instructions on processors in a cluster, with each processor assigned a portion of the data and work. The SPMD pattern remains important for scientific computing and is still the dominant form of parallelism in large-scale computing, enabling hundreds of applications to perform high-fidelity computations.
Although SPMD across multiple nodes will remain an important pattern for parallel computing, the real focus for future performance growth is parallelism on the node. Multiple cores on a node permit multiple threads of execution.* Thus, with increasing core counts will come the opportunity to execute more threads. Furthermore, node architectures permit each core to have an increasing number of threads "in flight": If one thread stalls because it is waiting on data, the core can switch to execution of another thread, and then switch back when the second thread stalls or completes execution.
Within each core, another growing resource for concurrency is vectorization, based on a single instruction, multiple data (SIMD) pattern. With each new generation of nodes, the number of data elements that can be processed simultaneously via vectorization grows. Moreover, additional types of computations are becoming vectorizable, including operations for which data is not contiguous in memory and must be gathered or scattered. Vectorization supports performance of the same operation on a sequence of data elements---multiplying an array of values by a scalar value, for example.
Finally, memory system performance is becoming more latency-sensitive, and bandwidth is declining relative to operation count and becoming more localized in its availability. Consequently, the most successful algorithms will be those that permit more asynchronous activity (with latency hidden by having lots of requests in flight), prefer computation to data access (reducing memory bandwidth demands), and localize data requests to a small portion of the memory system (minimizing the amount of global data traffic), if all other concerns are similar.
Exploiting the New Commodity Curves
To enjoy the increased performance on future computers that will come "for free," we need to design and implement new algorithms that permit ample multi-thread parallelism, expose vectorizable computations and data requests, and respect the performance limitations of memory systems. Additionally, software architectures must be designed to expose threading and vectorization, and data structures must map well to memory systems. Finally, programmers must organize their code to readily expose computations that can be executed by increasing numbers of threads. They must also write code so that the compiler identifies vectorization, which is typically obtained from the innermost loop; the loops must be sufficiently long to realize performance gains on future nodes with longer vector register lengths. At the same time, data structures must be flexible enough to adapt to emerging memory system designs.
Using Patterns to Reason About Parallelism
So far we have discussed SPMD and SIMD/SIMT as execution patterns, but we have not discussed how the algorithm designer and programmer might reason about parallelism using patterns. Parallel patterns are a useful framework for designing and describing algorithms without being specific about the execution pattern. For example, the parallel-for pattern specifies that a loop with n iterations can be correctly executed regardless of the order used to traverse the loop. Algorithms that match the parallel-for pattern can be mapped to SPMD, SIMD/SIMT, and several threading execution patterns. For this reason, parallel-for is a very attractive pattern to identify and exploit. Parallel-reduce, a related pattern, produces a single scalar result, such as computing the norm of a vector.
Fortunately, many scientific computations can be organized in a parallel-for or parallel-reduce pattern. This arises from the common activity of working on a mesh, grid, or graph, where similar computations are performed at each node†, and many times these operations can happen in parallel across nodes with little or no interaction.
For many application developers, knowledge of the parallel-for and parallel-
reduce patterns will be sufficient for most analysis, design, and implementation. Al
though they are conceptually simple, ensuring the properties of these patterns can be challenging in practice. Use of global variables, e.g., common blocks and direct access to attributes in a class, can introduce cross-iteration dependencies, or ambiguities a compiler cannot handle. Even in these simple cases, programmers are thus aided by the discipline that use of patterns encourages.
Additional patterns include pipeline and task-graph, which enable abstract descriptions of data flow parallelism. A pipeline is composed of filters. Each filter transforms input data into output data, and a filter can have the property that multiple instances can execute concurrently or not. A task-graph describes a collection of operations (tasks) such that some must be done before others can proceed. The beauty of these two patterns is that each individual filter or task can be encapsulated as sequential code fragments and, for a well-designed parallel application, new functionality can be added by simply writing a new filter or task fragment. This, in turn, creates a separation of concerns: adding functionality and achieving parallel execution.
There are other patterns, and competing taxonomies. But the primary point is to develop an abstract pattern framework for analyzing, designing, and implementing software, and doing it in a way that supports (i) reasoning about important parallel properties that are inherent in algorithms, (ii) designing a software architecture that enables separation of the concerns of extending functionality and achieving parallel execution, and (iii) implementing software so that parallelism is realized.
Preserving Value Going Forward
One fundamental property of many scientific computing applications is that sophisticated mathematical expressions are encoded in loop bodies, which are executed over a set of discrete entities. For example, a finite volume formulation of a differential equation computes an updated solution value for each cell in the physical domain. Although the exact finite volume formulation can restrict the order of execution across cells, the primary concern of the loop body expression is accurate encoding of the discrete formulation of the differential equation. More precisely: Scientific domain knowledge is encoded primarily in the loop bodies. Accordingly, our software design should respect the integrity of loop body expressions, but should otherwise be flexible, admitting different data structures and ordering of computations. Using this property as a guiding principle will allow us to preserve the tremendous investment we have in existing code.
Parallel computing is a requirement for riding the new commodity performance curves that have emerged from energy efficiency concerns. For scientists who use computing environments like Matlab and Maple, the underlying implementation layers address parallelism, and beyond a conceptual knowledge of basic parallel patterns and proper organization of work and data, no other effort is required. For scientists who write their own software, the strategies described in this article should be a good set of guiding principles for analyzing, designing, and implementing parallel programs that have sustained value.
Michael Heroux is a distinguished member of the technical staff at Sandia National Laboratories. He is a member of the organizing committee for the 2013 SIAM Conference on Computational Science and Engineering, to be held in Boston, February 25 to March 1.
*The definitions of "core" and "thread" in this article are those commonly used in descriptions of the features of a microprocessor, e.g., from Intel or AMD. The GPU community, in particular Nvidia, uses different terminology. What we call a core is similar to a streaming multiprocessor (SM), which is composed of a fixed number of CUDA threads. SMs can have multiple thread states in flight (called the occupancy rate), and CUDA threads operate with a single instruction on multiple threads (SIMT), which is a bit more general than the SIMD model described here. The OpenCL community uses yet another taxonomy, but the concepts in all three cases are similar enough for our purposes.
†Notice that "node" as defined here is different from the "compute node" discussed above; both are commonly used, so we must use context to arbitrate.