## Parallel Processing '08: Automatic Tuning of High-Performance Numerical Libraries: State of the Art and Open Problems

**June 11, 2008**

**Parry Husbands**

As modern processors have become more complex, so has the task of tuning mathematical routines for performance. High-performance libraries must deal with multiple levels of cache, in addition to processors that can automatically reorder the incoming instruction stream. Optimizing compilers provide only partial help, because of the intricacy of some of the code transformations required and the difficulty in predicting the benefits of certain optimizations. Compilers, moreover, usually do not attempt the changes in data structure that might be helpful with some algorithms.

Faced with these challenges, library writers in recent years have eschewed careful hand crafting of functions in favour of a strategy that chooses the best performer from a set of automatically generated candidates. Code generators are built to explore an implementation space that typically includes loop rearrangements, changes in statement ordering, selection of algorithmic parameters, and data structure reorganisation. A benchmarking phase follows, with the generated implementations timed on sample inputs and the best performing one selected. This generate-and-test phase can be conducted either once, at install time, or at runtime, with real inputs. This strategy has been used in the construction of many popular libraries, including PHiPAC, ATLAS (dense BLAS), FFTW, SPIRAL (Fourier transforms), and OSKI (sparse matrix operations).

The basic idea behind auto-tuning is straightforward, but many details need to be worked out for its successful application. For example, while it is advantageous to have the choice of as many implementations as possible, generating and evaluating large numbers of candidates may be too time-consuming, particularly in libraries like OSKI, which perform tuning steps when called in an application.

The speakers in this two-part minisymposium reported on work carried out in Japan that advances the state-of-the art in several directions: infrastructure for easier specification of auto-tuned libraries, parameter studies that define algorithmic parameters for tuning, methods for designing algorithms that adapt automatically, and theoretical work.

Takahiro Katagiri of the University of Tokyo opened the first day's session with a talk devoted primarily to ABCLibScript, a description language for defining the possible implementations to be searched during benchmarking. ABCLibScript commands, which occur as comments in code (only Fortran-90 and MPI-1 are currently supported), describe the set of optimizations to be attempted. A separate pass then reads the annotated source code and produces candidate implementations. ABCLibScript essentially automates the task of producing a code generator, as all of the optimizations and tuning parameters can be specified declaratively. This work is part of a larger project, called FIBER, that attempts to tune larger blocks of code and efficiently estimate tuning parameters.

Takao Sakurai (Hitachi Japan) continued with a talk on tuning the well-known Lanczos method for eigenvalue computations. The problem under consideration here was how to choose a projection matrix size that would reduce the total running time of the computation. One strategy presented detected stagnation (stalled convergence) in the algorithm and automatically changed the projection matrix size to compensate. While not a prototypical auto-tuning application, this work touches on the broader area of designing algorithms that adapt by looking at their own execution history.

Tuning parallel FFTs---a classic auto-tuning case---was the topic of the next talk, by Daisuke Takahasi (University of Tsukuba). The basic algorithm treats a large one-dimensional FFT as, say, a two-dimensional FFT, where many smaller transforms are performed in parallel on the columns, followed by a transpose, and then by another set of transforms performed in parallel. The scheme presented used a three-dimensional formulation and searched over ways of reshaping the original data and the sizes of the lower-dimensional FFTs performed. Speedups of approximately 1.25 over the FFTW library were reported on a small 32-processor cluster.

The final speaker in the first session, Reiji Suda (University of Tokyo), described a Bayesian approach for exploring one of the open problems in automatic tuning---how to reduce the search space yet be guaranteed that a good candidate will be selected. The study, by building a performance model and performing a small number of experiments, selected routines by considering their costs and uncertainties and used a variety of metrics to evaluate them.

Toshiyuki Imamura (University of Electro-Communications, Japan) began the second session with a discussion of a parameter study of the locally optimal block preconditioned gradient (LOBPCG) algorithm. The major requirements were stability and speed. Choices included the orthogonalisation, deflation, and shift strategies, along with rearrangement of algorithmic operations. Hisayasu Kuroda followed with a presentation on the ILIB project at the University of Tokyo; the aim is to build a library of iterative methods in which both algorithmic parameters and details of parallel execution are tuned. The focus in this talk was on finding the best way to use MPI collective operations; performance numbers for a Windows cluster were shown.

Takafumi Miyata (Nagoya University) described an interesting use of automatic tuning in a heterogeneous environment. For the application presented, a ClearSpeed CX600 processor was used to accelerate an implementation of the complex Hessenberg QR algorithm. Because the accelerator was employed only to speed up matrix–matrix multiplications, the implementation needed to be able to decide when it was profitable to pay the communication cost of offloading some of these operations from the main processor. The combination of a performance model, the identification of algorithmic parameters that lead to large matrix operations, and a small change to the basic algorithm combined to give a speedup of about 1.4 on the configurations evaluated.

Concluding the minisymposium, Kengo Nakajima (University of Tokyo) discussed parameter exploration in the context of finding preconditioners for parallel iterative solvers for finite element problems. The block–Jacobi localised preconditioners that were studied incorporate global information about the matrices with local features of the elements used.

*Parry Husbands works on user-friendly parallel computing at Interactive Supercomputing in Waltham, Massachusetts.*