Computational Grids: Current Trends in Performance-oriented Distributed ComputingMarch 2, 2002
As the rapid evolution of the Internet continues to define a new medium for the sharing and management of information, it also brings the potential for harnessing vast numbers of computers, storage devices, and networks as a platform for computation. Developing and running programs that can draw compute power from globally distributed resources pose new challenges for the computer and computational science communities. In addition to interoperability and security (key concerns for Internet users and developers as well), applications that use distributed resources as a unified compute platform must be able to achieve performance levels greater than those that could be delivered by any single resource alone.
The Computational Grid  is a software approach, the main goal of which is to aggregate resources culled from a global resource pool for use by applications and their users. Using a ubiquitous software veneer that provides a consistent set of services, Grid applications should be able to draw compute power from a federated collection of resources (i.e., generators), much as electrical appliances draw electrical power from a power utility. Building the software infrastructure that is capable of realizing this ambitious metaphor is the focus of many current research and development efforts.
This article surveys "the state" of Computational Grid computing in terms of some, but by no means all, of the currently active Grid efforts. We've categorized ongoing work in this area according to current practice, research efforts, and commercial developments, but we stress that the projects we've chosen to describe (particularly the research projects) are intended to be only a representative sampling from a population too large to detail exhaustively.
Grid software infrastructures that implement basic services have matured to the point that they have engendered active user communities. They are portable between all of the currently available Unix and Linux implementations, and are able to run on all of the architecture types used for scientific and engineering computing (e.g., parallel supercomputers, workstations, parallel shared-memory machines, laptops). Compatibility with the Microsoft operating systems is increasingly a requirement; although development is under way, little in the way of "production quality" software is now available for these platforms.
The Grid software infrastructure in widest use today is Globus ---a software toolkit for building Grid-enabled applications and services. Globus implements a set of nonproprietary protocols for securely identifying, allocating, and releasing resources within a Grid (i.e., from a globally federated pool). Applications use directly a variety of programming tools and libraries that implement these protocols to locate and gain access to Grid resources. At the same time, any operation can be authenticated via the Globus Grid Security Infrastructure (GSI)  mechanisms (using X.509 certificates, by default), which are built into all software components. Globus is available as open source, portable amongst all versions of Unix and Linux (including those that run on supercomputers, shared-memory parallel machines, clusters, etc.), and is used by a large number of the active Grid projects today.
Another mature and widely used infrastructure is Condor . Originally developed as a "cycle-scavenging" system for workstation networks, Condor has been adapted to work in wide-area settings with the addition of parallel supercomputers, clusters, and shared-memory multiprocessors as target machine types.
Condor attempts to maximize job throughput by borrowing time from idle resources. Resource owners who agree to run Condor specify the conditions under which Condor may acquire and must release their resources. When a resource becomes eligible for acquisition, Condor is free to schedule jobs on it. If processor or keyboard activity indicates that the resource is no longer idle, Condor checkpoints any jobs executing on the resource and evacuates them. Condor also supports a high-level resource-discovery mechanism called matchmaking that allows users to specify their resource needs in a high-level language. When the job is launched, the matchmaker service performs a best-fit search of the resource database to determine the resources that should be allocated.
Globus and Condor interoperate. Condor-G (as the combination is called) uses a combination of the Condor resource-allocation strategies (including matchmaking) and the Globus security and resource-access mechanisms. The result is a secure, portable, and efficient system for implementing a high job-throughput capability in Grid settings.
Several large-scale Grid computing efforts are under way; most if not all use Globus, Condor, or Condor-G in some capacity, but some research products from the wider community are included. The TeraGrid  project, funded by the National Science Foundation, is deploying a national-scale multi-institution Grid to support scientific computing with an expected peak performance of 13.5 teraflops. TeraGrid users will be able to log on to a nationally distributed collection of high-performance resources and to treat them as a single heterogeneous computational and storage platform.
Other Grid development and deployment projects have similar goals and scope, but typically with an emphasis on a particular application domain or class of problems. These projects include the Grid Physics Network (GriPhyN) , the Network for Earthquake Engineering System (NEES) , the NASA Information Power Grid , the Department of Energy's Particle Physics Data Grid (PPDG) , the European Data Grid , and the Asia-Pacific Grid . While the state of development and size of the user communities supported by these projects vary, taken together they constitute a cross-section of effective Grid planning and implementation today.
Research focusing on how to build and use Computational Grids continues to expand rapidly, yielding a plethora of experimental execution environments, schedulers, modeling systems, and user-interface tools. Indeed, so much work is under way that an exhaustive enumeration of these systems and their results at present is difficult to imagine. (A Web search using www.google.com for the key words "computational grid" on December 11, 2001, yielded 743 relevant Web pages.) Instead, our intention is to survey representative research efforts as an entree to the wider field.
Perhaps the most comprehensive single Grid research project, in terms of its scope, is the Grid Application Software (GrADS)  project, centered at HiPerSoft . GrADS researchers are investigating an integrated approach to Grid program development and execution that includes automatic Grid- enabled libraries, compilation and compile-time optimization techniques targeting the Grid, high-performance Grid runtime scheduling systems, dynamic application monitoring and control, new Grid simulation capabilities, and market-based Grid resource-allocation strategies. While each of these foci, taken individually, is an active area of research, GrADS is studying what is necessary to combine them into an effective programming and execution environment for the user.
The GrADS results generated in these research areas build on those from a number of seminal and currently active Grid projects, including AppLeS  (for application scheduling), Autopilot  (for application monitoring and control), MPICH-G  (for MPI support), The Network Weather Service  (for resource monitoring and performance forecasting), and ScaLAPACK  (for numerical libraries and distributed performance tuning). GrADSoft (the experimental software environment developed by GrADS) should also be able to leverage other Grid programming systems, such as NetSolve  and Ninf , that provide Grid-"aware" Remote Procedure Call (RPC) services.
By itself, the subproblem of storage management in Grid environments is the subject of active research. The Storage Resource Broker (SRB)  project is investigating infrastructure requirements that must be met for a uniform interface to storage across heterogen-eous resources. Using a different model for distributed storage, the Internet Backplane Protocol (IBP)  project considers the use of storage depots from which clients can make temporary allocation requests anonymously. While their approaches are different and complementary, both of these projects are focused on optimizing data delivery to Grid client applications from disparate storage sites.
Complementing the computational and storage capabilities investigated by such projects as GrADS and its related work, the Access Grid  project studies how the Grid can be used to foster human interaction and collaboration. In addition to "standard" video con-ferencing capabilities, the Access Grid is designed to support collaborating teams of users who require more than simple audio and visual connectivity to interact. Users can participate in "virtual venues" that are built with Multi-User Domain (MUD) environments and multimedia systems. Providing a Grid computing capability via the Web is also an active research area. The Purdue University Network Computing Hubs (PUNCH)  project uses a set of runtime resource-brokering services to identify potential execution sites for a user's job. It then launches the job and provides a Web-browser interface for its status and control.
PUNCH is an example of a wider effort to provide Web portal-like access to Grids . In a similar vein, the Indiana University Common Component Architecture Toolkit (CCAT)  project is studying ways for standardizing and maintaining Grid software component interfaces. The Globus CoG Kit  extends this notion to include technologies and tools originally developed for Internet applications and Web services.
In addition, researchers are developing application frameworks for specific types of applications that share common characteristics. The Cactus development environment  provides an object-oriented development framework that has been used primarily to implement large-scale physics applications for the Grid. Simulation parameter "sweeps," in which an individual simulation is to be run repeatedly, using a different parameterization each time, is also a successful class of Grid application. Both the Nimrod  and AppLeS Parameter Sweep Template (APST)  projects are studying ways to optimize the execution of these scientific applications.
The success of Grid research, and the user community it has attracted, have generated commercial interest in the Computational Grid paradigm. Recently, Compaq, Cray, Silicon Graphics, Sun Microsystems, Veridian, Fujitsu, Hitachi, NEC, Entropia, IBM, Microsoft, and Platform Computing announced plans to develop commercial offerings based on Globus . These companies intend to leverage the open standard that Globus has fostered in the community as the basis for new Grid products. Spin-off companies are also flourishing. Sun Microsystems' Grid Engine product line  stems from an early company acquisition. The Avaki Corporation  is offering a commercialized version of Legion ---a fully object-oriented Grid system originally developed at the University of Virginia.
Other companies are exploring peer-to-peer Internet technologies as a way to provide commercial Grid computing options. Entropia , United Devices , and Parabon  offer distributed systems for enterprise-wide or Internet-based Grid computing. Users "sign up" by activating a client library that allows both job submission and safe cycle harvesting. In addition, users from the World Wide Web are encouraged to participate in public computing projects, much as the Search for Extraterrestrial Intelligence (SETI) project recruits volunteers for SETI@Home .
The Computational Grid is an emerging distributed-computing paradigm with active research, user, and commercial development communities. Relatively mature software infrastructures that are freely available have engendered a host of large-scale development and deployment efforts. Experimental research that investigates more powerful and easier-to-use techniques continues, and commercial interest is robust.
With so much interest in the future of Computational Grids, The Global Grid Forum (GGF)  has been formed as a community-based organization for information interchange and standardization. Modeled after the Internet Engineering Task Force (IETF), GGF participants (organized as working groups) address as a community such issues as security, uniformity of access to information services, data management, performance monitoring, and application models. As a body, the GGF meets three times a year internationally, and wide participation is welcome.
The networking revolution that launched the Internet continues to open new possibilities with the advent of the Computational Grid. If these efforts are successful, computing power will be available from a utility in much the same way that electrical power, cable TV, and Internet service are today.
 The AccessGrid Home Page; http://www-fp.mcs.anl.gov/fl/accessgrid.
 The Asia Pacific Grid Home Page; http://www.apgrid.org.
 The AppLeS Home Page; http://apples.ucsd.edu
 The AppLeS Parameter Sweep Template Home Page; http://grail.sdsc.edu/ projects/apst
 The Autopilot Home Page; http://www-pablo.cs.uiuc.edu/Software/Autopilot/autopilot.htm
 The Avaki Home Page; http://www.avaki.com
 The Cactus Code Home Page; http://www.cactuscode.org
 The Common Component Architecture Toolkit Home Page; http://www.extreme. indiana.edu/ccat
 The Globus CoG Home Page; http://www.globus.org/cog
 The Condor Home Page; http://www.cs.wisc.edu/condor
 The Entropia Home Page; http://www.entropia.com
 The Data Grid Home Page; http://www.eu-datagrid.org
 I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, San Francisco, 1998
 The Global Grid Forum Home Page; http://www.globus.org
 12 Companies Adopt Argonne Lab/USC Globus Toolkit as Standard Grid Technology Platform; http://www.globus.org/developer/news/20011112a.html
 The GrADS Home Page; http://www.sun.com/gridware
 The Grid Portal Collaboration Home Page; http://www.ipg.nasa.gov/portals
 The Grid Physics Network Home Page; http://www.griphyn.org
 The Grid Security Infrastructure Home Page; http://www.globus.org/security
 The HiPerSoft Home Page; http://www.hipersoft.rice.edu
 The Internet Backplane Protocol Home Page; http://icl.cs.utk.edu/ibp
 The NASA Information Power Grid Home Page; http://www.ipg.nasa.gov
 The Legion Home Page; http://legion.virginia.edu
 The MPICH-G Home Page; http://www.niu.edu/mpi
 The Network for Earthquake Engineering and Simulation; http://www.neesgrid.org
 The NetSolve Home Page; http://icl.cs.utk.edu/netsolve
 The Nimrod Home Page; http://www.csse.monash.edu.au/˜davida/nimrod.html
 The Ninf Home Page; nifn.apgrid.org
 The Network Weather Service Home Page; http://nws.npaci.edu
 The Parabon Home Page; http://www.parabon.com
 The Particle Physics Data Grid Home Page; http://www.slac.stanford.edu/xorg/ngi/ppdg/ppdg-slac.html
 The Purdue University Network Computing Hubs Home Page; http://punch.ecn. purdue.edu
 The ScaLAPACK Home Page; http://www.netlib.org/scalapack
 The SETI@Home Web Page; http://setiathome.ssl.berkeley.edu
 The Storage Resource Broker Home Page; http://www.npaci.edu/DICE/SRB
 The TeraGrid Home Page; http://www.teragrid.org
 The United Devices Home Page; http://www.ud.com/home.htm
Rich Wolski (firstname.lastname@example.org) is a professor of computer science at the University of California, Santa Barbara.