07/30/2006   
 
VDT   Chimera   Outreach  
 
 
Project Information
 
Project
 Project Information
 Documents
 Education & Outreach
 Links
 
News & Events
 News
 Meetings & Events
 
Activities
 Chimera
 Pegasus
 Sphinx
 Virtual Data Toolkit
 Work Space
 
People
 Participants
 Contacts
 E-mail Archive
 
Related Projects
 iVDGL
 PPDG
 Open Science Grid
 EGEE
 European DataGrid
 TeraGrid
 Globus
 Condor



Home > Project Information > Project Introduction
Part 2: Petascale Virtual-Data Grids

A computational grid is a set of geographically distributed IT resources which can be mobilized by a single application using software services that tie them together. The definitive book on the subject is, "The Grid: Blueprint for a New Computing Infrastructure", edited by Ian Foster and Carl Kesselman, in which several authors describe how such computing grids might be built and what they could accomplish. We highly recommend this book to everyone interested in a technical overview of Grids.

Several computational grid testbeds are operational, but the challenges facing our experiments have led us to the concept of a Petascale Virtual-Data Grid. "Petascale" emphasizes the massive CPU resources (Petaflops) and the enormous datasets (Petabytes) that must be harnessed, while "virtual" refers to the many required data products that may not be physically stored but exist only as specifications for how they may be derived from other data. The resulting computational and data management problems differ fundamentally in the following respects from problems addressed in previous work:

  • Data-intensive as well as computation-intensive: Analysis tasks can involve thousands of computer, data handling, and network resources. The central problem is coordinated management of computation and data, not simply data movement.

  • Need for large-scale coordination without centralized control: Stringent performance goals require coordinated management of numerous resources, yet these resources are, for both technical and strategic reasons, highly distributed and not amenable to tight centralized control.

  • Large dynamic range in user demands and resource capabilities: These systems must be able to support and arbitrate among a complex task mix of experiment-wide, group-oriented, and (thousands of) individual activities-using I/O channels, local area networks, and wide area networks that span several distance scales. These considerations motivate the study of the virtual data grid technology that will be critical to future data-intensive computing not only in the four physics experiments, but in the many areas of science and commerce in which sophisticated software must harness large amounts of computing, communication and storage resources to extract information from measured data.

The Petascale Virtual-Data Grid (PVDG) is a unifying concept that describes the new technologies required to support such next-generation data-intensive applications. We use this term to capture the following unique characteristics:

  • A virtual data grid has large extent-national or worldwide-and scale, incorporating large numbers of resources on multiple distance scales.

  • A virtual data grid is more than a network: it layers sophisticated new services on top of local policies, mechanisms, and interfaces, so that geographically remote resources can be used in a coordinated fashion.

  • A virtual data grid provides a new degree of transparency in how data-handling and processing capabilities are integrated to deliver data products to end-user applications, so that requests for such products are easily mapped into computation and/or data access at multiple locations. (This transparency is needed to enable optimization across diverse, distributed resources, and to keep application development manageable.)

Figure 1: A production Grid showing the strong integration of data generation facilities, storage, computing and networks, plus tools for scheduling, management and security.

These characteristics combine to enable the definition and delivery of a potentially unlimited virtual space of data products derived from other data. In this virtual space, requests can be satisfied via direct retrieval of materialized products and/or computation, with local and global resource management, policy, and security constraints determining the strategy used. The concept of virtual data recognizes that all except irreproducible raw experimental data need 'exist' physically only as the specification for how they may be derived. The grid may instantiate zero, one, or many copies of derivable data depending on probable demand and the relative costs of computation, storage, and transport. (In high-energy physics today, for example, over 90% of data access is to derived data.) On a much smaller scale, this dynamic processing, construction, and delivery of data is precisely the strategy used to generate much, if not most, of the web content delivered in response to queries today.

3: Application to other domains >>
Supported by the National Science Foundation comments? contact webmaster