Mike Dahlin
University of Texas at Austin
As large amounts of valuable data appear in on-line repositories distributed across large-scale networks, a key operating systems challenge is to provide basic services to enable programs to manipulate this data easily, safely, and with high performance. These new services will aim to support data-intensive applications by, for instance, making it easy to move computation and data near one another to reduce latency, to increase bandwidth, or to allow geographically separated users to collaborate.
The current web infrastructure, built of monolithic browsers, caches, and servers provides little support for general computing and data access. As a result, applications must be constructed on an ad-hoc basis, thus complicating implementation, reducing efficiency, constraining resource management, and making it difficult to reason about security. For example, the current infrastructure falls far short of supporting data-intensive applications such as the following:
A scalable web site for an event of world-wide interest (e.g., imagine the web server for a human-on-Mars landing). Compared to the 18 mirrored web sites for the recent Mars Pathfinder mission, we envision a data system that will handle at least two orders of magnitude more users and two orders of magnitude more data per user, and a system that will not require manual configuration to set up mirror sites nor require users to manually select a nearby server.
An up-to-date web search engine that reindexes data within an hour of being notified of changes rather than periodically "crawling" all sites on the Internet.
Dynamic HTTP servers that increase their capacity by recruiting idle servers and that improve their responsiveness by spawning servers near groups of active clients.
Data mining applications or web agents that move computations to data rather than the reverse. An example of this type of application would be an agent that implements image analysis to perform content-based retrieval on remote image databases.
Collaborative computing applications, such as the file system for a large R&D team located at several sites and who want to share source, libraries, and input test files.
These applications require a range of services that are not presently available, such as dynamic cache hierarchies, more efficient use of distributed storage, a flexible and efficient cache consistency framework, support for replicating active objects, a location service for mobile objects, distributed resource management, and strong and flexible distributed security.
Although these general issues have long been faced by traditional operating systems, the wide area network environment poses new challenges calling for new trade-offs in operating system design. For example, compared to traditional systems, network performance will be limited, the number of nodes will be much larger, node and network failures will be more common, node and network performance will be more variable, the number of users will be larger, and the trust among nodes will be less uniform. Although we can draw inspiration from past operating systems, these constraints force us to rethink mechanisms and policies from the ground up, with more focus on extensibility and scalability to support a wide range of applications and data types.
Providing scalable services in this challenging environment poses a dilemma to system designers. On one hand, the complexity and scale of the system seem to demand high-level services that hide most implementation details from applications and users. On the other hand, the challenges of the environment seem to demand extensibility and a great deal of application control of how resources are used. Research efforts should focus attention on resolving this dilemma. Other researchers have noted several common-sense strategies that may serve as a starting point. First, interfaces of extensible low-level modules should target compilers and library writers rather than application writers. Second, it is incumbent upon a system designer to build "common case" libraries that support most applications rather than simply declaring, for instance, that the system is object oriented and can be extended with any policies needed by applications. Finally, interfaces to libraries, and when possible to the low level modules, should specify application goals not implementation mechanisms. For example, an application should specify, "I am going to access this file sequentially," rather than say, "use MRU replacement for this file."Extensible framework
To support extensible services, we envision building distributed operating system services that run as jobs on nodes that export a virtual machine interface. The virtual machine isolates programs from one another, controls access to system resources according to a flexible security policy that allows different applications and principles to access different system resources, and mediates access to shared resources via a set of resource-specific schedulers. The virtual machine thus implements basic mechanisms to control the local node; programs running on top of this virtual machine implement higher-level policies both to allow application control of some local policies and to provide an extensible way for system designers to build more complex distributed policies.
Recent research has provided many of the techniques needed to build such extensible virtual machines, so it seems feasible to build distributed operating system services upon this infrastructure. However, merely providing extensible mechanisms is not enough to enable a high-performance, scalable system. We must also develop key policies and algorithms for data distribution, consistency, location, and distributed resource management on top of these local mechanisms.
Dynamic replication
To support large-scale data driven applications, we should develop dynamic replication systems that automatically place data replicas where they are needed. Data replication and placement are crucial to performance in large-scale systems for three reasons. First, replication increases the throughput of the system by harnessing multiple machines. Second, moving data near where it will be used shortens the control loop between the data consumer and data storage, thereby reducing latency or making it easier to provide real time guarantees. Third, reducing the distance that data must travel improves the efficiency of the network by reducing the network resources that data transfers consume. These improvements are needed to support next-generation applications like those described in the introduction, particularly as the number of users rises and as demanding multimedia data becomes more prevalent. Unfortunately, current data replication architectures are an ad-hoc combination of mirror sites, web proxies, and geographical caches. Configuring systems on a case by case basis not only demands expensive human expertise and limits the maximum practical size of a data system, but it also makes it difficult to deal with dynamically changing demands and to coordinate the resource demands of different applications and replication hierarchies.
The research task is to build such a scalable dynamic hierarchy by synthesizing a range of techniques. For example, systems should separate control and data to allow scalable control without wastefully replicating data in deep hierarchies. Systems should dynamically increase or decrease the number of replicas and the size of the control hierarchy to improve scalability, locality, and efficiency. Systems should coordinate where objects are placed rather than relying on on-demand replication across a static hierarchy to use resources more efficiently and improve locality. Moreover, systems should provide support for replication of dynamic objects to, for instance, support replication of now-uncachable CGI programs.
Scalable consistency
Consistency is a key building block for distributed data-driven applications because it allows them to coordinate access to shared state. Furthermore, consistency is an essential part of a dynamic replication system because it allows clients to reason about data provided by a replica rather than by the main server.
To meet these needs we propose that a consistency framework be built in which each object is associated with a consistency policy that executes on the node's virtual machine. This will allow different applications to use different semantics so that each pays only for the semantics it needs. In addition, we should conduct research to determine how to provide consistency guarantees in a large system. To support a wide range of basic applications, these algorithms should provide strong or optimistic guarantees while retaining scalability and fault tolerance. Three aspects of this environment make it difficult to provide such properties. First, because node and network failures are relatively frequent in large systems, the implementation must be fault tolerant. Second, because callbacks require the server to track the contents of client caches, systems should limit how much server memory they require. Third, because the number of clients that a server may try to invalidate is large, the system should alleviate the bursts of server load when a popular object is modified.
A range of techniques may help meet these goals. Control hierarchies can reduce the state stored at any given node and distribute bursts of load across many machines. Extensible control hierarchies also provide a path to unify "push" services with the normal caching and consistency framework by supporting both invalidate and update consistency. Leases can provide fault tolerance, reduce server state, and reduce bursts of load at the server. Transactional updates provide fault tolerance when applications update data stored across multiple servers, and they can also improve performance by giving clients more control over when data are committed. Finally, volumes can group related data together and maintain consistency for each volume rather than for each object, reducing server state and load.
Scalable location service for dynamic objects
A central building block for a scalable data system is a high performance, scalable location service for tracking where objects are stored in a dynamic hierarchy. An object location service is needed by at least three parts of the system: 1) the consistency protocol needs to track where objects are cached so it can invalidate or update them, 2) read requests need to find nearby copies of objects, and 3) the object placement algorithm uses the current object locations as input to decide if more (or fewer) replicas are needed and where they should be placed.
The challenge to building a location service for this system is scalability. Every time an object is copied from one cache to another, its location record must be read (to find a copy of the object) and written (to add the new copy to the list). Fortunately, three aspects of the application make the task feasible. First, temporal imprecision is acceptable for locating nearby copies to read and for input to data location algorithms. In both cases, old data may hurt performance, but it will not cause incorrect operation. Second, for all three uses of the data, the list may be partitioned geographically or by network topology, and nodes need to maintain only the portion of the list corresponding to nearby copies. For example, if cache consistency metadata are stored hierarchically, each node in the hierarchy only needs to keep track of which of its immediate children are caching the data or have descendents that are caching the data. Similarly, if a cache is reading data, it cares about nearby copies not those that are far away. Third, to locate data for reads, hints are acceptable.
Distributed resource management
The scale of this new environment poses new challenges to resource management. In particular, the large number of nodes precludes centralized solutions, and resource allocation algorithms can no longer assume cooperation among applications or service providers.
One paradigm that should be examined is microeconomics-based resource management using secure micropayments. Several previous projects have taken inspiration from economics to guide computer resource allocation, and we believe that the scale of the Internet makes these approach particularly attractive. The key benefit of tying resource decisions to micropayments is that doing so provides incentive compatibility whereby users are discouraged from using resources inefficiently even if such use might marginally improve their own performance. For example, although a single user might benefit from a browser that aggressively prefetches all links leading from the current page, such prefetching would increase the load on the servers, hurting other users: if too many users prefetch, they will all see degraded performance due to server and network overload. A micropayment system would charge the prefetch algorithm a fee that corresponds to the delay that prefetching imposes on other users and thereby encourage the prefetch algorithm to account for this negative externality. Furthermore, this approach fits well with the extensible architectures we envision, because it provides a framework under which different policies can cooperate.
Although microeconomic algorithms seem well matched for large-scale environments, considerable research will be needed to understand the mechanisms and policies needed. For example, systems must provide more feedback on and control over what resources jobs use, and resource management algorithms must be built with explicit cost/benefit decisions stated in the form of some common currency. Additionally, we must examine the overheads of different policies to determine when micropayments are an appropriate strategy and explore ways to reduce the overhead to make such cases more common.
Providing basic services to large scale systems poses a dilemma to designers who must attempt to give applications control over their resources without swamping them with complexity. To do so, researchers should construct an extensible framework for system services, but manage the framework's complexity by targeting libraries and compilers rather than applications and by asking applications to recommend goals rather than mechanisms. Furthermore, when constructing extensible modules, system designers must provide "common case" libraries to support core applications. In the case of data services for large-scale systems, the benefit of resolving this dilemma will be to enable a new set of data-intensive applications that exploit the vast quantities of information becoming available on line.