Twenty-five years of experience with parallel and distributed programming has not produced a programming model for concurrent computation that is as successful as the random access machine for sequential computation. It is generally agreed that no single extant model suffices for all forms of parallel computation, or for all types of parallel machines. Abstractions, such as a global address space or the Linda tuple space, simplify parallel programming, but typically extract a performance penalty by hiding data locality and mechanisms for concurrency control. On the other hand, models with explicit message passing, which expose these issues to programmers, allow performance to be tuned, but are difficult to program. Thus, our options are reduced to the following few: (1) produce a new general programming model that can ease programming without loss of efficiency, (2) augment existing models with tools that aid the programmer to produce correct and efficient code, and (3) resort to different concurrent programming models for different applications.
We are addressing all of these options. The first four projects described below entail novel approaches to general-purpose distributed computing in a cluster environment. We have divided these into two categories. The first involves new languages and tools for general purpose distributed programming. The second deals with the provision of system-level services that simplify distributed programming. The remaining four projects embody application-specific approaches that have proven to be powerful aids to both parallel and sequential programming in a variety of domains.
A Discipline of Multiprogramming (Misra)
Research in multiprogramming has, traditionally, attempted to reconcile two apparently contradictory goals: (1) it should be possible to understand a module (e.g., a process or a data object) in isolation, without considerations of interference by the other modules, and (2) the implementation should permit a fine level of granularity so that no process is ever locked out of accessing common data for long periods of time. The goals are contradictory because fine granularity, in general, implies considerable interference. The earliest multiprograms (see, for instance, the solution to the mutual exclusion problem in Dijkstra) were trivially small and extremely difficult to understand, because the behaviors of the individual processes could not be understood in isolation, and all possible interactions among the processes had to be analyzed explicitly. Since then, much effort has gone into limiting or even eliminating interference among processes by employing a variety of synchronization mechanisms: locks or semaphores, critical regions, monitors and message communications. Elimination of interference, however, can lead to considerable loss of efficiency, as is the case in several cache-locking schemes for updating global memory.
We propose a model of multiprograms and a discipline of programming that addresses the issues of reasoning (e.g., understanding) and efficient implementation. The major point of departure is the disentanglement of sequential and multiprogramming features. We expect large sections of code to be written, understood and reasoned about as sequential programs. These sequential programs may be embedded within multiprograms; yet, we can guarantee that there will be no interference from the other programs and hence, these sequential programs may be regarded as atomic. Therefore, multiprograms may be understood without ``looking inside'' the sequential code fragments. We propose an efficient implementation scheme that can interleave the individual operations of the sequential programs with fine granularity.
Our model is sparse: a program is built out of boxes (a box is similar to a process/object/monitor) and a box is built out of procedures (that are similar to monitor procedures or methods in objects). No specific communication or synchronization mechanism, except procedure calls, is built into the model. The traditional schemes for message communications using bounded or unbounded channels, semaphores, and accesses to shared memory can be encoded as boxes in our model.
We distinguish between procedures that model sequential and concurrent aspects of programming. Terminating and potentially non-terminating computations - representing sequential programs and multiprograms, respectively - are, in our view, fundamentally different. The former can be assigned a transformational] semantics, with pre- and post-conditions, i.e., semantics based on possible inputs and corresponding outputs, without considering interference with its environment. Multiprograms, or reactive programs, however, cannot be given a pre- and post-condition semantics because on-going interaction with the environment is of the essence. In our model, a free procedure, modelling sequential computation, never waits (for an unbounded amount of time) to interact with its environment whereas a guarded procedure, modelling concurrent computation, may wait, possibly forever, for such interactions. In this view, a P operation on a semaphore is a guarded procedure because it may never terminate whereas a V operation is a free procedure. Our programming model allows for disciplined interactions between these two types of procedures.
The programming model we have proposed permits efficient implementations on message-based networks. The implementation will have a good deal of concurrency if the interactions among the boxes are disciplined (a term that we define precisely). We are planning an implementation that includes a (small) compiler for our notation (built on top of C++), a linker that enforces certain type-compatibilities among the boxes that are to be composed and a distributed runtime scheduler (that exploits a resource allocation algorithm known as the Drinking Philosophers Algorithm). We will generate code for a message network with an unlimited number of nodes and then map it to a given platform.
A long term objective of this research is to develop a discipline of multiprogramming that addresses both the theoretical and practical concerns in program design. We believe that the proposed model comes a long way toward achieving this goal. On the practical side, it suggests an attractive methodology for structuring systems using hierarchical composition of boxes. Additionally, the model suggests efficient implementations of processes, monitors and objects on multiple processors. On the theoretical side, the model allows us to view a system execution as a sequence of (large) indivisible actions, which, in turn, permits us to deduce properties of a system from the properties of its components.
A Specification Tool for Distributed Coordination (Ricciardi)
Application programmers cannot be expected to know the most efficient ways to solve distributed coordination problems. To help them, we are developing Sage, a specification tool for distributed programming. Its input is a high-level task specification and its output is a representation of the communications essential for performing the task. Thus, Sage will help non-experts to reason about their distributed programs and to develop efficient code for their applications. Sage will consist of (1) a specification tool to help users define and specify distributed coordination problems, (2) a communication graph extractor which will extract from the specification the messages required to solve the problem, and (3) a tool to identify communication primitives within the extracted communication graph and to provide on-line help.
The specification tool will present the user with a scripted forum for (1) naming the actions of the distributed coordination problem (e.g., ``deliver'' a message, ``decide'' a value), (2) characterizing the set of processes obliged to participate, (3) characterizing exemptions from obligation (e.g., process failure, joining after a certain condition has been met), and (4) expressing temporal notions for the completion of the coordinated activity (e.g., eventually, before another action is completed). The output of the specification tool is a collection of formulas that have been parsed as follows: a formula is a tree noting its owner (if it is a local formula), modalities, logical connectives, and subformulas.
The extraction tool will use this parsed output to derive the communication graph. In terms of process knowledge, a process must know certain facts about the global state before performing its part of the coordinated action. To learn about the global state, a process must exchange messages. The extraction tool must therefore recognize dependencies between formulas and associate process pairs with messages. Deriving the necessary messages is then, generally speaking, a matter of recognizing local formulae in valid global formulae.
The output of the extraction tool will be two trees. One will consist of high-level information and will display for the user the flow of information that is required. The other will contain the set of low-level formulas comprising the extracted communication graph. The display of this graph must be carefully considered. For example, it would be useful to distinguish sets of related messages. As well, the user should be able to manipulate the graph by adding and removing processes from the obligation set.
Neither the specification tool nor the extraction tool check for logical inconsistency or other semantic notions; like most parsers and compilers, they are driven by syntax. Similarly the basic extraction tool is not a true theorem prover; the formula manipulations it will be capable of will be relatively limited. Initially we will concentrate on determining the proper level of representation to allow the essence of distributed coordination to be captured adequately and on efficient implementation of this representation. We will undertake this work under the aegis of the CLEO project described later in this proposal. We intend to employ the infrastructure proposed here as a testbed to evaluate the effectiveness of our tool as an aid to distributed applications development.
Distributed Persistent Storage (Wilson)
Approaches to distributed persistent storage fall into three categories: (1) message passing systems provide persistence and pointer semantics only for local data, (2) distributed shared memory (DSM) systems provide a pure shared memory abstraction, (3) distributed object models (DOM) preserve the pointer semantics of objects across a network, and objects on different nodes communicate via messages which may contain pointers.
While these approaches have merit, each is deficient when attempting to build large-scale, complex, open, distributed systems. The complete loss of pointer semantics and memory coherence in the first approach over-burdens application programmers. In complex software, programmers create their own pointer semantics by using unique identifiers such as strings (e.g., references of the form ``http:/...'' in the World Wide Web.) They also devise protocols for application-specific coherence between these conceptual objects. While application-specific semantics are likely to be necessary in many situations, we believe that the underlying system software ought to make it easier for programmers to assemble something that meets their needs.
The DSM approach is attractive, but DSM has been made to work only for applications in relatively small networks, and it has relied heavily on assumptions of locality. The pure DSM approach is insufficient to provide good performance in the general case. An open system with arbitrarily interconnected data objects invites unpredictable sharing that may cause unexpected performance losses. For normal notions of consistency, directory management is problematic for performance and reliability reasons. For causal consistency, vector timestamps may become enormous, or, if their size is limited, they will force the use of a coarse resolution of causality and decrease parallelism. DSM's also have problems with respect to fault tolerance in unreliable networks. If nodes may fail and recover independently, the basic notions of consistency and object identity break down. If node A contains a pointer to an object on node B, and B rolls back to an earlier state, A's pointer may refer to an invalid version of the object or perhaps to no object at all. In a large, multifunctional system that interacts with the outside world, it is not feasible to force coordinated rollbacks of all processes that might communicate. Potential inconsistencies are therefore unavoidable, and the purpose of system software should not be to hide them entirely, but rather to expose them to programmers' control in a way that makes them easier to deal with appropriately.
The third approach is workable in many cases, but unnecessarily restricts programming styles. For performance reasons, it is often desirable to write modules that operate on data that are locally resident or cached. For example, a high-performance 3-D interactive simulation operating on large, intricate data structures cannot afford the continual indirections and presence checks required by a distributed object model.
We believe that in many cases it is desirable to preserve pointer semantics as a DSM does, but to explicitly code the shipping of data to ensure that real-time deadlines can be met. A process may ship a pool of data to another process explicitly, but, when it arrives there, it should be addressable via normal pointers as in a DSM. We believe it is important to separate the two different notions underlying conventional DSM's: sharing address space (e.g. using pointers) and sharing the actual data via automatic faulting and caching.
We therefore take a hybrid approach in which application programmers may allow data to be shared via different mechanisms, with different policies attached. In our approach, application programmers may allow data to be shared via direct pointers, either using DSM or explicit data shipping, or they may use a higher-level notion of network objects and insulate different applications or modules from the details of others' fine-grained data structures. We expect that in many cases, data will be shared with some processes in one way and with others in a different way. For instance, processes requesting definitions of particular words from a hypertext dictionary may only use a very high-level interface, while a program that constructs indices for free-text searches may need access to the pages of data holding the dictionary's internal data structures.
We will implement this capability through a layered set of modules for managing storage at different levels of abstraction, from disk space to recoverable virtual memory to distributed shared memories. Besides simple storage and caching, this will enable coherence and replication mechanisms by providing name service and dictionary components. Similarly, there will be layers for managing object identifiers, from raw untyped local addresses to consistent global addressing to program-level pointers. There will also be layers of messaging and sharing protocols, providing different reliability and performance tradeoffs. For example, distributed tracing garbage collectors have strong performance needs but low reliability needs. It is essential for good performance that the garbage collector be able to bypass many of the layers that provide application programs with conventional guarantees.
We have already developed the Texas Persistent Store, which allows reliable, long-term, checkpointed storage of C++ objects on a single processor. Texas allows data from a persistent store to be faulted into any available page of a process's virtual memory and uses a novel technique called pointer swizzling at page fault time to convert pointers into normal hardware-supported virtual memory addresses before they are ever seen by a running program. This allows Texas to work with normal compilers, and compiled code executes at normal hardware speed. The only additional cost is resolving the addresses within a page of data when that page is faulted into the virtual memory of a process, and this cost is usually small relative to the cost of the data transfer. Texas also supports the resolution of virtual function table pointers, so that C++ objects can be used with different executables than the ones that created them. Texas is very portable and currently runs under OS/2 and several varieties of UNIX. In its current form, Texas provides easy-to-use orthogonal persistence and recoverability for C++. Transient objects and persistent objects can be operated on by the same code. Texas provides an excellent base on which to develop our distributed persistent storage scheme.
Algorithms for Networked System Services (Plaxton, Ramachandran)
In the paragraphs that follow, we argue that the development of new network routing algorithms will play a crucial role in determining the level of abstraction that can feasibly be attained in the next generation of computer systems. We also argue that fault-tolerance, a feature rarely found in present-day computer systems, is an attainable goal if hardware failure is incorporated from the outset into our new abstract view of a computer system. In fact, we claim that certain basic forms of fault-tolerance could be achieved with little or no performance penalty.
An excellent recent example of the power of abstraction in computer system design is the introduction and widespread use of network file systems that present the user with a uniform logical view of the file system. By hiding the details of the actual distribution of files to physical disks, these systems greatly enhance the user's ability to function efficiently.
On the other hand, a major deficiency of most current network environments is that fundamental user-level operations such as process creation are normally associated with a particular user-specified physical machine. For example, the user typically logs in to a particular physical machine within a network, rather than simply logging in to ``the network'' as a whole. This is a clear case of over-specification, since the user does not care on which physical machine (or machines) the process is executed, as long as it is executed efficiently. We propose to design, analyze, and implement the routing algorithms needed to support a high-performance network environment in which the user is freed from the responsibility of specifying a particular physical machine to perform a given task. The algorithms that we develop will be scalable efficiently to networks of arbitrary size. In recent routing-related work, we have given a precise analysis of two fundamental contention resolution protocols.
For the purposes of developing such routing algorithms, we view a given network abstractly as a directed graph where the vertices correspond to physical machines (e.g., processors, disks, terminals) and the arcs correspond to communication channels. Each vertex/arc has an associated set of performance characteristics. In the case of an arc, the performance would typically be modeled by the maximum data rate of the associated communication channel. In the case of a vertex, the performance characteristics would include information describing the set of tasks that can be executed on the corresponding machine. Application-level programmers would design their programs not for a particular physical machine, but rather for an abstract machine with a particular set of capabilities (e.g., C language support, X Windows support). A request for service by a network user would be modeled by the introduction of a token, labeled with the set of capabilities required for its execution, at the appropriate vertex of the graph. The goal of our routing algorithms would be to efficiently solve the on-line distributed scheduling problem associated with executing all of the user tasks in as little time as possible. (If a particular user task cannot be executed because no physical machine in the network has the required set of capabilities, this fact would be reported to the user.)
Several important remarks are in order concerning the on-line distributed scheduling problems outlined above. First, although a large number of valuable advances have been achieved in the study of network routing problems, current routing algorithms are not sophisticated enough to solve the above problem efficiently. For instance, most of the research in network routing algorithms assumes a fixed source and destination vertex for each token. In our problem, a token can be routed to any vertex having the necessary set of capabilities. Of course, many of the dynamic load balancing algorithms that have been proposed also share this characteristic, but these algorithms tend to make other extreme assumptions; for example, that all tokens are indistinguishable.
Second, the graph model proposed above is actually over-simple because it does not take into account the potential for dynamic reconfiguration of the software-based capabilities of a network. For example, the execution of a particular user token might require a copy of a given file to be present at the vertex holding the token. Thus, we might wish to model files as a second kind of token.
Third, it should be mentioned that we do not expect to obtain precisely optimal solutions to the on-line distributed scheduling problems discussed above. Rather, we hope to discover on-line algorithms for which we can prove that the performance does not deviate from that of the best off-line strategy by more than some modest factor. Such competitive analysis has been successfully applied for analyzing other routing problems). Unfortunately, such an approach tends to lead to overly-pessimistic bounds, since a series of small factors are typically ``given up'' in order to simplify the analysis. For this reason, once we have identified an algorithm with provably good performance, we plan to experiment with an actual implementation of the algorithm in order to better understand its practical performance.
Finally, we propose to study fault tolerance with regard to both network configuration and routing strategies. Since the links of the network are much more likely to fail than either the workstations or the ATM switches, a fault-tolerant design for the network would be one in which there are k link-disjoint paths between any pair of workstations, for a suitable value of k that is determined by the probability of link failures. This will allow the network to be fully operational with up to k-1 link failures. In addition, we would like the number of hops between any pair of workstations to be as small as possible, and we would like the topology to be be incrementally upgradable. The problem of augmenting a current graph with a minimum number of edges to meet a certain edge- or vertex-connectivity requirement has been studied extensively, and efficient algorithms are known for the exact solution of the edge augmentation problem. The current problem of relevance to our network of workstations is further constrained: (i) we need to minimize the number of hops, (ii) the degree of a node representing an ATM switch is bounded by technological parameters (currently 16), and (iii) the workstations should be only the source or destination of paths, not intermediate nodes in the paths. Since it is well-known that problems closely related to the basic minimum augmentation problem are NP-complete, we will investigate first the tractability of finding exact solutions, and then investigate the issue of finding efficient algorithms for either the exact or approximate solution of the problem, as appropriate.
For routing algorithms we plan to investigate various heuristics for re-routing packets dynamically under link failures. These routing strategies will need to take into account the varied nature of the packets in the networks, with varying time constraints on delivery, and the finite buffer sizes at the ATM switches. We will analyze the performance of the heuristics we develop using standard theoretical tools; however, experimentation on the network will be a crucial tool in guiding the design of our heuristics and analyzing their performance.
Compositional Programming (Browne)
The compositional approach to programming separates structure and computation. It enables declarative specification of parallelism and thus facilitates compilation to multiple targets. It strongly encourages component reuse, since components of computations are specified separately from program structure. For sequential computations, compositional programming focuses on the definition of interfaces and the semantics of units of computation since the only possible control structure is a sequential flow among the units of computation. Parallel programming requires that an ordering of execution be derived from the compositional specifications. The nodes of a data flow graph, for example, can be regarded as units of computation and the arcs of the graph map outputs of one node to inputs of another. The entire graph then represents a unit of computation, which can be used in further dataflow specifications.
The CODE family of parallel programming systems has been in development since 1984. CODE specifies a parallel computation as a generalized dependence graph where the nodes are either primitive units of computation or (recursively) generalized dependence graphs. Each unit of computation is specified by a firing rule that determines the state in which the computation executes, a function or relation describing the computation, and a routing rule that determines the disposition of the results. The system has evolved through CODE-2 and CODE-3; the evolutions incorporate data partitioning parallelism along with the functional partitioning of the dataflow graph model and a component library manager. An integrated debugging system is being developed.
Another approach to compositional programming is embodied in the MaTRiX++ system, where computations are expressed in terms of operations on hierarchical matrices. The structure of the computation derives directly from the hierarchical type structure of the matrices defined by the computation. This allows the selection of optimal algorithms for the computation on each sub-matrix and leads naturally to a partitioning for parallel execution.
Our experience in designing programming environments for several scientific applications suggests that compositional programming is one of the key methodologies in such designs. First, we design the components that constitute the kernel of the application and then we design the user interfaces specifying how the components are to be used. We are currently building specialized versions of the graphical interface for the CODE system for the development of Adaptive Parallel Methods for Computational Fluid Dynamics (funded by ARPA) and the Binary Black Hole NSF Grand Challenge project; both of these projects are being undertaken in conjunction with TICAM. One of our major concerns is the efficiency of the implementations. The proposed platform will support the required experimentation.
Lightweight Database System Generators (Batory)
The conventional approach to building a database management system (DBMS) is to create a system that is general-purpose, yet open to internal modifications so that it can be customized for particular applications. Examples of this approach include well known toolkits (such as Wisconsin's Exodus) and open DBMSs (such as IBM's Starburst, Texas Instruments Open OODB, Berkeley's Postgres, and Texas's Genesis). Our experiences with Genesis have led us to question this approach to DBMS construction: most special-purpose database applications that we have encountered are performance-driven to the extreme. The overhead of general-purpose DBMSs is so high that it precludes DBMS usage altogether.
Consider the lock manager of a DBMS. It maintains a set of tables whose structures resemble that of relations. Each table stores a different number of ``tuples'' (e.g., lock names and lock requests); operations on tables (e.g., to set or remove locks) are atomic and multi-table queries arise frequently. Because performance is critical, the processing of queries is highly optimized. Clearly, a lock manager could use a general-purpose DBMS to manage its data, but the performance would be disastrous. Operating systems offer another example. Page tables, segment tables, and process tables contain ``tuples'' (e.g., page-table entries, segment entries, and process control blocks) that are interrelated. Multi-table queries are common, and operations in a multiprocessor environment must be atomic. The basic features that are needed to maintain page, segment, and process data are offered by general-purpose DBMSs--with an unacceptable performance penalty. There are many other examples: compilers query and update symbol tables, network managers access and modify routing tables, name servers maintain a horizontally partitioned and distributed table of service names and network addresses, mail systems query address alias tables and persistent object stores maintain page translation tables. Essentially any application involving frequent queries and updates to data structures fits this paradigm.
We argue that all such applications are actually employing special purpose database systems for their data management tasks. However, these special-purpose systems were never designed to support a general class of schemas or ad hoc queries. Rather, they are lightweight database systems (LDBs) that were designed to support a single, specific schema and a predeclared set of retrieval and update operations. Compared to general-purpose DBMSs, LDBs provide considerably less functionality with a correspondingly substantial increase in performance. LDBs today are hand-crafted and very expensive to build. Building a single ``open'' or ``extensible'' LDB is not the solution. All vestiges of DBMS generality and unnecessary features must be eliminated at LDB compile time so that the resulting LDB code is efficient. This requires a radically different approach to the construction of special-purpose DBMSs.
We have built a lightweight database system generator, P2, by reengineering and retargeting Genesis. P2 is a precompiler that offers a superset of the C programming language. Users program in terms of database container and cursor abstractions. The target LDB implementation is declared as a composition of prewritten components from the P2 library. The P2 preprocessor replaces all cursor and container declarations and operations with their C implementations.
Our initial experiments compared P2-generated container implementations to those found in contemporary template libraries. P2-containers were as efficient, and in many cases more efficient, than their template counterparts. We then reengineered Miranker's LEAPS production system compiler using P2. LEAPS is a compiler that translates OPS5 rule sets into C programs. We demonstrated that P2 offers significant increases in software productivity by reducing the LEAPS development time by a factor of 3 and the volume of code that had to be written by a factor of 4. Moreover, the P2-generated programs were between 50% and 250% faster than LEAPS-generated programs. The reasons for the improvement were (1) P2 could perform certain optimizations automatically that were difficult or impossible to do by hand, (2) the database abstractions offered by P2 revealed some global simplifications that were otherwise obscure, and (3) experiments with P2 confirmed long-suspected performance bottlenecks in LEAPS.
A primary benefit of an LDB generator is that different LDB implementations can be tried easily for performance tuning. This can be accomplished simply by altering the declaration of the composition of P2 components that defines the LDB implementation and recompiling. As an example, we replaced the nested-loops join algorithm in LEAPS with a hash-join algorithm and observed performance improvements of up to several orders of magnitude. This modification took 3 days to write a hash-join component and to recompile the P2-version of LEAPS. In contrast, it would have taken weeks to modify LEAPS. Moreover, P2 offers the possibility that completely different LDB implementations could be generated on a per-rule-set basis; there is no provision in LEAPS for such a capability.
We have demonstrated with Genesis that distributed DBMSs can be built from components that encapsulate distributed storage based on a client-server model. As part of our proposed research, we will show that similar componentry exists for P2, so that distributed, high-performance LDBs can be built from components. P2 is a powerful domain-specific programming tool whose scalability to real applications has been amply demonstrated. It can be expected to provide significant leverage in the development of high-performance, customized LDBs that will be encountered in various research efforts and applications described in this proposal. These efforts will in turn provide us with the same benefits that we obtained from our implementation of LEAPS, a rich source of applications with which to validate our approach.
Specialization of Generic Programs (Novak)
Our vision of the typical networked computing environment is a heterogeneous distributed system; an environment that offers a diverse set of capabilities and resources to a user. Experience has shown that traditional ways of programming--requiring a human to understand each computing resource and its interfaces and to manually write code to tie them together--is ineffective when the number of computing resources is large. In order to make such a computing environment effective, it is necessary to have facilities to generate programs from a specification that is much higher-level than subroutine calls in traditional languages. We propose to develop facilities for generation of programs by specialization of generic procedures, with programs being specified using graphical interfaces that are easy to use and natural to the problem domain.
For example, suppose that a user would like to predict soybean production in the current growing season. The system may provide a database from the Department of Agriculture telling in what counties soybeans are grown; a geographical database giving outlines of counties; a NASA database giving spectral sensor readings over the U.S.; a Weather Service database of rainfall estimates derived from radar data; a program that models soybean yields as a function of present state and rainfall, etc. The user would like to restrict the NASA data to the appropriate counties, estimate the state of the soybeans from the spectral data, and use the model and rainfall data to predict yields. In this example, the data sets and programs are large, come from different sources across the network, and were not designed to work together to solve this particular problem.
In order to interface diverse programs and data sets, it is essential to have symbolic descriptions of the syntax and semantics of data. Standardization of data formats can be used in small domains where a single organization controls all data sets and programs. However, in general the data sets and programs may come from many sources and be combined in unforeseen ways. Therefore, it is necessary to interface programs and data sets that are ``incompatible'' in the traditional sense. Our approach is to define abstract data types, and generic procedures for them. We then create views to describe how concrete application data types can be thought of as implementations of our abstract types. Views can be used to make different programs and data sets compatible in several ways. First, views can be used to automatically generate programs that convert one data format into another. For example, data from a database can be converted to the form needed by a library subroutine. Second, a generic procedure can be specialized through a view so that it operates directly on the existing form of application data. This avoids having to materialize a converted copy of the data. Third, application data of one type can be converted to application data of another type based on a shared view of each as the same abstract type. This could be used to combine data from two databases with similar data but different representations.
Views allow automatic generation of ``glue code'' that ties together various data sets and procedures by making their inputs and outputs compatible. They can be complex and difficult to write by hand, however. We have developed graphical user interfaces that make it easy for a user to create views by specifying correspondences between application data types and abstract types. Once made, views can be stored along with the data sets they describe. The organization that creates a data set could make typical views of that data set; if an appropriate view exists, ``glue code'' for use of the data set with a procedure that uses such a view can be generated automatically.
Programs should be specified at a high level, in which the most important decisions are specified by the user while details are handled automatically. Graphical connection of data sets and programs appears to be an effective method of specification. However, it is necessary to make certain that the connections between data sets and programs fit appropriately so that the results will be correct, the generated code is efficient, and the performance can be predicted well enough to alert the user to potential problems. Views can be used to make interfaces fit, as described above. In some cases, new views will be necessary; automated tools can help users to make them. Data set sizes and performance models of programs can be used to decide on the method of interfacing data and procedures. If a data set is small, it can be converted and materialized. If it is large, a specialized version of the procedure can avoid conversion. If a summary of a large data set is required, the procedure might best be sent to the location where the data is stored rather than transporting the data across the network.
Efficient Rule-Based Programming for Simulation and Database Access (Miranker)
We have been working for a number of years on developing compilation techniques for efficiently executing rule-based programs. As a result, we have achieved speedups of 2 to 3 orders of magnitude over traditional approaches for sequential rule processing. This work has made rule-based computation against large, disk-resident databases practical. We have continued to expore new avenues to increase the efficiency of rule-based computation. In particular, we have been exploring parallel execution of rule programs to achieve further speedup. Our initial results demonstrated that there is little performance to be gained from the parallel evaluation of main-memory applications written in languages comprising OPS5 semantics. However, we are steadily accumulating results that demonstrate impressive and scalable performance improvements for rule languages when executing against large, disk-resident databases.
We have determined that simulation models of the type used in distributed, event-driven battlefield simulations are essentially rule programs, although typically they are coded in an imperative or object-oriented language like C, C++ or ADA. Thus, they are difficult to develop and expensive to maintain and upgrade. To date, performance issues have prompted developers to avoid rule-based approaches. Our rule system technology has reached a level of performance that allows rule-based simulation models to be used in these environments in place of traditional code. To demonstrate this, we have reimplemented a component of a battlefield simulator. We achieved adequate performance with an order of magnitude decrease in code size compared to the original code. Immediately after demonstrating the basic system, several longstanding difficult maintenance issues with the original code were easily corrected.
We are now developing a system based on our rule technology that is designed to support real-time simulation. This system will, for the first time, incorporate all of the techniques that we have developed for efficient rule-based computation, as well as results from our investigation of real-time decision systems based on our programming model. We have shown that, for limited domains, we can provide both execution timing guarantees that fulfill hard real-time deadlines and outstanding raw performance.
The target of our new language and compiler is a heterogeneous distributed workstation environment of the type described in this proposal. Our system is expected to be the basis for implementing the simulation models in a new ARPA sponsored project on distributed interactive simulation that we are undertaking in conjunction with Applied Research Laboratory at UT Austin. In this context we plan to investigate concurrency control among simulated objects and primitives for efficient reasoning about large geometric databases. These capabilities are critical to simulation models that must interact with each other in a common simulated world.
From a user's perspective, the most significant problem standing in the way of using large, distributed, heterogeneous databases, such as the EOS databases, is locating pertinent information. Users must first determine which databases are potentially relevant to their interests, then formulate queries for them. This requires considerable knowledge of the overall contents and the internal form (i.e. terms and relations) of each database. A goal of our research is to develop methods for content-based accessing of databases using natural-language queries.
Content-Based Accessing
Because current query interpreters are quite rigid, users must express their queries using the database's (often peculiar) terms and relations. Even small mismatches--such as between two different, but synonymous, terms--are not tolerated. Clearly, a better query interpreter would use semantics to determine the relevance of database information to users' queries.
Content-based query interpretation requires a knowledge base of domain terms in which each term is ``defined'' by its position in one or more hierarchies and by its relationships to other terms. For example, in a knowledge base on climatology for the EOS application, ``rainfall'' might be a specialization of ``precipitation'' in which the precipitate is liquid. This knowledge can be used by query interpreters to reformulate queries in various ways. For example, if a query uses terms that do not syntactically match the terms in a database, then related terms (e.g. synonyms) can be substituted. Similarly, if a query fails, then terms in the query might be replaced by more general terms.
Several researchers have developed methods for reformulating queries using domain knowledge, and more will undoubtedly follow. Unfortunately, these methods are inherently dependent on knowledge bases, and methods for building knowledge bases efficiently are still in their infancy. Nevertheless, we have taken some important first steps. For one area of biology, we have built one of the largest knowledge bases of its kind, containing over 14,000 concepts and 200,000 relations among the concepts. We have developed a frame-based language for representing knowledge and we have built tools for visualizing and editing knowledge bases. Finally, we have used this knowledge base for the most extensive application and evaluation of automated explanation generation to date.
The next step in our research is to develop methods to partially automate the process of building large knowledge bases. First, we propose to develop a domain independent ontology of terms and relations. This will facilitate building new knowledge bases, and it will standardize and localize the dependencies between each knowledge base and the domain-independent programs (such as query interpreters and explanation generators) that use it. Second, we propose to develop machine learning methods that facilitate encoding information in the structured representations of knowledge bases. Although many methods might help, we have found that one is critically important: a method that efficiently computes the consequences of changing a knowledge base. The consequences, such as inconsistencies in the knowledge base, can be subtle yet significant. We have developed prototype software for this task, but it is not yet an effective tool for full-scale knowledge bases.
Natural-Language Queries
Queries to a database or information retrieval service must normally be stated in complex query languages such as SQL. Allowing users to enter queries in natural language is obviously more convenient and desirable for novice users. Research in computational linguistics has produced very useful methods for building natural-language interfaces to databases; however, developing such interfaces is still a very laborious task. Building an interface for a particular application generally requires a computational linguist to engineer both a specific grammar and ambiguity-resolution methods for the expected range of queries.
To address this problem, we have developed machine learning methods for automatically constructing natural-language parsers from sample sentences paired with their desired query-language translations. Recent automated induction techniques for learning first-order logical rules from data are used to construct a deterministic, shift-reduce parser written in Prolog from a corpus of parsed sentences automatically. This reduces the complex task of writing a natural-language parser to simply providing a sufficient set of examples of the desired translation process. The technique has already successfully produced a syntactic parser from real queries to an airline travel information service and a system for translating simple English queries into SQL from sentences artificially generated from a prescribed grammar.
For one or more of the intended applications, we propose to use this machine-learning technology to construct a natural-language interface for translating English queries into the formal language required by the knowledge-based query interpreter. Training data will be obtained by collecting formal-language queries from the initial users of the system and pairing them with equivalent English versions, also provided by the user or determined later by the experimenters. This process should be fairly straightforward for any of the intended applications.
Since the induction algorithm is complex and requires a fair amount of computing resources, the proposed equipment will allow this research to continue and the method improved to process large numbers of examples. Training on 200 sentences from a real corpus currently requires several hours on a Sparcstation 2. We propose to develop a more efficient version of the system by employing variations of the ``windowing'' approach used in decision-tree induction. In this approach, only a small subset of the data (the window) is initially used for training and the result is tested on the remaining data. A sample of the incorrect examples is added to the window, and the process is repeated until all of the examples are correct. This approach can substantially decrease training time by reducing the amount of data that needs to be actively processed by the induction algorithm.
We also propose to develop methods for automatically learning the associations between words and semantic tokens (lexical acquisition). Currently, the possible meanings of each word must be provided to the system in advance, an extra burden on the system developer. Using a greedy covering algorithm, we intend to automatically discover word-meaning pairs from training data consisting only of sentences paired with desired formal representations (e.g. SQL). At each step, the proposed method adds the word-meaning pair that accounts for the most uncovered tokens in the representations present in the training set. This is continued until each token in each representation is accounted for by some word in its corresponding sentence. We intend to implement this method and test it on constructing parsers for one or more of the proposed applications.