Computer Architecture Seminar Abstracts

Fall 1999


Jesse Fang
Intel Corporation

IA-64 Architecture and Compiler Technology

IA-64 is Intel's 64-bit instruction set architecture. IA-64 was co-developed with HP starting in 1994. Recently, Intel and HP have made public disclosures regarding the instruction set architecture. Since the design philosophy behind IA-64 is based on separating functionality between dynamic (runtime) and static (compile time), this talk covers an overview of IA-64 architecture and how to use its features to optimize code. The talk also describes compiler technology on IA-64 architecture, which includes special optimizations particularly designed for the architectural features. Since dynamic compilation and dynamic translators are becoming a mainstream technology, I also provide a brief introduction to Java and dynamic compilation on IA-64.

Biography: Jesse Fang is managing Compilers and Java Lab at MRL (Microprocessor Research Labs), Intel Corp. Before Intel, Jesse was working on architecture and compiler at HP Labs. Before HP, Jesse was at Convex Computer Corp. and Concurrent Computer Corp. He was post-doctor at CSRD (Center for Supercomputing Research and Development) at Univ. of Illinois after he got his Ph.D. in Computer Science at University of Nebraska.


Mark Oskin
University of California, Davis

Active Pages: Bringing Intelligent Memory to Commodity Systems

Microprocessors and memory systems suffer from a growing gap in performance. This talk introduces Active Pages, a computation model which addresses this gap by partitioning computations between the processor and the memory system. An Active Page consists of a page of data and a set of associated functions which can operate upon that data. For example, an Active Page might contain image or database data and support functions to perform filtering or simple searches.

Unlike most other research in intelligent memory, we focus upon integration with commodity desktop systems. Active Pages can be implemented as a pin-compatible replacement for conventional DRAM and rely upon a memory-mapped interface which uses only standard memory reads and writes. Active Pages also leverage conventional virtual memory mechanisms and provide a protected environment for multiprogramming.

We describe several implementations of Active Pages based upon the integration of reconfigurable logic or processors into DRAM. Simulation results demonstrate up to 1000X speedups on several applications using Active Page memory versus conventional memory systems. We also discuss ongoing work which explores operating systems, cache coherence, power, and cost for intelligent memory, as well as hierarchical intelligent systems.

For more information, visit http://arch.cs.ucdavis.edu/AP/.


Mateo Valero
Universitat Politécnica de Catalunya
Dept. de Arquitectura de Computadores
Barcelona, Spain

Register File Use and Organization for Future Superscalar Processors

The register file access, use and organization will be one of the critical points in the design of future superscalar processors, as they are expected to increase the issue width (that implies more register ports) and the size of the instruction window (that implies more registers). The case is also true for future multitreaded processors.

I will present a novel dynamic register renaming approach that targets the use of a minimal number of physical registers. The basic idea is to allocate physical registers to the instruction at the instant they are executed instead of at the decode time. This technique allows to increase the ILP for a given number of physical registers. Different allocation policies will be considered and evaluated.

In the second part of this talk, I will present and evaluate a register file organization based on a multilevel structure which provides low latency and single bypass logic. I will propose several caching policies and prefetching strategies demonstrating the high benefit and potential for this organization.

Biography: Dr. Mateo Valero obtained his Telecommunication Engineering Degree from the Polytechnic University of Madrid in 1974 and his Ph.D. from the Polytechnic MUL of Catalonia (UPC) in 1980. He is a Professor in the Computer Architecture Department at UPC. His current research interests are in the field of high performance architectures, with special interest in the following topics: processor organization, memory hierarchy, interconnection networks, compilation techniques and computer benchmarking. He has published approximately 200 papers on these topics. He served as the general chair for several conferences, including the ISCA-98 and ICS-95, and has been an associate editor for IEEE Transactions on Parallel and Distributed Systems for three years. Dr. Valero has been honored with several awards, including the Narcis Monturiol, presented by tha Catalan Government, the Salva i Campillo presented by the Telecommunications Engineer Association, and the King Jaime I by the Generalitat Valenciana. He is a member of the IEEE and Director of the C4 (Catalan Center for Computation and Communications). Since 1994, he has been a member of the Spanish Engineering Academy.


Babak Falsafi
Dept. of Electrical and Computer Engineering
Purdue University

Purdue Impetus: Designing Future High-Performance Enterprise Server Architectures

Distributed shared memory (DSM) is emerging as the architecture of choice for medium- to large-scale multiprocessor servers. DSMs offer programming compatibility with respect to the ubiquitous bus-based symmetric multiprocessors (SMPs) by providing a logical shared address space over physically distributed memory. DSMs also enhance scalability by removing the shared bus bottleneck in SMPs. Performance tuning applications on DSMs, however, can often be difficult due to the non-uniform nature of memory accesses. DSMs suffer from a lack of performance transparency with respect to SMPs because remote shared-memory accesses inherently take up to ten to a hundred times longer than local memory accesses.

The Purdue Impetus project focuses on designing intelligent DSMs that smell, act, and feel like the uniform memory architecture of SMPs. In these DSMs the memory system transparently--without involving the application software--hides remote communication. In this talk, I will present results from two papers (which appeared in ISCA '99) describing transparent hardware techniques to improve DSM performance. In the first half of the talk, I will introduce a novel architecture, SC++, that preserves the simple and intuitive programming interface of sequential consistency (SC) while providing the performance of the best of relaxed consistency models, Release Consistency (RC). SC++ uses ILP instruction speculation mechanisms to execute memory operations out of order. I will present simulation results indicating that SC++ performs as well as RC.

In the second half of the talk, I will present novel hardware DSMs which learn, predict, and execute shared-memory accesses speculatively in advance to hide some of or all the remote access latency. The key mechanism behind these DSMs are Memory Sharing Predictors, pattern-based predictors based on the ubiquitous Yeh&Patt two-level branch predictors that learn and accurately predict memory sharing patterns. I will present prediction accuracies and implementation cost for two MSP designs, and show preliminary performance results on executing shared-memory operations speculatively.

Biography:

Babak Falsafi is currently an Assistant Professor in the School of Electrical and Computer Engineering at Purdue University. Professor Falsafi's research interests include design, implementation, and evaluation of intelligent memory hierarchies with emphasis on prediction and speculation in high-performance memory systems. He is also interested in analytic and simulation tools for computer system performance evaluation. He is currently the principal investigator of the Purdue Impetus Project (URL: http://www.ece.purdue.edu/~impetus) designing and evaluating high-performance speculative shared-memory systems. He also co-leads the Purdue ICALP Project targeting an integrated architectural/circuit-level approach to power management in general-purpose and embedded computer systems. As a primary member of the Wisconsin Wind Tunnel, he designed Reactive NUMA, a novel DSM architecture that dynamically reacts to application/system behavior to optimize the remote caching policy. R-NUMA lays the foundation for the memory system in the recently-announced Sun WildFire DSM.

Prof. Falsafi received a B.S. in Computer Sciences and a B.S. in Electrical and Computer Engineering in 1990 from State University of New York at Buffalo. He went on to pursue a graduate degree at University of Wisconsin where he received an M.S. and a Ph.D. in Computer Sciences in 1991 and 1998 respectively. He is a member of IEEE and ACM.


William J. Dally
Professor of Electrical Engineering and Computer Science
Computer Systems Laboratory
Stanford University

Computer Architecture for the New Millenium

As we prepare to enter the next millenium changes in the two forces that drive computer architecture, semiconductor technology and applications, are motivating major changes to processor architecture. Technology has become communication limited, with wire delay dominating cycle time. This motivates a move toward architectures that exploit locality and eliminate global register files and instruction issue units. At the same time, streaming applications involving video, communications, and graphics have little spatial or temporal locality and yet demand considerable memory and arithmetic bandwidth.

This talk will discuss the trends driving this change in computer architecture and will describe how the Stanford Imagine architecture addresses the challenges posed by these trends. Imagine uses a streaming memory system, a hierarchical register organization with distributed register files, and a novel method of handling conditional operations to match the demands of media applications to the capabilities and limitations of emerging VLSI technology. In simulation, a single 0.5cm2 Imagine chip sustains performance in excess of 10GFLOPS on typical image and signal processing applications.

Biography:

William Dally received the B.S. degree in Electrical Engineering from Virginia Polytechnic Institute, the M.S. degree in Electrical Engineering from Stanford University, and the Ph.D. degree in Computer Science from Caltech. Bill and his group have developed system architecture, network architecture, signaling, routing, and synchronization technology that can be found in most large parallel computers today. While at Bell Telephone Laboratories Bill contributed to the design of the BELLMAC32 microprocessor and designed the MARS hardware accelerator. He was a Research Assistant and then a Research Fellow at Caltech where he designed the MOSSIM Simulation Engine and the Torus Routing Chip which pioneered wormhole routing and virtual-channel flow control. While a Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology he and his group built the J-Machine and the M-Machine, experimental parallel computer systems that pioneered the separation of mechanisms from programming models and demonstrated very low overhead mechanisms for synchronization and communication. Bill has worked with Cray Research and Intel to incorporate many of these innovations in commercial parallel computers and with Avici Systems to incorporate this technology into Internet routers. Bill is currently a Professor of Electrical Engineering and Computer Science at Stanford University where he leads projects on high-speed signaling, multiprocessor architecture, and graphics architecture. He has published over 80 papers in these areas and is an author of the textbook, Digital Systems Engineering.


Gurindar S. Sohi
Professor of Computer Science and Electrical & Computer Engineering
University of Wisconsin-Madison

New Methods for Exploiting Program Structure and Behavior in Computer Architecture

Processor and system performance have grown at a phenomenal rate (60% per year) for many years; a sizeable portion of this improvement has come from architectural and microarchitectural techniques used to make productive use of the available semiconductor resources. Many of the techniques used by architects (e.g., caches and branch predictors) exploit program behavior -- the observed empirical characteristics of program execution.

In the next decade, advances in semiconductor technology with provide us with lots of transistors with which to build processing engines. The job of a computer architect will be to make productive use for these transistors: carry out processing functions in more powerful ways than in previous generations. To do so, it is likely that program behavior would need to be understood/captured/exploited in heretofore unknown ways. While current hardware techniques reason about program behavior by observing events, future hardware techniques are likely to reason about program behavior by learning about the program structure (relationships between program instructions) that causes the observed behavior, and exploit these relationships.

In this talk, we will look at recently-proposed hardware techniques to solve several problems that arise in the design of computing systems. These novel techniques exploit some knowledge about the dependence relationships amongst the instructions of a program. We will see how program structure-based techniques can be applied to the problems of scheduling out-of-order memory operations, streamlining communication through memory, managing memory hierarchies, prefetching linked structures, and optimizing communication in shared memory multiprocessors.

Having seen the benefits of having access to relevant program structure information, an obvious question is how such information should be gathered, made available available to the execution hardware, and maintained. There are several possibilities, ranging from purely hardware solutions, to solutions that make use of compile-time information. This issue has has several open research problems, some of which we are investigating at Wisconsin.

Biography:

Guri Sohi teaches computer architecture at the University of Wisconsin-Madison. He joined the Wisconsin faculty after receiving his Ph.D from the University of Illinois in 1985, and is currently a Professor in both the Computer Sciences and Electrical and Computer Engineering departments.

Sohi's research has been in the area of architectural and microarchitectural techniques for high-performance microprocessors, including instruction-level parallelism, out-of-order execution with precise exceptions, non-blocking caches, decentralized microarchitectures, speculative multithreading, and memory dependence speculation. He received the 1999 ACM SIGARCH Maurice Wilkes award for contributions in the areas of high issue rate processors and instruction level parallelism.


Chuck Moore
IBM Austin

Server Oriented Microprocessor Optimizations

In this talk, I will describe aspects of server oriented workloads and how they affect the microarchitecture decisions of the microprocessor and surrounding memory subsystem. In addition, I will describe how technology can be applied to improve the performance of these types of workloads. In the end, I will describe some aspects of IBM's GigaProcessor (Power4) design.


Marius Evers
University of Michigan, Ann Arbor

Improving Branch Prediction by Understanding Branch Behavior

Pipeline flushes due to branch mispredictions is one of the most serious problems facing the designer of a deeply pipelined, superscalar processor. Due to the increasing issue widths and pipeline depths of current processors, this problem is getting progressively more severe. In this talk, we approach the branch problem in a way different from previous studies. The focus is on understanding how branches behave and why they are predictable. Branches are classified based n the type of behavior, and the extent of each type of behavior is quantified. Based on this information about branch behavior, some shortcomings of current branch prediction (including Gshare) are identified, new improved branch predictors are introduced, and potential areas for still more improvement are identified.