Frequent Value Locality and its Applications
Abstract:
By studying the behavior of programs in the SPEC95 benchmark suite we have observed the frequent value phenomenon according to which a few values appear very frequently in memory locations and are therefore involved in a large fraction of memory accesses. Moreover we have observed that the set of frequent values remains stable over the execution of the program and these values are distributed fairly uniformly across the memory.
In addition to describing the frequent value phenomenon I will describe two applications of the phenomenon. In the first application, by transmitting values belonging to a dynamically changing set of frequent values in encoded form, we greatly reduce the switching activity on a CPU's external data bus. In the second application, by storing values belonging to a fixed set of frequent values in compressed encoded form, we significantly reduce the energy consumed by the data cache.
Biography:
Rajiv Gupta is a Professor of Computer Science at The University of Arizona. He received the BTech degree (Electrical Engineering, 1982) from the Indian Institute of Technology, New Delhi, India, and the PhD degree (Computer Science, 1987) from the University of Pittsburgh. His primary areas of research interest include power and performance issues in superscalar and VLIW architectures, profile guided optimization and program analysis, and instruction level parallelism.
Rajiv received the National Science Foundation's Presidential Young Investigator Award in 1991 and is an IEEE Distinguished Visitor. He is serving as the Program Chair for PLDI'03 and HPCA-9 Conferences. He serves as an Associate Editor for the Parallel Computing journal and IASTED International Journal of Parallel and Distributed Systems and Networks. Rajiv is a member of ACM and a senior member of IEEE.
Vijay Narayanan
Penn State University
Managing Leakage: From circuits to software
Abstract:
Leakage energy is expected to become a dominant portion of energy consumption in technologies beyond 0.1 micron feature sizes. In this talk, techniques spanning from circuit to software levels for controlling leakage energy will be introduced. First, three circuit approaches to runtime leakage control: input vector control, substrate biasing and supply-gating, will be introduced. Next, a leakage control mechanism activated by a software garbage collector in an embedded architecture will be discussed. Finally, we will show how to tune the garbage collector to optimize the energy consumption.
Biography:
Vijaykrishnan Narayanan is an Assistant Professor of Computer Science and Engineering at The Pennsylvania State Univeristy. His research interests are in the areas of Java microarchitectures, Energy-efficient systems and embedded systems.
David Lilja
University of Minnesota
When All Else Fails, Guess: Evaluating Techniques for Speculation, Reuse, and Simulation in High-Performance Computing
Abstract:
Computer architects are aggressively developing novel mechanisms to execute instructions speculatively, that is, before it is known whether or not they should actually be executed, and even before the input values needed by the instructions have been computed. Our speculative multithreading execution model combines compiler-directed thread-level speculation of control dependences with run-time verification of data dependences. This new approach attempts to exploit the fine-grained parallelism available in program constructs that have been resistant to traditional parallelization techniques, such as do-while loops, pointers, and subscripted subscripts. This speculative multithreading model is supported by our superthreaded processor architecture, which is a hybrid of a wide-issue superscalar processor and a multiprocessor-on-a-chip. We also have developed libraries for both C and Java to extend this speculative multithreading to off-the-shelf shared-memory multiprocessors. To reduce execution times of individual threads, we have studied how the values produced by sequences of instructions can be reused. We find that the best reuse unit is somewhere between a compiler-generated basic block and a few consecutive instructions. Finally, we compare two techniques for combating the very long times needed to perform detailed simulations. The reduced input set approach develops new input sets to produce benchmark program behavior that is similar to the program behavior observed when executing with the reference inputs. The sampling approach, on the other hand, reduces simulation time by simulating only a fraction of the original program in detail.
Biography:
David J. Lilja received the Ph.D. and M.S. degrees, both in Electrical Engineering, from the University of Illinois at Urbana-Champaign, and a B.S. in Computer Engineering from Iowa State University in Ames. He is currently a Professor of Electrical and Computer Engineering, and a Fellow of the Minnesota Supercomputing Institute, at the University of Minnesota in Minneapolis. He also serves as a member of the graduate faculties in Computer Science and Scientific Computation, and was the founding Director of Graduate Studies for Computer Engineering. Previously, he worked as a research assistant at the Center for Supercomputing Research and Development at the University of Illinois, as a development engineer at Tandem Computers Incorporated (now a division of Compaq) in Cupertino, California, and as a visiting researcher in the Hardware Performance Analysis group at IBM. He also was a visiting professor at the University of Western Australia in Perth supported by a Fulbright award. He has served on the program committees of numerous conferences; was a distinguished visitor of the IEEE Computer Society; is a Senior member of the IEEE and a member of the ACM; and is a registered Professional Engineer.
The MIT SCALE project
Abstract:
Current embedded processors are simply stripped-down versions of desktop and server architectures. In the future, as the market for embedded information processing grows to dwarf that for desktop and server machines, a reversal of the design cycle can be predicted, where desktops and server processors will simply be optimized variants of embedded architectures. The SCALE project at MIT is developing a new multi-paradigm VLSI architecture for these future systems. The SCALE architecture provides a single hardware substrate to replace the ad-hoc collections of microprocessors, DSPs, NPUs, FPGAs, and ASICs used to implement embedded systems today. SCALE supports both traditional software-centric parallel computing models (e.g. SMP threads) as well as hardware-centric computing models (e.g., systolic computing) with a small set of unified silicon abstractions. This talk will provide an overview of current progress within the SCALE project.
Biography:
Krste Asanovic is an assistant professor in the MIT EECS Department and a member of the Laboratory for Computer Science. He received a bachelor's degree in Electrical and Information sciences from Cambridge University in 1987 and a Ph.D. in Computer Science from the University of California, Berkeley, in 1998. At MIT, he leads the SCALE project, which is investigating new energy-efficient microprocessor designs.
Architectural Support for User Level I/O
Abstract:
In our increasingly communication-centric communication world, I/O performance has moved from the "who cares?" position into the limelight as a leading candidate for the system bottleneck culprit. This talk presents some simple architectural enhancements to support user-level I/O, that are relatively cheap and simple to implement, and which yield a suprisingly significant performance improvement over the traditional kernel based approach. The architecture includes simple hardware mechanisms in both the I/O devices and host processor which support user-level requests, data transfers, and notifications while maintaining the level of protection and programming flexibility of kernel based architectures.
Biography:
Al Davis received his bachelor's degree from MIT in electrical engineering in 1969, and a Ph.D. in computer science from the University of Utah in 1972. His career has been split approximately equally between academia (University of Waterloo, University of Utah) and industry (Burroughs, Fairchild, Schlumberger, Hewlett-Packard, and Intel). He is currently a Professor in the School of Computing at the University of Utah.
His work has traditionally be focused on the high performance computer architecture domain, and in particular architectures and systems for highly parallel and scalable systems. He and his colleagues at Burroughs corporation built the first operational dataflow computer in 1976. He has also been active in the VLSI and the asynchronous circuits community. His current research is on what he calls "power efficient component based architectures" which is an investigation into the development methodologies to support rapid design, manufacture, and deployment of very power efficient embedded architectures.
Professor Davis has published more than 60 articles in journals and conferences, and holds 11 patents including the initial patents on dataflow architecture and autonomous interconnect routing chips. He has been a guest researcher at institutions in Israel, Germany, and the former USSR and has consulted for many of the major computer manufacturers (DEC, IBM, SUN, Intel, and HP).
David August
Princeton University
Architectural Exploration with Liberty
Abstract:
The same technological advancements which perpetuate exponential growth in the complexity of integrated circuits also pose difficult challenges for the design and design-automation communities. The trend toward increasingly complex programmable devices (embedded ASIPs, EPIC architectures, etc.) drives the need for novel architecture-software design space exploration tools. Today, for lack of these tools, architectural design is often an art form requiring the expertise of a few key designers. The resulting devices are often sub-optimal due to a lack of appropriate software development and simulation tools early in the design cycle. Systematic design of platforms demands a software framework that can keep pace with designers, providing retargetable compiler optimizations that can make best use of architectural features and retargetable simulation tools to measure their effects, all using real benchmarks. This talk will motivate and detail two technologies, Optimization Space Exploration for compilation and the Liberty Simulation Environment for modeling, designed to enable designers and researchers to achieve efficient systematic design space exploration.
Biography:
David August is an Assistant Professor in the Department of Computer Science at Princeton University. His research interests lie in computer architecture and back-end compilation. At Princeton, he directs the Liberty Computer Architecture Research Group (http://liberty.princeton.edu). The Liberty group is currently developing open-source tools necessary to perform rigorous, systematic processor design-space exploration. Liberty research is sponsored in part by Intel, The Gigascale Silicon Research Center, an NSF ITR, an IBM Faculty Partnership Award and an NSF CAREER Award. David received his Ph.D. in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign in 2000. At Illinois, as a member of the IMPACT research compiler group, he developed a framework for aggressive predicate analysis and optimization.
Mike Kistler
IBM Austin Research Laboratory
Modeling the Effect of Technology Trends on Soft Error Rate of Combinational Logic
Abstract:
Soft errors are transient faults that occur in VLSI circuits due to external radiation, generally from nuclear decay of packaging materials or atmospheric particles accelerated toward earth by cosmic rays. In this talk I will describe the effects of CMOS technology scaling and microarchitectural trends on the rate of soft errors in memory and logic circuits. We have developed an end-to-end model that allows us to quantify these effects and estimate the soft error rates (SER) for existing and future microprocessor-style designs. The model captures the effects of two important masking phenomena, electrical masking and latching-window masking, which provide combinational logic with a form of natural resistance to soft errors. We quantify the SER in combinational logic and latches for feature sizes from 600nm to 50nm and clock rates from 16 to 6 fan-out-of-4 delays. Our model predicts that the SER of microprocessor logic will increase eight orders of magnitude over this range of feature sizes and pipeline depths, and by 2011 will be comparable to the SER of microprocessor memory elements without protection mechanisms such as Error Correcting Codes (ECC). This result emphasizes the need for computer system designers to address the risks of SER in logic circuits in future designs.
Biography:
Mike Kistler is a Senior Software Engineer in the IBM Austin Research Laboratory. He received a Masters degree in Computer Science from Syracuse University in 1990, and is currently a PhD student in the Department of Computer Science at the University of Texas at Austin. His research interests are energy efficient server design, distributed and cluster computing, and fault tolerance, particularly for large commercial systems such as web application servers.
Brad Calder
University of California, San Diego
Analyzing Large Scale Program Behavior for Assisted Processor Design
Abstract:
Application specific processors offer the potential of rapidly designed logic specifically constructed to meet the performance and area demands of the task at hand. Recently, there have been several major projects that attempt to automate the process of transforming a predetermined processor configuration into a low level description for fabrication. These projects either leave the specification of the processor to the designer, which can be a significant engineering burden, or handle it in a fully automated fashion, which completely removes the designer from the loop. The goal of this research is to provide infrastructure to automatically search the design space for application specific processor design and systems on a chip. The first step towards this goal is to develop automatic techniques that are capable of finding and exploiting the Large Scale Behavior of programs (behavior seen over billions of instructions). The first step is the development of a hardware independent metric that can concisely summarize the behavior of an arbitrary section of execution in a program. To this end we examine the use of Basic Block Vectors. In this talk we quantify the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics (such as cache hit rates), explore the large scale behavior of several programs, develop a set of algorithms based on clustering capable of analyzing this behavior and automatically breaking the execution into a set of classes. We then demonstrate an application of this technology to automatically determine where to simulate for a program to help guide application specific processor design.
The second part of this talk will introduce techniques for searching the design space of in-order and VLIW application specific processors. The goal of this framework is to automate certain design tasks and provide early feedback to help the designer navigate their way through the design space. Our approach is to decompose the overall problem of choosing an optimal architecture into a set of sub-problems that are, to the first order, independent. For each sub-problem, we create a model that relates performance to area. From this, we build a constraint system that can be solved using linear-integer programming techniques, and arrive at an optimal parameter selection for all architectural components. Using basic block vectors to narrow down our search space, our approach only takes a few minutes to explore the design space allowing the designer or complier to see the potential benefits of optimizations rapidly. We show that the expected performance using our model correlates strongly to detailed pipeline simulations, and examine using our framework to evaluate the tradeoffs for using a customized branch predictor.
Biography:
Brad Calder is an Associate Professor of Computer Science and Engineering at the University of California, San Diego. During 2000-2002, he was a founder and Director of Platform Engineering of a desktop distributed computing company called Entropia. Before joining UCSD in January of 1997, he co-founded a startup called Tracepoint, which built performance analysis tools using x86 binary modification technology. In addition, he has worked as a Principal Engineer at Digital Equipment Corporation's Western Research Lab in Palo Alto. His research focuses on the interaction between computer architecture and compilers. Brad Calder received his Ph.D. in Computer Science from the University of Colorado, Boulder in 1995. He obtained a B.S. in Computer Science and a B.S. in Mathematics from the University of Washington in 1991. He is a recipient of an NSF CAREER Award.
Babak Falsafi
Carnegie Mellon University
PUMA: Bridging the Processor/Memory Performance Gap through Memory Access Prediction & Speculation
Abstract:
Increasing processor clock speeds along with microarchitectural innovation have led to a tremendous gap between processor and memory performance. Architects have primarily relied on deeper cache hierarchies, where each level trades off faster lookup speed for larger capacity, to reduce this performance gap. Conventional cache hierarchies employ a demand-fetch memory access model, in which data are fetched into higher levels upon processor requests. Unfortunately, the limited capacity in higher cache levels and the simple data placement mechanisms used in conventional hierarchies often result in high miss rates and reduce performance. The processor/memory performance gap is especially exacerbated in multiprocessor servers, where sharing data may require traversing multiple processor cache hierarchies.
This project proposes the PUMA (Proactively Uniform Memory Access) cache hierarchies, in which the memory system relies on prediction/speculation in hardware to "proactively" move data among cache levels in anticipation of a future processor memory reference. In this talk, I will first present the Last-Touch Memory Access model. Unlike the conventional demand fetch/replace access model, LTMA predicts the last reference to a cache block prior to its eviction, and allows early eviction and subsequent prefetching of data to reduce latency in large spectrum of integer, fp, and pointer-intensive applications. Next, I present speculation techniques to allow relaxing memory order without compromising sequential consistency, the most intuitive programming model for shared memory. Our results indicate that using hardware speculation, a sequentially consistent system can achieve the performance of the best relaxed systems.
Biography:
Babak Falsafi joined the Electrical and Computer Engineering Department at CMU as an Assistant Professor in January 2001. Prior to joining CMU, he held a position as an in the School of Electrical and Computer Engineering at Purdue University. His research interests include prediction and speculation in high-performance memory systems, power-aware processor and memory architectures, single-chip multiprocessor/multi-threaded architectures, and analytic and simulation tools for computer system performance evaluation. He has made numerous contributions in the design of distributed shared-memory multiprocessors and memory systems, including a recent result indicating that hardware speculation can bridge the performance gap among memory consistency models, and an adaptive and scalable caching architecture, Reactive NUMA, that lays the foundation for a family of multiprocessors built by Sun Microsystems code-named WildFire. He is a recipient of an NSF CAREER award in 2000 and an IBM Faculty Partnership Award in 2001. You may contact him at babak@ece.cmu.edu (http://www.ece.cmu.edu/~babak).
Peter Hsu
University of Wisconsin, Madison
The CAVA Architecture and CAVAtools
Abstract:
CAVA is an open source infrastructure to facilitate designs of computer system-on-a-chip (SoC). When completed, it will comprise an Instruction Set (ISA), a software tool chain (gcc), an operating system (linux), and a reference SoC design consisting of synthesizable hardware descriptions a simple processor core embedded within a chip-multiprocessor system. This talk has two parts. In part one I will describe the CAVA ISA, which is unusual in being 24 and 48 bits. During the development of the cAVA ISA, I created a tool to automate building the GNU compiler tool chain from a very compact description of the ISA. Part two of the talk will be about this tool, which lets you try out a new instruction by re-building the compiler, assembler, and simulator in 15 minutes.
Biography:
CAVA is an open source infrastructure to facilitate designs of computer system-on-a-chip (SoC). When completed, it will comprise an Instruction Set (ISA), a software tool chain (gcc), an operating system (linux), and a reference SoC design consisting of synthesizable hardware descriptions a simple processor core embedded within a chip-multiprocessor system. This talk has two parts. In part one I will describe the CAVA ISA, which is unusual in being 24 and 48 bits. During the development of the cAVA ISA, I created a tool to automate building the GNU compiler tool chain from a very compact description of the ISA. Part two of the talk will be about this tool, which lets you try out a new instruction by re-building the compiler, assembler, and simulator in 15 minutes.
Peter Hsu
University of Wisconsin, Madison
A View of Computer Architecture from the Physical Perspective
Abstract:
The term Computer Architecture in industry covers a tremendously broad area, much like the meaning of "Architecture" with respect to buildings, where the architect's firm subcontracts the entire construction process and is responsible for all material defects. In this talk I try to convey a glimmer of some areas not traditionally taught in a Computer Science curriculum: topics such as power supply, heat removal and physical form, the relationship of mechanical tolerances to logical interconnections, the impact of packaging and scalability on marketing strategies. I will tour the architecture of a hypothetical multichip parallel computer system as a vehicle to illustrate design decisions.
Biography:
Peter Hsu received his Bachelor of Computer Science degree in 1979 from the University of Minnesota, and his Master and Ph.D. degrees in 1983 and 1985 from the University of Illinois in Urbana-Champaign. He has worked at IBM Research, Sun Microsystems, Silicon Graphics, and startup company Cydrome, holding various technical and managerial posts including Chief Architect of the MIPS R8000 microprocessor. In 1997 he co-founded ArtX, design firm for the Nintendo GameCube video game console. In 1999 he joined Toshiba America Electronics Component, where he lead the development of a fully-synthesized, fully placed-and-routed superscalar microprocessor. Since 2001 Dr. Hsu has been an independent consultant.