Future systems challenges and opportunities
Abstract:
There is an increasing trend toward "Scale-out" systems that are designed using interconnected, low-cost, modular components. Even on the chip, multiple cores and accelerators will be combined to form computing blocks with potential for superior performance and power/performance. The challenge for the system designer is to achieve this potential for the widest range of applications possible, while approaching or even exceeding the reliability/availability characteristics and manageability of more traditional Scale-up systems. In this talk we discuss how IBM Research is attacking this challenge through innovative modular systems architectures and designs, including the software that provides virtualization, partitioning, and manages a scale-out system as a single entity. The spectrum of scale-out systems will range from game machines to supercomputers, and we are placing an increased focus on optimizing scale-out systems for commercial applications.
Biography:
Dr. Tilak Agerwala is Vice President, Systems at IBM Research. He is responsible for developing the next-generation technologies for IBM’s systems, from microprocessors to commercial systems and supercomputers, as well as novel supercomputing algorithms and applications.
Tilak joined IBM at the T.J. Watson Research Center and has held executive positions at IBM in research, advanced development, development, marketing, and business development. His research interests are in the area of high performance computer architectures and systems.
Tilak received the W. Wallace McDowell Award from the IEEE in 1998 for i "outstanding contributions to the development of high performance computers". He is a founding member of the IBM Academy of Technology. He is a Fellow of the Institute of Electrical and Electronics Engineers. He received his B.Tech. in Electrical Engineering from the Indian Institute of Technology, Kanpur, India and his Ph.D. in Electrical Engineering from the Johns Hopkins University, Baltimore, Maryland, USA.
Summarizing Performance Is No Mean Feat
Abstract:
For decades, computer benchmarkers have fought a War of Means, arguing over proper uses of Arithmetic, Harmonic, and Geometric Means, starting in the mid-1980s. One would think this basic issue of computer performance analysis would have been long resolved, but contradictions are still present in some excellent and widely-used textbooks.
This talk offers a framework that resolves these issues and includes both workload analyses and relative performance analyses (such as SPEC or Livermore Loops), emphasizing differences between algebraic and statistical approaches. In some cases, the lognormal distribution is found to be quite useful, especially with appropriate forms of standard deviation and confidence interval used to augment the usual Geometric Mean. Results can be used to indicate the relative importance of careful workload analysis.
Biography:
Dr John R. Mashey is a consultant for venture capitalists and technology companies, but has been involved off-and-on in computer performance analysis for 35 years. He is "an ancient UNIX person," having started work on it at Bell Labs in 1973, and continuing to work there for 10 years, including design of the UNIX per-process accounting software. He moved to Silicon Valley in 1983 to join Convergent Technologies, ending as director of software. Mashey joined MIPS Computer Systems in early 1985, managing operating systems development, and helping design the MIPS RISC architecture, as well as specific CPUs, systems and software. He continued similar work at SGI (1992 - 2000) most recently contributing to the design of SGI's NUMAflex modular computer architecture, ending as VP and Chief Scientist.
Mashey was one of the founders of the SPEC benchmarking group, was an ACM National Lecturer for four years, has been guest editor for IEEE Micro, and one of the long-time organizers of the Hot Chips conferences. Additionally, he has chaired technical conferences on operating systems and CPU chips, and has given more than 500 public talks on software engineering, RISC design, performance benchmarking and supercomputing. He is a Trustee of the Computer History Museum.
He holds a Ph.D. in Computer Science from Pennsylvania State University.
Algorithms and Architectures for Millisecond-Scale Molecular Dynamics Simulations of Proteins
Abstract:
Some of the most important outstanding questions in the fields of biology, chemistry, and medicine remain unsolved as a result of our limited understanding of the structure, behavior and interaction of biologically significant molecules. The underlying laws of physics that determine the form and function of these biomolecules are relatively well understood. Current technology, however, does not allow us to simulate the effect of these laws with sufficient accuracy, and for a sufficient period of time, to answer many of the questions that biologists, biochemists, and biomedical researchers are most anxious to answer. This talk will summarize the current state of the art in biomolecular modeling based on all-atom molecular dynamics simulations in explicitly modeled solvent, and will describe efforts within our own lab to develop novel algorithms and machine architectures to accelerate such simulations by several orders of magnitude. To the extent these efforts prove successful, we hope to make it possible for the first time for researchers to perform millisecond-scale simulations of such phenomena as the process of protein folding, the binding of proteins to pharmaceutical compounds and endogenous ligands, the interaction of proteins with other proteins and with nucleic acids, the mechanisms of allosteric activation and inhibition, and perhaps the operation of certain simple intracellular machines.
Biography:
David E. Shaw serves as chairman of the D. E. Shaw group of companies, and as a senior research fellow at the Center for Computational Biology and Bioinformatics at Columbia University. He received his Ph.D. from Stanford University in 1980 and served on the faculty of the Computer Science Department at Columbia University before turning his attention to the emerging field of computational finance in 1986. In 1988, he founded the D. E. Shaw group, a specialized investment and technology development firm with approximately $17 billion in assets whose activities center on various aspects of the intersection between technology and finance. While he continues to serve as chairman of the top-level parent companies of the D. E. Shaw group, Dr. Shaw is now spending the great majority of his time on hands-on scientific research in the field of computational biochemistry. In this capacity, he leads a small research group (currently consisting of 30 computational chemists and biologists, computer scientists and applied mathematicians, and computer architects and engineers) within an independent laboratory at D. E. Shaw Research and Development, LLC. The group is currently focusing on the development of novel physical models, computational algorithms and machine architectures for protein structure determination, computational drug design, and biomolecular simulation.
In 1994, Dr. Shaw was appointed by President Clinton to the President's Committee of Advisors on Science and Technology, in which capacity he served as chairman of the Panel on Educational Technology. He currently serves as treasurer of the American Association for the Advancement of Science and as a member of its board of directors, as co-chairman of the steering committee of the High-Performance Computing Initiative (a DoD/DoE-sponsored program conducted under the auspices of the Council on Competitiveness), and as a member of the external advisory group of the Sloan-Kettering Institute. He also serves as chairman of the Schrödinger companies, which produce computational chemistry software used for drug discovery, and of the top-level parent company of Attenuon, LLC, a drug discovery and development firm focusing on cancer therapeutics. Dr. Shaw is the author of 77 scholarly publications, and was named a fellow of the New York Academy of Sciences in 1999.
What is the Next Architectural Hurdle (after power density) as Technology Scaling Slows?
Abstract:
As lithogrophy scales past the 90 nanometer node, it is clear that classical (Dennard) device performance scaling is slowing. Further, we are beginning to see the end of frequency scaling, as dealing with power and power density impose physical limits (and cost limits) on systems. However, architectural scaling (in the forms of multicore and multithreading), and systems/software scaling (in the form of virtualization technology) will put even more pressure on systems design than frequency scaling ever did. At present, there is a huge focus on coping with power at the microarchitectural level, and there is a backlash against deep pipelining and ILP-intensive microarchitectures. The next major hurdles - after power - in systems design will be improving the on-chip and off-chip bandwidths, and in increasing the on-chip storage density. I will explain why this is true, and will posit the next steps in technology innovation required to deal with these hurdles. I will also explain why these hurdles are roughly equivalent.
Biography:
Dr. Philip Emma studied under Professor Davidson at the University of Illinois, and then joined the IBM T.J. Watson Research Center. He worked under von Neumann's chief engineer for 10 years in the areas of architecture, microarchitecture, and systems design. He then worked with Systems group as a design leader on the first IBM CMOS mainframes, and defined and led the design of the RAS checking and recovery features in the G4 and G5 processors - which made these the most reliable comercial processors in the industry. He then lead an exploratory circuit-design group, and then an exploratory technology group where he worked on novel memory technologies, high-speed circuit design, advanced packaging, optics, and interconnect technology. These groups did the earliest work on low-power microarchitecture in IBM - in the early 1990s. He is now the head of the Systems Technology and Microarchitecture group, working on advanced microarchitecture for the IBM server brands. His group also does much of the early concept work for games, for example, Cell. He holds over 100 patents in various fields, and was named an IBM Master Inventor. He is a member of the IBM Academy of Technology, and is a Fellow of the IEEE.
Ron Barnes
George Mason University
"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense
Abstract:
As microprocessor designs become increasingly power- and complexity-conscious, future microarchitectures must decrease their reliance on expensive dynamic scheduling structures. While compilers have generally proven adept at planning useful static instruction-level parallelism, relying solely on the compiler's instruction execution arrangement performs poorly when cache misses occur, because variable latency is not well tolerated. This talk presents the multipass "flea-flicker" microarchitectural approach that exploits meticulous compile-time scheduling on simple in-order hardware while achieving excellent cache miss tolerance through persistent advance preexecution beyond otherwise stalled instructions.
Multipass pipelining builds upon the previously published two-pass pipelining technique presented at MICRO-36. Instructions following a stalled instruction make multiple passes through a multipass pipeline. Each pass increases the speed and energy efficiency of subsequent ones by initiating independent memory accesses and preserving precomputed results. Simulation results show that multipass pipelining achieves the majority of the cache-miss tolerance of an aggressive out-of-order implementation at a fraction of the power overhead.
Biography:
Ronald Barnes is an Assistant Professor in the Department of Electrical and Computer Engineering at George Mason University. In May 2005, Ron received his PhD in Electrical Engineering at the University of Illinois while working with Professor Wen-mei Hwu in the IMPACT research group. Before his time at the University of Illinois, Ron graduated summa cum laude with a B.S. from the University of Oklahoma in 1998. Ron has published in several journals, symposia and workshops on topics including complexity-effective microarchitectures, program-phase detection, and binary and run-time optimization. His research interests additionally include compiler and feedback-directed optimization techniques for multi-core architectures and other novel architectures.
Iron Laws for Multi-Core Scalability
Abstract:
The current multi-core trend is the new rage in the industry. All major microprocessor companies have introduced or will soon introduce products containing multiple cores on a single die. It seems that scaling the number of cores has replaced the scaling of clock frequency as the primary design objective for architects. How far can we go in scaling the number of cores in the coming decade? What are the foundational principles, or "iron laws," that actually govern the scaling of multi-core machines? What are the fundamental forces that might conspire against multi-core scalability? This talk will attempt to formulate a clean framework to reason clearly about multi-core scalability issues. Iron laws on multi-core scalability with respect to architecture, algorithm, and power scaling, will be presented and illustrated with experimental data from MRL's research projects. Speculation on future multi-core designs and promising research directions will also be divulged.
Biography:
John P. Shen is the director of the Microarchitecture Research Lab (MRL) at Intel. Prior to joining Intel in 2000 he spent quite a few years as a Professor in the ECE Department of Carnegie Mellon University. He supervised a total of 17 PhD students during his years at CMU, and published over 100 papers. He recently published the book "Modern Processor Design: Fundamentals of Superscalar Processors" with McGraw-Hill. He is a Fellow of the IEEE and has received multiple teaching awards while at CMU. Currently he is enjoying his new job in the "real world."
Pauseless GC in the Azul JVM
Abstract:
Garbage Collection has been around for decades and for nearly just as long people have been complaining about GC pauses. Making a fast and concurrent (i.e., low-pause) GC that is also effective and non-fragmenting (i.e., precise and relocating) is fiendishly difficult. Despite decades of GC research and years of dilligent engineering, no mainstream JVM has a concurrent collect as the default GC. Yet all that is lacking is a simple hardware read-barrier, an instruction executed by regular Java threads to enforce the GC invariants. With a hardware read-barrier, Azul has built a fast, low-pause, precise, relocating and above all: a simple GC. This talk discusses Azul's GC algorithm and hardware, and shows it's effectiveness on a variety of benchmarks.
Biography:
With more than twenty-five years experience developing compilers, Cliff Click serves as Azul Systems' Chief JVM Architect. Cliff joined Azul in 2002 from Sun Microsystems where he was the architect and lead developer of the HotSpot Server Compiler, a technology that has delivered dramatic improvements in Java performance since its inception. Previously he was with Motorola where he helped deliver industry leading SpecInt2000 scores on PowerPC chips, and before that he researched compiler technology at HP Labs. Cliff has been writing optimizing compilers and JITs for over 20 years. He is invited to speak regularly at industry and academic conferences including JavaOne, JVM'04, and VEE'05 and has published many papers about HotSpot technology. Cliff holds a PhD in Computer Science from Rice University.
AMD's Pacifica Technology: Silicon Enhancements for Efficient Virtualization
Abstract:
Virtualization is changing the way IT shops think about deploying server and client technology, leveraging partitioning to implement new models of security, manageability, and workload flexibility. Unmodified, the x86 architecture is intrinsically difficult to virtualize. Hypervisor software today must use heroic acts in order to implement virtualization on unassisted x86 processors resorting to such tricks as ring compression, and dynamic instruction rewriting. This leads to solutions that are unnecessarily complex and performance limited.
This presentation describes fundamental changes to the x86 architecture, as integrated with AMD64 technology, designed to provide silicon assistance to allow for efficient virtualization without overly complex virtualization software. AMD's "Pacifica" virtualization architecture is described, including the instruction and register intercept model, new instructions and new processor data structures. Processor mechanisms for virtualizing interrupts are described, as is the virtual memory subsystem tuning. Security vulnerabilities exist in software-only virtualized environments, and modifications to the memory controller to address those vulnerabilities are discussed. "Pacifica" virtualization architecture, tightly coupled with AMD's high performance multicore technology, drives higher performing, less complex, and more secure virtualization platforms
Biography:
Kevin J. McGrath is a Fellow at AMD, California Microprocessor Division. He is the architect of AMD's 64-bit extensions and currently manages the AMD64 architecture and RTL team. McGrath's work experience includes 20 years in CPU design and verification, first for Hewlett-Packard and later for ELXSI. His career eventually brought him to AMD, where he has worked on the microarchitecture of the Nx586, AMD-K6, and AMD Athlon processors, leading the microcode team for those projects. McGrath has a BS in engineering technology from California Polytechnic University, San Luis Obispo.
Wen-mai Hwu
University of Illinois, Urbana-Champaign
Towards Ultra-Efficient Computing Platforms
Abstract:
For almost two decades, the IMPACT research group at the University of Illinois has been working towards architectural models and compiler techniques for ultra-efficient computing platforms. The first phase of this effort involves a combination of IMPACT compile-time parallelization, ROAR dynamic optimization, and Flea-Flicker multipass pipelining to exploit instruction-level parallelism and tolerate cache-miss latency to achieve high performance, low power execution of applications. This phase has resulted in numerous publications and contributed to the Intel IPF compilers and processors. I will begin the talk by summarizing some important lessons we learned in the process, many derived from experiments using real hardware.
More recently, due to stringent power constraints, semiconductor computing platforms are converging to a model that consists of multiple processor cores and domain-specific hardware accelerators. From the hardware perspective, these platforms provide great opportunities in further advancement of performance and power efficiency. The primary obstacle to the success of this model is the level of difficulty in programming and compiling for these platforms. Although high- performance FORTRAN compilers and the OpenMP API provide a strong technology base for the task, major roadblocks remain in compiling for popular implementation languages such as C/C++. I argue that most exciting applications that will drive the continued performance scaling of semiconductor computing platforms are those that reflect some aspects of the physical world: sound, images, chemical reactions, and physical forces. These applications often have abundant parallelism in their algorithms. The parallelism is, however, obscured by the implementation process.
In the second part of the talk, I will review some key lessons from our study of MPEG-4 (video), JPEG (image), and LAME (sound) based on the second generation IMPACT compiler technology. These applications exhibit several characteristics that distinguish them from previously studied scientific FORTRAN applications: extensive use of pointers in memory accesses, command line options that impose diverse constraints on code transformations, dynamic memory allocation that obscures targets of memory accesses, interprocedural value flow that makes loop analysis more difficult, and multiple layers of function calls with complex control flow involved in the desired units of parallel execution. I will outline research efforts in the GSRC Soft Systems thrust to develop scalable, deep program analysis and deep code transformation techniques as well as domain specific programming models to overcome these obstacles. I will also describe related GSRC work on Linux enhancements that enable seamless integration of hardware accelerators into the Linux software stack for use by driving applications.
Biography:
Wen-mei W. Hwu is the Sanders-AMD Endowed Chair Professor at the Department of
Electrical and Computer Engineering, University of Illinois at Urbana-Champaign.
His research interest is in the area of architecture, implementation, and
software for high performance computer systems. He is the director of the
IMPACT lab
Soft Errors in Microprocessors
Abstract:
With each technology generation, we are experiencing an increased rate
of cosmically-induced soft errors in our chips. In the past, the
impact of such errors could be minimized through protection of large
memory structures. Unfortunately, such techniques alone are becoming
insufficient to maintain adequately low error rates. Although, to a
very rough approximation, the fault rate per transistor is not
changing much, the increasing number of transistors is resulting in an
ever increasing raw rate of bit upsets. Thus, we are starting to see a
dark side to Moore's Law in which the increased functionality we get
with our exponentially increasing number of transistors is being
countered with a exponentially increasing soft error rate. This will
take increasing effort and cost to cope with.
In this talk I will try describe the severity of the soft error
problem as well as a technique we have developed to estimate a
processor's soft error rate. A key aspect of our soft error analysis
technique is that it can be applied early in the design process either
via various analytic calculations or through the use of a performance
model. Therefore, these estimates are available early enough to help
designers choose appropriate error protection schemes for various
structures within a microprocessor.
At the heart of our technique is the observation that some single-bit
faults (such as any occurring in the branch predictor tables) will not
produce an error in a program's output. We generalize this obsevation
to define a structure's architectural vulnerability factor (AVF) as
the probability that a fault in that particular structure will result
in an error in the final output of a program. Thus, the AVF for all
the structures in a chip along with some process and circuit factors
can be used to predict the failure rate for the chip. In this talk I
will review our technique for calculting AVF and show results for some
specific structures.
Biography:
Dr. Joel S. Emer is an Intel Fellow working in the Intel Architecture
Group, where he is director of micro-architecture research. Before
joining Intel he spent 22 years as a Digital/Compaq employee, where he
worked on processor architecture, performance analysis and performance
modeling methodologies for a number of VAX and Alpha CPUs. He is widely
recognized for his architecture contributions, including pioneering
efforts in simultaneous multithreading, and for his seminal work on
the now pervasive quantitative approach to processor evaluation. He
also has researched heterogeneous distributed systems and networked
file systems at DEC and during a three year sabbatical at MIT. His
current research interests include processor reliability,
multithreaded processor organizations, techniques for increased
instruction level parallelism, pipeline organization, instruction and
data cache organizations, branch prediction schemes, and performance
modeling. Dr. Emer holds a Ph.D. in Electrical Engineering from the
University of Illinois, and M.S.E.E. and B.S.E.E. degrees from Purdue
University. He is also a Fellow of the ACM and IEEE.
Todd Austin
Better Than Worst-Case Design
Abstract:
This talk introduces the audience to a novel design strategy, called "Better
Than Worst-Case" design, which couples complex design components with simple
reliable checkers. By delegating the responsibility of correctness and
reliability to a checker, it becomes possible to quickly build designs that
are capable of tolerating a variety of potential design disasters, ranging from
functional design errors to circuit timing bugs. Two example designs are
presented: DIVA and Razor. In addition, a complementary design technique,
called typical-case optimization (TCO), is introduced as a way to take
advantage of the relaxed design constraints on the core component.
The talk is intended for architects, designers and CAD engineers interested in
novel techniques for robust design, their potential applications, and their
implications to system design and CAD tools.
Biography:
Todd Austin is an Associate Professor of Electrical Engineering and Computer
Science at the University of Michigan in Ann Arbor. His research interests
include computer architecture, compilers, computer system verification, and
performance analysis tools and techniques. Prior to joining academia, Todd was
a Senior Computer Architect in Intel's Microcomputer Research Labs, a
product-oriented research laboratory in Hillsboro, Oregon. Todd is the first to
take credit (but the last to accept blame) for creating the SimpleScalar Tool
Set, a popular collection of computer architecture performance analysis tools.
In addition to his work in academia, Todd is founder and President of
SimpleScalar LLC and co-founder and Chief Technology Officer of BitRaker Inc
and InTempo Design LLC. Todd received his Ph.D. in Computer Science from the
University of Wisconsin in 1996.
Babak Falsafi
Temporal Memory Streaming
Abstract:
Device scaling in processor fabrication technologies along
with microarchitectural innovation have led to a tremendous gap
between processor and memory performance. While architects have primarily
relied on deeper cache hierarchies to reduce this performance gap,
the limited capacity in higher cache levels and simple data
placement/eviction policies have resulted in diminishing returns for
commercial workloads with large memory footprints and adverse access
patterns. Moreover, proposals to bridge the gap with runahead execution or
larger instruction windows do not benefit workloads with little inherent
memory-level parallelism such as transaction processing on database or web
servers.
The STEMS (Spatio-TEmporal Memory Streaming) project at Carnegie Mellon
is exploring memory system designs that exploit repetitive spatial and
temporal correlation among memory accesses and construct (in hardware or
software) data streams that can be moved and managed together through the
memory hierarchy to hide the long access latencies. In this talk, I will
present two recent results showing that: (a) a large fraction of memory
accesses in desktop and server workloads are temporally correlated, and (b)
such correlation can be captured with practical hardware designs to hide the
memory latency.
Biography:
Babak Falsafi is an Associate Professor in the Department of
Electrical and Computer Engineering at Carnegie Mellon University. He
leads the Impetus group (http://www.ece.cmu.edu/~impetus) targeting
research on architectures to break the memory wall, architectural support
for gigascale integration, and analytic and simulation tools for computer
system performance evaluation. He has made a number of contributions
to computer system design and evaluation including a recent
result indicating that hardware speculation can bridge the performance
gap among memory consistency models. He is a recipient of an NSF
CAREER award in 2000 and IBM Faculty Partnership Awards in 2001, 2003
and 2004, and an Alfred P. Sloan Research Fellowship in 2004. You may
contact him at babak@cmu.edu (http://www.ece.cmu.edu/~babak). He is a
member of IEEE and ACM.
The University of Michigan, Ann Arbor
Carnegie Mellon University