For many of these studies, keep in mind technology trends. How can you use 1 billion transistors to improve performance? How do we get good performance when a memory access costs 10,000 instruction times? How do we deal with new Web, Object Oriented, or Multimedia workloads? How do we get good I/O performance when disks are getting slower and slower relative to CPUs?

Some of these project ideas have been suggested by members of the faculty. If you decide to work on them, you may want to talk to those faculty to get a sense of where they wanted to go...

Select a paper that interests you from a recent ASPLOS or ISCA proceedings. Construct a simulator that will allow you to reproduce their main results and validate your simulator using their workload or a similar one. Are there any major assumptions the authors didn't mention in the paper? Use your simulator to evaluate their technique under a new workload or improve their technique and quantify your improvements.
Where does the time go? At the last SOSP conference, there were two papers that purported to tell us "where does the time go" by periodically interrupting the CPU to see which instruction it was executing and thereby develop a long-term profile of CPU execution. Unfortunately, this methodology only measures where the CPU spends its time. It does not tell you when a process is stalling for I/O. Unfortunately, as many of us have experienced, it is common to spend as much time waiting for I/O as waiting for the CPU, and technology trends suggest that I/O delays will become more dominant in the future. Use system call tracing to determine how much time processors spend waiting for I/O and what types of I/O are causing delays. By keeping track of which threads are active, it should be possible to determine whether it is I/O or processing that causes users to wait and to further determine which I/O operations are responsible. Talk to Mike if you are interested in this project. I've got an initial prototype system for system call tracing.
As CPU cache miss times approach thousands of cycles, during the time that a miss gets serviced, it seems likely that the processor could execute a cache-replacement-optimization program "in the background" without slowing down any unblocked data-flows of execution (Yale Patt at Michigan calls this sort of optimization code "micro-threads".) This project has two parts. First, estimate an upper bound on the performance that could be gained as follows: simulate a k-way associative cache where each cache set uses random, FIFO, LRU, and OPT replacement. Current caches use k= 1 to 8 and one of the simple replacement policies, and the best your system could do would be to approximate fully associative a fully-associative cache with OPT replacement. The gap between those two cases is a reasonable upper bound on the benefits this scheme could achieve. Also, this experiment will tell you what level of associativity and replacment policy to aim for in your design. You may want to run this experiment for L1, L2, and L3 caches to see where to focus your efforts. Second, design a cache microarchitecture that would allow for more sophisticated replacement policies. My intuition is that it will be important to make sure your design does not slow down hits, it should not slow down the time it takes to issue the miss request to memory, but it can probably burn a lot of cycles thinking about which current cache line to replace when that data comes back or moving data between different cache entries.

Along with the number of transistors, the complexity of microprocessor architectures continues grown exponentially, with very complex out-of-order processors being de-riguer. It is still not readily apparent how much performance is really being delivered to applications compared to simpler in-order designs. On a spectrum of benchmarks, quantify (through simulation - suggest SimpleScalar) the performance difference between an out-of-order processor and a simpler in-order processor, taking into account not only CPI, but also clock rate and power consumption. Faculty expert: Dr. Keckler
DRAMs are highly optimized for accesses that exhibit locality. Examine a memory interface architecture that reorders memory accesses to better exploit the column, page, and pipeline modes of modern DRAM implementations. Faculty expert: Dr. Keckler
Select an embedded application (such as interactive multimedia) and design and evaluate an architecture that executes it in a mobile environment. Address issues of functionality, performance (or at least providing the illusion of sufficient performance), and power consumption. Faculty expert: Dr. Keckler
Compare alternatives of embedding processing power in a DRAM chip (ie. reconfigurable logic vs. highly custom processor vs. hardwired logic for a given application) on a suite of data intensive and computationally demanding benchmarks. Faculty expert: Dr. Keckler
Characterize the benefits and costs of value prediction vs. other predictive techniques, such as instruction reuse. In the best cases, what is the maximum performance benefit? Faculty expert: Dr. Keckler
Compare the performance of a deep cache hierarchy (multiple levels) vs. a flatter organization (only one level) on a family of scientific and data intensive applications. Devise strategies to get the benefits of both. Faculty expert: Dr. Keckler
In large, out-of-order cores, loads have to be held back when an earlier store's address is unknown (because it might be the same). Dependence prediction guesses which load/store pairs are going to have dependences, and which aren't. These predictors have also been used to communicate values from stores to loads and do prefetching. Lots of interesting stuff here! Faculty expert: Dr. Burger.
Because of wire delays and register file bandwidth, processor designers have started looking (and building, cf. Alpha 21264) clusters, in which groups of functional units are associated with separate register files within a core. How to schedule work on these, and their implications for future architectures, is a hot topic. Faculty expert: Dr. Burger.
Simultaneous Multithreaded Processors (SMT) run multiple tasks in an out-of-order core at the same time, sharing the dynamic resources (physical registers, issue slots, cache pipes). Experiments with how resource usage conflicts in the different shared resources, with different combinations of workloads, would be interesting (there is a lot of work going on in this area so a literature search would be crucial). Faculty expert: Dr. Burger.
When multiple threads are running in an SMT core, how many extra cache misses are caused by the intersections of the threads' working sets? Quantifying this for different workload combinations was the project I had in mind. Faculty expert: Dr. Burger.
When a number of instructions waiting to execute in an out-of-order core are ready to go, but there are too many for (a) the issue width or (b) for the particular functional unit types available to issue in a single cycle, the hardware must choose among them. Oldest-first is the usual strategy. Other selection algorithms may be better. It would be interesting to try a few. Faculty expert: Dr. Burger.
Lots of people are still using SPEC95 for experiments, even though it's woefully dated. Any performance analysis of more-interesting workloads would be great (measuring cache behavior, ILP, microarchitectural resource consumption, etc.). Compare SPEC's view of an architecture with the view you get when you look at other interesting benchmarks: productivity tools, multimedia, natural language, databases, ... Faculty expert: Dr. Burger.
Any study of how Java interpreters, compilers, and the Java virtual machine interact with modern architectures (and how their performance could be improved) would be interesting. Faculty expert: Dr. Burger.
Analysis of critical path lengths across different processor components (with respect to wire delays ... which processor components (issue logic, caches, buses, etc.) will scale the worst as processors get way bigger and wires get way slower?) Faculty expert: Dr. Burger.
Consider the problem of designing a portable library for shared memory programming. Theory suggest that a bulk-synchronous style of shared memory programming (such as in the QSM or BSP model where the results of reads during one phase are available during the next phase) should be superior to a synchronous programming model (where reads can be used immediately). This would suggest that a library interface that enqueues messages and sends them asynchronously should work well. On the other hand, some machines support hardware for shared memory operations. On those machines, synchronous reads and writes are cheap. Compare the costs of the two approaches and determine which programming model makes sense if portability across architectures is a concern. (E.g., how much performance does a SMP give up if it uses a more general interface that also supports machines without hardware to support shared memory). One factor to consider is that modern processor architectures allow out of order execution that may reduce the impact of synchronous read and write instructions as long as the result is not used immediately; thus following a bulk-synchronous model may have benefits even on a SMP. Faculty expert: Dr. Ramachandran.
One of the key problems in architectures is that it is often more difficult to improve latency than bandwidth. Prefetching is one technique that can hide latency. Here are some possible prefetching topics:
- Quantify the limits of history-based prefetching. Prediction by partial matching (originally developed for text compression) has been shown to provide optimal prediction of future values based on past values. Using PPM, what are the limits of memory or disk prefetching? What input information (e.g., last k instruction addresses, last j data addresses, distance between last k addresses, last value loaded, ...) best predicts future fetches? What is the best trade-off between state used to store history information and prefetch performance? Contact Mike for some code that implements PPM to help you get started.
- Add a small amount of hardware to DRAM memory chips to exploit DRAM internal bandwidths to avoid DRAM latencies. Evaluates the performance benefits that can be gained and the costs of modifying the hardware.
Implement a run-time system for automatically translating from a QSM algorithm description to a program that has good performance on some parallel architecture. For example, this system might pipeline network messages to hide latency, group network messages to hide overhead, efficiently multiplex many "virtual processors" onto a smaller number of physical processors, .... (This project is almost certainly far to challenging to pull of in a semester, but is there a small piece of the problem you can attack to make progress to this eventual goal?)
Compare a hardware implementation of a Java virtual machine to software emulation on a fast processor. Given the checkered history of hardware instruction sets targeted to specific high-level languages, why does Sun think this approach will work for Java? Are they right?
Over the past 2 decades, memory sizes have increased by a factor of 1000, and page sizes by only a factor of 2-4. Should page sizes be dramatically larger, or are a few large "superpages" sufficient to offset this trend in most cases?
Compare 4 visions for future architectures -- traditional (big caches, super-pipelined/super-scaler, speculative execution, ...), IRAM (combine processor and RAM on one chip), single-chip multiprocessors, and multi-threading with fast context switches to tolerate memory latencies.
Extend Transparent Informed Prefetching (Patterson et al (SOSP95)), which was designed for page-level prefetching/caches to balance cache-line hardware prefecthing v. hardware caching.
Cooperative caching uses fast networks to access remote memory in liu of disk accesses. One drawback is that a user's data may be stored on multiple machines, potentially opening security holes (eavesdropping, modification). Encryption and digital signatures may solve the problem, but could slow down the system. Evaluate the performance impact of adding encryption and digital signatures to cooperatively cached data and project this performance into the future as processor speeds improve and as companies like Intel propose adding encryption functions to their processors..
As memory latencies increase, cache miss times could run to thousands of instruction-issue opportunities. This is nearly the same ratio of memory access times as were seeen for early VM paging systems. As miss times become so extremely bad, is it time to give control of cache replacement to the software? Will larger degrees of associativity be appropriate for caches?
Your good idea here...