For many of these studies, keep in mind technology trends. How can you
use 1 billion transistors to improve performance? How do we get good
performance when a memory access costs 10,000 instruction times? How do we
deal with new Web, Object Oriented, or Multimedia workloads? How do we
get good I/O performance when disks are getting slower and slower relative
to CPUs?
Some of these project ideas have been suggested by members of the
faculty. If you decide to work on them, you may want to talk to those
faculty to get a sense of where they wanted to go...
-
Select a paper that interests you from a recent ASPLOS or ISCA
proceedings. Construct a simulator that will allow you to reproduce
their main results and validate your simulator using their workload or
a similar one. Are there any major assumptions the authors didn't
mention in the paper? Use your simulator to evaluate their technique
under a new workload or improve their technique and quantify your
improvements.
-
Where does the time go? At the last SOSP conference, there were two
papers that purported to tell us "where does the time go" by
periodically interrupting the CPU to see which instruction it was
executing and thereby develop a long-term profile of CPU
execution. Unfortunately, this methodology only measures where the CPU
spends its time. It does not tell you when a process is stalling for
I/O. Unfortunately, as many of us have experienced, it is common to
spend as much time waiting for I/O as waiting for the CPU, and
technology trends suggest that I/O delays will become more dominant in
the future. Use system call tracing to determine how much time
processors spend waiting for I/O and what types of I/O are causing
delays. By keeping track of which threads are active, it should be
possible to determine whether it is I/O or processing that causes
users to wait and to further determine which I/O operations are
responsible. Talk to Mike if you are interested in this project. I've
got an initial prototype system for system call tracing.
- As CPU cache miss times approach thousands of cycles, during the
time that a miss gets serviced, it seems likely that the processor
could execute a cache-replacement-optimization program "in the
background" without slowing down any unblocked data-flows of execution
(Yale Patt at Michigan calls this sort of optimization code
"micro-threads".) This project has two parts.
First, estimate an upper bound on the
performance that could be gained as follows: simulate a k-way
associative cache where each cache set uses random, FIFO, LRU, and OPT
replacement. Current caches use k= 1 to 8 and one of the simple
replacement policies, and the best your system could do would be to
approximate fully associative a fully-associative cache with OPT
replacement. The gap between those two cases is a reasonable upper
bound on the benefits this scheme could achieve. Also, this experiment
will tell you what level of associativity and replacment policy to aim
for in your design. You may want to run this experiment for L1, L2,
and L3 caches to see where to focus your efforts. Second, design a cache
microarchitecture that would allow for more sophisticated replacement
policies. My intuition is that it will be important to make sure your
design does not slow down hits, it should not slow down the time it
takes to issue the miss request to memory, but it can probably burn a
lot of cycles thinking about which current cache line to replace when
that data comes back or moving data between different cache entries.
- Along with the number of transistors, the complexity of
microprocessor architectures continues grown exponentially, with very
complex out-of-order processors being de-riguer. It is still not
readily apparent how much performance is really being delivered to
applications compared to simpler in-order designs. On a spectrum of
benchmarks, quantify (through simulation - suggest SimpleScalar) the
performance difference between an out-of-order processor and a simpler
in-order processor, taking into account not only CPI, but also clock
rate and power consumption. Faculty expert: Dr. Keckler
- DRAMs are highly optimized for accesses that exhibit locality.
Examine a memory interface architecture that reorders memory accesses
to better exploit the column, page, and pipeline modes of modern DRAM
implementations. Faculty expert: Dr. Keckler
- Select an embedded application (such as interactive multimedia) and
design and evaluate an architecture that executes it in a mobile
environment. Address issues of functionality, performance (or at
least providing the illusion of sufficient performance), and power
consumption. Faculty expert: Dr. Keckler
- Compare alternatives of embedding processing power in a DRAM chip
(ie. reconfigurable logic vs. highly custom processor vs. hardwired
logic for a given application) on a suite of data intensive and
computationally demanding benchmarks. Faculty expert: Dr. Keckler
- Characterize the benefits and costs of value prediction vs.
other predictive techniques, such as instruction reuse. In the
best cases, what is the maximum performance benefit? Faculty expert:
Dr. Keckler
- Compare the performance of a deep cache hierarchy (multiple levels)
vs. a flatter organization (only one level) on a family
of scientific and data intensive applications. Devise strategies
to get the benefits of both. Faculty expert: Dr. Keckler
- In large, out-of-order cores, loads have to be held back when an earlier store's
address is unknown (because it might be the same). Dependence prediction guesses
which load/store pairs are going to have dependences, and which aren't. These
predictors have also been used to communicate values from stores to loads and
do prefetching. Lots of interesting stuff here! Faculty expert: Dr. Burger.
- Because of wire delays and register file bandwidth, processor designers have
started looking (and building, cf. Alpha 21264) clusters, in which groups of
functional units are associated with separate register files within a core.
How to schedule work on these, and their implications for future architectures,
is a hot topic. Faculty expert: Dr. Burger.
- Simultaneous Multithreaded Processors (SMT) run multiple tasks in an out-of-order
core at the same time, sharing the dynamic resources (physical registers,
issue slots, cache pipes). Experiments with how resource usage conflicts
in the different shared resources, with different combinations of workloads,
would be interesting (there is a lot of work going on in this area so a
literature search would be crucial). Faculty expert: Dr. Burger.
- When multiple threads are running in an SMT core, how many extra cache misses
are caused by the intersections of the threads' working sets? Quantifying this
for different workload combinations was the project I had in mind. Faculty expert: Dr. Burger.
- When a number of instructions waiting to execute in an out-of-order core are
ready to go, but there are too many for (a) the issue width or (b) for the
particular functional unit types available to issue in a single cycle, the
hardware must choose among them. Oldest-first is the usual strategy. Other
selection algorithms may be better. It would be interesting to try a few.
Faculty expert: Dr. Burger.
- Lots of people are still using SPEC95 for experiments, even though it's
woefully dated. Any performance analysis of more-interesting workloads
would be great (measuring cache behavior, ILP, microarchitectural resource
consumption, etc.). Compare SPEC's view of an architecture with the
view you get when you look at other interesting benchmarks:
productivity tools, multimedia, natural language, databases, ... Faculty expert: Dr. Burger.
- Any study of how Java interpreters, compilers, and the Java virtual machine
interact with modern architectures (and how their performance could be improved)
would be interesting. Faculty expert: Dr. Burger.
- Analysis of critical path lengths across different processor
components (with respect to wire delays ... which processor
components (issue logic, caches, buses, etc.) will scale the worst
as processors get way bigger and wires get way slower?) Faculty expert: Dr. Burger.
-
Consider the problem of designing a portable library for shared memory
programming. Theory suggest that a bulk-synchronous style of shared memory
programming (such as in the QSM or BSP model where the results of
reads during one phase are available during the next phase) should be
superior to a synchronous programming model (where reads can be used
immediately). This would suggest that a library interface that
enqueues messages and sends them asynchronously should work well.
On the other hand, some machines support hardware for shared memory
operations. On those machines, synchronous reads and writes are
cheap. Compare the costs of the two approaches and determine which
programming model makes sense if portability across architectures is a
concern. (E.g., how much performance does a SMP give up if it uses a
more general interface that also supports machines without hardware to
support shared memory). One factor to consider is that modern
processor architectures allow out of order execution that may reduce
the impact of synchronous read and write instructions as long as the
result is not used immediately; thus following a bulk-synchronous
model may have benefits even on a SMP.
Faculty expert: Dr. Ramachandran.
- One of the key problems in architectures is that it is often more
difficult to improve latency than bandwidth. Prefetching is one
technique that can hide latency. Here are some possible prefetching
topics:
-
Quantify the limits of history-based prefetching. Prediction by
partial matching (originally developed for text compression) has been
shown to provide optimal prediction of future values based on past
values. Using PPM, what are the limits of memory or disk prefetching?
What input information (e.g., last k instruction addresses, last j
data addresses, distance between last k addresses, last value loaded,
...) best predicts future fetches? What is the best trade-off between
state used to store history information and prefetch performance?
Contact Mike for some code that implements PPM to help you get started.
- Add a small amount of hardware to DRAM memory chips to exploit
DRAM internal bandwidths to avoid DRAM latencies. Evaluates the
performance benefits that can be gained and the costs of modifying the
hardware.
- Implement a run-time system for automatically translating from a
QSM algorithm description to a program that has good performance on
some parallel architecture. For example, this system might pipeline
network messages to hide latency, group network messages to hide
overhead, efficiently multiplex many "virtual processors" onto a
smaller number of physical processors, .... (This project is almost
certainly far to challenging to pull of in a semester, but is there a
small piece of the problem you can attack to make progress to this
eventual goal?)
- Compare a hardware implementation of a Java virtual machine to software
emulation on a fast processor. Given the checkered history of hardware
instruction sets targeted to specific high-level languages, why does Sun
think this approach will work for Java? Are they right?
- Over the past 2 decades, memory sizes have increased by a factor of
1000, and page sizes by only a factor of 2-4. Should page sizes be dramatically
larger, or are a few large "superpages" sufficient to offset
this trend in most cases?
- Compare 4 visions for future architectures -- traditional (big caches,
super-pipelined/super-scaler, speculative execution, ...), IRAM (combine
processor and RAM on one chip), single-chip multiprocessors, and multi-threading
with fast context switches to tolerate memory latencies.
- Extend Transparent Informed Prefetching (Patterson et al
(SOSP95)), which was designed for page-level prefetching/caches to
balance cache-line hardware prefecthing v. hardware caching.
- Cooperative caching uses fast networks to access remote memory in
liu of disk accesses. One drawback is that a user's data may be stored
on multiple machines, potentially opening security holes
(eavesdropping, modification). Encryption and digital signatures may
solve the problem, but could slow down the system. Evaluate the
performance impact of adding encryption and digital signatures to
cooperatively cached data and project this performance into the future
as processor speeds improve and as companies like Intel propose adding
encryption functions to their processors..
- As memory latencies increase, cache miss times could run to
thousands of instruction-issue opportunities. This is nearly the same
ratio of memory access times as were seeen for early VM paging
systems. As miss times become so extremely bad, is it time to give
control of cache replacement to the software? Will larger degrees of
associativity be appropriate for caches?
- Your good idea here...