HW1: Web Cache Simulator
CS395t: Web Operating Systems
Spring 2001
Due: Wednesday Feb 14 in class
Each person in class should work individually on this project.
Problem statement
A key technique for doing research on wide-area (and other) systems is
simulation. In the class we will build a few small simulators to
become familiar with some of the techniques.
In this homework, you will construct a simulator to measure some of the
basic properties of HTML traffic.
Building a simple simulator is easy. But you have to make
assumptions and inferences about the workload. A focus of this
homework is considering and quantifying possible sources of error from
your study.
Input
- A 3-day HTTP trace from the squid distributed caches
in /u/dahlin/tmp/squid-trace-395t/. This trace is described in the
README file in that directory.
Output
- The average byte and request hit rate your simulator measures for
the entire trace assuming (a)
an infinite shared cache and (b) that statistics are gathered over the
entire trace.
- A graph in which the x-axis is the cache size (in log scale,
ranging from 1MB to a size large enough to accomodate all requests
without requiring objects to be replaced), the y-axis is object hit
rate, and there are three lines reflecting different "warm up"
periods: 0-day warm-up, 1-day warm-up, 2-day warm up.
Methodology (and more output)
- Describe how you detect uncachable objects in the input
trace. What fraction of requests and bytes*requests do you classify as
uncachable? Are there any requests that are ambiguous as to their
cachability? If so, quantify the most that different assumptions about
cachability could change your first hit rate result (hit rate over the
entire trace assuming infinite cache and no cache warming interval.)
- You should simulate an invalidation-based consistency
algorithm (e.g., objects disappear from the cache as soon as they
change at the server; you don't need to simulate the mechanics of the
invalidation protocol, just delete things from the cache when you need
to.) Describe how you detect changes to objects in the trace
(e.g., when do cached copies of objects become stale in your
simulator)? Under what circumstances will your simulator overestimate
hit rate by "missing" a change and counting a hit to an object that
should have been invalidated? For each of these circumstances, bound
the maximum amount that this factor could change your results for the
first hit rate result (or
explain why a meaningful bound is impossible or argue that the factor
is unlikely to affect your results.)
- The squid trace indicates which requests are hits in the squid
cache and which are misses. How does your simulated hit rate differ
from the hit rate calculated by counting hits reported by squid?
What factors account for the difference (can you quantify or bound
these factors?)
- Were there any other significant assumptions you made that could
affect your results?
Hints
- The simulator should be written using a language of your choice
(C, C++, Java, even Perl I suppose.) Follow good coding practice
(modularity, etc.) throughout. Feel free to use standard
data-structure libraries (e.g., STL, LEDA) or code you have written in
the past.
- Your simulator should be event based with no
queuing. Process each record
in the trace in its entirety, then process the next. (Avoiding
simulating concurrency can greatly simplify your simulator. Later in
the semester, we will look at a simulator that accounts for queuing
effects.)
- Take the few seconds of extra time to pass parameters into your
simulator on the command line (rather than setting them as constants
in the source files.) This lets you run your experiments using scripts
that repeatedly execute your simulator with different parameters. This
also lets you document which paramters were used for which runs in your
output files easily (e.g., this code goes in
about every simulator I write).
- It is often useful to have a simulator spit out raw data that is
later post-processed using awk/perl/etc to create graphs.
- For large-scale simulations, I find it useful to (a) put each
simulation run in a a separate output file, (b) put *all* parameters
to the simulation run in the output file, (c) put the set of
experiments corresponding to a graph in a single subdirectory, (d)
name the files in a subdirectory with a name that includes the parameter(s)
that are changing for the different experiment runs for the graph,
(e) put a script "go.sh" in each subdirectory that runs the
experiments needed in that subdirectory , and (f) put a script
"graph.sh" in each subdirectory that generates the
graph for that subdirectory. This way, I have some hope of coping with
a paper that has 15 graphs in it. And, I have some hope that 6 months
later when I go back to, say, make the journal version of the paper
(or my dissertation) I will be able to regenerate my experiments.
- gnuplot is a good tool for making graphs. Don't forget to explicitly set
the yrange if needed to ensure that "0" is included; it is bad form to
truncate the y-axis of a scientific graph except in very rare cases.