HW1: Web Cache Simulator

CS395t: Web Operating Systems

Spring 2001

Due: Wednesday Feb 14 in class

Each person in class should work individually on this project.

Problem statement

A key technique for doing research on wide-area (and other) systems is simulation. In the class we will build a few small simulators to become familiar with some of the techniques.

In this homework, you will construct a simulator to measure some of the basic properties of HTML traffic.

Building a simple simulator is easy. But you have to make assumptions and inferences about the workload. A focus of this homework is considering and quantifying possible sources of error from your study.

Input

A 3-day HTTP trace from the squid distributed caches in /u/dahlin/tmp/squid-trace-395t/. This trace is described in the README file in that directory.

Output

The average byte and request hit rate your simulator measures for the entire trace assuming (a) an infinite shared cache and (b) that statistics are gathered over the entire trace.
A graph in which the x-axis is the cache size (in log scale, ranging from 1MB to a size large enough to accomodate all requests without requiring objects to be replaced), the y-axis is object hit rate, and there are three lines reflecting different "warm up" periods: 0-day warm-up, 1-day warm-up, 2-day warm up.

Methodology (and more output)

Describe how you detect uncachable objects in the input trace. What fraction of requests and bytes*requests do you classify as uncachable? Are there any requests that are ambiguous as to their cachability? If so, quantify the most that different assumptions about cachability could change your first hit rate result (hit rate over the entire trace assuming infinite cache and no cache warming interval.)
You should simulate an invalidation-based consistency algorithm (e.g., objects disappear from the cache as soon as they change at the server; you don't need to simulate the mechanics of the invalidation protocol, just delete things from the cache when you need to.) Describe how you detect changes to objects in the trace (e.g., when do cached copies of objects become stale in your simulator)? Under what circumstances will your simulator overestimate hit rate by "missing" a change and counting a hit to an object that should have been invalidated? For each of these circumstances, bound the maximum amount that this factor could change your results for the first hit rate result (or explain why a meaningful bound is impossible or argue that the factor is unlikely to affect your results.)
The squid trace indicates which requests are hits in the squid cache and which are misses. How does your simulated hit rate differ from the hit rate calculated by counting hits reported by squid? What factors account for the difference (can you quantify or bound these factors?)
Were there any other significant assumptions you made that could affect your results?

Hints

The simulator should be written using a language of your choice (C, C++, Java, even Perl I suppose.) Follow good coding practice (modularity, etc.) throughout. Feel free to use standard data-structure libraries (e.g., STL, LEDA) or code you have written in the past.
Your simulator should be event based with no queuing. Process each record in the trace in its entirety, then process the next. (Avoiding simulating concurrency can greatly simplify your simulator. Later in the semester, we will look at a simulator that accounts for queuing effects.)
Take the few seconds of extra time to pass parameters into your simulator on the command line (rather than setting them as constants in the source files.) This lets you run your experiments using scripts that repeatedly execute your simulator with different parameters. This also lets you document which paramters were used for which runs in your output files easily (e.g., this code goes in about every simulator I write).
It is often useful to have a simulator spit out raw data that is later post-processed using awk/perl/etc to create graphs.
For large-scale simulations, I find it useful to (a) put each simulation run in a a separate output file, (b) put *all* parameters to the simulation run in the output file, (c) put the set of experiments corresponding to a graph in a single subdirectory, (d) name the files in a subdirectory with a name that includes the parameter(s) that are changing for the different experiment runs for the graph, (e) put a script "go.sh" in each subdirectory that runs the experiments needed in that subdirectory , and (f) put a script "graph.sh" in each subdirectory that generates the graph for that subdirectory. This way, I have some hope of coping with a paper that has 15 graphs in it. And, I have some hope that 6 months later when I go back to, say, make the journal version of the paper (or my dissertation) I will be able to regenerate my experiments.
gnuplot is a good tool for making graphs. Don't forget to explicitly set the yrange if needed to ensure that "0" is included; it is bad form to truncate the y-axis of a scientific graph except in very rare cases.