The goal of this assignment is to learn about how to measure your program's behavior, and a bit about mmap.
We are interested in doing experimental computer science. We will follow the scientific method, which wikipedia tells me has been around for a long time. But knowing about science in the abstract is relatively easy; actually doing good science is difficult both to learn and to execute.
Let's start with reproducibility. You will write a report for this lab, and in your report you will include details about your system. Think about what it would take to recreate your results. I won't spell out exactly what information you should include, but please include everything relevant while not bogging down your reader. You should report things like the kernel version for your host and guest system.
Your report should answer every question in this lab and should do so in a way that is clearly labeled.
This lab requires you to report data. But how you report it is up to you. Please think carefully. I do not want a huge table of numbers for every experiment. You can provide a table, discuss what is and is not relevant, then give me a graph, or a small table, or bold the entries that I should pay attention to (and that you explain in your text).
I have a major pet peeve with excessive digits of precision. Your measurements are usually counts. If you average three counts, don't give me six decimal places of precision even if six digits is the default output format for flots in the language you are using. Decide how many digits are meaningful and then report that many. Also, make your decimal points line up when that makes sense. For example, if you report a mean and a standard deviation, make the decimal places always align so you can see easily if the standard deviation is less than a tenth of the mean (which is a good sign for reproducibility).
I would use C, but you can use whatever programming tools you want. One thing I want
you to do both for this class and for real life is always check the return code of
every single sytem call you ever make. I know it sounds a bit pedantic, but start
the habit now and you will have a happier programming life. For almost every system
call all that means is checking if the return code less than zero and if so
call perror
. When system calls don't work, you really want to know
about it early, trust me on this point.
Please include text about each experiment called for in this section, and include any graphs, tables or other data to make your point clearly and concisely. As always, report your experimental platform. You should find out what kind of CPU you have, its level 1 data cache capacity and associativity, and its data TLB capacity and associativity. Does your TLB have a second level? If so, describe it. Put all this information about your hardware in your report.
Your first task will be to write a program that opens, reads, and prints the /proc/self/maps file. Put the result of your program in your report and write a few sentences about what it means. I want you to look at the output, read the documentation to understand its fields, then think about what is the most interesting thing you found in your output and write a few sentences about what that is. Also answer this question: Report the base address of your executable (the start of the text section) and the start address of libc
. Why are these numbers so different?
Next, call getrusage
at the end of your program and print out the fields. Pay particular attention to utime
, stime
, maxrss
, minflt
, majflt
, inblock
, oublock
, voluntary and involuntary context switches.
You will use the perf_event_open
interface to measure
your code. You can do this in a VM or not, but please specify in
your report what you have done. Note that you have to configure
your kernel to support perf and you might need to build the perf
tools yourself if you are booting the latest kernel (as I requested
in Lab 0, but which is not 100% necessary for this lab). Read the documentation. It is pretty wild how you call it, right? Does using the syscall
routine mean there is a syscall opcode in your program? Check using objdump -d
and explain what you find in your report.
We want you to monitor the following events: level 1 data cache accesses and misses, and data TLB misses. Figuring out how to encode those events into perf_event_open is not trivial, so use the force (and the Internet). Note the division of events into read, write and prefetch. According to my experiments, the level 1 data cache experiences many misses due to prefetch, but they do not count prefetch accesses or prefetch data TLB misses (the latter likely because a prefetch that would cause a TLB miss is dropped).
Depending on the machine you use, you may find that there
is some nuance to interpreting the behavior of perf_event_open
and the data you subsequently read for events you create:
the implementation is highly architecture-dependent.
In keeping with the notes below, perf list
will definitely give you some good hints. However,
the bottom line is always the code--if things don't
you don't make sense, you may need to read the kernel code.
LXR is usually
a good place to start. You may not be able to report all
the metrics we ask for exactly as we define them--this is
fine as long as you clearly document the limitations and
how you worked around them.
Allocate a 1GB buffer, and pass the pointer to this routine (I'm
assuming your host and VM have at least 2GB of memory). This
routine assumes a global variable
called opt_random_access
. It also assumes your code
defines the constant CACHE_LINE_SIZE
, which is 64 on
x86 platforms (and you are allowed to hard code this constant).
This code represents the behavior of a program that we wish to
study. Briefly summarize its memory access behavior in your
report.
// p points to a region that is 1GB (ideally) void do_mem_access(char* p, int size) { int i, j, count, outer, locality; int ws_base = 0; int max_base = ((size / CACHE_LINE_SIZE) - 512); for(outer = 0; outer < (1<<20); ++outer) { long r = simplerand() % max_base; // Pick a starting offset if( opt_random_access ) { ws_base = r; } else { ws_base += 512; if( ws_base >= max_base ) { ws_base = 0; } } for(locality = 0; locality < 16; locality++) { volatile char *a; char c; for(i = 0; i < 512; i++) { // Working set of 512 cache lines, 32KB a = p + ws_base + i * CACHE_LINE_SIZE; if((i%8) == 0) { *a = 1; } else { c = *a; } } } } }
This simple random number generator comes from wikipedia. We give it to you so you can observe it using a simple profiler. Note that our code calls this function whether or not it is doing a random traversal. Explain why in your report.
/////////////////////////////////////////////////////////////// // Simple, fast random number generator, here so we can observe it using profiler long x = 1, y = 4, z = 7, w = 13; long simplerand(void) { long t = x; t ^= t << 11; t ^= t >> 8; x = y; y = z; z = w; w ^= w >> 19; w ^= t; return w; }
Enable your tracing events right before calling do_mem_access, then disable the events, read the results and report them along with the getrusage
results. You should compute and report the cache miss rate and the data TLB miss rate as a percentage (100.0 * cache misses / cache_accesses and 100.0 * tlb misses / cache_accesses). Also look at the counts themselves.
To be even more careful about generating repeatable results you should flush the level 1 data cache before enabling the performance counters. You can do this by reading (or writing, if your cache does write allocate) a memory buffer that is larger than your cache size. Also, lock your process onto a single processor and describe the system calls you need to do this in your report.
This is the main part of your lab. Perform the following experiments. Allocate your buffer using mmap
, both mapping anonymous memory and file-backed memory. Consider sequential and random access (these are properties of the do_mem_access
code we supply). For file-based mmap
consider MAP_PRIVATE
and MAP_SHARED
. Also consider MAP_POPULATE
. Consider the case when you memset the entire buffer after allocating it. Call msync
after the memset. What does that do? Remember to examine overall execution time and CPU utilization (often reported by the shell).
Perform all of the above experiments and study the data generated by getrusage
and perf_event_open
. Describe the results in your report using tables, graphs or whatever you think is appropriate. Explain the differences between experiments and why they happen. Determine if your results are stable by running your experiment several times and measuring the standard deviation of your results. Include the results of your analysis in your report.
Here is an example of the sort of question you should be asking yourself. Why can your data TLB miss count be lower than the total number of data pages your application accesses? Answer this question in your report.
Run your program under strace
. Enter the output for arch_prctl
in your report and explain what this system call does. Put the output of the system call that involves /etc/ld.so.preload
in your report and explain what is going on.
You should do your measurements on an otherwise quiet system. However, for extra credit, you can study the effect of background activity. Explain clearly how you generated the background activity, what you expected, and what you found.
For the final section of the lab, we will add code that forks
or
clones
a process that will compete for memory with our process under test.
When your main program ends, be sure this competing process also
dies. This can be accomplished in many different ways, several of which
require less than or equal to three lines of code.
For this part, please use the VM you build in Lab 0, configured with at least two
virtual CPUs (and check that you have at least two physical CPUs in your host).
Write some code that forks or clones and then calls compete_for_memory
.
You should fork/clone before locking your program under test to a single core. Note
that compete_for_memory
calls get_mem_size()
which is a
function you must provide that returns the size of physical memory in bytes. Because
the purpose of this function is to compete for memory with the foreground process, it
is ok if this number is
not completely equal to the exact amount of RAM in your system, but it should be
close. It is NOT acceptable to hard code this number, you must write code to measure
this number at runtime.
int compete_for_memory(void* unused) { long mem_size = get_mem_size(); int page_sz = sysconf(_SC_PAGE_SIZE); printf("Total memsize is %3.2f GBs\n", (double)mem_size/(1024*1024*1024)); fflush(stdout); char* p = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_NORESERVE|MAP_PRIVATE|MAP_ANONYMOUS, -1, (off_t) 0); if (p == MAP_FAILED) perror("Failed anon MMAP competition"); int i = 0; while(1) { volatile char *a; long r = simplerand() % (mem_size/page_sz); char c; if( i >= mem_size/page_sz ) { i = 0; } // One read and write per page //a = p + i * page_sz; // sequential access a = p + r * page_sz; c += *a; if((i%8) == 0) { *a = 1; } i++; } return 0; }
For your report: Why do we call fflush
after the printf
? Would this fflush be necessary if we fprintf'ed to stderr?
Now we will mess with the kernel's head (or more precisely its LRU
page replacement algorithm). These instructions are approximate since
your version of the kernel might be different, but look in mm/vmscan.c
in a function likely called shrink_page_list. In it, you will see a
switch statement with a PAGEREF_ACTIVATE case, which is the case where
the kernel sees the page has been recently accessed. In this case the
kernel gotos activate_locked, but you will change it to to do the same
thing as the PAGEREF_RECLAIM case. Then evaluate what this change
does to your test program when it has competition for memory. You can
do this by having two identical VMs with different kernels or one
kernel that you dynamically configure with its default behavior and
your modified behavior, perhaps controlled via the /proc
file system.
Configure your VM with at least two virtual CPUs, but first confirm that your host system has at least two CPUs.
To allocate space in a file, look at the abysmally named ftruncate
and
fallocate
. No matter what you use, call msync
after your memset and report your findings.
The perf subsystem uses the performance management unit (PMU) which
is a set of hardware counters supported directly by your processor.
If you look at dmesg
output, it should show the PMU
being initialized.
Check for perf
availability in your host system before
checking the guest. Most of the lab can be done on the host.
If you run perf list
on the command line, it will tell
you what counters are supported by your combination of hardware, OS
and perf tools. You need at least level 1 data cache accesses and
misses and data TLB misses. Some of the prefetch varieties may be unsupported.
Historically, VirtualBox has done a poor job of virtualizing the performance counters.
Only so many counters can be active at once (the exactly limit depends on the hardware/OS/perf tools). If you try to add too many you might get errors.
I'm not sure it is necessary, but if you get a lot of variation in
your results for the experiments that follow, you might want to disable
CPU frequency scaling on your system. I would do this in the BIOS, but you
can also try user-level tools like this one that allow you to set the frequency
directly (or perhaps the "-g performance" option would work, I'm not sure).
Here is a tool.
http://manpages.ubuntu.com/manpages/hardy/man1/cpufreq-selector.1.html
Your code will have to run with many different configurations. Consider using getopt
, or maybe you would prefer a configuration file, but I find command line options superior for this sort of task as they are more explicit and more easily scripted.