CS380L: Advanced Operating Systems

Lab #1

Program measurement and mmap

This lab requires access to the PMU on a Linux machine. Cloud VMs and the UTCS machines will probably not work. If you do not have access to a machine, you can use one through CloudLab (detailed instructions here).

The goal of this assignment is to learn about how to measure your program's behavior, and a bit about mmap.

We are interested in doing experimental computer science. We will follow the scientific method, which wikipedia tells me has been around for a long time. But knowing about science in the abstract is relatively easy; actually doing good science is difficult both to learn and to execute.

Let's start with reproducibility. You will write a report for this lab, and in your report you will include details about your system. Think about what it would take to recreate your results. I won't spell out exactly what information you should include, but please include everything relevant while not bogging down your reader. You should report things like the kernel version for your host and guest system. If you used CloudLab, include details about the hardware of the machine type you used.

Your report should answer every question in this lab and should do so in a way that is clearly labeled.

This lab requires you to report data. But how you report it is up to you. Please think carefully. I do not want a huge table of numbers for every experiment. You can provide a table, discuss what is and is not relevant, then give me a graph, or a small table, or bold the entries that I should pay attention to (and that you explain in your text).

You will likely be recording and analyzing a lot of data for this lab. Here is a very simple C++ library to help keep track of and summarize this data (and here is an example program). Feel free to use it (or not!).

I have a major pet peeve with excessive digits of precision. Your measurements are usually counts. If you average three counts, don't give me six decimal places of precision even if six digits is the default output format for floats in the language you are using. Decide how many digits are meaningful and then report that many. Also, make your decimal points line up when that makes sense. For example, if you report a mean and a standard deviation, make the decimal places always align so you can see easily if the standard deviation is less than a tenth of the mean (which is a good sign for reproducibility).

I would use C or C++, but you can use whatever programming tools you want. One thing I want you to do both for this class and for real life is always check the return code of every single sytem call you ever make. I know it sounds a bit pedantic, but start the habit now and you will have a happier programming life. For almost every system call all that means is checking if the return code less than zero and if so call perror. When system calls don't work, you really want to know about it early, trust me on this point.

Tasks

Please include text about each experiment called for in this section, and include any graphs, tables or other data to make your point clearly and concisely. As always, report your experimental platform. You should find out what kind of CPU you have, its level 1 data cache capacity and associativity, and its data TLB capacity and associativity. Does your TLB have a second level? If so, describe it. Put all this information about your hardware in your report.

Memory map

Your first task will be to write a program that opens, reads, and prints the /proc/self/maps file. Put the result of your program in your report and write a few sentences about what it means. I want you to look at the output, read the documentation to understand its fields, then think about what is the most interesting thing you found in your output and write a few sentences about what that is. Also answer this question: Report the base address of your executable (the start of the text section) and the start address of libc. Why are these numbers so different?

Getrusage

Next, call getrusage at the end of your program and print out the fields. Pay particular attention to utime, stime, maxrss, minflt, majflt, inblock, oublock, voluntary and involuntary context switches.

perf_event_open

You will use the perf_event_open interface to measure your code. You can do this in a VM or not, but please specify in your report what you have done. Note that you have to configure your kernel to support perf and you might need to build the perf tools yourself if you are booting the latest kernel (as I requested in Lab 0, but which is not 100% necessary for this lab). Read the documentation. It is pretty wild how you call it, right? Does using the syscall routine mean there is a syscall opcode in your program? Check using objdump -d and explain what you find in your report.

We want you to monitor the following events: level 1 data cache accesses and misses, and data TLB misses. Figuring out how to encode those events into perf_event_open is not trivial, so use the force (and the Internet). Note the division of events into read, write and prefetch. According to my experiments, the level 1 data cache experiences many misses due to prefetch, but they do not count prefetch accesses or prefetch data TLB misses (the latter likely because a prefetch that would cause a TLB miss is dropped).

Depending on the machine you use, you may find that there is some nuance to interpreting the behavior of perf_event_open and the data you subsequently read for events you create: the implementation is highly architecture-dependent. In keeping with the notes below, perf list will definitely give you some good hints. However, the bottom line is always the code--if things don't make sense, you may need to read the kernel code. LXR is usually a good place to start. You may not be able to report all the metrics we ask for exactly as we define them--this is fine as long as you clearly document the limitations and how you worked around them.

Allocate a 1GB buffer, and pass the pointer to this routine (I'm assuming your host and VM have at least 2GB of memory). This routine assumes a global variable called opt_random_access. It also assumes your code defines the constant CACHE_LINE_SIZE, which is 64 on x86 platforms (and you are allowed to hard code this constant). This code represents the behavior of a program that we wish to study. Briefly summarize its memory access behavior in your report.

// p points to a region that is 1GB (ideally)
void do_mem_access(char* p, int size) {
	int i, j, count, outer, locality;
   int ws_base = 0;
   int max_base = ((size / CACHE_LINE_SIZE) - 512);
	for(outer = 0; outer < (1<<20); ++outer) {
      long r = simplerand() % max_base;
      // Pick a starting offset
      if( opt_random_access ) {
         ws_base = r;
      } else {
         ws_base += 512;
         if( ws_base >= max_base ) {
            ws_base = 0;
         }
      }
      for(locality = 0; locality < 16; locality++) {
         volatile char *a;
         char c;
         for(i = 0; i < 512; i++) {
            // Working set of 512 cache lines, 32KB
            a = p + (ws_base + i) * CACHE_LINE_SIZE;
            if((i%8) == 0) {
               *a = 1;
            } else {
               c = *a;
            }
         }
      }
   }
}

This simple random number generator comes from wikipedia. We give it to you so you can observe it using a simple profiler. Note that our code calls this function whether or not it is doing a random traversal. Explain why in your report.

///////////////////////////////////////////////////////////////
// Simple, fast random number generator, here so we can observe it using profiler
long x = 1, y = 4, z = 7, w = 13;

long simplerand(void) {
	long t = x;
	t ^= t << 11;
	t ^= t >> 8;
	x = y;
	y = z;
	z = w;
	w ^= w >> 19;
	w ^= t;
	return w;
}

Enable your tracing events right before calling do_mem_access, then disable the events, read the results and report them along with the getrusage results. You should compute and report the cache miss rate and the data TLB miss rate as a percentage (100.0 * cache misses / cache_accesses and 100.0 * tlb misses / cache_accesses). Also look at the counts themselves.

To be even more careful about generating repeatable results you should flush the level 1 data cache before enabling the performance counters. You can do this by reading (or writing, if your cache does write allocate) a memory buffer that is larger than your cache size. Also, lock your process onto a single processor and describe the system calls you need to do this in your report.

Measuring memory access behavior

This is the main part of your lab. Perform the following experiments. Allocate your buffer using mmap, both mapping anonymous memory and file-backed memory. Consider sequential and random access (these are properties of the do_mem_access code we supply). For file-based mmap consider MAP_PRIVATE and MAP_SHARED. Also consider MAP_POPULATE. Consider the case when you memset the entire buffer after allocating it. Call msync after the memset. What does that do? Remember to examine overall execution time and CPU utilization (often reported by the shell).

Perform all of the above experiments and study the data generated by getrusage and perf_event_open. Describe the results in your report using tables, graphs or whatever you think is appropriate. Explain the differences between experiments and why they happen. Determine if your results are stable by running your experiment several times and measuring the standard deviation of your results. Include the results of your analysis in your report.

Here is an example of the sort of question you should be asking yourself. Why can your data TLB miss count be lower than the total number of data pages your application accesses? Answer this question in your report.

Run your program under strace. Enter the output for arch_prctl in your report and explain what this system call does. Put the output of the system call that involves /etc/ld.so.preload in your report and explain what is going on.

You should do your measurements on an otherwise quiet system. However, for extra credit, you can study the effect of background activity. Explain clearly how you generated the background activity, what you expected, and what you found.

For the final section of the lab, we will add code that forks or clones a process that will compete for memory with our process under test. When your main program ends, be sure this competing process also dies. This can be accomplished in many different ways, several of which require less than or equal to three lines of code. For this part, please use the VM you build in Lab 0, configured with at least two virtual CPUs (and check that you have at least two physical CPUs in your host).

Write some code that forks or clones and then calls compete_for_memory. You should fork/clone before locking your program under test to a single core. Note that compete_for_memory calls get_mem_size() which is a function you must provide that returns the size of physical memory in bytes. Because the purpose of this function is to compete for memory with the foreground process, it is ok if this number is not completely equal to the exact amount of RAM in your system, but it should be close. It is NOT acceptable to hard code this number, you must write code to measure this number at runtime.

int compete_for_memory(void* unused) {
   long mem_size = get_mem_size();
   int page_sz = sysconf(_SC_PAGE_SIZE);
   printf("Total memsize is %3.2f GBs\n", (double)mem_size/(1024*1024*1024));
   fflush(stdout);
   char* p = mmap(NULL, mem_size, PROT_READ | PROT_WRITE,
                  MAP_NORESERVE|MAP_PRIVATE|MAP_ANONYMOUS, -1, (off_t) 0);
   if (p == MAP_FAILED)
      perror("Failed anon MMAP competition");

   int i = 0;
   while(1) {
      volatile char *a;
      long r = simplerand() % (mem_size/page_sz);
      char c;
      if( i >= mem_size/page_sz ) {
         i = 0;
      }
      // One read and write per page
      //a = p + i * page_sz; // sequential access
      a = p + r * page_sz;
      c += *a;
      if((i%8) == 0) {
         *a = 1;
      }
      i++;
   }
   return 0;
}

For your report: Why do we call fflush after the printf? Would this fflush be necessary if we fprintf'ed to stderr?

Now we will mess with the kernel's head (or more precisely its LRU page replacement algorithm). These instructions are approximate since your version of the kernel might be different, but look in mm/vmscan.c in a function likely called shrink_page_list. In it, you will see a switch statement with a PAGEREF_ACTIVATE case, which is the case where the kernel sees the page has been recently accessed. In this case the kernel gotos activate_locked, but you will change it to to do the same thing as the PAGEREF_RECLAIM case. Then evaluate what this change does to your test program when it has competition for memory. You can do this by having two identical VMs with different kernels or one kernel that you dynamically configure with its default behavior and your modified behavior, perhaps controlled via the /proc file system.

Notes

Configure your VM with at least two virtual CPUs, but first confirm that your host system has at least two CPUs.

To allocate space in a file, look at the abysmally named ftruncate and fallocate. No matter what you use, call msync after your memset and report your findings.

The perf subsystem uses the performance management unit (PMU) which is a set of hardware counters supported directly by your processor. If you look at dmesg output, it should show the PMU being initialized.

Check for perf availability in your host system before checking the guest. Most of the lab can be done on the host.

If you run perf list on the command line, it will tell you what counters are supported by your combination of hardware, OS and perf tools. You need at least level 1 data cache accesses and misses and data TLB misses. Some of the prefetch varieties may be unsupported.

Historically, VirtualBox has done a poor job of virtualizing the performance counters.

Only so many counters can be active at once (the exactly limit depends on the hardware/OS/perf tools). If you try to add too many you might get errors.

I'm not sure it is necessary, but if you get a lot of variation in your results for the experiments that follow, you might want to disable CPU frequency scaling on your system. I would do this in the BIOS, but you can also try user-level tools like this one that allow you to set the frequency directly (or perhaps the "-g performance" option would work, I'm not sure). Here is a tool. https://manpages.ubuntu.com/manpages/hardy/man1/cpufreq-selector.1.html

Your code will have to run with many different configurations. Consider using getopt, or maybe you would prefer a configuration file, but I find command line options superior for this sort of task as they are more explicit and more easily scripted.

Please report how much time you spent on the lab.

Your code is subject to visual inspection and should be clean and readable. Points will be deducted if your code is too hard to follow. Your code should follow best practices, for example avoiding arbitrary constants and checking error codes. Seriously, check the return value of every system call you ever make.