CS378 Homework #5


Due: In class, Thursday November 16

This homework is out of 40 points and should be performed individually (i.e. not with your project partner).

Netflix Data Analysis

For this homework, you are to complete the project progress requirements as detailed in the project suggestions document. You must first download the netflix data set (available here). You must then compute each of the five following properties of the dataset:

  1. The average review scores across all reviewers in the dataset.
  2. A histogram showing the distribution for these scores. For each possible score (i.e. 1 through 5), you should count the number of reviews. Your histogram is simply a plot showing the number of reviews for each possible review value.
  3. Top 10 most highly rated movies. For each of these movies, you should look up the movie names in the movie title file (detailed in the readme file for the dataset). Also, compute the number of reviews and the variance of each of these movies. Are your results reasonable? Why or why not?
  4. Number of reviews as a function of time. The time should be discretized at the month level. You should provide two plots of this relation. The first plot should use normal, linearly separated axes. The second plot should use a log scale on the vertical axis (i.e. the log of the number of reviews as a function of time). Which plot is more appropriate here. Why?
  5. Find the ID of the reviewer whose distribution has the highest entropy.

In addition to the five properties given above, you should also investigate five other interesting properties fo the dataset that are relevant to your proejct. Be sure to describe how each of these properties are relevant to your project. At least two of these properties must involve plotting a relation between two variables (i.e. similar to properties 2 and 4 listed above). You should provide a clear writeup describing each of these properties. In addition to turning in this writeup in class, you should also put up a copy online (perhaps in the public_html directory of your cs account). Send a link to jdavis@cs.utexas.edu. Your writeup will then be made available for the rest of the class to use as a reference for the project.

Make sure you turn in any code you used to generate these plots. The coding problems may be implemented in the language of your choice. Print out all code written and attach to your homework solutions. Code should be clearly written and well-commented to receive full credit.