CS 395T- Large-Scale Data Mining

Homework 2

Using MC

The main goal of this homework is to create vector space models of text using the Matrix Creation (MC) software.

Understand the CCS matrix format used by MC.
Download the MC code and compile it in single-threaded and multi-threaded modes.
Run MC on classic3 and cmu.news 20_cleaned. (these data sets are available at /stage/projects4/cs395t_lsdm/Data.) Note: use the stopwords file /stage/projects4/cs395t_lsdm/Data/stopwords.
Experiment with different scaling schemes: txx, txn, tfn, lxn, and different pruning schemes: -l, -u, -ln, -un, -m, -M. Understand what they mean and see how they affect the number of words and matrix entries.
Read Section 3 of this paper to understand how MC works.

Answer the following questions:

1. Explain any 3 of the scaling schemes and any 3 of the pruning schemes you used.
2. Submit a table that shows the number of words, the running time and main memory consumed for classic3 and cmu.news 20_cleaned when you use tfn scaling, single thread and l= 0.02*GroupID, 0.05*GroupID, u= 15, 50.

The table will look like

prunning scheme	number of words	running time	memory consumed
l=0.2*ID,u=15
l=0.2*ID,u=50
l=0.5*ID,u=15
l=0.5*ID,u=50

3. Try the multi-threaded version on tfn and l=0.2 u=15. Try p=2,4,8, and 16 (# of threads) and see how much the speed-up is. Report your results.
4. Is MC producing the correct result? How can you tell?
5. Can you get MC to recognize email addresses and URLs?
6. Suggest improvements to MC

Zipf's Law

7. Suppose that a generator randomly produces characters from the 26 letters of the English alphabet and the blank, i.e. each of the 27 symbols is produced with equal probability. A word is a sequence of alphabets ended with a blank. Do the generated words obey Zipf's law? Give reasons for your answer (experimental or theoretical).
8. Give a plot showing Zipf's law for one of the above runs (on classic3 or cmu.news20 _cleaned ). HINT: Information about word counts is returned in the classic3_words & cmu.news20_cleaned_words file.

CS 395T- Large-Scale Data Mining

Homework 2

Using MC

Due date Oct. 2, 2001