CS 395T- Large-Scale Data Mining

Homework 2

Using MC

    The main goal of this homework is to create vector space models of text using the Matrix Creation (MC) software.


    Answer the following questions:

1. Explain any 3 of the scaling schemes and any 3 of the pruning schemes you used.
2. Submit a table that shows the number of words, the running time and main memory consumed for classic3 and cmu.news 20_cleaned when you use tfn scaling, single thread and l= 0.02*GroupID, 0.05*GroupID, u= 15, 50.
The table will look like
prunning scheme number of words running time memory consumed
l=0.2*ID,u=15
l=0.2*ID,u=50
l=0.5*ID,u=15
l=0.5*ID,u=50
3. Try the multi-threaded version on tfn and l=0.2 u=15. Try p=2,4,8, and 16 (# of threads) and see how much the speed-up is. Report your results.
4. Is MC producing the correct result? How can you tell?
5. Can you get MC to recognize email addresses and URLs?
6. Suggest improvements to MC
 
Zipf's Law
 
7. Suppose that a generator randomly produces characters from the 26 letters of the English alphabet and the blank, i.e. each of the 27 symbols is produced with equal probability. A word is a sequence of alphabets ended with a blank. Do the generated words obey Zipf's law? Give reasons for your answer (experimental or theoretical).

8. Give a plot showing Zipf's law for one of the above runs (on classic3 or cmu.news20 _cleaned ). HINT: Information about word counts is returned in the classic3_words & cmu.news20_cleaned_words file.

Due date Oct. 2, 2001