Data Compression

This exercise has to do with file compression using key-word encoding. There are several files associated with this exercise that are in the same directory.

wordList.cpp A file containing C++ program that produces a list of the unique words in a file and the number of times each appears.

words.dat The output from program WordList with the words sorted by number of occurrences.

history.in A data file containing 3436 non-blank characters, which was the input to the program.

Examine file "words.dat" and determine which words are appropriate to use in a key-word encoding scheme.
What symbol would you assign to each word?
Calculate how many characters you would save by using the key-word encoding. Calculate the compression ratio.

Program WordList is case sensitive; words beginning with an uppercase letter are considered different from the same word beginning with a lowercase letter.

Look carefully at program WordList. One small change would let the program ignore case. If line 160 is changed as follows, all letters are considered lowercase:

          letters[count] = tolower(letter);

"tolower" is a function that changes each character to lowercase before it is stored in letters.
This change was made, the program was rerun, the results were ordered by frequency, and the file was saved under wordslc.dat.
Calculate how many additional characters are saved if case is ignored, and recalculate the compression ratio.

Program WordList ignores words of less than three characters. Would it be better to ignore words of less than four characters? Recalculate the compression ratio not encoding words of less than four characters.