CS303E Homework 11

Instructor: Dr. Bill Young
Due Date: Friday, November 8, 2024 at 11:59pm

Text Mining

From Wikipedia:
Text mining, text data mining or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources."
A typical analysis task is taking the text of a document and counting the total number of words, counting the number of unique words, computing the frequency of individual words (possibly excluding some very common words), etc. Before computers, such work was extremely tedious, and scholars spent years performing such analysis on popular books such as the Bible or Homer's Iliad. With computers, it's something that any beginning programmer can do (even in a weekly project).

Your Assignment

In this assignment, you'll write some tools for doing some text analysis on a large text stored in a file. On this text you'll write functions to collect the following statistics:
  1. the number of total words;
  2. the number of unique words;
  3. the number of times each word occurs in the text, excluding some very common words;
  4. the k words that occur most often in the text;
  5. the k longest words in the text;
  6. the k shortest words in the text.
These aren't necessarily the most useful stats, but they'll give you practice in text analysis. You'll apply them to texts stored in files of various sizes, printing a nicely formatted report for each file.

This assignment will involve three tasks:

Task 1: Building a Word Frequency Dictionary. In this task you'll write a function to build a dictionary that associates each word from the text with the number of times that word appears, except that some very common words such and 'the' and 'is' will be excluded. Here is the header for your function:

def createDictionary( filename ):
    """Create a dictionary associating each word in a text file with the
    number of times the word occurs.  Also count the total number of
    words and the number of unique words in the text.  Certain very
    common words are not included in the dictionary, but are counted.
    Return a triple: (wordCount, uniqueWordCount, dictionary)."""
Your function should perform the following steps:
  1. The filename is passed as a parameter; for this step you can assume the file exists and contains only ASCII text.
  2. Open the file for reading.
  3. Read the file line by line, splitting each line into words.
  4. Lowercase each word and insert as a key in the dictionary (if it's not already there and not an excluded word) incrementing the count for that word.
  5. As you're reading words keep a count of total words and unique words; include excluded words in the total count and unique words count, but don't put them into the dictionary.
  6. Close the file.
  7. Return a triple of the count of total words, count of unique words, and the dictionary.
In addition to the dictionary, you'll need to keep a set of the excluded words that you've encountered in the text. Remember that you need to keep track of which words you've seen before so that you don't count them again in the count of unique words. You've seen a word before if it's a key in the dictionary or is in the set of excluded words previously encountered. If you see a new non-excluded word you'll add it to the dictionary with a count of 1; if you see a new excluded word, you'll add it to your set of previously seen excluded words. In either case, you'll increment the count of words and of unique words.

To get the words on each line use line.split(). This isn't ideal since it splits on whitespace and doesn't take punctuation into account. That means that "word" will be different from "word!" or "word," and that numbers and symbols will be treated as words. Of course, it would be possible to recognize only alphabetic strings, but that's beyond the scope of this assignment. For the two files provided, I've removed all punctuation (I think).

Here are the words to exclude when creating your dictionary:

['a', 'about', 'after', 'all', 'also', 'am', 'an', 'and', 'any',
 'are', 'as', 'at', 'back', 'be', 'because', 'but', 'by', 'can',
 'come', 'could', 'day', 'do', 'even', 'first', 'for', 'from', 'get',
 'give', 'go', 'good', 'had', 'have', 'he', 'her', 'him', 'his',
 'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'know',
 'like', 'look', 'make', 'man', 'me', 'men', 'most', 'my', 'new',
 'no', 'not', 'now', 'of', 'on', 'one', 'only', 'or', 'other', 'our',
 'out', 'over', 'people', 'said', 'say', 'see', 'she', 'so', 'some',
 'take', 'than', 'that', 'the', 'their', 'them', 'then', 'there',
 'these', 'they', 'think', 'this', 'time', 'to', 'two', 'up', 'us',
 'use', 'want', 'was', 'way', 'we', 'well', 'went', 'were', 'what',
 'when', 'which', 'who', 'will', 'with', 'work', 'would', 'year',
 'you', 'your']
Don't change this list, because then your counts won't match ours.

Task 2: Writing Text Analysis Functions. To perform analyses on the dictionary you created in Task 1, you'll write the following functions:

def sortByFrequency( dict ):
    """Return a list of pairs of (count, word)
    sorted by count in descending order. I.e., 
    the most frequent word should be first in the
    list."""
    pass

# Think about how to use the function sortByFrequency
# for this one.
def mostFrequentWords( dict, k ):
    """Return a list of the k most frequently occurring 
    words."""
    pass

def sortByWordLength( dict ):
    """Return a list of pairs of (length, word)
    sorted by length in descending order. I.e.,
    the longest word should be first in the list."""
    pass

# Think about how to use the function sortByWordLength
# for this one.
def longestWords( dict, k ):
    """Return a list of the k longest words in the
    text."""
    pass

# Think about how to use the function sortByWordLength
# for this one.
def shortestWords( dict, k ):
    """Return a list of the k shortest words in the
    text."""
    pass
Hint: Suppose L is a list of pairs of the form (num, word); you can sort them in reverse (descending) order with the command:
   L.sort( reverse = True )
This will sort them lexicographically, i.e, by the number first and then alphabetically by the words if the numbers are equal.

Write a main() function to Produce and Print Statistics on the Text. See the Sample Output below for what is required in your report. Your main() function should do the following:

  1. Accept a filename from the user. Validate that the file exists; if not, print an error message and terminate execution.
  2. Create your dictionary from the file. Save the dictionary, total words count, and unique words count.
  3. Print a report following the model in the Sample Output below. Note that you'll need to report the 10 most frequent words, 10 longest words, and 10 shortest words in the text. Be sure to follow the formatting in Sample Output.
Pay careful attention to the way in which the lists of words are formatted. The longest words are listed 5 per line; the others are all 10 words on the line. This will give you some more practice in formatting lists for printing.

Two files on which to test your code are here: MLK's I Have a Dream Speech and \ Homer's Odyssey. Note that I have removed punctuation from both files. The Odyssey is also in lowercase.

Sample Output:

> python AnalyzeText.py
Enter a filename: NoSuchFile.txt
File does not exist.
> python AnalyzeText.py 
Enter a filename: MLKDreamSpeech.txt

Text analysis of file: MLKDreamSpeech.txt
  Total word count:  1629
  Unique word count: 537
  10 most frequent words:
   [ freedom, negro, let, ring, dream, every, nation, today, satisfied, must ]
  10 longest words:
   [ discrimination, tranquilizing, righteousness, nullification, interposition, 
     tribulations, proclamation, pennsylvania, mountainside, invigorating ]
  10 shortest words:
   [ jr, 100, ago, bad, cup, end, god, has, hew, let ]

> python AnalyzeText.py 
Enter a filename: wordsFromOdyssey.txt

Text analysis of file: wordsFromOdyssey.txt
  Total word count:  118069
  Unique word count: 6416
  10 most frequent words:
   [ ulysses, house, has, own, son, did, upon, telemachus, tell, been ]
  10 longest words:
   [ straightforwardly, inextinguishable, notwithstanding, extraordinarily, disrespectfully, 
     accomplishments, pyriphlegethon, laestrygonians, interpretation, embellishments ]
  10 shortest words:
   [ o, v, x, ii, iv, ix, mt, oh, ox, re ]

>
BTW: the Odyssey is broken into "books" which are numbered with Roman numerals. That's why you see words like "v", "x", and "ix".

Turning in the Assignment:

The program should be in a file named AnalyzeText.py. Submit the file via Canvas before the deadline shown at the top of this page. Submit it to hw11 in the assignments sections by uploading your Python file.

Your file must compile and run before submission. It must also contain a header with the following format:

# File: AnalyzeText.py
# Student: 
# UT EID:
# Course Name: CS303E
# 
# Date:
# Description of Program: 

Programming Tips:

Functional Abstraction: Remember that functions are a type of abstraction. Defining a function extends your toolkit with a new operation that may not be provided natively by the programming language. If you construct the right set of functions, you can do things that would be very difficult "from scratch." Think how much harder arithmetic computations in Python would be if someone hadn't already provided the math module for you to import.

In Task 2 above, you're asked to write five functions. But once you've written sortByFrequency and sortByWordLength, the others are really pretty simple. But if you had to write the others from scratch, you'd be hard pressed.

New programmers often avoid writing functions, thinking it's just a waste of time. Considering what functions to define is often one of the most useful and efficient things you can do when writing a program of any size. Without functional abstraction, it would be almost impossible to have ever written the huge programming systems that power our lives.