For this assignment, we would like to study the representative works of two authors and ask ourselves the question "Whose vocabulary is more extensive?"
The authors that we will be looking at are Charles Dickens and Thomas Hardy. And the representative novels that we will analyze are A Tale of Two Cities and The Return of the Native respectively. Here are the following steps in our analysis.
Step I: Go to the Project Gutenberg and download the plain text versions of both books. Call the first book Tale.txt and the second book Return.txt. Open the books in a text editor and delete the preamble in the beginning of the books and the license agreement and the closing blurbs at the end of the books. The first book in Tale.txt should begin and end with these lines:
A TALE OF TWO CITIES A STORY OF THE FRENCH REVOLUTION by Charles Dickens ... it is a far, far better rest that I go to than I have ever known."The second book in Return.txt should begin and end with these lines:
THE RETURN OF THE NATIVE by Thomas Hardy ... kindly received, for the story of his life had become generally known.
Step II: The program that you will be writing will be called Books.py. The following is the suggested structure. You do not have to adhere to it. However, we will be looking at good documentation, design, and adherence to the coding convention discussed in class. Use meaningful variable names in your program.
# Create word dictionary from the comprehensive word list word_dict = {} def create_word_dict (): # Removes punctuation marks from a string def parseString (st): # Returns a dictionary of words and their frequencies def getWordFreq (file): # Compares the distinct words in two dictionaries def wordComparison (author1, freq1, author2, freq2): def main(): # Create word dictionary from comprehensive word list create_word_dict() # Enter names of the two books in electronic form book1 = input ("Enter name of first book: ") book2 = input ("Enter name of second book: ") print() # Enter names of the two authors author1 = input ("Enter last name of first author: ") author2 = input ("Enter last name of second author: ") print() # Get the frequency of words used by the two authors wordFreq1 = getWordFreq (book1) wordFreq2 = getWordFreq (book2) # Compare the relative frequency of uncommon words used # by the two authors wordComparison (author1, wordFreq1, author2, wordFreq2) main()
Step III: Declare a global dictionary variable word_dict. In the function create_word_dict() open the file words.txt and populate the dictionary. Each word in the file will be a key in the dictionary and the value will be 1. Hard code the name of the file, words.txt, in your program. When we test your program we will be using a file with the same name.
Step IV: You will have to get the frequency of words in each text to start with. Then here is some additional processing that you will have to do:
Step V: In this step you will be working on the function wordComparison(). First you will get some statistics of the two novels separately and then you will compare the two together. For each novel compute and print the following pieces of information:
Here are two sample files - dickens.txt and hardy.txt taken from the two novels. Your output for these two sample files should be of the following form:
Enter name of first book: dickens.txt Enter name of second book: hardy.txt Enter last name of first author: Dickens Enter last name of second author: Hardy Dickens Total distinct words = 58 Total words (including duplicates) = 119 Ratio (% of total distinct words to total words) = 48.7394957983 Hardy Total distinct words = 93 Total words (including duplicates) = 123 Ratio(% of total distinct words to total words) = 75.6097560976 Dickens used 50 words that Hardy did not use. Relative frequency of words used by Dickens not in common with Hardy = 63.8655462185 Hardy used 85 words that Dickens did not use. Relative frequency of words used by Hardy not in common with Dickens = 77.2357723577Here is the output of the frequencies of the words in the two excerpts - dickens.out.txt and hardy.out.txt. These outputs have now been checked against the comprehensive word list.
The above program will have a header of the following form:
# File: Books.py # Description: # Student Name: # Student UT EID: # Course Name: CS 303E # Unique Number: # Date Created: # Date Last Modified:
This is an exercise in Computer Science and not in linguistics. What we are trying to illustrate is how easy it is to perform these types of computation in Python that lead to interesting observations but not necessarily deep insights.
Use the Canvas program to submit your Books.py file. We should receive your work by 11 PM on Tuesday, 05 Dec 2017. There is on extension to this deadline. There will be substantial penalties if you do not adhere to the guidelines.