CS303E Assignment 12

Due: Thursday, July 25 by 11 pm

File Name: first_digit.py Use this provided file.

Submit to Assignment 12 on Gradescope via Canvas

Purpose: To practice working with dictionaries, functions, and strings

Limitations: You may only use the Python syntax and language features covered in the book chapters 1 - 9.

Review the assignment guidelines.


Preparation. You must complete these questions before writing and running your program.  Recall, this course carries the UT qualitative reasoning flag.

1. Locate a data source online. Find a free data source on the web that meets our requirements (at least 150 values, not a top X list) and transform it into the format suitable for input to the first_digit.py program. The file format is specified below. Note, you will likely not find a data source that meets the formatting requirements initially, and you will have to transform it (maybe with another program you write yourself?) into the required format.

The data must all be separate integer measurements of a single thing. For example: measurements of university and college enrollments, populations of ALL cities or counties in various states or countries, measurement of the number of articles submitted or updated each day on Wikipedia, the length of rivers in Canada, number of games with a specified tag on Steam, or the number of lifetime hits of 200 major league baseball players. (Not the top 200 players, a random sampling of players or all the players from a single season.)

Do not use lists of data for the top X (such as the top 100 or top 200) measurements of a given phenomena. Also, you cannot use data sets with the populations of counties in US states nor can you use hockey statistics similar to the ones from the sample data files. You can use data from sports besides hockey if you wish. Your data source needs to be different than the given examples. If you have a question as to whether a data source is appropriate or not please post to the class discussion group on the Ed Discussion group.

Additionally your data source must contain at least 150 measurements.

You will include the contents of your file as a comment (a very large comment) in your program after the call to main, so your TA can see the data set your found and its source.

2. The first_digits.py program will analyze a file with measurements and print out the percentage of values in the file that started with each digit. In other words it prints out the percentage of values that started with a 1, the percentage of values that started with a 2, and so forth up to the digit 9. Our program will ignore any values that equal 0 and will also ignore leading zeros. In other words 007 would be interpreted as just 7.

Before starting to program and without looking up any information on line or the sample output, what do you think the breakdown of percentages will be? Predict it precisely, with what you the percentage will be for each leading digit and the total of all the percentages in your estimate shall equal 100. Include this in your analysis in your program.

3. After your complete your program compare the actual results to your expected results. For the given data file and your data file describe the typical results. Compare the results you expected, as described in two with the actual results? Are you surprised by the actual results? Are the actual results intuitive? How do the results of the given data files compare to the results of the data source your found? Are there data sets where the results would not follow the typical results?  Include the results of the program for your data file in your analysis. 4 of the points on the assignment are based on the depth and clarity of your analysis.

The Program:

Like assignment 8, each time we run the program it analyzes a single file, the name of which is entered by the user.

The program process a data file with integer values. After reading in the entire file, for each digit 1 through 9 the program prints the percentage of values in the data file that started with that digit. In other words we are interested in the first digit of each value, not the values themselves, other than finding the max value in the data set.

So for example here are some lines from the data file with human populations of all Texas counties.

1250884,Travis
6916,Bailey
22063,Gray
5187,Jim Hogg

At the time this data was collected, Travis county had a population of 1,250,884. It has a leading digit of 1. Bailey (which is northwest of Lubbock and on the New Mexico border) had a population of 6,916. It has a leading digit of 6. Gray has a leading digit of 2 and Jim Hogg has a leading digit of 6.

The file format for this assignment is:

<Explanation of Data>
<Data source, typically a URL>
<value 1>,<label 1>
<value 2>,<label 2>
...
<value N>,<label N>

The first line of the file is a description of the data in the file. The second line is the source of the data, where it was obtained, typically a url. The rest of the lines contain the data of the file, one data value and label per line. Each data line is one value and then a comma and then the label for that value. The label may contain multiple words and punctuation, except for commas.

All data values must be >= 0. No negative values are allowed.

The program creates a dictionary with keys equal to the integers 1 through 9. The values are the number of data values in the file that have the key as its first digit.

Use the following to create the initial dictionary in the function that processes the file. This creates a dictionary with digits 1 through 9 as keys and 0 as the value for each key. Recall we can use the string split function to break a string apart into a list of strings. In this case we would the the comma character as the separator or delimiter, not the default whitespace. For example st1.split(',') assuming st1 is a string.

frequencies = dict.fromkeys(range(1, 10), 0)

The program also finds the max data value in the file (this should be done at the same time as reading the file to update the dictionary) and the label for that maximum value. If there is a tie, pick the value and label closest to the start of the file. (This specification is not meant to cause special cases. Rather it avoids any special case code.)

Here is a sample run of the program:

Determine percentage of leading digits 1 through 9.
Enter file name: tx_county_pop.txt
First digit data for Human Population of Texas Counties:
Digit Percentage
1     29.1
2     18.1
3     14.6
4      7.9
5      9.8
6      4.3
7      6.3
8      6.3
9      3.5

Max value: 4680609
Max label: Harris

I strongly suggest you plan out your program in advance. Implement a little bit of your program at a time and test the program before moving on.

Here is a file with multiple runs of the program. Given the same inputs and the same seed your output shall match this exactly.

 I strongly recommend you check your output and the expected output with a diff program such as https://www.diffchecker.com/.

Here is the initial version of first_digit.py for you to use and the data files from the examples.

  1. first_digit.py
  2. tx_county_pop.txt
  3. mo_county_pop.txt
  4. hockey_goals.txt
  5. penalty_mins.txt