CS303E Homework 11

Instructor: Dr. Bill Young
Due Date: Friday, April 5, 2024 at 11:59pm

Benford's Law

In 1881, Simon Newcomb had noticed that in tables of logarithms, the first pages were much more worn and smudged than later pages. In 1938, Frank Benford published a paper showing the distribution of the leading digit in many disparate sources of data. In all these sets of data, the number 1 was the leading digit about 30% of the time. This is rather surprising; you'd expect that leading digits would be much more evenly distributed.

Benford's Law states that in many naturally occurring collections of numbers, the leading digit is likely to be small. Benford's Law is followed most closely when the collection of numbers is reasonably large and there are many unique values in the collection.

Benford's law has been found to apply to population numbers, death rates, lengths of rivers, mathematical distributions given by some power law, and physical constants like atomic weights and specific heats. This law is now used to detect fraud in lists of socio-economic data submitted in support of public planning decisions and as an indicator of accounting and expenses fraud.

In this programming assignment you will verify Benford's law for some population data for Texas counties. The file Texas County Population Data gives the population of all counties in Texas according to the 1990, 2000, 2010, and 2020 census along with an estimate of county population in 2022 and a percentage change between 2020 and 2022. There's an extra line for the entire state, but just treat that as if it were another county. You won't use all of this data, just the population numbers for 2020 and 2022.

You'll need to copy the file into your own directory. Each line of data in the file has format:

FIPS&County name&RUC&Pop. 1990&Pop. 2000&Pop. 2010&Pop. 2020&Pop. 2022&Change 2020-22
Notice, I was missing the RUC field in an earlier version, so my counts were off. FIPS is a Federal Information Processing Standards number designating a particular U.S. geographic area.

Fields are separated by ampersands "&". Here are the first few lines of data:

48000&Texas&0&16,986,335&20,851,028&25,145,561&29,145,428&30,029,572&3.0%
48001&Anderson&7&48,024&55,114&58,458&57,922&58,064&0.2%
48003&Andrews&6&14,338&13,002&14,786&18,614&18,334&-1.5%
For this assignment, all you need care about is the census population data for 2020 and the estimated population in 2022. You'll obtain these numbers by splitting the line on "&" and extracting the appropriate fields. Note the population numbers have embedded commas, so you can't just convert them to integers. In fact, there's no reason to, and every reason not to. You're really looking at the numbers as ascii strings and asking about the first character in the string. Ignore any line that contains the character "#", even at the end of the line. Those are comment lines.

Your assignment is to process the data to validate that it follows Benford's Law. This has several parts.

  1. Accept from the user the name of a file, containing the data. If no file of that name exists, print an error message and quit.

  2. Create an empty set to store unique population values.

  3. Create a dictionary for leading digit counts, with entries of the form [digit:count]. Note that digit here is a character, not a number. The counts should initially all be 0, for the nine possible leading digits (no number will start with a '0').

  4. For each line in the file do the following:
    1. If the line contains "#" anywhere, ignore it; it's a comment line.
    2. Parse the line to extract the 2020 and 2022 population number.
    3. Add those to the set (a value won't repeat if it's already there).
    4. Get the first digit of each and increment the count for that digit in the dictionary. Remember: it's a string, not an integer.

  5. Close the file.

  6. Print to the terminal the line: "Output written to benford.txt"

  7. Write all other output to a file named benford.txt: how many total county population values (data lines processed); how many unique population values (the size of the resulting set); a table of results for the leading digits. (See the examples below.) Your table of results should be formatted as shown below. I used tabs between fields in the table, but you don't have to. (Note that tabs may not properly align the columns; that's OK. You can use format if your columns don't properly align.) Each run of the program will overwrite the file. (Hint: you may want to just print to the terminal until you get everything running, and then change to writing to a file.)

  8. Be sure to close benford.txt before exiting; otherwise, some output may not be written completely to the file.

There are two reasons why we're using a set to count the number of unique population values: to get you some practice using sets, and because an assumption of Benford's Law is that there are a large enough number of data points for the Law to reasonably apply. You are reporting the size of the dataset to show that it's large enough for Benford's Law to apply.

Note that you don't get your digit counts from the numbers stored in the set, but from all lines. Remember that you're reading two values from each non-comment line. You count lead digits even from numbers that repeat, and so only appear once in your set. For example, if 1000 appears twice among the data, that will only count as one element of your set of unique values, but will increment the initial digit "1" twice.

This assignment gives you practice in file manipulation (input and output), sets, and dictionaries. You must use all of these to get full credit.

Sample Runs

Notice that the second run uses the same input file you will use. So your output file should be effectively identical to what is shown below.
> python Benford.py
Enter the name of a file: NoSuchFile.txt
File does not exist
> python Benford.py
Enter the name of a file: populationDataForHW11
Output written to benford.txt

I was previously missing a field in the data that threw off my counts. What is below should be OK now.

So how do you know if your program wrote the right thing into file benford.txt? You look. You can do that by opening the file in a text editor. Alternatively, there are commands available from your operating system to display the contents of a file to the terminal. For example, on a Linux or MacOS system, you could do the following, at the operating system command line level:

> cat benford.txt 
Total number of counties: 255
Unique population counts: 508
First digit frequency distributions:
Digit	Count	Percentage
1	152	29.8
2	89	17.5
3	74	14.5
4	35	6.9
5	50	9.8
6	34	6.7
7	23	4.5
8	26	5.1
9	27	5.3
>
Note that cat is just the Linux/MacOS command to display on the terminal the contents of a file. It was just my way of showing you what ended up in file benford.txt. I believe that type does something similar on Windows. These commands won't work at the level of the Python interactive loop or in IDLE.

Note that fields in the output table are separated by tabs. Round percentages to one decimal place.

Turning in the Assignment:

The program should be in a file named Benford.py. Submit the file via Canvas before the deadline shown at the top of this page. Submit it to assignment 11 in the assignments sections by uploading your python file.

Your file must compile and run before submission. It must also contain a header with the following format:

# File: Benford.py
# Student: 
# UT EID:
# Course Name: CS303E
# 
# Date:
# Description of Program: 

Programming Tips:

Python vs. the operating system: Over the course of this semester, you've been learning how to program in Python. But hopefully you've also been learning something about interacting with your operating system (OS). That may be Windows, Linux, MacOS, or even an OS on your phone like Android or iOS. Your OS is what makes your computer usable by providing an interface to your computer's hardware, whether through a command line or a graphical interface like clicking on icons and dragging things around on your "desktop."

Whenever you start the Python interpreter or an IDE such as IDLE or VSCode, you're really telling the OS to start that software for you. But once you're inside Python or the IDE, you have commands that are appropriate there. It's often said that the OS and the application (Python or the IDE) are at "different levels of abstraction" and that the application is running "under" the OS.

Most often, things just work. But no doubt this semester you've run into some OS-related issues such as when Python can't find your source file, typically because you're running Python in a different directory than the one where the file resides. That's not really a Python issue; it's an OS issue. In a lot of programming languages, you'd have to exit the programming language environment and deal with the OS directly. Unlike many other programming languages, Python has built-in many commands that allow you to execute OS operations from within Python. Many of those are in the os library.

For example, from within Python you can: move to another directory, list all the files in a directory, create a new directory, delete a directory or files, etc. That's very handy because it means you can use Python to write "system programs," which really just means writing code to perform OS-level operations.

Below is some Python code that will show you your current directory and list all of the files in it.

import os
dir = os.getcwd()                    # get current working directory (cwd)
print("Directory is: ", dir )        # print it
myfiles = os.listdir()               # get list of files in cwd
for file in myfiles:                 # print them
    print( file )
So, the next time Python tells you it can't find a module, run this code to see what directory you're in and what files are there. (Remember that a module moduleName in Python is just the code in the file named moduleName.py.)