Benford's Law states that in many naturally occurring collections of numbers, the leading digit is likely to be small. Benford's Law is followed most closely when the collection of numbers is reasonably large and there are many unique values in the collection.
Benford's law has been found to apply to population numbers, death rates, lengths of rivers, mathematical distributions given by some power law, and physical constants like atomic weights and specific heats. This law is now used to detect fraud in lists of socio-economic data submitted in support of public planning decisions and as an indicator of accounting and expenses fraud.
In this programming assignment you will verify Benford's law for some population data for Texas counties. The file Texas County Population Data gives the population of all counties in Texas according to the 1990, 2000, 2010, and 2020 census along with an estimate of county population in 2022 and a percentage change between 2020 and 2022. There's an extra line for the entire state, but just treat that as if it were another county. You won't use all of this data, just the population numbers for 2020 and 2022.
You'll need to copy the file into your own directory. Each line of data in the file has format:
FIPS&County name&RUC&Pop. 1990&Pop. 2000&Pop. 2010&Pop. 2020&Pop. 2022&Change 2020-22Notice, I was missing the RUC field in an earlier version, so my counts were off. FIPS is a Federal Information Processing Standards number designating a particular U.S. geographic area.
Fields are separated by ampersands "&". Here are the first few lines of data:
48000&Texas&0&16,986,335&20,851,028&25,145,561&29,145,428&30,029,572&3.0% 48001&Anderson&7&48,024&55,114&58,458&57,922&58,064&0.2% 48003&Andrews&6&14,338&13,002&14,786&18,614&18,334&-1.5%For this assignment, all you need care about is the census population data for 2020 and the estimated population in 2022. You'll obtain these numbers by splitting the line on "&" and extracting the appropriate fields. Note the population numbers have embedded commas, so you can't just convert them to integers. In fact, there's no reason to, and every reason not to. You're really looking at the numbers as ascii strings and asking about the first character in the string. Ignore any line that contains the character "#", even at the end of the line. Those are comment lines.
Your assignment is to process the data to validate that it follows Benford's Law. This has several parts.
There are two reasons why we're using a set to count the number of unique population values: to get you some practice using sets, and because an assumption of Benford's Law is that there are a large enough number of data points for the Law to reasonably apply. You are reporting the size of the dataset to show that it's large enough for Benford's Law to apply.
Note that you don't get your digit counts from the numbers stored in the set, but from all lines. Remember that you're reading two values from each non-comment line. You count lead digits even from numbers that repeat, and so only appear once in your set. For example, if 1000 appears twice among the data, that will only count as one element of your set of unique values, but will increment the initial digit "1" twice.
This assignment gives you practice in file manipulation (input and output), sets, and dictionaries. You must use all of these to get full credit.
> python Benford.py Enter the name of a file: NoSuchFile.txt File does not exist > python Benford.py Enter the name of a file: populationDataForHW11 Output written to benford.txt
I was previously missing a field in the data that threw off my counts. What is below should be OK now.
So how do you know if your program wrote the right thing into file benford.txt? You look. You can do that by opening the file in a text editor. Alternatively, there are commands available from your operating system to display the contents of a file to the terminal. For example, on a Linux or MacOS system, you could do the following, at the operating system command line level:
> cat benford.txt Total number of counties: 255 Unique population counts: 508 First digit frequency distributions: Digit Count Percentage 1 152 29.8 2 89 17.5 3 74 14.5 4 35 6.9 5 50 9.8 6 34 6.7 7 23 4.5 8 26 5.1 9 27 5.3 >Note that cat is just the Linux/MacOS command to display on the terminal the contents of a file. It was just my way of showing you what ended up in file benford.txt. I believe that type does something similar on Windows. These commands won't work at the level of the Python interactive loop or in IDLE.
Note that fields in the output table are separated by tabs. Round percentages to one decimal place.
Your file must compile and run before submission. It must also contain a header with the following format:
# File: Benford.py # Student: # UT EID: # Course Name: CS303E # # Date: # Description of Program:
Whenever you start the Python interpreter or an IDE such as IDLE or VSCode, you're really telling the OS to start that software for you. But once you're inside Python or the IDE, you have commands that are appropriate there. It's often said that the OS and the application (Python or the IDE) are at "different levels of abstraction" and that the application is running "under" the OS.
Most often, things just work. But no doubt this semester you've run into some OS-related issues such as when Python can't find your source file, typically because you're running Python in a different directory than the one where the file resides. That's not really a Python issue; it's an OS issue. In a lot of programming languages, you'd have to exit the programming language environment and deal with the OS directly. Unlike many other programming languages, Python has built-in many commands that allow you to execute OS operations from within Python. Many of those are in the os library.
For example, from within Python you can: move to another directory, list all the files in a directory, create a new directory, delete a directory or files, etc. That's very handy because it means you can use Python to write "system programs," which really just means writing code to perform OS-level operations.
Below is some Python code that will show you your current directory and list all of the files in it.
import os dir = os.getcwd() # get current working directory (cwd) print("Directory is: ", dir ) # print it myfiles = os.listdir() # get list of files in cwd for file in myfiles: # print them print( file )So, the next time Python tells you it can't find a module, run this code to see what directory you're in and what files are there. (Remember that a module moduleName in Python is just the code in the file named moduleName.py.)