CS303E Homework 11

Instructor: Dr. Bill Young
Due Date: Friday, April 4, 2025 at 11:59pm

Benford's Law

In 1881, Simon Newcomb had noticed that in tables of logarithms, the first pages were much more worn and smudged than later pages. In 1938, Frank Benford published a paper showing the distribution of the leading digit in many disparate sources of data. In all these sets of data, the number 1 was the leading digit about 30% of the time. This is rather surprising; you'd expect that leading digits would be much more evenly distributed.

Benford's Law states that in many naturally occurring collections of numbers, the leading digit is likely to be small. Benford's Law is followed most closely when the collection of numbers is reasonably large and there are many unique values in the collection.

Benford's law has been found to apply to population numbers, death rates, lengths of rivers, mathematical distributions given by some power law, and physical constants like atomic weights and specific heats. This law is now used to detect fraud in lists of socio-economic data submitted in support of public planning decisions and as an indicator of accounting and expenses fraud.

In this programming assignment you will test Benford's law for a large data set I found online. The data are from the Institute for Health Metrics and Evaluation (IHME), reporting the Global Burden of Disease Estimating the burden of diseases, injuries, and risk factors globally and for 21 regions for 1990 and 2010. The data was found at: Mortality Data source.

For your program, you will use this version: Mortality Charts. You'll need to copy/download the file into your own directory. Each line of data in the file has format:

Country Code,Country Name,Year,Age Group,Sex,Number of Deaths,Death Rate Per 100_000

Fields are separated by commas. Here are the first few lines of data:

# Country Code,Country Name,Year,Age Group,Sex,Number of Deaths,Death Rate Per 100_000
AFG,Afghanistan,1970,0-6 days,Male,19_241,318_292.90
AFG,Afghanistan,1970,0-6 days,Female,12_600,219_544.20
AFG,Afghanistan,1970,0-6 days,Both,31_840,270_200.70

For this assignment, all you need care about are the two fields: Number of Deaths and Death Rate Per 100_000. You'll obtain these numbers by splitting the line on "," and extracting the appropriate fields. Note the population numbers have embedded underscores, where you'd often see commas in long numbers. Python ignores single underscores in numbers, as long as they are not at the beginning or end of the number.

Note that there's no reason to convert the values in these fields to numbers, and every reason not to. You're really looking at the numbers as ascii strings and asking about the first character in the string.

Your program should ignore any line that contains the character "#", even at the end of the line. Those are comment lines.

Your assignment is to process the data to see if it follows Benford's Law. This has several parts.

Accept from the user the name of a file, containing the data. If no file of that name exists, print an error message and quit.
Create an empty set to store unique values. Note that these are strings.
Create a dictionary for leading digit counts, with entries of the form [digit:count]. Note that digit here is a character, not a number. The counts should initially all be 0, for the ten possible leading digits. In many data sets, no number will start with a '0'. That's not true here, since we're dealing with counts, which can be 0.
For each line in the file do the following:
1. If the line contains "#" anywhere, ignore it; it's a comment line.
2. Parse the line to extract the relevant fields.
3. Add those to the set (a value won't repeat if it's already there).
4. Get the first digit of each and increment the count for that digit in the dictionary. Remember: it's a string, not an integer.
Close the file.
Print to the terminal the line: "Output written to benford.txt"
Write all other output to a file named benford.txt: the file processed, how many total (non-comment) data lines were read; how many unique values (the size of the resulting set); a table of results for the leading digits. (See the examples below.) Your table of results should be formatted as shown below. I used tabs between fields in the table, but you don't have to. (Note that tabs may not properly align the columns; that's OK. You can use format if your columns don't properly align.) Each run of the program will overwrite the file. (Hint: you may want to just print to the terminal until you get everything running, and then change to writing to a file.)
Be sure to close benford.txt before exiting; otherwise, some output may not be written completely to the file.

There are two reasons why we're using a set to count the number of unique population values: to get you some practice using sets, and because an assumption of Benford's Law is that there are a large enough number of data points for the Law to reasonably apply. You are reporting the size of the dataset to show that it's large enough for Benford's Law to apply.

Note that you don't get your digit counts from the numbers stored in the set, but from all lines. Remember that you're reading two values from each non-comment line. You count lead digits even from numbers that repeat, and so only appear once in your set. For example, if 1000 appears twice among the data, that will only count as one element of your set of unique values, but will increment the initial digit "1" twice.

This assignment gives you practice in file manipulation (input and output), sets, and dictionaries. You must use all of these to get full credit.

Sample Runs

Notice that the second run uses the same input file you will use. So your output file should be effectively identical to what is shown below.

> python Benford.py
Enter the name of a file of census data: NoSuchFile.csv
File does not exist
> python Benford.py
Enter the name of a file of census data: MortalityCharts.csv
Output written to benford.txt
>

So how do you know if your program wrote the right thing into file benford.txt? You look. You can do that by opening the file in a text editor. Alternatively, there are commands available from your operating system to display the contents of a file to the terminal. For example, on a Linux or MacOS system, you could do the following, at the operating system command line level:

> cat benford.txt 
Processing file: MortalityCharts.csv

Total lines processed: 58905
Unique numbers count: 48405
First digit frequency distributions:
Digit	Count	Percentage
0	81	0.1
1	37379	31.7
2	19793	16.8
3	13936	11.8
4	10941	9.3
5	9173	7.8
6	7914	6.7
7	6921	5.9
8	6184	5.2
9	5488	4.7
>

Note that earlier, I didn't include the line for '0', but probably should have. Ordinarily, integer values in a table would not begin with '0' except 0 itself. But since the data we're looking at is counts, it can be 0. So we should probably include that line. If you didn't include it; don't worry about it. You won't be penalized.

Note that cat is just the Linux/MacOS command to display on the terminal the contents of a file. It was just my way of showing you what ended up in file benford.txt. I believe that type does something similar on Windows. These commands won't work at the level of the Python interactive loop or in IDLE.

Note that fields in the output table are separated by tabs. Round percentages to one decimal place.

Turning in the Assignment:

The program should be in a file named Benford.py. Submit the file via Canvas before the deadline shown at the top of this page. Submit it to hw11 in the assignments sections by uploading your Python file.

Your file must compile and run before submission. It must also contain a header with the following format:

# File: Benford.py
# Student: 
# UT EID:
# Course Name: CS303E
# 
# Date:
# Description of Program:

Programming Tips:

Python vs. the operating system: Over the course of this semester, you've been learning how to program in Python. But hopefully you've also been learning something about interacting with your operating system (OS). That may be Windows, Linux, MacOS, or even an OS on your phone like Android or iOS. Your OS is what makes your computer usable by providing an interface to your computer's hardware, whether through a command line or a graphical interface like clicking on icons and dragging things around on your "desktop."

Whenever you start the Python interpreter or an IDE such as IDLE or VSCode, you're really telling the OS to start that software for you. But once you're inside Python or the IDE, you have commands that are appropriate there. It's often said that the OS and the application (Python or the IDE) are at "different levels of abstraction" and that the application is running "under" the OS.

Most often, things just work. But no doubt this semester you've run into some OS-related issues such as when Python can't find your source file, typically because you're running Python in a different directory than the one where the file resides. That's not really a Python issue; it's an OS issue. In a lot of programming languages, you'd have to exit the programming language environment and deal with the OS directly. Unlike many other programming languages, Python has built-in many commands that allow you to execute OS operations from within Python. Many of those are in the os library.

For example, from within Python you can: move to another directory, list all the files in a directory, create a new directory, delete a directory or files, etc. That's very handy because it means you can use Python to write "system programs," which really just means writing code to perform OS-level operations.

Below is some Python code that will show you your current directory and list all of the files in it.

import os
dir = os.getcwd()                    # get current working directory (cwd)
print("Directory is: ", dir )        # print it
myfiles = os.listdir()               # get list of files in cwd
for file in myfiles:                 # print them
    print( file )

So, the next time Python tells you it can't find a module, run this code to see what directory you're in and what files are there. (Remember that a module moduleName in Python is just the code in the file named moduleName.py.)