Benford's Law states that in many naturally occurring collections of numbers, the leading digit is likely to be small. Benford's Law is followed most closely when the collection of numbers is reasonably large and there are many unique values in the collection.
Benford's law has been found to apply to population numbers, death rates, lengths of rivers, mathematical distributions given by some power law, and physical constants like atomic weights and specific heats. This law is now used to detect fraud in lists of socio-economic data submitted in support of public planning decisions and as an indicator of accounting and expenses fraud.
In this programming assignment you will test Benford's law for a large data set I found online. The data are from the Institute for Health Metrics and Evaluation (IHME), reporting the Global Burden of Disease Estimating the burden of diseases, injuries, and risk factors globally and for 21 regions for 1990 and 2010. The data was found at: Mortality Data source.
For your program, you will use this version: Mortality Charts. You'll need to copy/download the file into your own directory. Each line of data in the file has format:
Country Code,Country Name,Year,Age Group,Sex,Number of Deaths,Death Rate Per 100_000Fields are separated by commas. Here are the first few lines of data:
# Country Code,Country Name,Year,Age Group,Sex,Number of Deaths,Death Rate Per 100_000 AFG,Afghanistan,1970,0-6 days,Male,19_241,318_292.90 AFG,Afghanistan,1970,0-6 days,Female,12_600,219_544.20 AFG,Afghanistan,1970,0-6 days,Both,31_840,270_200.70For this assignment, all you need care about are the two fields: Number of Deaths and Death Rate Per 100_000. You'll obtain these numbers by splitting the line on "," and extracting the appropriate fields. Note the population numbers have embedded underscores, where you'd often see commas in long numbers. Python ignores single underscores in numbers, as long as they are not at the beginning or end of the number.
Note that there's no reason to convert the values in these fields to numbers, and every reason not to. You're really looking at the numbers as ascii strings and asking about the first character in the string.
Your program should ignore any line that contains the character "#", even at the end of the line. Those are comment lines.
Your assignment is to process the data to see if it follows Benford's Law. This has several parts.
There are two reasons why we're using a set to count the number of unique population values: to get you some practice using sets, and because an assumption of Benford's Law is that there are a large enough number of data points for the Law to reasonably apply. You are reporting the size of the dataset to show that it's large enough for Benford's Law to apply.
Note that you don't get your digit counts from the numbers stored in the set, but from all lines. Remember that you're reading two values from each non-comment line. You count lead digits even from numbers that repeat, and so only appear once in your set. For example, if 1000 appears twice among the data, that will only count as one element of your set of unique values, but will increment the initial digit "1" twice.
This assignment gives you practice in file manipulation (input and output), sets, and dictionaries. You must use all of these to get full credit.
> python Benford.py Enter the name of a file of census data: NoSuchFile.csv File does not exist > python Benford.py Enter the name of a file of census data: MortalityCharts.csv Output written to benford.txt >
So how do you know if your program wrote the right thing into file benford.txt? You look. You can do that by opening the file in a text editor. Alternatively, there are commands available from your operating system to display the contents of a file to the terminal. For example, on a Linux or MacOS system, you could do the following, at the operating system command line level:
> cat benford.txt Processing file: MortalityCharts.csv Total lines processed: 58905 Unique numbers count: 48405 First digit frequency distributions: Digit Count Percentage 0 81 0.1 1 37379 31.7 2 19793 16.8 3 13936 11.8 4 10941 9.3 5 9173 7.8 6 7914 6.7 7 6921 5.9 8 6184 5.2 9 5488 4.7 >Note that earlier, I didn't include the line for '0', but probably should have. Ordinarily, integer values in a table would not begin with '0' except 0 itself. But since the data we're looking at is counts, it can be 0. So we should probably include that line. If you didn't include it; don't worry about it. You won't be penalized.
Note that cat is just the Linux/MacOS command to display on the terminal the contents of a file. It was just my way of showing you what ended up in file benford.txt. I believe that type does something similar on Windows. These commands won't work at the level of the Python interactive loop or in IDLE.
Note that fields in the output table are separated by tabs. Round percentages to one decimal place.
Your file must compile and run before submission. It must also contain a header with the following format:
# File: Benford.py # Student: # UT EID: # Course Name: CS303E # # Date: # Description of Program:
Whenever you start the Python interpreter or an IDE such as IDLE or VSCode, you're really telling the OS to start that software for you. But once you're inside Python or the IDE, you have commands that are appropriate there. It's often said that the OS and the application (Python or the IDE) are at "different levels of abstraction" and that the application is running "under" the OS.
Most often, things just work. But no doubt this semester you've run into some OS-related issues such as when Python can't find your source file, typically because you're running Python in a different directory than the one where the file resides. That's not really a Python issue; it's an OS issue. In a lot of programming languages, you'd have to exit the programming language environment and deal with the OS directly. Unlike many other programming languages, Python has built-in many commands that allow you to execute OS operations from within Python. Many of those are in the os library.
For example, from within Python you can: move to another directory, list all the files in a directory, create a new directory, delete a directory or files, etc. That's very handy because it means you can use Python to write "system programs," which really just means writing code to perform OS-level operations.
Below is some Python code that will show you your current directory and list all of the files in it.
import os dir = os.getcwd() # get current working directory (cwd) print("Directory is: ", dir ) # print it myfiles = os.listdir() # get list of files in cwd for file in myfiles: # print them print( file )So, the next time Python tells you it can't find a module, run this code to see what directory you're in and what files are there. (Remember that a module moduleName in Python is just the code in the file named moduleName.py.)