Foundations of Data

Types of Data

Data Visualization starts with raw data that has to be preprocessed before it can be rendered. Data can be of many forms. It can be gathered from one or many sources. It could real or synthetic (generated data).

A data set consists of n records. Each record consists of m observations or variables. An observation or variable could be one or more number / symbol / string or some complex data structure. A variable could be independent or dependent. In many cases we may not know which is the independent variable(s) and which are the dependent variables(s). For synthetic data we may have a generating function. The independent variables would be called the the domain of the function and the dependent variable the range of the function.

There are various ways of classifying data:

Physical type in memory - int, float, boolean, string
Numerical Data

Binary assuming only two values
Discrete - taking only integer values
Continuous - can take any value within a range

Categorical or Qualitative Data

Nominal - unordered categories
Ordinal - ordered categories

Interval Data - difference between measurements but no true zero - SAT scores, time measures on a clock, IQ score
Ratio Data - difference between measurements but true zero exists - monetary change that people have with them

Types of Datasets

Scalar - a single datum in a record is a scalar.
Vector - a one-dimensional list of data
Matrix - a two-dimensional list of data sometimes also called a table
Tensor - a general term for an n-dimensional list of data. A scalar is a tensor of rank 0, a vector is a tensor of rank 1, and a matrix is a tensor of rank 2
Graph / Network - a set of vertices and edges. A tree is a directed unweighted graph without any cycles
Geometry - a grid of values in n-dimensional space, e.g. distribution of temperature and pressure at every point in the room

Data Preprocessing or Data Wrangling

Most times raw data is not useful to visualize and has to be preprocessed. For quantitative data we often compute the mean, median, and the standard deviation. For qualitative data we find the mode or the frequency distribution of the categorical types. Here are some steps that we take in data preprocessing:

Find missing values. Possible solutions to the problem:
- Use a sentinel value to signify that the data is missing.
- Use the mean or median for the missing value for quantitative data
- Use the mode for categorical data
- Use neighboring values and interpolate
- Discard the record
Data out of range - e.g. negative salary: use sentinel values or or discard the record
Duplicate values - remove duplicates

Normalization: Data can be transformed so that it satisfies some statistical property. Here are some examples of normalization:

d_normalized = (d_raw - d_min) / (d_max - d_min)
d_normalized = (d_raw - d_mean) / d_sigma (otherwise known as the z-score)

Data Discretization: Continuous data can be broken into discrete categories or bins and the average or the median value of each bin is computed.

Smoothing or Filtering: Sometimes noisy data need to to be smoothed. Getting the weighted average of the neighbors is a cost-effective way of doing this:
(p_i)' = (p_i-1) / 4 + p_i / 2 + (p_i+1) / 4

Mapping Nominal Data: Most plotting routines are designed to handle numerical data for displaying purposes. To plot nominal data one of the simplest way is to map nominal values to colors. Read the paper on mapping nominal values to numbers for displaying.

Dimension Reduction

We need to reduce the dimensions of the data when it exceeds the capabilities of the visualization tools. For example we have the expression of 15 genes for 60 mice. How do we visualize that to find patterns? How do we know which mice are similar and which are different? One technique is to use the Principal Component Analysis (PCA).

Let us take a subset of the process. Let us plot Gene1 vs Gene2. The points form a cloud. Principal component 1 (PC1) is a line that goes through the center of that cloud and describes it best. If you project the original dots on it, two things happen:

The total distance between the projected points is a maximum so that we can distinguish points from one another.
The total distance from the original points to their corresponding projected points is minimum. That representation is as close to the original data as possible.

Our PC1 has the maximum variation among data points and contain minimum error.

But our data set has 15 genes. To create PC1, a line is anchored at the center of the 15-D cloud of dots and rotate in 15 directions, projecting the original 60 dots. This rotation continues until the total distance among projected points is maximum. The rotating line now describes the most variation among 60 mice, and is fit to be PC1.

PC2 is the second line that meets PC1, perpendicularly, at the center of the cloud, and describes the second most variation in the data.

If PCA is suitable for your data, just the first 2 or 3 principal components should convey most of the information of the data. The advantages are:

Principal components help reduce the number of dimensions down to 2 or 3, making it possible to see strong patterns
Information from all the genes are taken into account. Principal components take all dimensions and data points into account.
Since the PC1 and PC2 are perpendicular to each other they become the axes of our plot.

After determining the PC axes how do we project the data points on those axes? Each of the axes have different weights. Think of each of the original points being pulled to 15 different axes but the forces acting on them are different. Here is the formula for the projection of Mouse 1 on PC1:
PC1 value of Mouse 1 = (gene1 value * gene1 weight on PC1) + ... + (gene15 value * gene15 weight on PC1)

Mice that have similar expression profiles are now clustered together. The principal component analysis (PCA) brings out strong patterns from large and complex datasets. The essence of the data is captured in a few principal components, which themselves convey the most variation in the dataset. PCA reduces the number of dimensions without selecting or discarding them.