Foundations of Data
Types of Data
Data Visualization starts with raw data that has to be preprocessed
before it can be rendered. Data can be of many forms. It can be
gathered from one or many sources. It could real or synthetic
(generated data).
A data set consists of n records. Each record consists of
m observations or variables. An observation or variable could be
one or more number / symbol / string or some complex data structure. A
variable could be independent or dependent. In many cases
we may not know which is the independent variable(s) and which are the
dependent variables(s). For synthetic data we may have a generating
function. The independent variables would be called the the domain
of the function and the dependent variable the range of the
function.
There are various ways of classifying data:
- Physical type in memory - int, float, boolean, string
- Numerical Data
- Binary assuming only two values
- Discrete - taking only integer values
- Continuous - can take any value within a range
- Categorical or Qualitative Data
- Nominal - unordered categories
- Ordinal - ordered categories
- Interval Data - difference between measurements but no true zero
- SAT scores, time measures on a clock, IQ score
- Ratio Data - difference between measurements but true zero exists
- monetary change that people have with them
Types of Datasets
- Scalar - a single datum in a record is a scalar.
- Vector - a one-dimensional list of data
- Matrix - a two-dimensional list of data sometimes also called a
table
- Tensor - a general term for an n-dimensional list of data. A scalar
is a tensor of rank 0, a vector is a tensor of rank 1, and a matrix
is a tensor of rank 2
- Graph / Network - a set of vertices and edges. A tree is a directed
unweighted graph without any cycles
- Geometry - a grid of values in n-dimensional space, e.g. distribution
of temperature and pressure at every point in the room
Data Preprocessing or Data Wrangling
Most times raw data is not useful to visualize and has to be preprocessed.
For quantitative data we often compute the mean, median, and the standard
deviation. For qualitative data we find the mode or the frequency
distribution of the categorical types. Here are some steps that we take
in data preprocessing:
- Find missing values. Possible solutions to the problem:
- Use a sentinel value to signify that the data is missing.
- Use the mean or median for the missing value for quantitative
data
- Use the mode for categorical data
- Use neighboring values and interpolate
- Discard the record
- Data out of range - e.g. negative salary: use sentinel values or
or discard the record
- Duplicate values - remove duplicates
Normalization: Data can be transformed so that it satisfies
some statistical property. Here are some examples of normalization:
- d_normalized = (d_raw - d_min) / (d_max - d_min)
- d_normalized = (d_raw - d_mean) / d_sigma (otherwise known as the
z-score)
Data Discretization: Continuous data can be broken into discrete
categories or bins and the average or the median value of each bin is
computed.
Smoothing or Filtering: Sometimes noisy data need to to be
smoothed. Getting the weighted average of the neighbors is a cost-effective
way of doing this:
(p_i)' = (p_i-1) / 4 + p_i / 2 + (p_i+1) / 4
Mapping Nominal Data: Most plotting routines are designed to
handle numerical data for displaying purposes. To plot nominal data one
of the simplest way is to map nominal values to colors. Read the
paper on
mapping nominal values to numbers for displaying.
Dimension Reduction
We need to reduce the dimensions of the data when it exceeds the
capabilities of the visualization tools. For example we have the expression
of 15 genes for 60 mice. How do we visualize that to find patterns? How
do we know which mice are similar and which are different? One technique
is to use the Principal Component Analysis (PCA).
Let us take a subset of the process. Let us plot Gene1 vs Gene2. The
points form a cloud. Principal component 1 (PC1) is a line that goes through
the center of that cloud and describes it best. If you project the original
dots on it, two things happen:
- The total distance between the projected points is a maximum so that
we can distinguish points from one another.
- The total distance from the original points to their corresponding
projected points is minimum. That representation is as close to the original
data as possible.
Our PC1 has the maximum variation among data points and contain minimum error.
But our data set has 15 genes. To create PC1, a line is anchored at the
center of the 15-D cloud of dots and rotate in 15 directions, projecting the
original 60 dots. This rotation continues until the total distance among
projected points is maximum. The rotating line now describes the most
variation among 60 mice, and is fit to be PC1.
PC2 is the second line that meets PC1, perpendicularly, at the center of
the cloud, and describes the second most variation in the data.
If PCA is suitable for your data, just the first 2 or 3 principal components
should convey most of the information of the data. The advantages are:
- Principal components help reduce the number of dimensions down to
2 or 3, making it possible to see strong patterns
- Information from all the genes are taken into account. Principal
components take all dimensions and data points into account.
- Since the PC1 and PC2 are perpendicular to each other they become
the axes of our plot.
After determining the PC axes how do we project the data points on those
axes? Each of the axes have different weights. Think of each of the
original points being pulled to 15 different axes but the forces acting
on them are different. Here is the formula for the projection of Mouse 1
on PC1:
PC1 value of Mouse 1 = (gene1 value * gene1 weight on PC1) + ... +
(gene15 value * gene15 weight on PC1)
Mice that have similar expression profiles are now clustered together. The
principal component analysis (PCA) brings out strong patterns from large and
complex datasets. The essence of the data is captured in a few principal
components, which themselves convey the most variation in the dataset. PCA
reduces the number of dimensions without selecting or discarding them.