One of the first tasks involved in any data science project is to get to understand the data. This can be extremely beneficial for several reasons:
- Catch mistakes in data
- See patterns in data
- Find violations of statistical assumptions
- Generate hypotheses etc.
We can think of this task as an exercise in summarization of the data. To summarize the main characteristics of the data, often two methods are used: numerical and graphical.
The numerical summary of data is done through descriptive statistics. While the graphical summary of the data is done through exploratory data analysis (EDA). In this post, we will look at both of these fundamental data science techniques in more detail using some examples.
Descriptive statistics are statistics that quantitatively describe or summarize features of a collection of information. Some measures that are commonly used to describe a data set are:
- Measures of Central Tendency or Measure of Location, such as mean
- Measures of Variability or Dispersion, such as standard deviation
- Measure of the shape of the distribution, such as skewness or kurtosis
- Relative Standing Measures, such as z-score, Quartiles etc.
Central tendency (or measure of central tendency) is a central or typical value for a probability distribution. Measures of central tendency are often called averages. The most common measures of central tendency are the arithmetic mean, the median and the mode.
The arithmetic mean (or mean or average) is the most commonly used and readily understood measure of central tendency. In statistics, however, the term average refers to any of the measures of central tendency. If we have a data set containing the values , then the arithmetic mean, is defined by the formula:
If the data set is a statistical population (i.e., consists of every possible observation and not just a subset of them), then the mean of that population is called the population mean. If the data set is a statistical sample (a subset of the population), we call the statistic resulting from this calculation a sample mean.
The median is the midpoint of the data set. This midpoint value is the point at which half the observations are above the value and half the observations are below the value. The median is determined by ranking the observations and finding the observation that are at the number in the ranked order. If the number of observations are even, then the median is the average value of the observations that are ranked at numbers and .
The median and the mean both measure central tendency. But unusual values, called outliers, affect the median less than they affect the mean. When you have unusual values, you can compare the mean and the median to decide which is the better measure to use. If your data are symmetric, the mean and median are similar.
The concept of median can be generalized as quartiles. Quartiles are the three values – the first quartile at 25% (), the second quartile at 50% ( or median), and the third quartile at 75% () – that divide a sample of ordered data into four equal parts.
The mode is the value that appears most often in a set of data. The mode of a discrete probability distribution is the value x at which its probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled.
For example, a distribution that has more than one mode may identify that your sample includes data from two populations. If the data contain two modes, the distribution is bimodal. If the data contain more than two modes, the distribution is multi-modal.
Many a times looking at the smallest and largest data and their relative positioning wrt to other central tendencies are also quite helpful.
Use the maximum/minimum to identify a possible outliers or any data- entry errors. One of the simplest ways to assess the spread of your data is to compare the minimum and maximum. If the maximum value is very high, even when you consider the center, the spread, and the shape of the data, investigate the cause of the extreme value.
Dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched. A measure of statistical dispersion is a non-negative real number that is zero if all the data are the same and increases as the data become more diverse. Some common examples of dispersion measures are: Standard Deviation, Interquartile Range (IQR), Mean Absolute Difference and Median Absolute Difference etc.
The standard deviation is a measure of how spread out the data are about the mean. The symbol is often used to represent the standard deviation of a population, while is used to represent the standard deviation of a sample.
If we have a data set containing the values , then the standard deviation, is defined by the formula:
A higher standard deviation value indicates greater spread in the data. A good rule of thumb for a [normal distribution][normal] is that approximately 68% of the values fall within one standard deviation of the mean, 95% of the values fall within two standard deviations, and 99.7% of the values fall within three standard deviations.
The interquartile range (IQR) is the distance between the first quartile () and the third quartile (). 50% of the data are within this range.
The interquartile range can be used to describe the spread of the data. As the spread of the data increases, the IQR becomes larger. It is also used to build box plots.
The range is the difference between the largest and smallest data values in the sample. The range represents the interval that contains all the data values.
The range can be used to understand the amount of dispersion in the data. A large range value indicates greater dispersion in the data. A small range value indicates that there is less dispersion in the data. Because the range is calculated using only two data values, it is more useful with small data sets.
Generally speaking, a moment is a specific quantitative measure, used in both mechanics and statistics, of the shape of a set of points. If the points represent probability density, then the zeroth moment is the total probability (i.e. one), the first moment is the mean, the second central moment is the variance, the third central moment is the skewness, and the fourth central moment (with normalization and shift) is the kurtosis.
We have already seen the use of first and second moments in describing statistics. The shape of distributions are further described using higher moments as described below.
skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its measure of central tendency. The skewness value can be positive or negative, or even undefined.
For a unimodal distribution, negative skew indicates that the tail on the left side of the probability density function is longer or fatter than the right side – it does not distinguish these two kinds of shape. Conversely, positive skew indicates that the tail on the right side is longer or fatter than the left side. In multi-modal distributions and discrete distributions, skewness is very difficult to interpret.
There are two common definitions of skewness:
A. Pearson Moment Coefficient of Skewness: Pearson Moment Coefficient of Skewness refers to the third standardized moment, defined as:
where, is the mean, is the standard deviation, is the expectation operator, and refers to the data points.
B. Bowley Skewness:
Bowley skewness is a way to measure skewness purely from quartiles. One of the most popular ways to find skewness is the Pearson Mode Skewness formula. However, in order to use it you must know the mean, mode (or median) and standard deviation for your data. Sometimes you might not have that information; Instead you might have information about your quartiles.
Bowley skewness is an important quantity, if you have extreme data values (outliers) or if you have an open-ended distribution.
Mathematically, Bowley Skewness is defined as :
where, , and , represent, first, second and third quartiles, respectively. Bowley Skewness is an absolute measure of skewness. In other words, it’s going to give you a result in the units that your distribution is in. That’s compared to the Pearson Mode Skewness, which gives you results in a dimensionless unit — the standard deviation. This means that you cannot compare the skewness of different distributions with different units using Bowley Skewness.
Kurtosis indicates how the peak and tails of a distribution differ from the normal distribution. Mathematically, it is the fourth standardized moment, defined as,
where, is the mean, is the standard deviation, is the expectation operator, and refers to the data points.
Use kurtosis to initially understand general characteristics about the distribution of your data. Normally distributed data establish the baseline for kurtosis. A kurtosis value of 0 indicates that the data follow the normal distribution perfectly. A kurtosis value that significantly deviates from 0 may indicate that the data are not normally distributed.
A distribution that has a positive kurtosis value indicates that the distribution has heavier tails and a sharper peak than the normal distribution. For example, data that follow a t-distribution have a positive kurtosis value.
A distribution with a negative kurtosis value indicates that the distribution has lighter tails and a flatter peak than the normal distribution. For example, data that follow a beta distribution with first and second shape parameters equal to 2 have a negative kurtosis value.
A measure of relative standing is a measure of where a data value stands relative to the distribution of the whole data set. With an idea of relative standing, we can say things like, “You got a really high score compared to the rest of the class” or, “that basketball player is unusually short” etc. Some of the common measures of relative standings are: z-score, quartile and percentile.
The z-score (or standard score) is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. Observed values above the mean have positive standard scores, while values below the mean have negative standard scores.
Mathematically, z-score of a raw score is given by,
The z-score is often used in the z-test in standardized testing – the analog of the Student's t-test for a population whose parameters are known, rather than estimated. As it is very unusual to know the entire population, the t-test is much more widely used.
A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found. The term percentile and the related term, percentile rank, are often used in the reporting of scores from norm-referenced tests. For example, if a score is in the 86th percentile, it is higher than 86% of the other scores. The 25th percentile is also known as the first quartile (), the 50th percentile as the median or second quartile (), and the 75th percentile as the third quartile ().
Often the data that we deal with is multi-dimensional in nature. correlation most often refers to the extent to which two variables have a linear relationship with each other. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice.
The most familiar measure of dependence between two quantities is the Pearson product-moment correlation coefficient, or "Pearson's correlation coefficient", commonly called simply "the correlation coefficient".
The population correlation coefficient between two variates and with means and and standard deviations and is defined as:
There are additional alternative ways to measures of correlations. Some common examples are: Rank Correlation, Distance Correlation, polychoric correlation and correlation ratio etc. Each of such measures capture different aspects of the data and should be used with care depending on the situation.
Most correlation measures are sensitive to the manner in which and are sampled. Dependencies tend to be stronger if viewed over a wider range of values. Sensitivity to the data distribution can be used to an advantage. For example, scaled correlation is designed to use the sensitivity to the range in order to pick out correlations between fast components of time series.
:fire:Correlation does not imply causation.:fire: If a strong correlation is observed between two variables A and B, there are several possible explanations: (a) A influences B; (b) B influences A; (c) A and B are influenced by one or more additional variables; (d) the relationship observed between A and B was a chance error.
Small correlation values do not necessarily indicate that two variables are disassociated. For example, Pearson's coefficients will underestimate the association between two variables that show a quadratic relationship. You should always examine the scatter plot in the EDA.
The correlation of two variables that both have been recorded repeatedly over time can be misleading and spurious. Time trends should be removed from such data before attempting to measure correlation. Caution should be used in interpreting results of correlation analysis when large numbers of variables have been examined, resulting in a large number of correlation coefficients.
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. The objectives of EDA are to:
- Suggest hypotheses about the causes of observed phenomena
- Assess assumptions on which statistical inference will be based
- Support the selection of appropriate statistical tools and techniques
- Provide a basis for further data collection through surveys or experiments
Typical graphical techniques used in EDA are:
- Box Plot
- Multi-Vari Chart
- Run Chart
- Pareto Chart
- Scatter Plot
- Stem-and-Leaf Plot
- Parallel Coordinates
- Odd Ratio
- Multidimensional Scaling
- Targeted Projection Pursuit
- Principal Component Analysis (PCA)
- Multi-linear PCA
- Dimensionality Reduction
- Nonlinear Dimensionality Reduction (NLDR)
Typical quantitative techniques used in EDA are:
I will be going through mathematical details of some of others in future posts.