Measure of Centrality:
In this explainable Academy Article, we will learn more about Measures of Centrality, let's start with why do we need it.
Drawing plots can give a good visual summary but in some situation, we want an even more succinct summary (say, in a single/few numbers).
Recall : A Population
Is any numeric property of the entire population under study and a statistic 🧑🤝🧑 is any numeric property of the sample of a population which is used as an estimate for the corresponding parameter of the population.
Let’s see some statistics (namely summary statistics) :
Summary Statistics
Summary statistics is used for quantitative data. for example ➡️
✅ Measures of centrality (mean, median, mode)
✅ Percentiles (Quartiles, quintiles, deciles)
✅ Measure of Spread (range, IQR, Variance, Standard deviation)
Different measures of Centrality
Do we really need these statistics, can we work without it. Let's look at some questions and see how we can answer them :
What is the typical value of an attribute in our database?
How many runs does Virat Kohli typically score in a match?
How many balls does Virat typically face in a match?
The answer to above question is difficult to gather from a plot. Thus we generally use some measure of Centrality, like Mean, median etc.
Measure of Centrality : Mean
☑️ Mean is the sum of all the elements in the data divided by total number of elements.
Notation:
-
n data points in sample is denoted as : \(x_1, x_2, x_3,.......,x_n\)
-
Mean of sample is denoted as : \(\bar{x}\)
-
Mean of population is denoted as : \(\bar{\mu}\)
The the formula for Mean is :
Mean : \(\frac{Sum\:of\:all\:elements}{Total\:number\:of\:elements} = \frac{x_1+ x_2+ x_3+.......+x_n}{n}\)
Thus, \(\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i\)
Measure of Centrality : Median
☑️ Median is the value which appears at the center of the data when the data is sorted.
When n is odd
The median is the value at the central location (or, mid-point).
Center Location of data : The element at position (\(\frac{n+1}{2}\)) i.e median = \(x_\frac{n+1}{2}\)
When n is even
The median is the average of the value at the two central location (or, mid-point).
Center Location of data : the mean of element at positions (\(\frac{n}{2}\) & \(\frac{n}{2}+1\)) i.e median = \(\frac{x_\frac{n}{2}+x_{\frac{n}{2}+1}}{2}\)
☑️ In both the cases, there are equal numbers of elements on either side of the central location.
Measure of Centrality : Mode
☑️ The Mode is defined as the most frequently occurring value in the dataset.
Single mode : (only 1 most frequent value)
example : 5, 10, 36, 2 , 3 , 7 ,88, 99, 35, 5, 45, 6, 5.
- In the above case mode there is a single mode i.e 5.
Multi mode : (more than 1 most frequent value)
example : 1, 2, 2, 2, 3, 5, 5, 5, 5, 5, 6, 7, 6, 8, 14, 14, 14, 14, 14, 16
- In the above case there are two mode 5 & 14.
No mode : (all values appear only once)
example : 1, 2, 3, 5,, 6, 7, 8, 14, 16
- In this case we could either say all values are mode or there is no mode.
Characteristics of Measures of Centrality
Mean is the center of gravity
Let Data and Mean be defined as,
Data : \(x_1, x_2, x_3, .... , x_n\)
Mean : \(\bar{x}\)
🌛 The deviation of a point from the mean is defined as the difference between this point and the mean
Deviation : \(x_i - \bar{x}\)
The sum of the deviation of all points from the mean is 0.
sum of deviations = \(\sum_{i=1}^{n}(x_i - \bar{x})\)
= \((x_1 - \bar{x})+(x_2 - \bar{x})+......+(x_n - \bar{x})\)
= \((x_1+x_2+x_3+x_4+....+x_n) - (\bar{x}+\bar{x}+\bar{x}+.....n \:times)\)
= \(\sum_{i=1}^{n}x_i - n\bar{x}\)
= \(\sum_{i=1}^{n}x_i - n\frac{1}{n}\sum_{i=1}^{n}x_i = 0\)
What is the physical interpretation of the above result?
The mean is called the center of gravity of the data
Sum of the deviation from the mean = 0
i.e At mean the Deviation on left side = Deviation on right side
Sensitivity of the Measures of Centrality to Outliers
Mean & Median
Informally, we define an outlier as any point which is far off from the other values in the data.
A more formal definition we will learn when we will learn about percentiles.
Let’s take an example of Runs scored in a cricket match by some players.
Runs scored by Rohit Sharma in 9 innings
Runs : 2, 7, 7, 10, 14, 16, 37, 39, 244 |
Mean = 41.78; Median = 14 |
Runs scored by Virat Kohli in 9 innings
Runs : 1, 9, 14, 15, 51, 58, 61, 67, 83 |
Mean : 39.88 ; Median : 51 |
Which player performed better in the series?
As we can see just because of a single double century the mean is affected a lot for Rohit sharma. Although if we look closely he has not performed consistently as opposed to virat kohli.
What if we drop the outlier?
Rohit Sharma : 2, 7, 7, 10, 14, 16, 37, 39 , 244 |
---|
Old Mean = 41.78 Old Median = 14 |
New Mean = 16.5 New Median = 12 |
Thus if we look performance wise, the median gives as more realistic view. The mean is very sensitive to outliers whereas the median is not so sensitive.
It should be clear that, as in case of mean the outlier effects more as a large value is added to numerator whereas in median even with outliers mostly the central point will not get effected and thus median is not sensitive to outliers.
🔥 To account for the sensitivity to outliers it is advised to compute the trimmed mean
Trimmed mean :
Trimmed mean is computed by dropping k extreme elements from either side of data (note we need to drop same number of elements from both sides)
Rohit Sharma : 2, 7, 7, 10, 14, 16, 37, 39 , 244 |
---|
Mean = 41.78 ; Trimmed mean = 18.57 (calculated after removing 2 (from starting) and 244 (from end)) |
Sensitivity to outliers (Mode) :
👾 Mode is not sensitive to outliers (unless the mode itself is the outlier)
SUMMARY :
Mean is sensitive to outliers whereas median and mode are not.
It is often good idea to compute a trimmed mean by dropping the same number of elements from both the extremes.
What do the measures of Centrality look like for different types of distributions?
Perfectly symmetric distribution
If x is the central location in the data then for every element (x-i) in the data, there will also be a corresponding element (x+i). Such a Distribution is called Perfectly Symmetric Distribution.
Can we say something interesting about the mean, median and mode?
mean = median = mode
For such a Distribution, Mean is equal to Median is equal to Mode.
✅ mode corresponds to tallest bar
✅ median also corresponds to the tallest bar with equal no. of elements on either side.
In this case we are considering unimodal Symmetric distribution (only one mode).
What about bimodal distribution?
mean = median
If we have Perfectly symmetric distribution (uni-Modal, Bi-Modal or Multi-Modal). Mean = Median. In addition if it is Uni-Modal, then Mean = Median = Mode.
Skewed Distributions :
Right Skewed
✅ Right-skewed distribution : has a long tail to the right (also called positive skewed)
Left Skewed
✅ Left-skewed distribution : has a long tail to the left (also called negative skewed)
Without computing it, can we say where the mean, median and mode would be?
Left Skewed Distribution :
Note
✅ Observation: Mean < Median < Mode
Right Skewed Distribution :
Note
✅ Observation: Mean > Median > Mode
Is this always true?
☑️ Almost Always
Left Skewed Distribution but mean > median
Not true for some Skewed distributions which have a heavy tail on the other side (in this case right)
🔥 The generic observation holds true if the distribution doesn’t have a heavy tail.
SUMMARY
🦖 Left Skewed : Mean < Median < Mode
Right Skewed : Mean > Median > Mode
note
🔥 (This is almost always true except for some cases where there is a heavy tail on the other side or if the distribution is Bi-Modal)
Effect of Transformations on the measures of centrality
🔅Effect on Mean :
Scaling and Shifting :
Let new data \(x_{new}\) be defined as :
Special Case :
prove
Prove that if \(x_{new} = a*x+c \:\) i.e if data is scaled and shifted. then, effect on mean is : \(\bar{x}_{new}=a*\bar{x}+c\)
Proof :
Effect of scaling on Median :
The location of the median does not change (it only gets scaled)
Effect of scaling on Mode :
The scaled value of the mode will be the new mode