Descriptive Statistics : Mean, Median and Mode using Python

Descriptive statistics describe the basic features of the data. It simplifies and summarizes large data in a meaningful and sensible manner. Descriptive statistics, however do not allows us to make conclusions beyond the data we have analyzed, it is simply a way to describe data.

Typically there are two ways to describe the data statistically :-

Measure of central tendency :- These are the ways to describe the central position of a frequency distribution. This can be described using mean, mode, median.

Measures of spread :- These are the ways of summarizing a group of data by describing how spread the scores are. As an example, mean score of students in a class will be 70, but not all students will have scored 70, few may have scored less and few as more than 70. Measures of spread summarizes how spread out these scores are. This is mainly described by range, quartiles, absolute deviation, variation and standard deviation.

In this post, measures of central tendency will be covered and its implementation using Python.

Mean : Mean also known as average is calculated by taking sum of all the values in a data set divided by the number of values.

Median : Median in simple terms is the middle value of the data set. It is calculated by arranging the numbers in an ascending order and middle element is selected as median. If there are odd number of elements, then it is obvious to select the middle element. As an example, for data set with 11 elements, the 6th element is the median which is dividing the data set into two parts. However, for a data set with even number of elements, median is calculated by taking mean of middle two elements. So, for a data set with 10 elements, median would be calculated by taking mean of element 5th & 6th.

Mode : Mode is the most occuring frequency item in a dataset i.e. the value that occurs most of the time.

Now that we have an idea about Mean, Median and Mode, let's see how we can calculate these using Python. We will use the Python libraries Pandas and Stats for computing these values.

Next question is what measurement of central tendency to use?

- If the data is Categorical (Nominal or Ordinal), use Mode.
- If the data is quantitative, use Mean or Median
- If there are outliers or highly skewed data, use Median over Mean.

As you can see from the sample code, we have outliers for Subject2, where student 8 & 9 scored significantly higher than the rest of the class. So, in this case for Subject 2, the mean is 21.7 & median as 3.5. So, in this case, it makes sense to use Median over Mean.

Percentile is another common used concept which means a certain percentage of score falls below this number. As an example, if you scored 75 out of 80 and are at 90 percentile , that means you performed well than 90% of the class.

For Quartiles, we divide the data set into 4 quarters and each quarter is 25% of the data set. First quartile or Q1 is the value in data such that 25% of the data points are less than this value and 75% greater than this value. Second quartile means 50% values are less and 50% values are greater than this value. Third quartile means 75% values are less than and 25% values are greater than this value.

Sarbjit Singh

Search This Blog

Descriptive Statistics : Mean, Median and Mode using Python

Labels

Comments

Post a Comment