Unit I : Descriptive Statistics


Meaning of Statistics

Statistics is a branch of mathematics that deals with the collection, organization, analysis, interpretation, and presentation of numerical data. It helps in understanding large volumes of data and making informed decisions based on patterns and trends.

In simple terms, statistics helps transform raw data into meaningful information to solve real-world problems and support decision-making in fields like business, economics, healthcare, and social sciences. 

Scope of Statistics

The scope of statistics is broad, covering multiple fields and functions where data-driven decisions are essential. Below are the key areas:

Descriptive Statistics

Types of Statistics:

1. Descriptive Statistics

Descriptive statistics focus on summarizing and presenting data in a meaningful way without making conclusions beyond the available data.

Key Techniques:

  • Measures of Central Tendency: Mean, Median, Mode
  • Measures of Dispersion: Range, Variance, Standard Deviation
  • Data Visualization: Tables, Bar Graphs, Pie Charts, Histograms

Applications:

  • Analyzing customer demographics for a marketing campaign
  • Reporting average sales figures for a specific period

2. Inferential Statistics

Inferential statistics involve making predictions, inferences, or generalizations about a population based on sample data.

Key Techniques:

  • Hypothesis Testing: Tests whether a claim about a population parameter is true
  • Confidence Intervals: Estimates the range in which a population parameter lies
  • Regression Analysis: Understands the relationship between variables
  • Probability Distributions: Normal and Binomial distributions

Applications:

  • Forecasting market trends based on survey results
  • Determining the effectiveness of a new product using sample feedback
Descriptive Statistics

Functions of Statistics

Functions of Statistics
Descriptive Statistics

Limitations of Statistics

Descriptive Statistics
Descriptive Statistics

Measures of Central tendency – Mean, Median, Mode 

1. Mean (Arithmetic Mean)

The mean is the average of a dataset, calculated by dividing the sum of all data values by the number of observations.

Formula:

Mean=XN

Where:

  • XX = Data values
  • NN = Number of observations

Types of Mean:

  • Simple Mean: For ungrouped data
  • Weighted Mean: When different values have different weights

Example:

Dataset: {12, 15, 18, 20, 25}

Mean=12+15+18+20+255=18

Advantages:

  • Easy to calculate
  • Considers all data points

Disadvantages:

  • Sensitive to extreme values (outliers)

2. Median

The median is the middle value when the data is arranged in ascending or descending order.

Calculation:

  1. Arrange the data in order.
  2. If the number of observations is odd, the median is the middle value.
  3. If the number of observations is even, the median is the average of the two middle values.

Formula for Even Data:

Median=X(N/2)+X(N/2+1)2

Example:

Dataset: {10, 15, 20, 25, 30}
Median = 20 (middle value)

For Even Dataset: {10, 15, 20, 25}
Median = (15+20)/2=17.5(15 + 20) / 2 = 17.5

Advantages:

  • Not affected by outliers
  • Suitable for skewed distributions

Disadvantages:

  • Ignores extreme data points
  • Less suitable for advanced statistical analysis

3. Mode

The mode is the most frequently occurring value in a dataset.

Calculation:

Identify the data value that appears the most number of times.

Example:

Dataset: {12, 14, 14, 15, 18, 18, 18, 20}
Mode = 18 (appears 3 times)

Advantages:

  • Useful for categorical data
  • Easy to understand

Disadvantages:

  • May not exist if no value repeats
  • Can have multiple modes (bimodal, multimodal)
Descriptive Statistics

Quartiles, Measures of Dispersion 

1. Quartiles (Q1, Q2, Q3)

Quartiles divide data into four equal parts.

  • First Quartile (Q1): 25% of data lies below this value.
  • Second Quartile (Q2 or Median): 50% of data lies below this value.
  • Third Quartile (Q3): 75% of data lies below this value.

Calculation Example:

Dataset: {12, 15, 18, 20, 25, 30, 35, 40}
Step 1: Arrange the data: Already sorted.
Step 2: Q1: (n+1)/4 = (8+1)/4 = 2.25 → Interpolate between 2nd (15) and 3rd (18) → Q1 ≈ 16.5
Step 3: Q2 (Median): Middle of the dataset (25 + 30)/2 = 27.5
Step 4: Q3: 3(n+1)/4 = 6.75 → Interpolate between 6th (30) and 7th (35) → Q3 ≈ 32.5

2. Range

Measures the spread of data from the smallest to the largest value.

Formula:

Range=Max(X)Min(X)

Example:

Dataset: {12, 18, 25, 35, 40}
Range = 40 - 12 = 28

Advantages:

  • Simple and quick

Disadvantages:

  • Ignores data distribution and is sensitive to outliers

3. Interquartile Range (IQR)

The range of the middle 50% of the data.

Formula:

IQR=Q3Q1

Example:

Q3 = 32.5, Q1 = 16.5
IQR = 32.5 - 16.5 = 16

Advantage:

  • Not influenced by extreme values

Disadvantage:

  • Requires sorting and calculation of quartiles

4. Mean Deviation (Absolute Mean Deviation)

Measures the average of the absolute deviations from the mean.

Formula:

MeanDeviation=XiMeanN​

Example:

Dataset: {10, 20, 30}
Mean = (10 + 20 + 30) / 3 = 20
Mean Deviation = (|10 - 20| + |20 - 20| + |30 - 20|)/3 = 10

Advantage:

  • Provides a simple spread of data

Disadvantage:

  • Less useful for advanced statistical models

5. Standard Deviation (SD)

Measures how much data deviates from the mean.

Formula:

SD=(XiMean)2N

Example:

Dataset: {10, 20, 30}
Mean = 20
SD = (1020)2+(2020)2+(3020)238.16\sqrt{\frac{(10 - 20)^2 + (20 - 20)^2 + (30 - 20)^2}{3}} \approx 8.16

Advantage:

  • Widely used in data analysis

Disadvantage:

  • Sensitive to outliers

6. Variance

The square of the standard deviation.

Formula:

Variance=(XiMean)2N​

Example:

Variance = 66.6766.67 for the dataset {10, 20, 30}

Advantage:

  • Useful for understanding variability

Disadvantage:

  • Hard to interpret directly due to squared units

7. Coefficient of Variation (CV)

Indicates relative variability compared to the mean.

Formula:

CV=(SDMean)×100

Example:

SD = 8.16, Mean = 20
CV ≈ 40.8%

Advantage:

  • Useful for comparing variability across datasets

Disadvantage:

  • Not suitable for data with negative values

8. Skewness

Measures the asymmetry of the distribution.

Types:

  • Positive Skew: Right tail is longer
  • Negative Skew: Left tail is longer
  • Zero Skew: Symmetrical data

Formula:

Skewness=3(MeanMedian)SD​

Example:

Dataset with Mean > Median is positively skewed.

9. Kurtosis

Measures the "peakedness" or "flatness" of the distribution.

Types:

  • Leptokurtic (Kurtosis > 3): Sharp peak
  • Mesokurtic (Kurtosis = 3): Normal distribution
  • Platykurtic (Kurtosis < 3): Flat peak

Formula:

Kurtosis=(XiMean)4N×SD43

Example:

If the dataset has many extreme values, it will be leptokurtic.

Descriptive Statistics