Unit I : Descriptive Statistics
Meaning of Statistics
Statistics is a branch of mathematics that deals with the collection, organization, analysis, interpretation, and presentation of numerical data. It helps in understanding large volumes of data and making informed decisions based on patterns and trends.
In simple terms, statistics helps transform raw data into meaningful information to solve real-world problems and support decision-making in fields like business, economics, healthcare, and social sciences.
Scope of Statistics
The scope of statistics is broad, covering multiple fields and functions where data-driven decisions are essential. Below are the key areas:
Types of Statistics:
1. Descriptive Statistics
Descriptive statistics focus on summarizing and presenting data in a meaningful way without making conclusions beyond the available data.
Key Techniques:
- Measures of Central Tendency: Mean, Median, Mode
- Measures of Dispersion: Range, Variance, Standard Deviation
- Data Visualization: Tables, Bar Graphs, Pie Charts, Histograms
Applications:
- Analyzing customer demographics for a marketing campaign
- Reporting average sales figures for a specific period
2. Inferential Statistics
Inferential statistics involve making predictions, inferences, or generalizations about a population based on sample data.
Key Techniques:
- Hypothesis Testing: Tests whether a claim about a population parameter is true
- Confidence Intervals: Estimates the range in which a population parameter lies
- Regression Analysis: Understands the relationship between variables
- Probability Distributions: Normal and Binomial distributions
Applications:
- Forecasting market trends based on survey results
- Determining the effectiveness of a new product using sample feedback
Limitations of Statistics
1. Mean (Arithmetic Mean)
The mean is the average of a dataset, calculated by dividing the sum of all data values by the number of observations.
Formula:
Where:
- = Data values
- = Number of observations
Types of Mean:
- Simple Mean: For ungrouped data
- Weighted Mean: When different values have different weights
Example:
Dataset: {12, 15, 18, 20, 25}
Advantages:
- Easy to calculate
- Considers all data points
Disadvantages:
- Sensitive to extreme values (outliers)
2. Median
The median is the middle value when the data is arranged in ascending or descending order.
Calculation:
- Arrange the data in order.
- If the number of observations is odd, the median is the middle value.
- If the number of observations is even, the median is the average of the two middle values.
Formula for Even Data:
Example:
Dataset: {10, 15, 20, 25, 30}
Median = 20 (middle value)
For Even Dataset:
{10, 15, 20, 25}
Median =
Advantages:
- Not affected by outliers
- Suitable for skewed distributions
Disadvantages:
- Ignores extreme data points
- Less suitable for advanced statistical analysis
3. Mode
The mode is the most frequently occurring value in a dataset.
Calculation:
Identify the data value that appears the most number of times.
Example:
Dataset: {12, 14, 14, 15, 18, 18, 18, 20}
Mode = 18 (appears 3 times)
Advantages:
- Useful for categorical data
- Easy to understand
Disadvantages:
- May not exist if no value repeats
- Can have multiple modes (bimodal, multimodal)
Quartiles, Measures of Dispersion
1. Quartiles (Q1, Q2, Q3)
Quartiles divide data into four equal parts.
- First Quartile (Q1): 25% of data lies below this value.
- Second Quartile (Q2 or Median): 50% of data lies below this value.
- Third Quartile (Q3): 75% of data lies below this value.
Calculation Example:
Dataset: {12, 15, 18, 20, 25, 30, 35, 40}
Step 1: Arrange the data: Already sorted.
Step 2: Q1: (n+1)/4 =
(8+1)/4 = 2.25 → Interpolate between 2nd (15) and 3rd (18) → Q1 ≈ 16.5
Step 3:
Q2 (Median): Middle of the
dataset (25 + 30)/2 = 27.5
Step 4:
Q3: 3(n+1)/4 = 6.75 →
Interpolate between 6th (30) and 7th (35) → Q3 ≈ 32.5
2. Range
Measures the spread of data from the smallest to the largest value.
Formula:
Example:
Dataset: {12, 18, 25, 35, 40}
Range =
40 - 12 = 28
Advantages:
- Simple and quick
Disadvantages:
- Ignores data distribution and is sensitive to outliers
3. Interquartile Range (IQR)
The range of the middle 50% of the data.
Formula:
Example:
Q3 = 32.5, Q1 = 16.5
IQR = 32.5 -
16.5 = 16
Advantage:
- Not influenced by extreme values
Disadvantage:
- Requires sorting and calculation of quartiles
4. Mean Deviation (Absolute Mean Deviation)
Measures the average of the absolute deviations from the mean.
Formula:
Example:
Dataset: {10, 20, 30}
Mean = (10 + 20
+ 30) / 3 = 20
Mean Deviation = (|10
- 20| + |20 - 20| + |30 - 20|)/3 = 10
Advantage:
- Provides a simple spread of data
Disadvantage:
- Less useful for advanced statistical models
5. Standard Deviation (SD)
Measures how much data deviates from the mean.
Formula:
Example:
Dataset: {10, 20, 30}
Mean = 20
SD =
Advantage:
- Widely used in data analysis
Disadvantage:
- Sensitive to outliers
6. Variance
The square of the standard deviation.
Formula:
Example:
Variance = for the dataset {10, 20, 30}
Advantage:
- Useful for understanding variability
Disadvantage:
- Hard to interpret directly due to squared units
7. Coefficient of Variation (CV)
Indicates relative variability compared to the mean.
Formula:
Example:
SD = 8.16, Mean = 20
CV ≈ 40.8%
Advantage:
- Useful for comparing variability across datasets
Disadvantage:
- Not suitable for data with negative values
8. Skewness
Measures the asymmetry of the distribution.
Types:
- Positive Skew: Right tail is longer
- Negative Skew: Left tail is longer
- Zero Skew: Symmetrical data
Formula:
Example:
Dataset with Mean > Median is positively skewed.
9. Kurtosis
Measures the "peakedness" or "flatness" of the distribution.
Types:
- Leptokurtic (Kurtosis > 3): Sharp peak
- Mesokurtic (Kurtosis = 3): Normal distribution
- Platykurtic (Kurtosis < 3): Flat peak
Formula:
Example:
If the dataset has many extreme values, it will be leptokurtic.