Central limit theorem
General Idea of the Central Limit Theorem
- Start with a [[ random variable ]]: $X$
- Add $n$ samples of this variable: $X_{1}+X_{2}+\dots+X_{n}$
- The distribution of this sum looks more like a bell curve as $n \to \infty$
Underlying Assumptions
- i.i.d ⟶ independent and identically distributed.
- All $X_{i}$’s are [[ independent random variables|independent ]] from each other: the outcome of one random process doesn’t influence the outcome of any other process.
- Each $X_{i}$ is drawn from the same distribution.
- $0< \mathrm{var}(X_{i})<\infty$ and finite expectation.
[!summary] Central Limit Theorem \(\lim_{ n \to \infty } P\left( a < \frac{(X_{1}+X_{2}+\dots X_{n})-n\cdot\mu}{\sigma \sqrt{ n }} < b\right) =\int_{a}^{b} \frac{1}{\sqrt{ 2\pi }} e^{-x^{2}/2} \, dx\)
ok ( ok )
The 68-95-99.7 rule
- $68\%$ of values fall within 1 standard deviations of the mean.
- $95\%$ of values fall within 2 standard deviations of the mean.
- $99.7\%$ of values fall within 3 standard deviations of the mean.
! [[ sample_means_coins.png#invert ]] ! [[ sample_means_movies.png#invert ]]
-
Class activity: [[ Statistics Group Project Report ]]
- skewed and symmetric hsitogram plots
- as sample size inc, range on x-axis decreases, estimators in both cases become precise
-
histograms for coins: $n = 10$ to $n = 50$
- $n = 10$ is the most skewed, $n = 50$ is the least skewed.
- histograms are becoming more bell-curved
- histograms for movie lengths: $n = 10$ to $n = 50$, histograms become more bell-curved
- difference: for movie lengths the contrast between $n = 10$ and $n = 50$ is much lesser than for coins.
- $n=30$ for movies is more symmetric than $n=30$ for coins.
- (why do we want it symmetric? why does it become symmetric with increasing sample size?) \(sd(\bar{x})=\frac{\sigma}{\sqrt{ n }}\)
- as $n$ increases, standard error of $\bar{x}$ decreases.
-
Central Limit Theorem: As sample size increases, the distribution of the sample mean becomes normal. \(\bar{x} \to \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right), n \to \infty \text{ for large } n\)
- holds regardless of population distribution.
- (Generally), For $n>30$, $\bar{x} \to \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right)$
- For $X_{1}, X_{2}, \dots Xn$, as $n\to \infty$, $\bar{x} \to \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right)$
[!caution] Note CLT applies only for sample mean $\bar{x}$, it does not say anything about $X_{i}$ (population) $n$ large does not mean population is normal.
https://youtu.be/jvoxEYmQHNM
- Why are the means for coins more skewed? Nature of the distribution
- e.g., height, weight or movies (very few people who are extremely tall to be more in numbers) vs cholesterol or coins (fewer people are at extremes than at mean)
- confidence interval, interval estimation
The Movie opened on Easter weekend in April 2009. Over the three-day weekend, the movie became the number-one box office attraction (The Wall Street Journal, April 13, 2009). The ticket sales revenue in dollars for a sample of 25 theatres is as follows.
Ticket sales (in $) |
---|
20,200 |
8350 |
10,750 |
13,900 |
13,185 |
10,150 |
7300 |
6240 |
4200 |
9200 |
13,000 |
14,000 |
12,700 |
6750 |
21,400 |
11,320 |
9940 |
7430 |
6700 |
11,380 |
9700 |
11,200 |
13,500 |
9330 |
10,800 |
Problem A
What is the 95% confidence interval estimate for the mean ticket sales revenue per theatre? Interpret this result.
[!check] Solution https://www.vaia.com/en-us/textbooks/economics/economics-for-today-6-edition/chapter-8/problem-22-disneys-hannah-montana-the-movie-opened-on-easter/
Let $\bar{x}$ be the sample mean of the sample data. Let $n\, (= 25)$ denote the given sample size. We need $1-\alpha = 0.95 \implies \alpha=0.05$ Population is not known to be normal. And $\sigma$ is also not known. Hence, non-parametric methods must be used to 95% confidence interval. But here, we’ll explicitly assume that the population is normal and use $t$-confidence interval. For 95% confidence, \(\bar{x}-t_{\alpha/2}^{n-1} \frac{s}{\sqrt{ n }} < \mu < \bar{x}+t_{n-1,\alpha/2} \frac{s}{\sqrt{ n }}\)
- $\bar{x} = \frac{1}{n} \sum_{i=1}^{N}X_{i} = 10905$
- $s = \frac{1}{n-1}\sum(X_{i} - \bar{x})^{2} \approx 3962.11$
- Use the function
T.INV
in spreadsheet to calculate the $t$-score asT.INV(1 - 0.05/2, 25 - 1)
, then multiply it by $s/\sqrt{ n }$ to get $t_{n-1,\alpha/2} \frac{s}{\sqrt{ n }} = 1635.48$- Hence, $\mu = 10905 \pm 1635.48 = \left[ 9270, 12540 \right]$
Problem B
Using the movie ticket price of $7.16 per ticket, what is the estimate of the mean number of customers per theatre?
- Mean ticket sales per theatre was found to be $\left[ 9270, 12540 \right]$ with 95% confidence.
- So, the estimate of the mean number of customers per theatre will be $\left[ 1294.7, 1751.46 \right]$ with 95% confidence.
Problem C
The movie was shown in 3118 theatres. Estimate the total number of customers who saw Hannah Montana: The Movie and the total box office ticket sales for the three-day weekend.
- Total customers $=3118 \times \text{average number of customers per theatre}$ = [40,36,874.6, 54,61,052.28]
- box office collection = [2,89,04,022.136 , 3,91,01,134.3248]
[[ 2025-01-16|16 January 2025 ]]
Confidence Interval
- Why do racquet games seem easier than other sports like Javelin or dartboard?
- suppose you play with a cricket bat or a wicket instead of a badminton racquet –> it’s harder.
- badminton racquet has a much wider coverage area than a bat or a wicket.
- think: fishing net vs fishing rod $\sim$ interval estimate vs point estimate.
- for practical purposes, we do not rely on a point estimator but we provide a small range.
- fishing net will not always guarantee you will catch a fish. we cannot ensure with 100% probability that the interval will always catch our unknown parameter?
-
we can quantify the uncertainity using probability. this probability is the amount of confidence that the unknown parameter lies in the interval estimate.
- confidence that $\mu \in \left( \bar{x} \pm z_{\alpha/{2}} \frac{\sigma}{\sqrt{ n }} \right)$ is $100(1-\alpha)\%$
- $\bar{x}$ can be found
- $n$ is the sample size.
- $z_{\frac{\alpha}{2}}$ can be found using value of $\alpha$ (How?)
- $\alpha$ can always be calculated
- The variance $\sigma^{2}=\frac{1}{N}(x_{i}-\mu)^{2}$ is a population parameter ^[
STDDEV.P
in Spreadsheet software] and cannot be calculated. $s^{2} = \frac{1}{n-1} \sum(X_{i}-\bar{x})^{2}$ ^[STDDEV.S
in spreadsheet software] is a sample measure and can be calculated. - even though we dont know the population data, in very few circumstances, we can use past data like census ??
- how was this confidence found? using CLT
- $X_{1}, X_{2}, \dots X_{n}$, $\bar{x} \sim \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right)$ for large $n$ (greater than 30)
- $N$ does not have to be large
- $n > 30$
- $n > 30$ –> use z-confidence interval
- $N(\mu, \sigma^{2})$ –> use z-confidence interval
- $n < 30$ $N\left( \mu, \frac{s^{2}}{n} \right)$ use $t$-confidence interval
- $\mu \in \left( \bar{x} \pm {t^{n-1}}_{\frac{\alpha}{2}} \frac{\sigma}{\sqrt{ n }} \right)$
! [[ Drawing 2025-01-16 11.44.06.excalidraw|100% ]]
[[ 2025-01-21|21 January 2025 ]]
-
for large $n$, use $\sigma$ or $s$, whatever is available.
- $t_{\frac{\alpha}{2}}$ denotes the (??)
- LCL and UCL
problem
[[ CLT Problem ]]
- $z_{\alpha/2}$ in spreadsheet
NORM.S.INV(1-alpha/2)
. No degrees of freedom. - For larger sample sizes, $t$-score converges/merges to $z$-score (by CLT).
- Population parameters are always constant, sample measures follow a distribution.
- CI for $\hat{p}$: \(\hat{p} \pm z_{\alpha/2} \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} }\)
-
$se(\hat{p}) = sd(\hat{p}) = \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} }$
-
$np > 5$ or $np > 10$ (stricter assumption)
- Pattern: start with a point estimate (pivot)
- $\pm$ distribution of the estimation times standard error
- batman vs estimator of batman [[ 2025-01-22|22 January 2025 ]] ! [[ Pasted image 20250122091519.png ]]
Recap
- Ci for $\mu$: $\sigma$ known: $z$ confidence
- Ci for $\mu$: $\sigma$ unknown: $t$ confidence
- Construct intervals:
- point estimate $\pm$ margin of error
- $\underbrace{ \text{point estimator} }{ \text{pivot} } \pm \underbrace{ \alpha \text{ point } \cdot \text{standard error} }{ \text{margin of error} }$
- $\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{ n }}$
- $\bar{p} \pm z_{\alpha/2} \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} }$
- UCL = $\bar{x} + z_{\alpha/2} \frac{\sigma}{\sqrt{ n }}$
- LCL = $\bar{x} - z_{\alpha/2} \frac{\sigma}{\sqrt{ n }}$
- width of the CI = $2\cdot\underbrace{ z_{\alpha/2} \frac{\sigma}{\sqrt{ n }} }_{ \text{precision} }$
- precision = margin of error $=E$ = CI width / 2
- more precise ⟹ lower magnitude of precision.
- this calculation is to be made before you take sample data: it is made to find the size of $n.$
- $\sigma$ high => $n$ needs to be high (height example; we need to account for variability)
- But we don’t know the value of $\sigma$ or $s$ (we haven’t yet taken a sample). How to calculate $\sigma$?
- Use $\sigma$ from census records / previous studies.
- Use $\sigma$ from pilot studies: before you collect data for the main sample, collect data for a small pilot sample.
- calculate sample size for possible different values of $\sigma$
- $\sigma \approx \frac{\text{Range}}{4}$ (keep in mind though, range is affected by extremeties)
So, our equation becomes: \(n = \frac{z_{\alpha/2}^{2} {\sigma ^{*}}^{2}}{E^{2}} \; \text{for mean}\)
- For proportion:
\(\begin{align} E = z_{\alpha/2} \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} } \\ \\ \implies n =\frac{z_{\alpha/2}^{2}\cdot p^{*}(1-p ^{*})}{E^{2}} \end{align}\) - Use previous studies, pilot studies, etc.
Problem
- always round up
- if u round down, margin of error and/or confidence level may change drastically
Observations about $n$
- $\sigma$ increases ⟹ $n$ inc
- $E$ inc ⟹ $n$ dec
- $\alpha$ inc ⟹ $n$ dec
-
$100(1-\alpha)\%$ inc ⟹ $n$ inc
- Suppose $n$ comes out to be $150$, but you have the money to collect data with only $n = 100$
- 100 might give you only 80%
- 120 might give you a confidence very close to that given by $n = 150$
- graph is useful
- $n$ and margin of error do not change linearly.