Central limit theorem

29 June 2025

General Idea of the Central Limit Theorem

Start with a [[ random variable ]]: $X$
Add $n$ samples of this variable: $X_{1}+X_{2}+\dots+X_{n}$
The distribution of this sum looks more like a bell curve as $n \to \infty$
Underlying Assumptions
i.i.d ⟶ independent and identically distributed.
- All $X_{i}$’s are [[ independent random variables|independent ]] from each other: the outcome of one random process doesn’t influence the outcome of any other process.
- Each $X_{i}$ is drawn from the same distribution.
$0< \mathrm{var}(X_{i})<\infty$ and finite expectation.

[!summary] Central Limit Theorem $\lim_{ n \to \infty } P\left( a < \frac{(X_{1}+X_{2}+\dots X_{n})-n\cdot\mu}{\sigma \sqrt{ n }} < b\right) =\int_{a}^{b} \frac{1}{\sqrt{ 2\pi }} e^{-x^{2}/2} \, dx$

ok ( ok )

The 68-95-99.7 rule

$68\%$ of values fall within 1 standard deviations of the mean.
$95\%$ of values fall within 2 standard deviations of the mean.
$99.7\%$ of values fall within 3 standard deviations of the mean.

! [[ sample_means_coins.png#invert ]] ! [[ sample_means_movies.png#invert ]]

Class activity: [[ Statistics Group Project Report ]]
- skewed and symmetric hsitogram plots
- as sample size inc, range on x-axis decreases, estimators in both cases become precise
- histograms for coins: $n = 10$ to $n = 50$
  - $n = 10$ is the most skewed, $n = 50$ is the least skewed.
  - histograms are becoming more bell-curved
- histograms for movie lengths: $n = 10$ to $n = 50$, histograms become more bell-curved
- difference: for movie lengths the contrast between $n = 10$ and $n = 50$ is much lesser than for coins.
- $n=30$ for movies is more symmetric than $n=30$ for coins.
- (why do we want it symmetric? why does it become symmetric with increasing sample size?) $sd(\bar{x})=\frac{\sigma}{\sqrt{ n }}$
- as $n$ increases, standard error of $\bar{x}$ decreases.
Central Limit Theorem: As sample size increases, the distribution of the sample mean becomes normal. $\bar{x} \to \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right), n \to \infty \text{ for large } n$
- holds regardless of population distribution.
- (Generally), For $n>30$, $\bar{x} \to \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right)$
- For $X_{1}, X_{2}, \dots Xn$, as $n\to \infty$, $\bar{x} \to \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right)$

[!caution] Note CLT applies only for sample mean $\bar{x}$, it does not say anything about $X_{i}$ (population) $n$ large does not mean population is normal.

https://youtu.be/jvoxEYmQHNM

Why are the means for coins more skewed? Nature of the distribution
e.g., height, weight or movies (very few people who are extremely tall to be more in numbers) vs cholesterol or coins (fewer people are at extremes than at mean)

\[\begin{align} \text{Large } n \implies \bar{x} \sim \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right) \\ \\ Z =\left( \frac{\bar{x}-\mu}{\sigma/{\sqrt{ n }}} \right) \sim N\left( 0, 1 \right) & \text{ by standardizing sample mean} \end{align}\] \[\begin{align} P(-a<Z<a) = 95 \% \\ \\ P\left( -a<\left( \frac{\bar{x} - \mu }{\sigma/{\sqrt{ n }}} \right) < a \right) = 95\% \\ \\ P \left( -\frac{a\sigma}{\sqrt{ n }}-\bar{x} < -\mu < \frac{a\sigma}{\sqrt{ n }}-\bar{x} \right) = 95\% \\ \\ \\ \end{align}\] \[P\left[ \mu \in \underbrace{ \left( \bar{x} \pm a \frac{\sigma}{\sqrt{ n }} \right) }_{ \text{interval} } \right] = 95\%\] \[P\left[ \mu \in \left( \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{ n }} \right) \right] = 100(1-\alpha)\%\]

confidence interval, interval estimation

The Movie opened on Easter weekend in April 2009. Over the three-day weekend, the movie became the number-one box office attraction (The Wall Street Journal, April 13, 2009). The ticket sales revenue in dollars for a sample of 25 theatres is as follows.

Ticket sales (in $)
20,200
8350
10,750
13,900
13,185
10,150
7300
6240
4200
9200
13,000
14,000
12,700
6750
21,400
11,320
9940
7430
6700
11,380
9700
11,200
13,500
9330
10,800

Problem A

What is the 95% confidence interval estimate for the mean ticket sales revenue per theatre? Interpret this result.

[!check] Solution https://www.vaia.com/en-us/textbooks/economics/economics-for-today-6-edition/chapter-8/problem-22-disneys-hannah-montana-the-movie-opened-on-easter/

Let $\bar{x}$ be the sample mean of the sample data. Let $n\, (= 25)$ denote the given sample size. We need $1-\alpha = 0.95 \implies \alpha=0.05$ Population is not known to be normal. And $\sigma$ is also not known. Hence, non-parametric methods must be used to 95% confidence interval. But here, we’ll explicitly assume that the population is normal and use $t$-confidence interval. For 95% confidence, $\bar{x}-t_{\alpha/2}^{n-1} \frac{s}{\sqrt{ n }} < \mu < \bar{x}+t_{n-1,\alpha/2} \frac{s}{\sqrt{ n }}$

$\bar{x} = \frac{1}{n} \sum_{i=1}^{N}X_{i} = 10905$

$s = \frac{1}{n-1}\sum(X_{i} - \bar{x})^{2} \approx 3962.11$

Use the function T.INV in spreadsheet to calculate the $t$-score as T.INV(1 - 0.05/2, 25 - 1), then multiply it by $s/\sqrt{ n }$ to get $t_{n-1,\alpha/2} \frac{s}{\sqrt{ n }} = 1635.48$

Hence, $\mu = 10905 \pm 1635.48 = \left[ 9270, 12540 \right]$

Problem B

Using the movie ticket price of $7.16 per ticket, what is the estimate of the mean number of customers per theatre?

Mean ticket sales per theatre was found to be $\left[ 9270, 12540 \right]$ with 95% confidence.
So, the estimate of the mean number of customers per theatre will be $\left[ 1294.7, 1751.46 \right]$ with 95% confidence.
Problem C

The movie was shown in 3118 theatres. Estimate the total number of customers who saw Hannah Montana: The Movie and the total box office ticket sales for the three-day weekend.
Total customers $=3118 \times \text{average number of customers per theatre}$ = [40,36,874.6, 54,61,052.28]
box office collection = [2,89,04,022.136 , 3,91,01,134.3248]

[[ 2025-01-16|16 January 2025 ]]

Confidence Interval

Why do racquet games seem easier than other sports like Javelin or dartboard?
- suppose you play with a cricket bat or a wicket instead of a badminton racquet –> it’s harder.
- badminton racquet has a much wider coverage area than a bat or a wicket.
- think: fishing net vs fishing rod $\sim$ interval estimate vs point estimate.
for practical purposes, we do not rely on a point estimator but we provide a small range.
fishing net will not always guarantee you will catch a fish. we cannot ensure with 100% probability that the interval will always catch our unknown parameter?
we can quantify the uncertainity using probability. this probability is the amount of confidence that the unknown parameter lies in the interval estimate.
confidence that $\mu \in \left( \bar{x} \pm z_{\alpha/{2}} \frac{\sigma}{\sqrt{ n }} \right)$ is $100(1-\alpha)\%$
- $\bar{x}$ can be found
- $n$ is the sample size.
- $z_{\frac{\alpha}{2}}$ can be found using value of $\alpha$ (How?)
- $\alpha$ can always be calculated
- The variance $\sigma^{2}=\frac{1}{N}(x_{i}-\mu)^{2}$ is a population parameter ^[STDDEV.P in Spreadsheet software] and cannot be calculated. $s^{2} = \frac{1}{n-1} \sum(X_{i}-\bar{x})^{2}$ ^[STDDEV.S in spreadsheet software] is a sample measure and can be calculated.
- even though we dont know the population data, in very few circumstances, we can use past data like census ??
how was this confidence found? using CLT
- $X_{1}, X_{2}, \dots X_{n}$, $\bar{x} \sim \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right)$ for large $n$ (greater than 30)
- $N$ does not have to be large
- $n > 30$
$n > 30$ –> use z-confidence interval
$N(\mu, \sigma^{2})$ –> use z-confidence interval
$n < 30$ $N\left( \mu, \frac{s^{2}}{n} \right)$ use $t$-confidence interval
$\mu \in \left( \bar{x} \pm {t^{n-1}}_{\frac{\alpha}{2}} \frac{\sigma}{\sqrt{ n }} \right)$

! [[ Drawing 2025-01-16 11.44.06.excalidraw|100% ]]

[[ 2025-01-21|21 January 2025 ]]

for large $n$, use $\sigma$ or $s$, whatever is available.
$t_{\frac{\alpha}{2}}$ denotes the (??)
LCL and UCL

problem

[[ CLT Problem ]]

$z_{\alpha/2}$ in spreadsheet NORM.S.INV(1-alpha/2). No degrees of freedom.
For larger sample sizes, $t$-score converges/merges to $z$-score (by CLT).

Population parameters are always constant, sample measures follow a distribution.

\[\underbrace{ \hat{p} }_{ \text{sample proportion} } \sim N\left( \underbrace{p}_{ \text{mean, population proportion} }, \underbrace{ \frac{ p(1-p)}{n} }_{ \text{variance of } \hat{p} } \right), n \to \infty\]

CI for $\hat{p}$: $\hat{p} \pm z_{\alpha/2} \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} }$
$se(\hat{p}) = sd(\hat{p}) = \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} }$
$np > 5$ or $np > 10$ (stricter assumption)
Pattern: start with a point estimate (pivot)
- $\pm$ distribution of the estimation times standard error

batman vs estimator of batman [[ 2025-01-22|22 January 2025 ]] ! [[ Pasted image 20250122091519.png ]]

Recap

Ci for $\mu$: $\sigma$ known: $z$ confidence
Ci for $\mu$: $\sigma$ unknown: $t$ confidence
Construct intervals:
- point estimate $\pm$ margin of error
- $\underbrace{ \text{point estimator} }{ \text{pivot} } \pm \underbrace{ \alpha \text{ point } \cdot \text{standard error} }{ \text{margin of error} }$
- $\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{ n }}$
- $\bar{p} \pm z_{\alpha/2} \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} }$
- UCL = $\bar{x} + z_{\alpha/2} \frac{\sigma}{\sqrt{ n }}$
- LCL = $\bar{x} - z_{\alpha/2} \frac{\sigma}{\sqrt{ n }}$
- width of the CI = $2\cdot\underbrace{ z_{\alpha/2} \frac{\sigma}{\sqrt{ n }} }_{ \text{precision} }$
- precision = margin of error $=E$ = CI width / 2
- more precise ⟹ lower magnitude of precision.

\[\begin{align} E = z_{\alpha/2} \frac{\sigma}{\sqrt{ n }} \\ \\ \implies n = \left( z_{\alpha/2}\cdot \frac{\sigma}{E} \right)^{2} \\ \\ \boxed{n = \frac{z_{\alpha/2}^{2} \sigma^{2}}{E^{2}}} \end{align}\]

this calculation is to be made before you take sample data: it is made to find the size of $n.$
$\sigma$ high => $n$ needs to be high (height example; we need to account for variability)
But we don’t know the value of $\sigma$ or $s$ (we haven’t yet taken a sample). How to calculate $\sigma$?
- Use $\sigma$ from census records / previous studies.
- Use $\sigma$ from pilot studies: before you collect data for the main sample, collect data for a small pilot sample.
- calculate sample size for possible different values of $\sigma$
- $\sigma \approx \frac{\text{Range}}{4}$ (keep in mind though, range is affected by extremeties)

So, our equation becomes: $n = \frac{z_{\alpha/2}^{2} {\sigma ^{*}}^{2}}{E^{2}} \; \text{for mean}$

For proportion:

$\begin{align} E = z_{\alpha/2} \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} } \\ \\ \implies n =\frac{z_{\alpha/2}^{2}\cdot p^{*}(1-p ^{*})}{E^{2}} \end{align}$ - Use previous studies, pilot studies, etc.

Problem

always round up
if u round down, margin of error and/or confidence level may change drastically

Observations about $n$

$\sigma$ increases ⟹ $n$ inc
$E$ inc ⟹ $n$ dec
$\alpha$ inc ⟹ $n$ dec
$100(1-\alpha)\%$ inc ⟹ $n$ inc
Suppose $n$ comes out to be $150$, but you have the money to collect data with only $n = 100$
- 100 might give you only 80%
- 120 might give you a confidence very close to that given by $n = 150$
- graph is useful
- $n$ and margin of error do not change linearly.