Central limit theorem

General Idea of the Central Limit Theorem

  • Start with a [[ random variable ]]: $X$
  • Add $n$ samples of this variable: $X_{1}+X_{2}+\dots+X_{n}$
  • The distribution of this sum looks more like a bell curve as $n \to \infty$

    Underlying Assumptions

  • i.i.d ⟶ independent and identically distributed.
    • All $X_{i}$’s are [[ independent random variables|independent ]] from each other: the outcome of one random process doesn’t influence the outcome of any other process.
    • Each $X_{i}$ is drawn from the same distribution.
  • $0< \mathrm{var}(X_{i})<\infty$ and finite expectation.

[!summary] Central Limit Theorem \(\lim_{ n \to \infty } P\left( a < \frac{(X_{1}+X_{2}+\dots X_{n})-n\cdot\mu}{\sigma \sqrt{ n }} < b\right) =\int_{a}^{b} \frac{1}{\sqrt{ 2\pi }} e^{-x^{2}/2} \, dx\)

ok ( ok )

The 68-95-99.7 rule

  • $68\%$ of values fall within 1 standard deviations of the mean.
  • $95\%$ of values fall within 2 standard deviations of the mean.
  • $99.7\%$ of values fall within 3 standard deviations of the mean.

! [[ sample_means_coins.png#invert ]] ! [[ sample_means_movies.png#invert ]]

  • Class activity: [[ Statistics Group Project Report ]]
    • skewed and symmetric hsitogram plots
    • as sample size inc, range on x-axis decreases, estimators in both cases become precise
    • histograms for coins: $n = 10$ to $n = 50$
      • $n = 10$ is the most skewed, $n = 50$ is the least skewed.
      • histograms are becoming more bell-curved
    • histograms for movie lengths: $n = 10$ to $n = 50$, histograms become more bell-curved
    • difference: for movie lengths the contrast between $n = 10$ and $n = 50$ is much lesser than for coins.
    • $n=30$ for movies is more symmetric than $n=30$ for coins.
    • (why do we want it symmetric? why does it become symmetric with increasing sample size?) \(sd(\bar{x})=\frac{\sigma}{\sqrt{ n }}\)
    • as $n$ increases, standard error of $\bar{x}$ decreases.
  • Central Limit Theorem: As sample size increases, the distribution of the sample mean becomes normal. \(\bar{x} \to \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right), n \to \infty \text{ for large } n\)
    • holds regardless of population distribution.
    • (Generally), For $n>30$, $\bar{x} \to \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right)$
    • For $X_{1}, X_{2}, \dots Xn$, as $n\to \infty$, $\bar{x} \to \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right)$

[!caution] Note CLT applies only for sample mean $\bar{x}$, it does not say anything about $X_{i}$ (population) $n$ large does not mean population is normal.

https://youtu.be/jvoxEYmQHNM

  • Why are the means for coins more skewed? Nature of the distribution
  • e.g., height, weight or movies (very few people who are extremely tall to be more in numbers) vs cholesterol or coins (fewer people are at extremes than at mean)
\[\begin{align} \text{Large } n \implies \bar{x} \sim \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right) \\ \\ Z =\left( \frac{\bar{x}-\mu}{\sigma/{\sqrt{ n }}} \right) \sim N\left( 0, 1 \right) & \text{ by standardizing sample mean} \end{align}\] \[\begin{align} P(-a<Z<a) = 95 \% \\ \\ P\left( -a<\left( \frac{\bar{x} - \mu }{\sigma/{\sqrt{ n }}} \right) < a \right) = 95\% \\ \\ P \left( -\frac{a\sigma}{\sqrt{ n }}-\bar{x} < -\mu < \frac{a\sigma}{\sqrt{ n }}-\bar{x} \right) = 95\% \\ \\ \\ \end{align}\] \[P\left[ \mu \in \underbrace{ \left( \bar{x} \pm a \frac{\sigma}{\sqrt{ n }} \right) }_{ \text{interval} } \right] = 95\%\] \[P\left[ \mu \in \left( \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{ n }} \right) \right] = 100(1-\alpha)\%\]
  • confidence interval, interval estimation

The Movie opened on Easter weekend in April 2009. Over the three-day weekend, the movie became the number-one box office attraction (The Wall Street Journal, April 13, 2009). The ticket sales revenue in dollars for a sample of 25 theatres is as follows.

Ticket sales (in $)
20,200
8350
10,750
13,900
13,185
10,150
7300
6240
4200
9200
13,000
14,000
12,700
6750
21,400
11,320
9940
7430
6700
11,380
9700
11,200
13,500
9330
10,800

Problem A

What is the 95% confidence interval estimate for the mean ticket sales revenue per theatre? Interpret this result.

[!check] Solution https://www.vaia.com/en-us/textbooks/economics/economics-for-today-6-edition/chapter-8/problem-22-disneys-hannah-montana-the-movie-opened-on-easter/

Let $\bar{x}$ be the sample mean of the sample data. Let $n\, (= 25)$ denote the given sample size. We need $1-\alpha = 0.95 \implies \alpha=0.05$ Population is not known to be normal. And $\sigma$ is also not known. Hence, non-parametric methods must be used to 95% confidence interval. But here, we’ll explicitly assume that the population is normal and use $t$-confidence interval. For 95% confidence, \(\bar{x}-t_{\alpha/2}^{n-1} \frac{s}{\sqrt{ n }} < \mu < \bar{x}+t_{n-1,\alpha/2} \frac{s}{\sqrt{ n }}\)

  • $\bar{x} = \frac{1}{n} \sum_{i=1}^{N}X_{i} = 10905$
  • $s = \frac{1}{n-1}\sum(X_{i} - \bar{x})^{2} \approx 3962.11$
  • Use the function T.INV in spreadsheet to calculate the $t$-score as T.INV(1 - 0.05/2, 25 - 1), then multiply it by $s/\sqrt{ n }$ to get $t_{n-1,\alpha/2} \frac{s}{\sqrt{ n }} = 1635.48$
  • Hence, $\mu = 10905 \pm 1635.48 = \left[ 9270, 12540 \right]$

Problem B

Using the movie ticket price of $7.16 per ticket, what is the estimate of the mean number of customers per theatre?

  • Mean ticket sales per theatre was found to be $\left[ 9270, 12540 \right]$ with 95% confidence.
  • So, the estimate of the mean number of customers per theatre will be $\left[ 1294.7, 1751.46 \right]$ with 95% confidence.

    Problem C

    The movie was shown in 3118 theatres. Estimate the total number of customers who saw Hannah Montana: The Movie and the total box office ticket sales for the three-day weekend.

  • Total customers $=3118 \times \text{average number of customers per theatre}$ = [40,36,874.6, 54,61,052.28]
  • box office collection = [2,89,04,022.136 , 3,91,01,134.3248]

[[ 2025-01-16|16 January 2025 ]]

Confidence Interval

  • Why do racquet games seem easier than other sports like Javelin or dartboard?
    • suppose you play with a cricket bat or a wicket instead of a badminton racquet –> it’s harder.
    • badminton racquet has a much wider coverage area than a bat or a wicket.
    • think: fishing net vs fishing rod $\sim$ interval estimate vs point estimate.
  • for practical purposes, we do not rely on a point estimator but we provide a small range.
  • fishing net will not always guarantee you will catch a fish. we cannot ensure with 100% probability that the interval will always catch our unknown parameter?
  • we can quantify the uncertainity using probability. this probability is the amount of confidence that the unknown parameter lies in the interval estimate.

  • confidence that $\mu \in \left( \bar{x} \pm z_{\alpha/{2}} \frac{\sigma}{\sqrt{ n }} \right)$ is $100(1-\alpha)\%$
    • $\bar{x}$ can be found
    • $n$ is the sample size.
    • $z_{\frac{\alpha}{2}}$ can be found using value of $\alpha$ (How?)
    • $\alpha$ can always be calculated
    • The variance $\sigma^{2}=\frac{1}{N}(x_{i}-\mu)^{2}$ is a population parameter ^[STDDEV.P in Spreadsheet software] and cannot be calculated. $s^{2} = \frac{1}{n-1} \sum(X_{i}-\bar{x})^{2}$ ^[STDDEV.S in spreadsheet software] is a sample measure and can be calculated.
    • even though we dont know the population data, in very few circumstances, we can use past data like census ??
  • how was this confidence found? using CLT
    • $X_{1}, X_{2}, \dots X_{n}$, $\bar{x} \sim \mathcal{N}\left( \mu, \frac{\sigma^{2}}{n} \right)$ for large $n$ (greater than 30)
    • $N$ does not have to be large
    • $n > 30$
  • $n > 30$ –> use z-confidence interval
  • $N(\mu, \sigma^{2})$ –> use z-confidence interval
  • $n < 30$ $N\left( \mu, \frac{s^{2}}{n} \right)$ use $t$-confidence interval
  • $\mu \in \left( \bar{x} \pm {t^{n-1}}_{\frac{\alpha}{2}} \frac{\sigma}{\sqrt{ n }} \right)$

! [[ Drawing 2025-01-16 11.44.06.excalidraw|100% ]]


[[ 2025-01-21|21 January 2025 ]]

  • for large $n$, use $\sigma$ or $s$, whatever is available.

  • $t_{\frac{\alpha}{2}}$ denotes the (??)
  • LCL and UCL

problem

[[ CLT Problem ]]

  • $z_{\alpha/2}$ in spreadsheet NORM.S.INV(1-alpha/2). No degrees of freedom.
  • For larger sample sizes, $t$-score converges/merges to $z$-score (by CLT).

  • Population parameters are always constant, sample measures follow a distribution.
\[\underbrace{ \hat{p} }_{ \text{sample proportion} } \sim N\left( \underbrace{p}_{ \text{mean, population proportion} }, \underbrace{ \frac{ p(1-p)}{n} }_{ \text{variance of } \hat{p} } \right), n \to \infty\]
  • CI for $\hat{p}$: \(\hat{p} \pm z_{\alpha/2} \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} }\)
  • $se(\hat{p}) = sd(\hat{p}) = \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} }$

  • $np > 5$ or $np > 10$ (stricter assumption)

  • Pattern: start with a point estimate (pivot)
    • $\pm$ distribution of the estimation times standard error

  • batman vs estimator of batman [[ 2025-01-22|22 January 2025 ]] ! [[ Pasted image 20250122091519.png ]]

Recap

  • Ci for $\mu$: $\sigma$ known: $z$ confidence
  • Ci for $\mu$: $\sigma$ unknown: $t$ confidence
  • Construct intervals:
    • point estimate $\pm$ margin of error
    • $\underbrace{ \text{point estimator} }{ \text{pivot} } \pm \underbrace{ \alpha \text{ point } \cdot \text{standard error} }{ \text{margin of error} }$
    • $\bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{ n }}$
    • $\bar{p} \pm z_{\alpha/2} \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} }$
    • UCL = $\bar{x} + z_{\alpha/2} \frac{\sigma}{\sqrt{ n }}$
    • LCL = $\bar{x} - z_{\alpha/2} \frac{\sigma}{\sqrt{ n }}$
    • width of the CI = $2\cdot\underbrace{ z_{\alpha/2} \frac{\sigma}{\sqrt{ n }} }_{ \text{precision} }$
    • precision = margin of error $=E$ = CI width / 2
    • more precise ⟹ lower magnitude of precision.
\[\begin{align} E = z_{\alpha/2} \frac{\sigma}{\sqrt{ n }} \\ \\ \implies n = \left( z_{\alpha/2}\cdot \frac{\sigma}{E} \right)^{2} \\ \\ \boxed{n = \frac{z_{\alpha/2}^{2} \sigma^{2}}{E^{2}}} \end{align}\]
  • this calculation is to be made before you take sample data: it is made to find the size of $n.$
  • $\sigma$ high => $n$ needs to be high (height example; we need to account for variability)
  • But we don’t know the value of $\sigma$ or $s$ (we haven’t yet taken a sample). How to calculate $\sigma$?
    • Use $\sigma$ from census records / previous studies.
    • Use $\sigma$ from pilot studies: before you collect data for the main sample, collect data for a small pilot sample.
    • calculate sample size for possible different values of $\sigma$
    • $\sigma \approx \frac{\text{Range}}{4}$ (keep in mind though, range is affected by extremeties)

So, our equation becomes: \(n = \frac{z_{\alpha/2}^{2} {\sigma ^{*}}^{2}}{E^{2}} \; \text{for mean}\)

  • For proportion:

\(\begin{align} E = z_{\alpha/2} \sqrt{ \frac{\hat{p}(1-\hat{p})}{n} } \\ \\ \implies n =\frac{z_{\alpha/2}^{2}\cdot p^{*}(1-p ^{*})}{E^{2}} \end{align}\) - Use previous studies, pilot studies, etc.

Problem

  • always round up
  • if u round down, margin of error and/or confidence level may change drastically

Observations about $n$

  • $\sigma$ increases ⟹ $n$ inc
  • $E$ inc ⟹ $n$ dec
  • $\alpha$ inc ⟹ $n$ dec
  • $100(1-\alpha)\%$ inc ⟹ $n$ inc

  • Suppose $n$ comes out to be $150$, but you have the money to collect data with only $n = 100$
    • 100 might give you only 80%
    • 120 might give you a confidence very close to that given by $n = 150$
    • graph is useful
    • $n$ and margin of error do not change linearly.