[Lecture Notes] ShanghaiTech SI140 Probability and statistics

Typical distributions

表中的mean就是expectation

Discrete distributions 离散型分布

NameParamPMFMeanVar
Bernoulli 伯努利pp\cdotspppqpq
Binomial 二项n,pn,p(nk)pkqnk\binom{n}{k}p^k q^{n-k}npnpnpqnpq
FS 首次成功pppqk1pq^{k-1}1/p1/pq/p2q/p^2
Geom 几何pppqkpq^kq/pq/pq/p2q/p^2
NBinom 负二项r,pr,p(r+n1r1)prqn\binom{r+n-1}{r-1}p^rq^nrq/prq/prq/p2rq/p^2
HGeom 超几何w,b,nw,b,n(wk)(bnk)(w+bn)\frac{\binom{w}{k}\binom{b}{n-k}}{\binom{w+b}{n}}μ=nww+b\mu = \frac{nw}{w+b}(w+bnw+b1)nμn(1μn)\left(\frac{w+b-n}{w+b-1}\right)n\frac{\mu}{n}\left(1-\frac{\mu}{n}\right)
Poisson 泊松λ\lambdaeλλkk!\frac{e^{-\lambda}\lambda^k}{k!}λ\lambdaλ\lambda

Continous distributions 连续型分布

NameParamPDFMeanVar
Uniform 均匀a<ba<b1ba\frac{1}{b-a} for x(a,b)x\in (a,b)a+b2\frac{a+b}{2}(ba)212\frac{(b-a)^2}{12}
Normal 正态μ,σ2\mu, \sigma^21σ2πe(xμ)2/(2σ2)\frac{1}{\sigma \sqrt{2\pi} }e^{-(x-\mu)^2 / (2\sigma^2)}μ\muσ2\sigma^2
Expo 指数λ\lambdaλeλx\lambda e^{-\lambda x} for x>0x>01/λ1/\lambda1/λ21/\lambda^2
Gammaa,λa, \lambdaΓ(a)1(λx)aeλxx1\Gamma(a)^{-1} (\lambda x)^a e^{-\lambda x} x^{-1}a/λa/\lambdaa/λ2a/\lambda^2
Betaa,ba,bΓ(a+b)Γ(a)Γ(b)xa1(1x)b1\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{a-1} (1-x)^{b-1}μ=aa+b\mu = \frac{a}{a+b}μ(1μ)a+b+1\frac{\mu(1-\mu)}{a+b+1}

Lecture 2 : Conditional Probability

1. Definition & Intuition

  • Definition of Conditional Probability 条件概率

P(AB)=P(AB)P(B)P(A\mid B) = \frac{P(A\cap B)}{P(B)}

2. Bayes’ Rule & LOTP

  • Chain Rule 链式法则

    P(A1,A2)=P(A1)P(A2A1)P(A_1, A_2) = P(A_1)P(A_2\mid A_1)

    P(A1,,An)=P(A1)P(A2A1)P(A3A1,A2)P(AnA1,,An1)P(A_1, \cdots, A_n) = P(A_1)P(A_2\mid A_1)P(A_3\mid A_1, A_2) \cdots P(A_n\mid A_1, \cdots, A_n-1)

  • Bayes’ Rule 贝叶斯公式

    P(AB)=P(BA)P(A)P(B)P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)}

  • The Law of Total Probability / LOTP 全概率公式

    Let A1,,AnA_1, \cdots, A_n be a partition of sample space SS, with P(Ai)>0P(A_i)>0

    P(B)=i=1nP(BAi)P(Ai)P(B) = \sum_{i=1}^{n} P(B\mid A_i)P(A_i)

  • Inference & Bayes’ Rule (其实就是把LOTP和贝叶斯公式套一起)

P(AiB)=P(Ai)P(BAi)P(A1)P(BA1)++P(An)P(BAn)P(A_i\mid B) = \frac{P(A_i)P(B\mid A_i)}{P(A_1)P(B\mid A_1) + \cdots + P(A_n)P(B\mid A_n)}

3. Conditional Probabilities are Probabilities

  • 所有条件概率都是概率,也就是说条件概率也满足所有概率的性质,比如可以把上面公式中的普通概率P(.)P(.)用条件概率P(.E)P(.\mid E)代替,就能得到一些新的东西

  • Bayes’ Rule with extra conditioning

P(AB,E)=P(BA,E)P(AE)P(BE)P(A\mid B, E) = \frac{P(B\mid A, E)P(A\mid E)}{P(B\mid E)}

  • LOTP with extra conditioning

P(BE)=i=1nP(BAi,E)P(Ai,E)P(B\mid E) = \sum_{i=1}^{n} P(B\mid A_i, E)P(A_i, E)

4. Independence of Events

  • Independence of two events 两个事件的独立性

    ​ Events AA and BB are independent if

P(AB)=P(A)P(B)P(A\cap B) = P(A)P(B)

​ If P(A)>0P(A)>0 and P(B)>0P(B)>0, this is equivalent to

P(AB)=P(A),P(BA)=P(B)P(A\mid B) = P(A), P(B\mid A)=P(B)

Lecture 3 4 5 unfinished

Nothing here

Lecture 6: Joint Distributions

Covariance

Definition

Cov(x,y)=E((XEX)(YEY))=E(XY)E(X)E(Y)\mathrm{Cov}(x,y) = E((X-EX)(Y-EY)) = E(XY) - E(X)E(Y)

Properties

  • Cov(X,X)=Var(X)Cov(X,X) = Var(X)

  • Cov(X,Y)=Cov(Y,X)Cov(X,Y) = Cov(Y,X)

  • Cov(X,c)=0Cov(X,c) = 0

  • Cov(aX,Y)=aCov(X,Y)Cov(aX, Y) = a \cdot Cov(X,Y)

  • Cov(X+Y,Z)=Cov(X,Z)+Cov(Y,Z)Cov(X+Y,Z) = Cov(X,Z)+ Cov(Y,Z)

  • Cov(X+Y,Z+W)=Cov(X,Z)+Cov(X,W)+Cov(Y,Z)+Cov(Y,W)Cov(X+Y, Z+W) = Cov(X,Z) + Cov(X,W) + Cov(Y,Z) + Cov(Y,W)

  • Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)

  • For nn r.v.s X1,,XnX_1, \cdots, X_n,

    Var(X1++Xn)=Var(X1)++Var(Xn)+2i<jCov(Xi,Yj)Var(X_1 + \cdots + X_n) = Var(X_1) + \cdots + Var(X_n) + 2 \sum_{i<j}Cov(X_i, Y_j)

Correlation

Definition

Corr(X,Y)=Cov(X,Y)Var(X)Var(Y)Corr(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}

Properties
Theorem:

Cov(X,Y)=0   or   Corr(X,Y)=0      UncorrelatedCov(X,Y) = 0 \mathrm{\ \ \ or\ \ \ } Corr(X,Y) = 0 \ \ \ \Rightarrow \ \ \ \mathrm{Uncorrelated}

Independent      Uncorrelated\mathrm{Independent \ \ \ } \Rightarrow\ \ \ \mathrm{Uncorrelated}

1Corr(X,Y)1-1 \leq Corr(X,Y) \leq 1

Multinomial Distribution

Multinomial Joint PMF: XMultk(n,p)X\sim Mult_k(n, p), then the joint PMF is

P(X1=n1,,Xk=nk)=n!n1!n2!nk!p1n1pknkP(X_1=n_1, \cdots, X_k=n_k) = \frac{n!}{n_1!n_2!\cdots n_k!}p_1^{n_1}\cdots p_k^{n_k}

for n1++nk=nn_1 + \cdots + n_k = n

Multinomial Marginal:

XMultk(n,p)    XjBin(n,pj)X\sim Mult_k(n,p) \mathrm{\ \ \Rightarrow\ \ } X_j \sim Bin(n, p_j)

Multivariate Normal Distribution (多元正态分布 MVN)

Definition: A random vector X=(X1,,Xk)X=(X_1, \cdots, X_k) is said to have a MVN distribution if every linear conbination of XjX_j has a normal distribution.

That is, t1X1++tkXkt_1 X_1 + \cdots + t_k X_k have normal distribution for any choice of constants t1,,tkt_1, \cdots, t_k.

When k=2k=2, Bivariate Normal (二元正态分布 BVN)

Theorem: If (X1,X2,X3)(X_1, X_2, X_3) is MVN, then (X1,X2)(X_1, X_2) is MVN

Theorem:

Lecture 7: Transformations

1. Change of variables

已知 XX 的PDF,求 g(X)g(X) 的PDF

Let XX be a continous r.v. with PDF fXf_X, let Y=g(X)Y=g(X) ( gg 可导且严格单调递增/递减), then the PDF of YY is

fY(y)=fX(x)dxdyf_Y(y) = f_X(x) \left\lvert \frac{\mathrm{d}x}{\mathrm{d}y} \right\rvert

Jacobi行列式

Let X=(X1,,Xn)X=(X_1, \cdots, X_n) be a continous random vector with joint PDF fX(x)f_X(x), and let Y=g(X)Y=g(X) where gg is an invertible function. y=g(x)y=g(x) and xiyi\frac{\partial x_i}{\partial y_i} exists.

Jacobi

xy=(x1y1x1y2x1ynxny1xny2xnyn)\frac{\partial x}{\partial y} = \begin{pmatrix} \frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} & \cdots & \frac{\partial x_1}{\partial y_n}\\ \vdots & \vdots & & \vdots\\ \frac{\partial x_n}{\partial y_1} & \frac{\partial x_n}{\partial y_2} & \cdots & \frac{\partial x_n}{\partial y_n} \end{pmatrix}

Then the joint PDF of YY is

fY(y)=fX(x)xyf_Y(y) = f_X(x) \left\lvert \frac{\partial x}{\partial y} \right\rvert

2. Convolutions 卷积

Convolution sums and integrals 两个随机变量之和的分布

Theorem: X,YX,Y independent, discrete, T=X+YT=X+Y PMF is (独立,离散,加起来的PMF是)

P(T=t)=xP(Y=tx)P(X=x)=yP(X=ty)P(Y=y)\begin{aligned} P(T=t) &= \sum_x P(Y=t-x)P(X=x)\\ &= \sum_y P(X=t-y)P(Y=y) \end{aligned}

Theorem: X,YX,Y independent, continous, T=X+YT=X+Y PDF is (独立,连续,加起来的PDF是)

fT(t)=fY(tx)fX(x)dx=fX(ty)fY(y)dy\begin{aligned} f_T(t) &= \int_{-\infty}^{\infty} f_Y(t-x)f_X(x)dx\\ &= \int_{-\infty}^{\infty} f_X(t-y)f_Y(y)dy \end{aligned}

3. Order statistics 顺序统计量

CDF & PDF of order statistics 顺序统计量的PDF和CDF:
Let X1,,XnX_1, \cdots, X_n be i.i.d continous r.v.s with CDF FF, PDF ff, then the PDF and CDF of X(j)X_{(j)} is

P(X(j)x)=k=jn(nk)F(x)k(1F(X))nkP(X_{(j)}\leq x) = \sum_{k=j}^{n} \binom{n}{k} F(x)^k (1-F(X))^{n-k}

fX(j)(x)=n(n1j1)f(x)F(x)j1(1F(x))njf_{X_{(j)}}(x) = n \binom{n-1}{j-1} f(x) F(x)^{j-1} (1-F(x))^{n-j}

Joint PDF 顺序统计量的联合分布: (x1<x2<<xnx_1 < x_2 < \cdots < x_n)

fX(1),X(2),,X(n)(x1,,xn)=n!i=1nf(Xi)f_{X_{(1)}, X_{(2)},\cdots, X_{(n)}}(x_1, \cdots, x_n) = n! \prod_{i=1}^{n} f(X_i)

Related Indentity 离散与连续中的恒等式

Theorem: For 0<p<10<p<1, nonnegative integer kk,

j=0k(nj)pj(1p)nj=n!k!(nk1)!p1xk(1x)nk1dx\sum_{j=0}^{k} \binom{n}{j} p^j (1-p)^{n-j} = \frac{n!}{k! (n-k-1)!}\int_p^1 x^k (1-x)^{n-k-1} dx

4. Beta distribution

Beta分布

  • XBeta(a,b)X\sim Beta(a,b) (a>0,b>0a>0, b>0)
  • PDF: (for 0<x<10<x<1)

f(x)=1β(a,b)xa1(1x)b1f(x) = \frac{1}{\beta(a,b)} x^{a-1} (1-x)^{b-1}

  • Beta function:

β(a,b)=01xa1(1x)b1dx\beta(a,b) = \int_{0}^{1} x^{a-1} (1-x)^{b-1}dx

β(a,b)=(a1)!(b1)!(a+b1)!=Γ(a)Γ(b)Γ(a+b)\beta(a,b) = \frac{(a-1)!(b-1)!}{(a+b-1)!} = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}

Story: Bayes’ billiards

01(nk)xk(1x)nkdx=1n+1\int_0^1 \binom{n}{k} x^k (1-x)^{n-k} dx = \frac{1}{n+1}

for any integer kk and nn with 0kn0\leq k \leq n

Story: Beta-binomial conjugacy 共轭

nn tosses, kk of nn tosses lands heads, what is the estimator of p^\hat{p}?

设先验分布为 pBeta(a,b)p\sim Beta(a,b) ,则后验分布为 pBeta(a+k,b+nk)p\sim Beta(a+k, b+n-k),期望 E(p)=a+kb+nkE(p) = \frac{a+k}{b+n-k}.

如果先验分布是beta,且data是条件二项分布 given pp,那么后验分布也是beta,称 beta是binomial的 共轭先验 conjugate prior.

5. Gamma distribution

gamma function Γ(.)\Gamma(.)

For a>0a>0,

Γ(a)=0xa1exdx\Gamma(a) = \int_0^\infty x^{a-1} e^{-x} dx

Properties:

  • Γ(a+1)=aΓ(a)\Gamma(a+1) = a\Gamma(a) (a>0a>0)
  • Γ(n)=(n1)!\Gamma(n) = (n-1)! (aa正整数)

Gamma distribution

  • YGamma(a,λ)Y\sim Gamma(a, \lambda) (a>0a>0, λ>0\lambda>0)
  • PDF

f(y)=1Γ(a)(λy)aeλy1yf(y) = \frac{1}{\Gamma(a)} (\lambda y)^a e^{-\lambda y} \frac{1}{y}

  • Gamma分布是指数分布的推广

Gamma(1,λ)=Expo(λ)Gamma(1, \lambda) = Expo(\lambda)

  • Moments of gamma distribution:

    • XGamma(a,1)X\sim Gamma(a, 1)

      E(X)=a,E(X2)=a(a+1),Var(X)=aE(X) = a, E(X^2) = a(a+1), Var(X) = a

    • Y=XλGamma(a,λ)Y = \frac{X}{\lambda} \sim Gamma(a, \lambda)

      E(Y)=aλ,Var(Y)=aλ2E(Y) = \frac{a}{\lambda}, Var(Y) = \frac{a}{\lambda^2}

Gamma: convolution of exponential:

X1,,XnX_1, \cdots, X_n be i.i.d Expo(λ)Expo(\lambda), then X1++XnGamma(n,λ)X_1 + \cdots + X_n \sim Gamma(n, \lambda)

Gamma分布可以看做n个指数分布的卷积/叠加

Beta-Gamma connection (bank-post office story):

independent XGamma(a,λ)X\sim Gamma(a, \lambda), Y(b,λ)Y\sim (b, \lambda), then

X+YGamma(a+b,λ)XX+YBeta(a,b)\begin{aligned} X+Y &\sim Gamma(a+b, \lambda)\\ \frac{X}{X+Y} &\sim Beta(a, b) \end{aligned}

and they are independent.

Lecture 8: Bayesian Statistical Inference

Bayesian statictics

General LOTP

..
XX discrete, YY discreteP(X=x)=yP(X=xY=y)P(Y=y)P(X=x) = \sum_y P(X=x\mid Y=y) P(Y=y)
XX discrete, YY continousP(X=x)=P(X=xY=y)fY(y)dyP(X=x) = \int_{-\infty}^{\infty} P(X=x\mid Y=y) f_Y(y) dy
XX continous, YY discretefX(x)=yfX(xY=y)P(Y=y)f_X(x) = \sum_y f_X(x\mid Y=y) P(Y=y)
XX continous, YY continousfX(x)=fXY(xy)fY(y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X\mid Y} (x\mid y) f_Y(y) dy

General Bayes Rule

..
XX discrete, YY discreteP(Y=yX=x)=P(X=xY=y)P(Y=y)P(X=x)P(Y=y\mid X=x) = \frac{P(X=x\mid Y=y)P(Y=y)}{P(X=x)}
XX discrete, YY continousfY(yX=x)=P(X=xY=y)fY(y)P(X=x)f_Y(y\mid X=x) = \frac{P(X=x\mid Y=y)f_Y(y)}{P(X=x)}
XX continous, YY discreteP(Y=yX=x)=fX(xY=y)P(Y=y)fX(x)P(Y=y\mid X=x) = \frac{f_X(x\mid Y=y)P(Y=y)}{f_X(x)}
XX continous, YY continousfYX(yx)=fXY(xy)fY(y)fX(x)f_{Y\mid X} (y\mid x) = \frac{f_{X\mid Y}(x\mid y) f_Y(y)}{f_X(x)}

MAP (Maximum A Posterior Probability)

Given the observation value x, the MAP rule selects a value θ^\hat{\theta} that maximizes over θ\theta the posterior distribution pΘX(θx)p_{\Theta\mid X} (\theta\mid x) or fΘX(θx)f_{\Theta\mid X} (\theta\mid x)

3. Conditional expectation

Definition

Conditional expectation given an event:

E(YA)=yyP(Y=yA)E(YA)=yf(yA)dyE(Y\mid A) = \sum_y y P(Y=y\mid A)\\ E(Y\mid A) = \int_{-\infty}^{\infty} y f(y\mid A) dy

在测试足够多时, E(YA)E(Y\mid A)近似为YY 的平均值

LOTE (law of total expectation)

E(Y)=i=1nE(YAi)P(Ai)E(Y) = \sum_{i=1}^{n} E(Y\mid A_i) P(A_i)

Definition: Conditional expectation given an r.v

g(x)=E(YX=x)g(x) = E(Y\mid X=x)

E(YX)E(Y\mid X) is a function of XX, and it it also a random variable.

Properties

  • Dropping what’s independent:

    If XX and YY are independent, E(YX)=E(Y)E(Y\mid X) = E(Y)

  • Taking out what’s known:

    For any function hh, E(h(X)YX)=h(X)E(YX)E(h(X)Y\mid X) = h(X) E(Y\mid X)

  • Linearity: 线性性

    E(Y1+Y2X)=E(Y1X)+E(Y2X)E(Y_1 + Y_2\mid X) = E(Y_1\mid X) + E(Y_2 \mid X)

  • Adam’s Law: 亚当定理 “套娃定理”

    E(E(YX))=E(Y)E(E(Y\mid X)) = E(Y)

  • Adam’s Law with extra conditioning:

    E(E(YX,Z)Z)=E(YZ)E(E(Y\mid X, Z)\mid Z) = E(Y\mid Z)

    E(E(XZ,Y)Y)=E(XY)E(E(X\mid Z, Y)\mid Y) = E(X\mid Y)

Conditional Variance

Var(YX)=E[(YE(YX))2X]Var(Y\mid X) = E[(Y-E(Y\mid X))^2\mid X]

Var(YX)=E(Y2X)(E(YX))2Var(Y\mid X) = E(Y^2\mid X) - (E(Y\mid X))^2

Eve’s Law / EVVE

Var(Y)=E(Var(YX))+Var(E(YX))Var(Y) = E(Var(Y\mid X)) + Var(E(Y\mid X))

4. Prediction and estimation

Linear Regression

The linear regression model uses a single explanatory variable XX to predict a responce variable YY, and it assumes that the conditional expectation of YY is linear in XX: E(YX)=a+bXE(Y\mid X) = a+bX.

An equivalent way to express this is to write: Y=a+bX+ϵY=a+bX+\epsilon.

{a=E(Y)bE(X)=E(Y)Cov(X,Y)Var(X)E(X)b=Cov(X,Y)Var(X)\begin{cases} a = E(Y) - bE(X) = E(Y) - \frac{Cov(X,Y)}{Var(X)}\cdot E(X) \\ b = \frac{Cov(X,Y)}{Var(X)} \end{cases}

LLSE / Linear Least Square Estimate

The LLSE of YY given XX, denoted by L[YX]L[Y\mid X], is the linear function a+bXa+bX that minimizes E[(YabX)2]E[(Y-a-bX)^2]. In fact,

L[YX]=E(Y)+Cov(X,Y)Var(X)(XE(X))L[Y\mid X] = E(Y) + \frac{Cov(X,Y)}{Var(X)} (X-E(X))

MMSE / Minimum Mean Square Error Estimator

The MMSE of YY given XX is given by

g(X)=E(YX)g(X) = E(Y\mid X)

Projection Interpretation / Geometric perspectve

YE(YX)h(X)Y-E(Y\mid X) \bot h(X)

E((YE(YX))h(X))=0E((Y-E(Y\mid X)) \cdot h(X)) = 0

Orthoginality Property of MMSE

Theorem:

(a) For any function ϕ(.)\phi(.) , E((YE(YX))ϕ(X))=0E((Y-E(Y\mid X)) \cdot \phi(X)) = 0

(b) Moreover, if the function g(X)g(X) is such that E((Yg(X))h(X))=0E((Y-g(X)) \cdot h(X)) = 0 for any ϕ\phi, then g(X)=E(YX)g(X) = E(Y\mid X)

MMSE for jointly gaussian random variables

Theorem: Let XX, YY be jointly Gaussian random variables, Then

E[YX]=L[YX]=E(Y)+Cov(X,Y)Var(X)(XE(X))E[Y\mid X] = L[Y\mid X] = E(Y) + \frac{Cov(X,Y)}{Var(X)} (X-E(X))

Lecture 9: Classical Statistical Inference 经典统计推断

1. Inference Rule: MLE / Maximum likelihood estimation 最大似然估值

MLE

  • MLE估值就是 使得给定数据的联合分布概率最大:
    θn^=argmaxθPX(x1,,xn;θ)\hat{\theta_n} = arg \max_{\theta} P_X(x_1, \cdots, x_n; \theta)

MLE under Independent case (在独立的条件下,MLE会更方便计算)

  • Observations XiX_i are independent. We observe x=(x1,,xn)x = (x_1, \cdots, x_n).
  • Log-likelihood function:

log[PX(x1,,xn;θ)]=logi=1nPXi(xi;θ)=i=1nlog[PXi(xi;θ)]log[fX(x1,,xn;θ)]=logi=1nfXi(xi;θ)=i=1nlog[fXi(xi;θ)]\log\left[P_X(x_1, \cdots, x_n;\theta)\right] = \log \prod_{i=1}^{n} P_{X_i} (x_i;\theta) = \sum_{i=1}^{n} \log\left[ P_{X_i}(x_i;\theta) \right]\\ \log\left[f_X(x_1, \cdots, x_n;\theta)\right] = \log \prod_{i=1}^{n} f_{X_i} (x_i;\theta) = \sum_{i=1}^{n} \log\left[ f_{X_i}(x_i;\theta) \right]

  • MLE under independent case :

θn^=argmaxθi=1nlog[PXi(xi;θ)]θn^=argmaxθi=1nlog[fXi(xi;θ]\hat{\theta_n} = arg\max_{\theta} \sum_{i=1}^{n} \log\left[P_{X_i}(x_i;\theta) \right]\\ \hat{\theta_n} = arg\max_{\theta} \sum_{i=1}^{n} \log\left[f_{X_i}(x_i;\theta \right]

3. Central Limit Theorem 中心极限定理

Central Limit Theorem

  • As nn\sim \infty,

n(Xnμσ)N(0,1)\sqrt{n} \left(\frac{\overline{X_n} - \mu}{\sigma} \right) \to \mathcal{N}(0,1)

in distribution. In words, the CDF of the left-hand side approaches the CDF of the standard normal distribution.

CLT approximation

  • For large nn, the distribution of Xn\overline{X_n} is approximately N(μ,σ2)\mathcal{N}(\mu,\sigma^2).
  • For large nn, the distribution of nXnn\overline{X_n} is approximately N(nμ,nσ2)\mathcal{N}(n\mu,n\sigma^2).
  • (也就是说,n很大的时候,不管XnX_n原来是什么分布,Xn\overline{X_n}都能用正态分布来近似)

Poisson convergence to normal

Let YPois(n)Y\sim Pois(n). We can consider it to be a sum of nn i.i.d Pois(1)Pois(1). Therefore for large nn, YN(n,n)Y\sim \mathcal{N}(n,n)

Gamma convergence to normal

Let YGamma(n,λ)Y\sim Gamma(n, \lambda). We can consider it to be a sum of nn i.i.d Expo(λ)Expo(\lambda). Therefore for large nn, YN(nλ,nλ2)Y\sim \mathcal{N}(\frac{n}{\lambda}, \frac{n}{\lambda^2})

Binomial convergence to normal

Let YBin(n,p)Y\sim Bin(n,p). We can consider it to be a sum of nn i.i.d Bern(p)Bern(p). Therefore for large nn, YN(np,np(1p))Y\sim \mathcal{N}(np, np(1-p))

Continuity Correction 连续性修正: De Moivre-Laplace Approximation

P(Y=k)=P(k12<Y<k+12)ϕ(k+12npnp(1p))ϕ(k12npnp(1p))\begin{aligned} P(Y=k) &= P(k-\frac{1}{2} < Y < k+\frac{1}{2})\\ &\approx \phi(\frac{k + \frac{1}{2} - np}{\sqrt{np(1-p)}}) - \phi(\frac{k - \frac{1}{2} - np}{\sqrt{np(1-p)}}) \end{aligned}

P(kYI)=P(k12<Y<I+12)ϕ(I+12npnp(1p))ϕ(k12npnp(1p))\begin{aligned} P(k\leq Y\leq I) &= P(k-\frac{1}{2} < Y < I+\frac{1}{2})\\ &\approx \phi(\frac{I + \frac{1}{2} - np}{\sqrt{np(1-p)}}) - \phi(\frac{k - \frac{1}{2} - np}{\sqrt{np(1-p)}}) \end{aligned}

4. Confidence Interval

Lecture 10: Monte Carlo Statistical Methods

2. Law of large numbers 大数定理

Recall: Sample Mean

  • Let X1,,XnX_1, \cdots, X_n be i.i.d r.v.s with mean μ\mu and variance σ2\sigma^2. The sample mean Xn=1nj=1nXj\overline{X_n} = \frac{1}{n}\sum_{j=1}^n X_j itself is an r.v. with mean μ\mu and variance σ2/n\sigma^2 / n.

  • 算术平均它自己也是一个随机变量,当n趋近于无穷大时,方差趋于零

Strong Law of Large Numbers 强大数定理 / SLLN

  • The sample mean Xn\overline{X_n} converges to the true mean μ\mu pointwise as nn\to\infty with probability 11. In other words, the event Xnμ\overline{X_n}\to\mu has probability 11.

Weak Law of Large Numbers 弱大数定理 / WLLN

  • For all ϵ>0\epsilon > 0, P(Xnμ>ϵ)0P(\lvert \overline{X_n} - \mu \rvert > \epsilon) \to 0 as nn\to\infty.

3. Non-asymptotic Analysis: Inequalities 非渐进分析

Cauchy-Schwarz Inequality

  • For any r.v.s X,YX,Y with finite variances,

E(XY)E(X2)E(Y2)\lvert E(XY)\rvert \leq \sqrt{E(X^2)E(Y^2)}

Second Moment Method 二阶矩

  • Let XX be 非负rv, then

P(X=0)Var(X)E(X2)P(X=0)\leq \frac{Var(X)}{E(X^2)}

Jensen’s Inequality

  • If ff is a convex function (凸函数,二阶导大于零), 0λ1,λ21,λ1+λ2=10\leq \lambda_1, \lambda_2 \leq 1, \lambda_1 + \lambda_2 = 1, then for any x1,x2x_1, x_2,

f(λ1x1+λ2x2)λ1f(x1)+λ2f(x2)f(\lambda_1 x_1 + \lambda_2 x_2) \leq \lambda_1 f(x_1) + \lambda_2 f(x_2)

  • Let XX be an r.v. If gg is a convex function, then E(g(x))g(E(X))E(g(x))\geq g(E(X)). If gg is a concave function, then E(g(x))g(E(X))E(g(x))\leq g(E(X)). 当且仅当g(X)=a+bXg(X) = a+bX时取等号.

Entropy

  • X is discrete r.v. The entropy of XX is

H(X)=j=1npjlog2(1pj)H(X) = \sum_{j=1}^{n} p_j \log_2 (\frac{1}{p_j})

  • 用Jensen不等式证明,当X是uniform的时候熵最大

[Lecture Notes] ShanghaiTech SI140 Probability and statistics
https://www.billhu.us/2022/13_probability/
Author
Bill Hu
Posted on
January 21, 2022
Licensed under