Typical distributions Discrete distributions 离散型分布Name Param PMF Mean Var Bernoulli 伯努利 p p p ⋯ \cdots ⋯ p p p p q pq p q Binomial 二项 n , p n,p n , p ( n k ) p k q n − k \binom{n}{k}p^k q^{n-k} ( k n ) p k q n − k n p np n p n p q npq n p q FS 首次成功 p p p p q k − 1 pq^{k-1} p q k − 1 1 / p 1/p 1 / p q / p 2 q/p^2 q / p 2 Geom 几何 p p p p q k pq^k p q k q / p q/p q / p q / p 2 q/p^2 q / p 2 NBinom 负二项 r , p r,p r , p ( r + n − 1 r − 1 ) p r q n \binom{r+n-1}{r-1}p^rq^n ( r − 1 r + n − 1 ) p r q n r q / p rq/p r q / p r q / p 2 rq/p^2 r q / p 2 HGeom 超几何 w , b , n w,b,n w , b , n ( w k ) ( b n − k ) ( w + b n ) \frac{\binom{w}{k}\binom{b}{n-k}}{\binom{w+b}{n}} ( n w + b ) ( k w ) ( n − k b ) μ = n w w + b \mu = \frac{nw}{w+b} μ = w + b n w ( w + b − n w + b − 1 ) n μ n ( 1 − μ n ) \left(\frac{w+b-n}{w+b-1}\right)n\frac{\mu}{n}\left(1-\frac{\mu}{n}\right) ( w + b − 1 w + b − n ) n n μ ( 1 − n μ ) Poisson 泊松 λ \lambda λ e − λ λ k k ! \frac{e^{-\lambda}\lambda^k}{k!} k ! e − λ λ k λ \lambda λ λ \lambda λ
Continous distributions 连续型分布Name Param PDF Mean Var Uniform 均匀 a < b a<b a < b 1 b − a \frac{1}{b-a} b − a 1 for x ∈ ( a , b ) x\in (a,b) x ∈ ( a , b ) a + b 2 \frac{a+b}{2} 2 a + b ( b − a ) 2 12 \frac{(b-a)^2}{12} 1 2 ( b − a ) 2 Normal 正态 μ , σ 2 \mu, \sigma^2 μ , σ 2 1 σ 2 π e − ( x − μ ) 2 / ( 2 σ 2 ) \frac{1}{\sigma \sqrt{2\pi} }e^{-(x-\mu)^2 / (2\sigma^2)} σ 2 π 1 e − ( x − μ ) 2 / ( 2 σ 2 ) μ \mu μ σ 2 \sigma^2 σ 2 Expo 指数 λ \lambda λ λ e − λ x \lambda e^{-\lambda x} λ e − λ x for x > 0 x>0 x > 0 1 / λ 1/\lambda 1 / λ 1 / λ 2 1/\lambda^2 1 / λ 2 Gamma a , λ a, \lambda a , λ Γ ( a ) − 1 ( λ x ) a e − λ x x − 1 \Gamma(a)^{-1} (\lambda x)^a e^{-\lambda x} x^{-1} Γ ( a ) − 1 ( λ x ) a e − λ x x − 1 a / λ a/\lambda a / λ a / λ 2 a/\lambda^2 a / λ 2 Beta a , b a,b a , b Γ ( a + b ) Γ ( a ) Γ ( b ) x a − 1 ( 1 − x ) b − 1 \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{a-1} (1-x)^{b-1} Γ ( a ) Γ ( b ) Γ ( a + b ) x a − 1 ( 1 − x ) b − 1 μ = a a + b \mu = \frac{a}{a+b} μ = a + b a μ ( 1 − μ ) a + b + 1 \frac{\mu(1-\mu)}{a+b+1} a + b + 1 μ ( 1 − μ )
Lecture 2 : Conditional Probability 1. Definition & IntuitionDefinition of Conditional Probability 条件概率 P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A\mid B) = \frac{P(A\cap B)}{P(B)} P ( A ∣ B ) = P ( B ) P ( A ∩ B )
2. Bayes’ Rule & LOTPChain Rule 链式法则
P ( A 1 , A 2 ) = P ( A 1 ) P ( A 2 ∣ A 1 ) P(A_1, A_2) = P(A_1)P(A_2\mid A_1) P ( A 1 , A 2 ) = P ( A 1 ) P ( A 2 ∣ A 1 )
P ( A 1 , ⋯ , A n ) = P ( A 1 ) P ( A 2 ∣ A 1 ) P ( A 3 ∣ A 1 , A 2 ) ⋯ P ( A n ∣ A 1 , ⋯ , A n − 1 ) P(A_1, \cdots, A_n) = P(A_1)P(A_2\mid A_1)P(A_3\mid A_1, A_2) \cdots P(A_n\mid A_1, \cdots, A_n-1) P ( A 1 , ⋯ , A n ) = P ( A 1 ) P ( A 2 ∣ A 1 ) P ( A 3 ∣ A 1 , A 2 ) ⋯ P ( A n ∣ A 1 , ⋯ , A n − 1 )
Bayes’ Rule 贝叶斯公式
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)} P ( A ∣ B ) = P ( B ) P ( B ∣ A ) P ( A )
The Law of Total Probability / LOTP 全概率公式
Let A 1 , ⋯ , A n A_1, \cdots, A_n A 1 , ⋯ , A n be a partition of sample space S S S , with P ( A i ) > 0 P(A_i)>0 P ( A i ) > 0
P ( B ) = ∑ i = 1 n P ( B ∣ A i ) P ( A i ) P(B) = \sum_{i=1}^{n} P(B\mid A_i)P(A_i) P ( B ) = i = 1 ∑ n P ( B ∣ A i ) P ( A i )
Inference & Bayes’ Rule (其实就是把LOTP和贝叶斯公式套一起)
P ( A i ∣ B ) = P ( A i ) P ( B ∣ A i ) P ( A 1 ) P ( B ∣ A 1 ) + ⋯ + P ( A n ) P ( B ∣ A n ) P(A_i\mid B) = \frac{P(A_i)P(B\mid A_i)}{P(A_1)P(B\mid A_1) + \cdots + P(A_n)P(B\mid A_n)} P ( A i ∣ B ) = P ( A 1 ) P ( B ∣ A 1 ) + ⋯ + P ( A n ) P ( B ∣ A n ) P ( A i ) P ( B ∣ A i )
3. Conditional Probabilities are ProbabilitiesP ( A ∣ B , E ) = P ( B ∣ A , E ) P ( A ∣ E ) P ( B ∣ E ) P(A\mid B, E) = \frac{P(B\mid A, E)P(A\mid E)}{P(B\mid E)} P ( A ∣ B , E ) = P ( B ∣ E ) P ( B ∣ A , E ) P ( A ∣ E )
LOTP with extra conditioning P ( B ∣ E ) = ∑ i = 1 n P ( B ∣ A i , E ) P ( A i , E ) P(B\mid E) = \sum_{i=1}^{n} P(B\mid A_i, E)P(A_i, E) P ( B ∣ E ) = i = 1 ∑ n P ( B ∣ A i , E ) P ( A i , E )
4. Independence of EventsP ( A ∩ B ) = P ( A ) P ( B ) P(A\cap B) = P(A)P(B) P ( A ∩ B ) = P ( A ) P ( B )
If P ( A ) > 0 P(A)>0 P ( A ) > 0 and P ( B ) > 0 P(B)>0 P ( B ) > 0 , this is equivalent to
P ( A ∣ B ) = P ( A ) , P ( B ∣ A ) = P ( B ) P(A\mid B) = P(A), P(B\mid A)=P(B) P ( A ∣ B ) = P ( A ) , P ( B ∣ A ) = P ( B )
Lecture 3 4 5 unfinished Lecture 6: Joint Distributions Covariance Definition
C o v ( x , y ) = E ( ( X − E X ) ( Y − E Y ) ) = E ( X Y ) − E ( X ) E ( Y ) \mathrm{Cov}(x,y) = E((X-EX)(Y-EY)) = E(XY) - E(X)E(Y) C o v ( x , y ) = E ( ( X − E X ) ( Y − E Y ) ) = E ( X Y ) − E ( X ) E ( Y )
Properties
C o v ( X , X ) = V a r ( X ) Cov(X,X) = Var(X) C o v ( X , X ) = V a r ( X )
C o v ( X , Y ) = C o v ( Y , X ) Cov(X,Y) = Cov(Y,X) C o v ( X , Y ) = C o v ( Y , X )
C o v ( X , c ) = 0 Cov(X,c) = 0 C o v ( X , c ) = 0
C o v ( a X , Y ) = a ⋅ C o v ( X , Y ) Cov(aX, Y) = a \cdot Cov(X,Y) C o v ( a X , Y ) = a ⋅ C o v ( X , Y )
C o v ( X + Y , Z ) = C o v ( X , Z ) + C o v ( Y , Z ) Cov(X+Y,Z) = Cov(X,Z)+ Cov(Y,Z) C o v ( X + Y , Z ) = C o v ( X , Z ) + C o v ( Y , Z )
C o v ( X + Y , Z + W ) = C o v ( X , Z ) + C o v ( X , W ) + C o v ( Y , Z ) + C o v ( Y , W ) Cov(X+Y, Z+W) = Cov(X,Z) + Cov(X,W) + Cov(Y,Z) + Cov(Y,W) C o v ( X + Y , Z + W ) = C o v ( X , Z ) + C o v ( X , W ) + C o v ( Y , Z ) + C o v ( Y , W )
V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) + 2 C o v ( X , Y ) Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y) V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) + 2 C o v ( X , Y )
For n n n r.v.s X 1 , ⋯ , X n X_1, \cdots, X_n X 1 , ⋯ , X n ,
V a r ( X 1 + ⋯ + X n ) = V a r ( X 1 ) + ⋯ + V a r ( X n ) + 2 ∑ i < j C o v ( X i , Y j ) Var(X_1 + \cdots + X_n) = Var(X_1) + \cdots + Var(X_n) + 2 \sum_{i<j}Cov(X_i, Y_j) V a r ( X 1 + ⋯ + X n ) = V a r ( X 1 ) + ⋯ + V a r ( X n ) + 2 i < j ∑ C o v ( X i , Y j )
Correlation Definition
C o r r ( X , Y ) = C o v ( X , Y ) V a r ( X ) V a r ( Y ) Corr(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}} C o r r ( X , Y ) = V a r ( X ) V a r ( Y ) C o v ( X , Y )
Properties Theorem:
C o v ( X , Y ) = 0 o r C o r r ( X , Y ) = 0 ⇒ U n c o r r e l a t e d Cov(X,Y) = 0 \mathrm{\ \ \ or\ \ \ } Corr(X,Y) = 0 \ \ \ \Rightarrow \ \ \ \mathrm{Uncorrelated} C o v ( X , Y ) = 0 o r C o r r ( X , Y ) = 0 ⇒ U n c o r r e l a t e d
I n d e p e n d e n t ⇒ U n c o r r e l a t e d \mathrm{Independent \ \ \ } \Rightarrow\ \ \ \mathrm{Uncorrelated} I n d e p e n d e n t ⇒ U n c o r r e l a t e d
− 1 ≤ C o r r ( X , Y ) ≤ 1 -1 \leq Corr(X,Y) \leq 1 − 1 ≤ C o r r ( X , Y ) ≤ 1
Multinomial DistributionMultinomial Joint PMF : X ∼ M u l t k ( n , p ) X\sim Mult_k(n, p) X ∼ M u l t k ( n , p ) , then the joint PMF is
P ( X 1 = n 1 , ⋯ , X k = n k ) = n ! n 1 ! n 2 ! ⋯ n k ! p 1 n 1 ⋯ p k n k P(X_1=n_1, \cdots, X_k=n_k) = \frac{n!}{n_1!n_2!\cdots n_k!}p_1^{n_1}\cdots p_k^{n_k} P ( X 1 = n 1 , ⋯ , X k = n k ) = n 1 ! n 2 ! ⋯ n k ! n ! p 1 n 1 ⋯ p k n k
for n 1 + ⋯ + n k = n n_1 + \cdots + n_k = n n 1 + ⋯ + n k = n
Multinomial Marginal :
X ∼ M u l t k ( n , p ) ⇒ X j ∼ B i n ( n , p j ) X\sim Mult_k(n,p) \mathrm{\ \ \Rightarrow\ \ } X_j \sim Bin(n, p_j) X ∼ M u l t k ( n , p ) ⇒ X j ∼ B i n ( n , p j )
Multivariate Normal Distribution (多元正态分布 MVN)Definition : A random vector X = ( X 1 , ⋯ , X k ) X=(X_1, \cdots, X_k) X = ( X 1 , ⋯ , X k ) is said to have a MVN distribution if every linear conbination of X j X_j X j has a normal distribution.
That is, t 1 X 1 + ⋯ + t k X k t_1 X_1 + \cdots + t_k X_k t 1 X 1 + ⋯ + t k X k have normal distribution for any choice of constants t 1 , ⋯ , t k t_1, \cdots, t_k t 1 , ⋯ , t k .
When k = 2 k=2 k = 2 , Bivariate Normal (二元正态分布 BVN)
Theorem : If ( X 1 , X 2 , X 3 ) (X_1, X_2, X_3) ( X 1 , X 2 , X 3 ) is MVN, then ( X 1 , X 2 ) (X_1, X_2) ( X 1 , X 2 ) is MVN
Theorem :
…
1. Change of variables 已知 X X X 的PDF,求 g ( X ) g(X) g ( X ) 的PDFLet X X X be a continous r.v. with PDF f X f_X f X , let Y = g ( X ) Y=g(X) Y = g ( X ) ( g g g 可导且严格单调递增/递减), then the PDF of Y Y Y is
f Y ( y ) = f X ( x ) ∣ d x d y ∣ f_Y(y) = f_X(x) \left\lvert \frac{\mathrm{d}x}{\mathrm{d}y} \right\rvert f Y ( y ) = f X ( x ) ∣ ∣ ∣ ∣ ∣ d y d x ∣ ∣ ∣ ∣ ∣
Jacobi行列式Let X = ( X 1 , ⋯ , X n ) X=(X_1, \cdots, X_n) X = ( X 1 , ⋯ , X n ) be a continous random vector with joint PDF f X ( x ) f_X(x) f X ( x ) , and let Y = g ( X ) Y=g(X) Y = g ( X ) where g g g is an invertible function. y = g ( x ) y=g(x) y = g ( x ) and ∂ x i ∂ y i \frac{\partial x_i}{\partial y_i} ∂ y i ∂ x i exists.
Jacobi
∂ x ∂ y = ( ∂ x 1 ∂ y 1 ∂ x 1 ∂ y 2 ⋯ ∂ x 1 ∂ y n ⋮ ⋮ ⋮ ∂ x n ∂ y 1 ∂ x n ∂ y 2 ⋯ ∂ x n ∂ y n ) \frac{\partial x}{\partial y} = \begin{pmatrix} \frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} & \cdots & \frac{\partial x_1}{\partial y_n}\\ \vdots & \vdots & & \vdots\\ \frac{\partial x_n}{\partial y_1} & \frac{\partial x_n}{\partial y_2} & \cdots & \frac{\partial x_n}{\partial y_n} \end{pmatrix} ∂ y ∂ x = ⎝ ⎜ ⎜ ⎛ ∂ y 1 ∂ x 1 ⋮ ∂ y 1 ∂ x n ∂ y 2 ∂ x 1 ⋮ ∂ y 2 ∂ x n ⋯ ⋯ ∂ y n ∂ x 1 ⋮ ∂ y n ∂ x n ⎠ ⎟ ⎟ ⎞
Then the joint PDF of Y Y Y is
f Y ( y ) = f X ( x ) ∣ ∂ x ∂ y ∣ f_Y(y) = f_X(x) \left\lvert \frac{\partial x}{\partial y} \right\rvert f Y ( y ) = f X ( x ) ∣ ∣ ∣ ∣ ∣ ∂ y ∂ x ∣ ∣ ∣ ∣ ∣
2. Convolutions 卷积 Convolution sums and integrals 两个随机变量之和的分布Theorem : X , Y X,Y X , Y independent, discrete, T = X + Y T=X+Y T = X + Y PMF is (独立,离散,加起来的PMF是)
P ( T = t ) = ∑ x P ( Y = t − x ) P ( X = x ) = ∑ y P ( X = t − y ) P ( Y = y ) \begin{aligned} P(T=t) &= \sum_x P(Y=t-x)P(X=x)\\ &= \sum_y P(X=t-y)P(Y=y) \end{aligned} P ( T = t ) = x ∑ P ( Y = t − x ) P ( X = x ) = y ∑ P ( X = t − y ) P ( Y = y )
Theorem : X , Y X,Y X , Y independent, continous, T = X + Y T=X+Y T = X + Y PDF is (独立,连续,加起来的PDF是)
f T ( t ) = ∫ − ∞ ∞ f Y ( t − x ) f X ( x ) d x = ∫ − ∞ ∞ f X ( t − y ) f Y ( y ) d y \begin{aligned} f_T(t) &= \int_{-\infty}^{\infty} f_Y(t-x)f_X(x)dx\\ &= \int_{-\infty}^{\infty} f_X(t-y)f_Y(y)dy \end{aligned} f T ( t ) = ∫ − ∞ ∞ f Y ( t − x ) f X ( x ) d x = ∫ − ∞ ∞ f X ( t − y ) f Y ( y ) d y
3. Order statistics 顺序统计量CDF & PDF of order statistics 顺序统计量的PDF和CDF : Let X 1 , ⋯ , X n X_1, \cdots, X_n X 1 , ⋯ , X n be i.i.d continous r.v.s with CDF F F F , PDF f f f , then the PDF and CDF of X ( j ) X_{(j)} X ( j ) is
P ( X ( j ) ≤ x ) = ∑ k = j n ( n k ) F ( x ) k ( 1 − F ( X ) ) n − k P(X_{(j)}\leq x) = \sum_{k=j}^{n} \binom{n}{k} F(x)^k (1-F(X))^{n-k} P ( X ( j ) ≤ x ) = k = j ∑ n ( k n ) F ( x ) k ( 1 − F ( X ) ) n − k
f X ( j ) ( x ) = n ( n − 1 j − 1 ) f ( x ) F ( x ) j − 1 ( 1 − F ( x ) ) n − j f_{X_{(j)}}(x) = n \binom{n-1}{j-1} f(x) F(x)^{j-1} (1-F(x))^{n-j} f X ( j ) ( x ) = n ( j − 1 n − 1 ) f ( x ) F ( x ) j − 1 ( 1 − F ( x ) ) n − j
Joint PDF 顺序统计量的联合分布 : (x 1 < x 2 < ⋯ < x n x_1 < x_2 < \cdots < x_n x 1 < x 2 < ⋯ < x n )
f X ( 1 ) , X ( 2 ) , ⋯ , X ( n ) ( x 1 , ⋯ , x n ) = n ! ∏ i = 1 n f ( X i ) f_{X_{(1)}, X_{(2)},\cdots, X_{(n)}}(x_1, \cdots, x_n) = n! \prod_{i=1}^{n} f(X_i) f X ( 1 ) , X ( 2 ) , ⋯ , X ( n ) ( x 1 , ⋯ , x n ) = n ! i = 1 ∏ n f ( X i )
Related Indentity 离散与连续中的恒等式
Theorem: For 0 < p < 1 0<p<1 0 < p < 1 , nonnegative integer k k k ,
∑ j = 0 k ( n j ) p j ( 1 − p ) n − j = n ! k ! ( n − k − 1 ) ! ∫ p 1 x k ( 1 − x ) n − k − 1 d x \sum_{j=0}^{k} \binom{n}{j} p^j (1-p)^{n-j} = \frac{n!}{k! (n-k-1)!}\int_p^1 x^k (1-x)^{n-k-1} dx j = 0 ∑ k ( j n ) p j ( 1 − p ) n − j = k ! ( n − k − 1 ) ! n ! ∫ p 1 x k ( 1 − x ) n − k − 1 d x
4. Beta distribution Beta分布X ∼ B e t a ( a , b ) X\sim Beta(a,b) X ∼ B e t a ( a , b ) (a > 0 , b > 0 a>0, b>0 a > 0 , b > 0 )PDF : (for 0 < x < 1 0<x<1 0 < x < 1 )f ( x ) = 1 β ( a , b ) x a − 1 ( 1 − x ) b − 1 f(x) = \frac{1}{\beta(a,b)} x^{a-1} (1-x)^{b-1} f ( x ) = β ( a , b ) 1 x a − 1 ( 1 − x ) b − 1
β ( a , b ) = ∫ 0 1 x a − 1 ( 1 − x ) b − 1 d x \beta(a,b) = \int_{0}^{1} x^{a-1} (1-x)^{b-1}dx β ( a , b ) = ∫ 0 1 x a − 1 ( 1 − x ) b − 1 d x
β ( a , b ) = ( a − 1 ) ! ( b − 1 ) ! ( a + b − 1 ) ! = Γ ( a ) Γ ( b ) Γ ( a + b ) \beta(a,b) = \frac{(a-1)!(b-1)!}{(a+b-1)!} = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)} β ( a , b ) = ( a + b − 1 ) ! ( a − 1 ) ! ( b − 1 ) ! = Γ ( a + b ) Γ ( a ) Γ ( b )
Story: Bayes’ billiards∫ 0 1 ( n k ) x k ( 1 − x ) n − k d x = 1 n + 1 \int_0^1 \binom{n}{k} x^k (1-x)^{n-k} dx = \frac{1}{n+1} ∫ 0 1 ( k n ) x k ( 1 − x ) n − k d x = n + 1 1
for any integer k k k and n n n with 0 ≤ k ≤ n 0\leq k \leq n 0 ≤ k ≤ n
Story: Beta-binomial conjugacy 共轭n n n tosses, k k k of n n n tosses lands heads, what is the estimator of p ^ \hat{p} p ^ ?
设先验分布为 p ∼ B e t a ( a , b ) p\sim Beta(a,b) p ∼ B e t a ( a , b ) ,则后验分布为 p ∼ B e t a ( a + k , b + n − k ) p\sim Beta(a+k, b+n-k) p ∼ B e t a ( a + k , b + n − k ) ,期望 E ( p ) = a + k b + n − k E(p) = \frac{a+k}{b+n-k} E ( p ) = b + n − k a + k .
如果先验分布是beta,且data是条件二项分布 given p p p ,那么后验分布也是beta,称 beta是binomial的 共轭先验 conjugate prior.
5. Gamma distribution gamma function Γ ( . ) \Gamma(.) Γ ( . ) For a > 0 a>0 a > 0 ,
Γ ( a ) = ∫ 0 ∞ x a − 1 e − x d x \Gamma(a) = \int_0^\infty x^{a-1} e^{-x} dx Γ ( a ) = ∫ 0 ∞ x a − 1 e − x d x
Properties:
Γ ( a + 1 ) = a Γ ( a ) \Gamma(a+1) = a\Gamma(a) Γ ( a + 1 ) = a Γ ( a ) (a > 0 a>0 a > 0 )Γ ( n ) = ( n − 1 ) ! \Gamma(n) = (n-1)! Γ ( n ) = ( n − 1 ) ! (a a a 正整数) Gamma distributionY ∼ G a m m a ( a , λ ) Y\sim Gamma(a, \lambda) Y ∼ G a m m a ( a , λ ) (a > 0 a>0 a > 0 , λ > 0 \lambda>0 λ > 0 )PDF f ( y ) = 1 Γ ( a ) ( λ y ) a e − λ y 1 y f(y) = \frac{1}{\Gamma(a)} (\lambda y)^a e^{-\lambda y} \frac{1}{y} f ( y ) = Γ ( a ) 1 ( λ y ) a e − λ y y 1
G a m m a ( 1 , λ ) = E x p o ( λ ) Gamma(1, \lambda) = Expo(\lambda) G a m m a ( 1 , λ ) = E x p o ( λ )
Gamma: convolution of exponential :
X 1 , ⋯ , X n X_1, \cdots, X_n X 1 , ⋯ , X n be i.i.d E x p o ( λ ) Expo(\lambda) E x p o ( λ ) , then X 1 + ⋯ + X n ∼ G a m m a ( n , λ ) X_1 + \cdots + X_n \sim Gamma(n, \lambda) X 1 + ⋯ + X n ∼ G a m m a ( n , λ )
Gamma分布可以看做n个指数分布的卷积/叠加
Beta-Gamma connection (bank-post office story) :
independent X ∼ G a m m a ( a , λ ) X\sim Gamma(a, \lambda) X ∼ G a m m a ( a , λ ) , Y ∼ ( b , λ ) Y\sim (b, \lambda) Y ∼ ( b , λ ) , then
X + Y ∼ G a m m a ( a + b , λ ) X X + Y ∼ B e t a ( a , b ) \begin{aligned} X+Y &\sim Gamma(a+b, \lambda)\\ \frac{X}{X+Y} &\sim Beta(a, b) \end{aligned} X + Y X + Y X ∼ G a m m a ( a + b , λ ) ∼ B e t a ( a , b )
and they are independent.
Lecture 8: Bayesian Statistical Inference Bayesian statictics General LOTP. . X X X discrete, Y Y Y discreteP ( X = x ) = ∑ y P ( X = x ∣ Y = y ) P ( Y = y ) P(X=x) = \sum_y P(X=x\mid Y=y) P(Y=y) P ( X = x ) = ∑ y P ( X = x ∣ Y = y ) P ( Y = y ) X X X discrete, Y Y Y continousP ( X = x ) = ∫ − ∞ ∞ P ( X = x ∣ Y = y ) f Y ( y ) d y P(X=x) = \int_{-\infty}^{\infty} P(X=x\mid Y=y) f_Y(y) dy P ( X = x ) = ∫ − ∞ ∞ P ( X = x ∣ Y = y ) f Y ( y ) d y X X X continous, Y Y Y discretef X ( x ) = ∑ y f X ( x ∣ Y = y ) P ( Y = y ) f_X(x) = \sum_y f_X(x\mid Y=y) P(Y=y) f X ( x ) = ∑ y f X ( x ∣ Y = y ) P ( Y = y ) X X X continous, Y Y Y continousf X ( x ) = ∫ − ∞ ∞ f X ∣ Y ( x ∣ y ) f Y ( y ) d y f_X(x) = \int_{-\infty}^{\infty} f_{X\mid Y} (x\mid y) f_Y(y) dy f X ( x ) = ∫ − ∞ ∞ f X ∣ Y ( x ∣ y ) f Y ( y ) d y
General Bayes Rule. . X X X discrete, Y Y Y discreteP ( Y = y ∣ X = x ) = P ( X = x ∣ Y = y ) P ( Y = y ) P ( X = x ) P(Y=y\mid X=x) = \frac{P(X=x\mid Y=y)P(Y=y)}{P(X=x)} P ( Y = y ∣ X = x ) = P ( X = x ) P ( X = x ∣ Y = y ) P ( Y = y ) X X X discrete, Y Y Y continousf Y ( y ∣ X = x ) = P ( X = x ∣ Y = y ) f Y ( y ) P ( X = x ) f_Y(y\mid X=x) = \frac{P(X=x\mid Y=y)f_Y(y)}{P(X=x)} f Y ( y ∣ X = x ) = P ( X = x ) P ( X = x ∣ Y = y ) f Y ( y ) X X X continous, Y Y Y discreteP ( Y = y ∣ X = x ) = f X ( x ∣ Y = y ) P ( Y = y ) f X ( x ) P(Y=y\mid X=x) = \frac{f_X(x\mid Y=y)P(Y=y)}{f_X(x)} P ( Y = y ∣ X = x ) = f X ( x ) f X ( x ∣ Y = y ) P ( Y = y ) X X X continous, Y Y Y continousf Y ∣ X ( y ∣ x ) = f X ∣ Y ( x ∣ y ) f Y ( y ) f X ( x ) f_{Y\mid X} (y\mid x) = \frac{f_{X\mid Y}(x\mid y) f_Y(y)}{f_X(x)} f Y ∣ X ( y ∣ x ) = f X ( x ) f X ∣ Y ( x ∣ y ) f Y ( y )
MAP (Maximum A Posterior Probability)Given the observation value x, the MAP rule selects a value θ ^ \hat{\theta} θ ^ that maximizes over θ \theta θ the posterior distribution p Θ ∣ X ( θ ∣ x ) p_{\Theta\mid X} (\theta\mid x) p Θ ∣ X ( θ ∣ x ) or f Θ ∣ X ( θ ∣ x ) f_{\Theta\mid X} (\theta\mid x) f Θ ∣ X ( θ ∣ x )
3. Conditional expectation DefinitionConditional expectation given an event :
E ( Y ∣ A ) = ∑ y y P ( Y = y ∣ A ) E ( Y ∣ A ) = ∫ − ∞ ∞ y f ( y ∣ A ) d y E(Y\mid A) = \sum_y y P(Y=y\mid A)\\ E(Y\mid A) = \int_{-\infty}^{\infty} y f(y\mid A) dy E ( Y ∣ A ) = y ∑ y P ( Y = y ∣ A ) E ( Y ∣ A ) = ∫ − ∞ ∞ y f ( y ∣ A ) d y
在测试足够多时, E ( Y ∣ A ) E(Y\mid A) E ( Y ∣ A ) 近似为Y Y Y 的平均值
LOTE (law of total expectation)
E ( Y ) = ∑ i = 1 n E ( Y ∣ A i ) P ( A i ) E(Y) = \sum_{i=1}^{n} E(Y\mid A_i) P(A_i) E ( Y ) = i = 1 ∑ n E ( Y ∣ A i ) P ( A i )
Definition: Conditional expectation given an r.v
g ( x ) = E ( Y ∣ X = x ) g(x) = E(Y\mid X=x) g ( x ) = E ( Y ∣ X = x )
E ( Y ∣ X ) E(Y\mid X) E ( Y ∣ X ) is a function of X X X , and it it also a random variable.
PropertiesDropping what’s independent :
If X X X and Y Y Y are independent, E ( Y ∣ X ) = E ( Y ) E(Y\mid X) = E(Y) E ( Y ∣ X ) = E ( Y )
Taking out what’s known :
For any function h h h , E ( h ( X ) Y ∣ X ) = h ( X ) E ( Y ∣ X ) E(h(X)Y\mid X) = h(X) E(Y\mid X) E ( h ( X ) Y ∣ X ) = h ( X ) E ( Y ∣ X )
Linearity: 线性性
E ( Y 1 + Y 2 ∣ X ) = E ( Y 1 ∣ X ) + E ( Y 2 ∣ X ) E(Y_1 + Y_2\mid X) = E(Y_1\mid X) + E(Y_2 \mid X) E ( Y 1 + Y 2 ∣ X ) = E ( Y 1 ∣ X ) + E ( Y 2 ∣ X )
Adam’s Law: 亚当定理 “套娃定理”
E ( E ( Y ∣ X ) ) = E ( Y ) E(E(Y\mid X)) = E(Y) E ( E ( Y ∣ X ) ) = E ( Y )
Adam’s Law with extra conditioning:
E ( E ( Y ∣ X , Z ) ∣ Z ) = E ( Y ∣ Z ) E(E(Y\mid X, Z)\mid Z) = E(Y\mid Z) E ( E ( Y ∣ X , Z ) ∣ Z ) = E ( Y ∣ Z )
E ( E ( X ∣ Z , Y ) ∣ Y ) = E ( X ∣ Y ) E(E(X\mid Z, Y)\mid Y) = E(X\mid Y) E ( E ( X ∣ Z , Y ) ∣ Y ) = E ( X ∣ Y )
Conditional VarianceV a r ( Y ∣ X ) = E [ ( Y − E ( Y ∣ X ) ) 2 ∣ X ] Var(Y\mid X) = E[(Y-E(Y\mid X))^2\mid X] V a r ( Y ∣ X ) = E [ ( Y − E ( Y ∣ X ) ) 2 ∣ X ]
V a r ( Y ∣ X ) = E ( Y 2 ∣ X ) − ( E ( Y ∣ X ) ) 2 Var(Y\mid X) = E(Y^2\mid X) - (E(Y\mid X))^2 V a r ( Y ∣ X ) = E ( Y 2 ∣ X ) − ( E ( Y ∣ X ) ) 2
Eve’s Law / EVVEV a r ( Y ) = E ( V a r ( Y ∣ X ) ) + V a r ( E ( Y ∣ X ) ) Var(Y) = E(Var(Y\mid X)) + Var(E(Y\mid X)) V a r ( Y ) = E ( V a r ( Y ∣ X ) ) + V a r ( E ( Y ∣ X ) )
4. Prediction and estimation Linear RegressionThe linear regression model uses a single explanatory variable X X X to predict a responce variable Y Y Y , and it assumes that the conditional expectation of Y Y Y is linear in X X X : E ( Y ∣ X ) = a + b X E(Y\mid X) = a+bX E ( Y ∣ X ) = a + b X .
An equivalent way to express this is to write: Y = a + b X + ϵ Y=a+bX+\epsilon Y = a + b X + ϵ .
{ a = E ( Y ) − b E ( X ) = E ( Y ) − C o v ( X , Y ) V a r ( X ) ⋅ E ( X ) b = C o v ( X , Y ) V a r ( X ) \begin{cases} a = E(Y) - bE(X) = E(Y) - \frac{Cov(X,Y)}{Var(X)}\cdot E(X) \\ b = \frac{Cov(X,Y)}{Var(X)} \end{cases} { a = E ( Y ) − b E ( X ) = E ( Y ) − V a r ( X ) C o v ( X , Y ) ⋅ E ( X ) b = V a r ( X ) C o v ( X , Y )
LLSE / Linear Least Square EstimateThe LLSE of Y Y Y given X X X , denoted by L [ Y ∣ X ] L[Y\mid X] L [ Y ∣ X ] , is the linear function a + b X a+bX a + b X that minimizes E [ ( Y − a − b X ) 2 ] E[(Y-a-bX)^2] E [ ( Y − a − b X ) 2 ] . In fact,
L [ Y ∣ X ] = E ( Y ) + C o v ( X , Y ) V a r ( X ) ( X − E ( X ) ) L[Y\mid X] = E(Y) + \frac{Cov(X,Y)}{Var(X)} (X-E(X)) L [ Y ∣ X ] = E ( Y ) + V a r ( X ) C o v ( X , Y ) ( X − E ( X ) )
MMSE / Minimum Mean Square Error EstimatorThe MMSE of Y Y Y given X X X is given by
g ( X ) = E ( Y ∣ X ) g(X) = E(Y\mid X) g ( X ) = E ( Y ∣ X )
Projection Interpretation / Geometric perspectveY − E ( Y ∣ X ) ⊥ h ( X ) Y-E(Y\mid X) \bot h(X) Y − E ( Y ∣ X ) ⊥ h ( X )
E ( ( Y − E ( Y ∣ X ) ) ⋅ h ( X ) ) = 0 E((Y-E(Y\mid X)) \cdot h(X)) = 0 E ( ( Y − E ( Y ∣ X ) ) ⋅ h ( X ) ) = 0
Orthoginality Property of MMSETheorem :
(a) For any function ϕ ( . ) \phi(.) ϕ ( . ) , E ( ( Y − E ( Y ∣ X ) ) ⋅ ϕ ( X ) ) = 0 E((Y-E(Y\mid X)) \cdot \phi(X)) = 0 E ( ( Y − E ( Y ∣ X ) ) ⋅ ϕ ( X ) ) = 0
(b) Moreover, if the function g ( X ) g(X) g ( X ) is such that E ( ( Y − g ( X ) ) ⋅ h ( X ) ) = 0 E((Y-g(X)) \cdot h(X)) = 0 E ( ( Y − g ( X ) ) ⋅ h ( X ) ) = 0 for any ϕ \phi ϕ , then g ( X ) = E ( Y ∣ X ) g(X) = E(Y\mid X) g ( X ) = E ( Y ∣ X )
MMSE for jointly gaussian random variablesTheorem : Let X X X , Y Y Y be jointly Gaussian random variables, Then
E [ Y ∣ X ] = L [ Y ∣ X ] = E ( Y ) + C o v ( X , Y ) V a r ( X ) ( X − E ( X ) ) E[Y\mid X] = L[Y\mid X] = E(Y) + \frac{Cov(X,Y)}{Var(X)} (X-E(X)) E [ Y ∣ X ] = L [ Y ∣ X ] = E ( Y ) + V a r ( X ) C o v ( X , Y ) ( X − E ( X ) )
Lecture 9: Classical Statistical Inference 经典统计推断 1. Inference Rule: MLE / Maximum likelihood estimation 最大似然估值 MLEMLE估值就是 使得给定数据的联合分布概率最大:θ n ^ = a r g max θ P X ( x 1 , ⋯ , x n ; θ ) \hat{\theta_n} = arg \max_{\theta} P_X(x_1, \cdots, x_n; \theta) θ n ^ = a r g max θ P X ( x 1 , ⋯ , x n ; θ ) MLE under Independent case (在独立的条件下,MLE会更方便计算)Observations X i X_i X i are independent. We observe x = ( x 1 , ⋯ , x n ) x = (x_1, \cdots, x_n) x = ( x 1 , ⋯ , x n ) . Log-likelihood function :log [ P X ( x 1 , ⋯ , x n ; θ ) ] = log ∏ i = 1 n P X i ( x i ; θ ) = ∑ i = 1 n log [ P X i ( x i ; θ ) ] log [ f X ( x 1 , ⋯ , x n ; θ ) ] = log ∏ i = 1 n f X i ( x i ; θ ) = ∑ i = 1 n log [ f X i ( x i ; θ ) ] \log\left[P_X(x_1, \cdots, x_n;\theta)\right] = \log \prod_{i=1}^{n} P_{X_i} (x_i;\theta) = \sum_{i=1}^{n} \log\left[ P_{X_i}(x_i;\theta) \right]\\ \log\left[f_X(x_1, \cdots, x_n;\theta)\right] = \log \prod_{i=1}^{n} f_{X_i} (x_i;\theta) = \sum_{i=1}^{n} \log\left[ f_{X_i}(x_i;\theta) \right] log [ P X ( x 1 , ⋯ , x n ; θ ) ] = log i = 1 ∏ n P X i ( x i ; θ ) = i = 1 ∑ n log [ P X i ( x i ; θ ) ] log [ f X ( x 1 , ⋯ , x n ; θ ) ] = log i = 1 ∏ n f X i ( x i ; θ ) = i = 1 ∑ n log [ f X i ( x i ; θ ) ]
MLE under independent case : θ n ^ = a r g max θ ∑ i = 1 n log [ P X i ( x i ; θ ) ] θ n ^ = a r g max θ ∑ i = 1 n log [ f X i ( x i ; θ ] \hat{\theta_n} = arg\max_{\theta} \sum_{i=1}^{n} \log\left[P_{X_i}(x_i;\theta) \right]\\ \hat{\theta_n} = arg\max_{\theta} \sum_{i=1}^{n} \log\left[f_{X_i}(x_i;\theta \right] θ n ^ = a r g θ max i = 1 ∑ n log [ P X i ( x i ; θ ) ] θ n ^ = a r g θ max i = 1 ∑ n log [ f X i ( x i ; θ ]
3. Central Limit Theorem 中心极限定理 Central Limit Theoremn ( X n ‾ − μ σ ) → N ( 0 , 1 ) \sqrt{n} \left(\frac{\overline{X_n} - \mu}{\sigma} \right) \to \mathcal{N}(0,1) n ( σ X n − μ ) → N ( 0 , 1 )
in distribution. In words, the CDF of the left-hand side approaches the CDF of the standard normal distribution.
CLT approximationFor large n n n , the distribution of X n ‾ \overline{X_n} X n is approximately N ( μ , σ 2 ) \mathcal{N}(\mu,\sigma^2) N ( μ , σ 2 ) . For large n n n , the distribution of n X n ‾ n\overline{X_n} n X n is approximately N ( n μ , n σ 2 ) \mathcal{N}(n\mu,n\sigma^2) N ( n μ , n σ 2 ) . (也就是说,n很大的时候,不管X n X_n X n 原来是什么分布,X n ‾ \overline{X_n} X n 都能用正态分布来近似) Poisson convergence to normal
Let Y ∼ P o i s ( n ) Y\sim Pois(n) Y ∼ P o i s ( n ) . We can consider it to be a sum of n n n i.i.d P o i s ( 1 ) Pois(1) P o i s ( 1 ) . Therefore for large n n n , Y ∼ N ( n , n ) Y\sim \mathcal{N}(n,n) Y ∼ N ( n , n )
Gamma convergence to normal
Let Y ∼ G a m m a ( n , λ ) Y\sim Gamma(n, \lambda) Y ∼ G a m m a ( n , λ ) . We can consider it to be a sum of n n n i.i.d E x p o ( λ ) Expo(\lambda) E x p o ( λ ) . Therefore for large n n n , Y ∼ N ( n λ , n λ 2 ) Y\sim \mathcal{N}(\frac{n}{\lambda}, \frac{n}{\lambda^2}) Y ∼ N ( λ n , λ 2 n )
Binomial convergence to normal
Let Y ∼ B i n ( n , p ) Y\sim Bin(n,p) Y ∼ B i n ( n , p ) . We can consider it to be a sum of n n n i.i.d B e r n ( p ) Bern(p) B e r n ( p ) . Therefore for large n n n , Y ∼ N ( n p , n p ( 1 − p ) ) Y\sim \mathcal{N}(np, np(1-p)) Y ∼ N ( n p , n p ( 1 − p ) )
Continuity Correction 连续性修正: De Moivre-Laplace ApproximationP ( Y = k ) = P ( k − 1 2 < Y < k + 1 2 ) ≈ ϕ ( k + 1 2 − n p n p ( 1 − p ) ) − ϕ ( k − 1 2 − n p n p ( 1 − p ) ) \begin{aligned} P(Y=k) &= P(k-\frac{1}{2} < Y < k+\frac{1}{2})\\ &\approx \phi(\frac{k + \frac{1}{2} - np}{\sqrt{np(1-p)}}) - \phi(\frac{k - \frac{1}{2} - np}{\sqrt{np(1-p)}}) \end{aligned} P ( Y = k ) = P ( k − 2 1 < Y < k + 2 1 ) ≈ ϕ ( n p ( 1 − p ) k + 2 1 − n p ) − ϕ ( n p ( 1 − p ) k − 2 1 − n p )
P ( k ≤ Y ≤ I ) = P ( k − 1 2 < Y < I + 1 2 ) ≈ ϕ ( I + 1 2 − n p n p ( 1 − p ) ) − ϕ ( k − 1 2 − n p n p ( 1 − p ) ) \begin{aligned} P(k\leq Y\leq I) &= P(k-\frac{1}{2} < Y < I+\frac{1}{2})\\ &\approx \phi(\frac{I + \frac{1}{2} - np}{\sqrt{np(1-p)}}) - \phi(\frac{k - \frac{1}{2} - np}{\sqrt{np(1-p)}}) \end{aligned} P ( k ≤ Y ≤ I ) = P ( k − 2 1 < Y < I + 2 1 ) ≈ ϕ ( n p ( 1 − p ) I + 2 1 − n p ) − ϕ ( n p ( 1 − p ) k − 2 1 − n p )
4. Confidence Interval Lecture 10: Monte Carlo Statistical Methods 2. Law of large numbers 大数定理 Recall: Sample MeanLet X 1 , ⋯ , X n X_1, \cdots, X_n X 1 , ⋯ , X n be i.i.d r.v.s with mean μ \mu μ and variance σ 2 \sigma^2 σ 2 . The sample mean X n ‾ = 1 n ∑ j = 1 n X j \overline{X_n} = \frac{1}{n}\sum_{j=1}^n X_j X n = n 1 ∑ j = 1 n X j itself is an r.v. with mean μ \mu μ and variance σ 2 / n \sigma^2 / n σ 2 / n .
算术平均它自己也是一个随机变量,当n趋近于无穷大时,方差趋于零
Strong Law of Large Numbers 强大数定理 / SLLNThe sample mean X n ‾ \overline{X_n} X n converges to the true mean μ \mu μ pointwise as n → ∞ n\to\infty n → ∞ with probability 1 1 1 . In other words, the event X n ‾ → μ \overline{X_n}\to\mu X n → μ has probability 1 1 1 . Weak Law of Large Numbers 弱大数定理 / WLLNFor all ϵ > 0 \epsilon > 0 ϵ > 0 , P ( ∣ X n ‾ − μ ∣ > ϵ ) → 0 P(\lvert \overline{X_n} - \mu \rvert > \epsilon) \to 0 P ( ∣ X n − μ ∣ > ϵ ) → 0 as n → ∞ n\to\infty n → ∞ . 3. Non-asymptotic Analysis: Inequalities 非渐进分析 Cauchy-Schwarz InequalityFor any r.v.s X , Y X,Y X , Y with finite variances, ∣ E ( X Y ) ∣ ≤ E ( X 2 ) E ( Y 2 ) \lvert E(XY)\rvert \leq \sqrt{E(X^2)E(Y^2)} ∣ E ( X Y ) ∣ ≤ E ( X 2 ) E ( Y 2 )
Second Moment Method 二阶矩P ( X = 0 ) ≤ V a r ( X ) E ( X 2 ) P(X=0)\leq \frac{Var(X)}{E(X^2)} P ( X = 0 ) ≤ E ( X 2 ) V a r ( X )
Jensen’s InequalityIf f f f is a convex function (凸函数,二阶导大于零), 0 ≤ λ 1 , λ 2 ≤ 1 , λ 1 + λ 2 = 1 0\leq \lambda_1, \lambda_2 \leq 1, \lambda_1 + \lambda_2 = 1 0 ≤ λ 1 , λ 2 ≤ 1 , λ 1 + λ 2 = 1 , then for any x 1 , x 2 x_1, x_2 x 1 , x 2 , f ( λ 1 x 1 + λ 2 x 2 ) ≤ λ 1 f ( x 1 ) + λ 2 f ( x 2 ) f(\lambda_1 x_1 + \lambda_2 x_2) \leq \lambda_1 f(x_1) + \lambda_2 f(x_2) f ( λ 1 x 1 + λ 2 x 2 ) ≤ λ 1 f ( x 1 ) + λ 2 f ( x 2 )
Let X X X be an r.v. If g g g is a convex function, then E ( g ( x ) ) ≥ g ( E ( X ) ) E(g(x))\geq g(E(X)) E ( g ( x ) ) ≥ g ( E ( X ) ) . If g g g is a concave function, then E ( g ( x ) ) ≤ g ( E ( X ) ) E(g(x))\leq g(E(X)) E ( g ( x ) ) ≤ g ( E ( X ) ) . 当且仅当g ( X ) = a + b X g(X) = a+bX g ( X ) = a + b X 时取等号. EntropyX is discrete r.v. The entropy of X X X is H ( X ) = ∑ j = 1 n p j log 2 ( 1 p j ) H(X) = \sum_{j=1}^{n} p_j \log_2 (\frac{1}{p_j}) H ( X ) = j = 1 ∑ n p j log 2 ( p j 1 )
用Jensen不等式证明,当X是uniform的时候熵最大