Sunforger

Sunforger

Exploration of the Exponential Distribution Family

Reading about the exponential family of distributions in the paper, I decided to organize my thoughts. It may not be entirely accurate, and I hope the experts will not hesitate to provide guidance.

The Exponential Family, also known as the exponential class or exponential distributions, is one of the most important parameterized families of distributions in statistics.

When studying the exponential family, it should be distinguished from the Exponential Distribution. The two are not the same.

The term "family" in English refers to a group with similar characteristics. The exponential family is a set of distributions whose probability density functions and probability distribution functions change with the variation of distribution parameters.

Common examples of the exponential family include:
Normal distribution, Chi-squared distribution, Binomial distribution, Multinomial distribution, Poisson distribution, Pascal distribution, β\beta distribution, Γ\Gamma distribution, Log-normal distribution, etc. For specifics, see Wikipedia entry and Zhihu column.

Exponential Family#

The probability density function of the exponential family has the following form:

fX(x;θ)=h(x)exp{η(θ),T(x)A(θ)}f_\mathbf{X}(x;\theta) = h(x)\exp\{\langle\eta(\theta), T(x)\rangle-A(\theta)\}

where θ\theta is the unique parameter, and all θ\theta satisfying the above equation form the parameter space Θ\Theta, corresponding to the parameter distribution family {Pθ:θΘ}\{P_\theta:\theta\in\Theta\} which is the exponential family. It must be noted that the parameter θ\theta here is not limited to narrow real numbers; it can also be an nn-dimensional vector θRn\theta\in \mathbb{R}^n.

As the parameter θ\theta changes, the shape of the distribution XX (probability density function, probability distribution function, and corresponding graphs) will also change. The random variable xx follows the distribution XX. The functions T(x),h(x),η(θ),A(θ)T(x), h(x), \eta(\theta), A(\theta) are all known functions. The function h(x)h(x) is a non-negative function.

h(x)h(x) is commonly referred to as the base measure.

T(x)T(x) is the sufficient statistic.

A(θ)A(\theta) is the cumulant generating function or the log-partition function (the logarithm of the partition function). Clearly, A(θ)A(\theta) is a real-valued function that returns a real number.

Here, η(θ)\eta(\theta) and T(x)T(x) can be real numbers or vectors.

From the definition, it can be seen that, due to the properties of the exponential function, exp{}=e{}>0\exp\{\cdot\} = e^{\{\cdot\}} > 0 is non-negative. Thus, the support set of the exponential family only relates to h(x)h(x). This means it only relates to xx and is independent of the unknown parameter θ\theta. We can use this to exclude non-exponential families (such as uniform distributions).

Here, a brief addition about the concept of the support set is needed. Simply put, for a real-valued function ff, the support set of ff is defined as:

supp(f)={xX:f(x)0}\text{supp}(f)=\{x\in X:f(x)\neq0\}

The support set is a subset of the original domain XX of the function ff. For more information, refer to Wikipedia entry or CSDN blog. In the probability density function, since probabilities are non-negative, the support set of the random variable can be defined as (see Zhihu column):

supp(X)={xR:fX(x)>0}\text{supp}(X) =\{x\in R : f_X(x)\gt 0\}

Several Equivalent Forms#

Based on the rules of operations with exponentials, two equivalent forms of the exponential family are given through equivalent transformations:

fX(x;θ)=h(x)g(θ)exp{η(θ),T(x)}f_\mathbf{X}(x;\theta) = h(x)g(\theta)\exp\{\langle\eta(\theta), T(x)\rangle\}

fX(x;θ)=exp{η(θ),T(x)A(θ)+B(x)}f_\mathbf{X}(x;\theta) = \exp\{\langle\eta(\theta), T(x)\rangle-A(\theta)+B(x)\}

The corresponding substitution relationships are: A(θ)=lng(θ)-A(\theta) = \ln g(\theta), B(x)=lnh(x)B(x)=\ln h(x)

In particular, if we take Z(θ)=1g(θ)Z(\theta) = \dfrac{1}{g(\theta)}, we can obtain another very common expression of the exponential family as follows. Here, Z(θ)Z(\theta) is the partition function of this distribution.

fX(x;θ)=1Z(θ)h(x)exp{η(θ),T(x)}f_\mathbf{X}(x;\theta) = \frac{1}{Z(\theta)}h(x)\exp\{\langle\eta(\theta), T(x)\rangle\}

Canonical Form#

In the above definition, η(θ)\eta(\theta) is a function of the parameter θ\theta. In the exponential family, we require that η()\eta(\cdot) is a bijective function (i.e., a one-to-one correspondence function). A bijection implies that the function is monotonic and differentiable, and has an inverse function.

Using the properties of bijective functions, we can simplify the form of the exponential family. Let θ^=η(θ)\hat\theta = \eta(\theta), and this transformation is reversible: θ=η1(θ^)\theta = \eta^{-1}(\hat\theta). Thus, we obtain: fX(x;θ^)=h(x)exp{θ^,T(x)A(θ^)}f_\mathbf{X}(x;\hat\theta) = h(x)\exp\{\langle\hat\theta, T(x)\rangle-A^\prime(\hat\theta)\}

By equivalently replacing symbols, we obtain the Canonical Form of the exponential family as follows:

fX(x;θ)=h(x)exp{θ,T(x)A(θ)}f_\mathbf{X}(x;\theta) = h(x)\exp\{\langle\theta, T(x)\rangle-A(\theta)\}

We usually refer to this updated parameter θ\theta as the canonical parameter of the exponential family.

Natural Form#

Although there are different definitions, it is generally believed that the Natural Form of the exponential family is equivalent or nearly equivalent to the canonical form. For example, see Stanford University materials, Berkeley University materials, MIT course materials, blog, Zhihu column 1 and Zhihu column 2.

Wikipedia provides another understanding, which will not be introduced here.

Natural Parameter Space#

Before introducing the Natural Parameter Space, we first introduce the log-partition function A(θ)A(\theta)

A(θ)=log(Xh(x)exp{θ,T(x)} dx) A(\theta) = \log\left(\int_X h(x)\exp\{\langle\theta, T(x)\rangle\}~{\rm d}x\right)

The partition function can be understood as a special form of normalization constant.

The log-partition function A(θ)A(\theta) ensures that fX(x;θ)f_\mathbf{X}(x;\theta) is normalized, meaning it guarantees that fX(x;θ)f_\mathbf{X}(x;\theta) is a probability density function. Understanding this normalization can refer to the expression in the previous section Several Equivalent Forms

fX(x;θ)=1Z(θ)h(x)exp{η(θ),T(x)}f_\mathbf{X}(x;\theta) = \frac{1}{Z(\theta)}h(x)\exp\{\langle\eta(\theta), T(x)\rangle\}

where Z(θ)=Xh(x)exp{θ,T(x)} dxZ(\theta) = \int_X h(x)\exp\{\langle\theta, T(x)\rangle\}~{\rm d}x is a function that is independent of xx. Then integrating both sides gives:

XfX(x;θ)=1Z(θ)Xh(x)exp{η(θ),T(x)}=1\int_Xf_\mathbf{X}(x;\theta) = \frac{1}{Z(\theta)}\int_X h(x)\exp\{\langle\eta(\theta), T(x)\rangle\} = 1

The so-called natural parameter space is the set of parameters θ\theta for which the partition function is finite (<\lt \infty), that is:

N={θ:Xh(x)exp{θ,T(x)} dx<}={θ:Z(θ)<}\mathcal N = \left\{\theta:\int_X h(x)\exp\{\langle\theta, T(x)\rangle\}~{\rm d}x \lt \infty\right\} = \left\{\theta:Z(\theta) \lt \infty\right\}

The natural parameter space has some special properties. First, the natural parameter space N\mathcal N is a convex set, and the log-partition function A(θ)A(\theta) is a convex function. The proof is as follows:

Consider two different parameters θ1N, θ2N\theta_1\in\mathcal N,~\theta_2\in\mathcal N, given 0<λ<10\lt\lambda\lt 1, prove that θ=λθ1+(1λ)θ2\theta=\lambda\theta_1+(1-\lambda)\theta_2 is also in the natural parameter space N\mathcal N (i.e., prove that θN\theta\in\mathcal N also holds)

Z(θ)=exp{A(θ)}=exp{A(λθ1+(1λ)θ2)}=Xh(x)exp{(λθ1+(1λ)θ2),T(x)} dx=X(h(x)λexp{λθ1,T(x)})(h(x)1λexp{(1λ)θ2,T(x)}) dx(Xh(x)exp{1λλθ1,T(x)} dx)λ(Xh(x)exp{11λ(1λ)θ2,T(x)} dx)1λ=Z(θ1)λZ(θ2)1λ\begin{aligned} Z(\theta) &= \exp\{A(\theta)\} = \exp\{A(\lambda\theta_1+(1-\lambda)\theta_2)\}\\ &=\int_X h(x)\exp\{\langle(\lambda\theta_1+(1-\lambda)\theta_2), T(x)\rangle\}~{\rm d}x \\ & = \int_X \left(h(x)^{\lambda}\exp\{\langle\lambda\theta_1, T(x)\rangle \}\right)\left(h(x)^{1-\lambda}\exp\{\langle(1-\lambda)\theta_2, T(x)\rangle\}\right)~{\rm d}x \\ &\leq \left(\int_X h(x)\exp\{\frac1\lambda\langle\lambda\theta_1, T(x)\rangle\} ~{\rm d}x \right)^\lambda \left(\int_X h(x)\exp\{\frac1{1-\lambda}\langle(1-\lambda)\theta_2, T(x)\rangle\} ~{\rm d}x \right)^{1-\lambda} \\ &=Z(\theta_1)^\lambda \cdot Z(\theta_2)^{1-\lambda} \end{aligned}

The \leq in the above comes from Hölder's inequality, which can be referenced in Wolfram MathWorld or Zhihu column. It is worth mentioning that the famous mathematical software Mathematica was developed by Wolfram Research.

Since θ1,θ2N\theta_1,\theta_2\in \mathcal N, it follows that Z(θ1),Z(θ2)<Z(\theta_1),Z(\theta_2)\lt\infty holds. Therefore, Z(θ)=Z(θ1)λZ(θ2)1λ<Z(\theta) = Z(\theta_1)^\lambda \cdot Z(\theta_2)^{1-\lambda} \lt \infty also holds, and by definition, we have θN\theta\in\mathcal N. Thus, it can be proven that the natural parameter space N\mathcal N is a convex set.

Taking the logarithm of the above expression gives:

A(θ)=A(λθ1+(1λ)θ2)λA(θ1)+(1λ)A(θ2)A(\theta) = A(\lambda\theta_1+(1-\lambda)\theta_2) \leq \lambda A(\theta_1) + (1-\lambda)A(\theta_2)

Thus, it can be proven that the log-partition function A(θ)A(\theta) is a convex function. When θ1θ2\theta_1\neq\theta_2, Hölder's inequality cannot achieve equality, and A(θ)A(\theta) is a strictly convex function.

For definitions of convex sets and convex functions, refer to the convex optimization tutorial Zhihu column or the classic textbook on convex optimization cvxbook by Stephen Boyd.

Examples of Exponential Family#

Recalling the canonical form of the exponential family, we will prove that several common distributions belong to the exponential family.

fX(x;θ)=h(x)exp{θ,T(x)A(θ)}f_\mathbf{X}(x;\theta) = h(x)\exp\{\langle\theta, T(x)\rangle-A(\theta)\}

Bernoulli Distribution (Two-point Distribution)#

The probability mass function of the Bernoulli distribution (which is discrete, hence a probability mass function) is:

p(x;λ)=λx(1λ)(1x)p(x;\lambda) = \lambda^x\cdot (1-\lambda)^{(1-x)}

where λ\lambda is the parameter of this Bernoulli distribution (the probability of the event occurring), x=0x =0 (event does not occur), x=1x =1 (event occurs). No other values of xx exist.

We rewrite the expression:

p(x;λ)=λx(1λ)(1x)=exp{log(λ1λ)x+log(1λ)}\begin{aligned} p(x;\lambda) &= \lambda^x\cdot (1-\lambda)^{(1-x)}\\ &\color{red}=\exp\left\{\log\left(\frac{\lambda}{1-\lambda}\right)x+\log(1-\lambda) \right\} \end{aligned}

We take

θ=λ1λ,T(x)=x,A(θ)=log(1λ)=log(1+eθ),h(x)=1\theta = \frac{\lambda}{1-\lambda}, \quad T(x)=x,\quad A(\theta) = -\log(1-\lambda) = \log(1+e^\theta),\quad h(x) = 1

Thus, it can be proven that the Bernoulli distribution belongs to the single-parameter exponential family.

Poisson Distribution#

The probability mass function of the Poisson distribution is as follows:

p(x;λ)=λxeλx!=1x!exp{xlogλλ}\begin{aligned} p(x;\lambda) &= \frac{\lambda^xe^{-\lambda}}{x!} \\ &\color{red} = \frac{1}{x!}\exp\{x\log\lambda-\lambda\} \end{aligned}

Taking

θ=logλ,T(x)=x,A(θ)=λ=eθ,h(x)=1x!\theta = \log\lambda,\quad T(x) = x,\quad A(\theta)=\lambda=e^\theta,\quad h(x)=\frac{1}{x!}

It can be proven that the Poisson distribution belongs to the single-parameter exponential family.

Gaussian Distribution (Normal Distribution)#

The probability density function of the Gaussian distribution is as follows:

p(x;μ,σ2)=12πσexp{12σ2(xμ)2}=12πexp{μσ2x12σ2x212σ2μ2logσ}\begin{aligned} p(x;\mu,\sigma^2) &= \frac{1}{\sqrt{2\pi}\sigma}\exp\left\{ -\frac{1}{2\sigma^2}(x-\mu)^2 \right\}\\ & \color{red}=\frac{1}{\sqrt{2\pi}}\exp\left\{ \frac{\mu}{\sigma^2}x-\frac{1}{2\sigma^2}x^2-\frac{1}{2\sigma^2}\mu^2-\log\sigma \right\} \end{aligned}

Taking

θ=[μ/σ21/2σ2],T(x)=[xx2],A(θ)=μ22σ2+logσ=θ124θ212log(2θ2),h(x)=12π\theta = \begin{bmatrix}\mu / \sigma^2 \\ \\ -1/2\sigma^2\end{bmatrix},\quad T(x) = \begin{bmatrix} x \\ \\ x^2\end{bmatrix},\quad A(\theta) = \frac{\mu^2}{2\sigma^2}+\log\sigma=-\frac{\theta_1^2}{4\theta_2}-\frac12\log(-2\theta_2),\quad h(x)=\frac{1}{\sqrt{2\pi}}

It can be proven that the Gaussian distribution belongs to the multi-parameter exponential family.

Properties of the Exponential Family#

Sufficient Statistic#

For understanding sufficient statistics, in addition to this article, you can refer to Zhihu column or blog. These materials will also greatly aid in understanding the content. The notes in this article are partly derived from these materials.

Let X1,,XnX_1,\cdots,X_n be a set of samples from XX. Before observation, the samples X1,,XnX_1,\cdots,X_n are random variables, and after observation, the samples X1,,XnX_1,\cdots,X_n are specific values.

From the perspective of mathematical statistics, we hope to infer the original distribution from the samples. The sufficient statistic, as a statistic defined on the sample space, is a measurable function denoted as T(X1,,X2)T(X_1,\cdots, X_2), often written as T(X)T(X). As a statistic, it reduces the information contained in the original random variable.

For example, when calculating the sample mean, the order of the sample values is information we do not care about.

For a set of samples, there exists a joint probability density function, denoted as f(x)f(x). If this distribution does not have parameters (or parameters are known), then this function essentially characterizes all the information contained in this set of samples.

If the joint probability density function has an unknown parameter θ\theta, it is denoted as f(x;θ)f(x;\theta) or fθ(x)f_\theta(x). Given the value of the statistic TT as T=tT=t, if the corresponding conditional distribution Fθ(XT=t)F_\theta(X|T=t) is a distribution independent of the unknown parameter θ\theta (i.e., a determined distribution), then this statistic TT is a sufficient statistic.

A sufficient statistic retains all useful information about the parameter θ\theta and eliminates useless information.

Building on sufficient statistics, we further introduce the minimum sufficient statistic. Intuitively, we would prefer the form of sufficient statistics to be as simple as possible, and this is the definition of minimum sufficient statistics.

If T=T(X)T^\star = T^\star(X) is a sufficient statistic, and for any sufficient statistic T=T(X)T=T(X), there exists a measurable function φ\varphi such that T=φ(T)T^\star = \varphi(T), then TT^\star is a minimum sufficient statistic.

The logic of this definition is that if TT^\star is a sufficient statistic, then TT must also be a sufficient statistic.

Derivatives and Expectations#

When learning about expectations, we know that calculating an expectation involves computing an integral. However, the special properties of the exponential family can link expectations with derivatives. Since differentiation is generally simpler than integration, we prefer derivatives.

Taking the first derivative of the cumulant generating function A(θ)A(\theta) gives us the expectation of the sufficient statistic TT.

A(θ)θT=θT{logXh(x)exp{θ,T(x)} dx}=XT(x)h(x)exp{θ,T(x)} dxXh(x)exp{θ,T(x)} dx=1Z(θ)XT(x)h(x)exp{θ,T(x)} dx=XT(x)h(x)exp{θ,T(x)A(θ)} dx=XT(x)fX(x;θ) dx=E[T(X)]\begin{align*} \frac{\partial A(\theta)}{\partial \theta^T} &= \frac{\partial}{\partial \theta^T}\left\{ \log \int_Xh(x)\exp\{\langle\theta, T(x)\rangle\}~{\rm d}x\right\} \\ &= \frac{\int_X T(x)h(x)\exp\{\langle\theta, T(x)\rangle\}~{\rm d}x}{\int_Xh(x)\exp\{\langle\theta, T(x)\rangle\}~{\rm d}x} \\ &=\frac{1}{Z(\theta)} \int_X T(x)h(x)\exp\{\langle\theta, T(x)\rangle\}~{\rm d}x\\ &=\int_X T(x)h(x)\exp\{\langle\theta,T(x)\rangle - A(\theta)\}~{\rm d}x \\ &=\int_X T(x)f_\mathbf{X}(x;\theta)~{\rm d}x \\ &=\mathbb E[T(X)] \end{align*}

There are several points to note in this formula:

  1. Why differentiate with respect to θT\theta^T? This can be simply understood as ensuring that when applying the chain rule, the derivative of θ,T(x)\langle\theta,T(x) \rangle yields T(x)T(x) rather than T(x)TT(x)^T.
  2. Why can the derivative and integral symbols be interchanged? This satisfies the Dominated Convergence Theorem.
  3. Why does the expression include an additional A(θ)A(\theta)? We find that the denominator can be factored out as a partition function Z(θ)Z(\theta), which is independent of the integral variable xx, allowing us to move it inside according to the rules of exponentials. This step can refer to the previous section Several Equivalent Forms.
  4. How does the last step become an expectation? Because fX(x;θ)f_\mathbf{X}(x;\theta) is a probability distribution.

Derivatives and Variance#

Taking the second derivative of the cumulant generating function A(θ)A(\theta) gives us the variance of the sufficient statistic TT.

θ(A(θ)θT)=θXT(x)h(x)exp{θ,T(x)A(θ)} dx=XT(x)h(x)exp{θ,T(x)A(θ)}(T(x)TθA(θ)) dx=XT(x)(T(x)θTA(θ))Th(x)exp{θ,T(x)A(θ)} dx=XT(x)(T(x)E[T(X)])Th(x)exp{θ,T(x)A(θ)} dx=XT(x)T(x)Th(x)exp{θ,T(x)A(θ)} dxE[T(X)]TXT(x)h(x)exp{θ,T(x)A(θ)} dx=E[T(X)T(X)T]E[T(X)]E[T(X)]T=Var[T(X)]\begin{align*} \frac{\partial}{\partial \theta}(\frac{\partial A(\theta)}{\partial \theta^T}) &= \frac{\partial}{\partial \theta} \int_X T(x)h(x)\exp\{\langle\theta,T(x)\rangle - A(\theta)\}~{\rm d}x \\ &= \int_X T(x)h(x)\exp\{\langle\theta,T(x)\rangle - A(\theta)\}\left(T(x)^T - \frac{\partial}{\partial \theta}A(\theta)\right)~{\rm d}x \\ &= \int_X T(x)\left(T(x) - \frac{\partial}{\partial \theta^T}A(\theta)\right)^T h(x)\exp\{\langle\theta,T(x)\rangle - A(\theta)\}~{\rm d}x \\ &= \int_X T(x)\left(T(x) - \mathbb E[T(X)]\right)^T h(x)\exp\{\langle\theta,T(x)\rangle - A(\theta)\}~{\rm d}x \\ &= \int_X T(x)T(x)^T h(x)\exp\{\langle\theta,T(x)\rangle - A(\theta)\}~{\rm d}x \\ &\quad- \mathbb E[T(X)]^T \int_X T(x) h(x)\exp\{\langle\theta,T(x)\rangle - A(\theta)\}~{\rm d}x \\ &= \mathbb E[T(X)T(X)^T]-\mathbb E[T(X)]\cdot\mathbb E[T(X)]^T \\ &= Var[T(X)] \end{align*}

Similar to the previous section, the interchange of derivatives and integrals is also used here, and the specific details can refer to the Dominated Convergence Theorem.

For derivatives and transposes of matrices and vectors, refer to Blog Garden. The blog and the referenced links provide detailed explanations.

Parameterization#

Parameterization means representing something using parameters.

If the elements of the parameter θ\theta of the exponential family are linearly independent, and the elements of the sufficient statistic T(x)T(x) are also linearly independent, then we can call this exponential family a minimal exponential family.

It seems there is no corresponding Chinese translation for minimal exponential family, so a literal translation is used. However, translating it as the simplest exponential family might be more appropriate for the following reasons:
For those non-minimal exponential families, we can obtain a minimal exponential family through some suitable parameter substitution or transformation.

The log-partition function A(θ)A(\theta) of the minimal exponential family is a strictly convex function that satisfies Fenchel's inequality. Before introducing Fenchel's inequality, we first introduce convex conjugates.

Refer to Wikipedia.
Convex Conjugate
For an extended real-valued function f:XR  {,+}f: X\rightarrow\mathbb R~\cup~\{-\infty, +\infty\} on the original space XX, its conjugate function on the dual space XX^* is denoted as

f=XR  {,+}f^*=X^*\rightarrow\mathbb R~\cup~\{-\infty, +\infty\}

We define the correspondence between a point xXx^*\in X^* in the dual space and a point xXx\in X in the original space as:

f(x)=sup{x,xf(x)}f^*(x^*)=\sup\{\langle x^*,x\rangle-f(x)\}

where sup\sup is the supremum (least upper bound), and inf\inf is the infimum (greatest lower bound), with the distinction that:
CSDN blog, Zhihu column

  1. Real-valued functions must have a supremum/infimum (which can be attained). However, they may not necessarily have a maximum or minimum (the maximum/minimum points may not be attainable by definition). For example, f(x)=sinxxf(x)=\frac{\sin x}{x}.
  2. If the maximum/minimum can be attained, it is the supremum/infimum.

For A(θ)A(\theta), its convex conjugate A(θ)A^*(\theta^*) is given by

A(θ)=sup{θ,θA(θ)}A^*(\theta^*) = \sup\{\langle\theta^*,\theta\rangle-A(\theta)\}

We define μ=E[T(X)]\mu = \mathbb E[T(X)], thus θT(θ,θA(θ))=θμ\dfrac{\partial}{\partial\theta^T}\left(\langle\theta^*,\theta\rangle-A(\theta)\right) = \theta^*-\mu.

Therefore, when θ=μ\theta^*=\mu, the derivative value is zero, achieving the supremum. The corresponding convex conjugate is A(μ)=μ,θA(θ)A^*(\mu)=\langle\mu,\theta\rangle-A(\theta), and with some rearrangement, we obtain:

A(μ)+A(θ)=μ,θA^*(\mu) +A(\theta) = \langle\mu,\theta\rangle

Fenchel's Inequality
On the other hand, according to Fenchel's inequality, for any xX, xXx\in X,~x^*\in X^*, we have:

f(x)+f(x)x,xf(x)+f^*(x^*)\geq\langle x^*, x\rangle

Since μA(θ)\mu\in\partial A(\theta), the above achieves equality.

Mean Representation The exponential family can be represented using standard parameterization (canonical parameterization) or mean parameterization, as θ\theta and the mean μ\mu are in one-to-one correspondence. That is, it can be viewed as a function of θ\theta or as a function of the mean μ\mu.

Statistical Inference#

Maximum Likelihood Estimation for Population Mean#

First, let's review the concept of Maximum Likelihood Estimation (MLE).

There is an unknown distribution, and we have a series of sample observations. We aim to use these sample observations to infer the most likely distribution. This raises two questions:

  1. Is the model determined? Generally, to simplify the problem, the model is specified. In practical problems, if the model is not specified, it may require trying each model one by one.
  2. Are the parameters determined? The parameters are uncertain. If the model is known, the typical operation is to fit the model using this set of sample observations and then infer the parameters.

Using Maximum Likelihood Estimation to find the population mean μ\mu. Steps:

  1. Given a set of independent and identically distributed sample observations from nn repetitions, denoted as D=(x1,x2,,xN)\mathcal D=(x_1,x_2,\cdots,x_N).
  2. Write the likelihood function. The method is to directly substitute these sample values into the probability density function and multiply the results.
L(θD)=i=1Nf(xi;θ)=i=1Nh(xi)exp{η(θ),T(xi)A(θ)}L(\theta|\mathcal D) =\prod_{i=1}^N f(x_i;\theta) = \prod_{i=1}^N h(x_i)\exp\{\langle\eta(\theta), T(x_i)\rangle-A(\theta)\}
  1. Take the logarithm of the likelihood function and differentiate to obtain the score function:
l(θD)=logL(θD)=log(i=1Nh(xi))+θT(i=1NT(xi))NA(θ)θl(θD)=i=1NT(xi)NθA(θ)\begin{align*} &l(\theta|\mathcal D) = \log L(\theta|\mathcal D) = \log\left(\prod_{i=1}^N h(x_i)\right) + \theta^T\left( \sum_{i=1}^N T(x_i) \right) - NA(\theta) \\ & \nabla_\theta l(\theta|\mathcal D)= \sum_{i=1}^N T(x_i) - N\nabla_\theta A(\theta) \end{align*}
  1. Set the derivative to zero and solve the likelihood equation.
θl(θD)=0θA(θ^)=1Ni=1NT(xi)\nabla_\theta l(\theta|\mathcal D) = 0 \quad \Longrightarrow \quad \nabla_\theta A(\hat\theta) = \frac1N\sum_{i=1}^N T(x_i)

The essence of maximum likelihood estimation is to maximize the likelihood function. However, there are special cases:

  1. If the log-likelihood function is monotonic, leading to the absence of zero points for the derivative.
  2. Or due to too few samples, leading to the situation where zero points for the derivative exist but cannot be attained.

Generally, endpoint values are taken.

We define the population mean μ=E[T(X)]\mu = \mathbb E[T(X)], and combining with the above, we obtain:

μ^MLE=E[T(X)] = θA(θ^)=1Ni=1NT(xi)\hat\mu_{MLE} = \mathbb E[T(X)] ~{\color{red} = }~ \nabla_\theta A(\hat\theta) = \frac1N\sum_{i=1}^N T(x_i)

This equality (the red equality) holds because we have already proven it in the previous section Derivatives and Expectations.

μ^MLE\hat\mu_{MLE} is unbiased. Because

E[μ^MLE]=1Ni=1NE[T(Xi)]=1NNμ=μ\mathbb E [\hat\mu_{MLE}] = \frac1N\sum_{i=1}^N\mathbb E[T(X_i)] = \frac1N N\mu = \mu

μ^MLE\hat\mu_{MLE} is efficient. It can be proven that μ^MLE\hat\mu_{MLE} is the uniformly minimum-variance unbiased estimator (UMVUE).

As mentioned above, the first derivative of the log-likelihood function is also called the score function, denoted as:

S(X;θ)=θl(X;θ)=θlogL(X;θ)=θlogi=1Nf(xi;θ)=i=1NT(xi)NθA(θ)\begin{align*} S(X;\theta) &= \nabla_\theta l(X;\theta) = \nabla_\theta\log L(X;\theta) = \nabla_\theta\log \prod_{i=1}^N f(x_i;\theta) \\ & = \sum_{i=1}^N T(x_i) - N\nabla_\theta A(\theta) \end{align*}

where XX is the sample sequence {X1,X2,,Xn}\{X_1,X_2,\cdots, X_n\}, and the corresponding sample observations are {x1,x2,,xn}\{x_1, x_2,\cdots,x_n\}.

Refer to Zhihu Q&A for the introduction of Fisher Information. Fisher Information is the second moment of the score function.

I(θ)=E[S2(X;θ)]I(\theta) = \mathbb E[S^2(X;\theta)]

Fisher Information is used to measure the precision of parameter estimation.
The Fisher Information obtained from N observations is N times that obtained from a single observation.
In the following, we will take the Fisher Information from a single observation as an example.

The score function is a function of θ\theta, and obviously, this Fisher Information matrix is also a function of θ\theta. Refer to Wikipedia and online blog to prove:

E[S(X;θ)]=XS(X;θ)f(x;θ) dx=Xθf(x;θ)f(x;θ)f(x;θ) dx=θXf(x;θ) dx=θ1=0\begin{align*} \mathbb E[S(X;\theta)] & = \int_X S(X;\theta) f(x;\theta) ~{\rm d}x = \int_X\frac{\frac{\partial}{\partial \theta} f(x;\theta)}{f(x;\theta)}f(x;\theta) ~{\rm d}x\\ &=\color{red} \frac{\partial}{\partial \theta}\int_X f(x;\theta)~{\rm d}x = \frac{\partial}{\partial \theta} 1 = 0 \end{align*}

Here, the red part equals zero because the integral and second derivative can be interchanged.
In the discrete case, the integral sign can be replaced by summation. This may produce an N-fold relationship.

Thus, I(θ)=E[S2(X;θ)]E2[S(X;θ)]=Var[S(X;θ)]I(\theta) = \mathbb E[S^2(X;\theta)] - \mathbb E^2[S(X;\theta)] = Var[S(X;\theta)]. That is, Fisher Information is the variance of the score function.

Since S(X;θ)S(X;\theta) is twice differentiable, we can prove:

E[S2(X;θ)]=E[2θ2logL(X;θ)]\begin{align*} \mathbb E[S^2(X;\theta)] =-\mathbb E\left[\frac{\partial^2}{\partial\theta^2}\log L(X;\theta)\right] \end{align*}

The proof process is similar, because

E[2θ2logL(X;θ)]=X2θ2logL(X;θ)f(x;θ) dx=XθS(X;θ)f(x;θ) dx=Xθ(θf(x;θ)f(x;θ))f(x;θ) dx=X(2θ2f(x;θ)f(x;θ)(θf(x;θ)f(x;θ))2)f(x;θ) dx=0X(θlogL(X;θ))2f(x;θ) dx=XS2(X;θ)f(x;θ) dx=E[S2(X;θ)]\begin{align*} \mathbb E\left[\frac{\partial^2}{\partial\theta^2}\log L(X;\theta)\right] &= \int_X \frac{\partial^2}{\partial\theta^2}\log L(X;\theta)f(x;\theta)~{\rm d}x = \int_X\frac{\partial}{\partial\theta}S(X;\theta)f(x;\theta)~{\rm d}x \\ &=\int_X\frac{\partial}{\partial\theta}\left(\frac{\frac{\partial}{\partial\theta}f(x;\theta)}{f(x;\theta)}\right)f(x;\theta)~{\rm d}x\\ &=\int_X\left(\frac{\frac{\partial^2}{\partial\theta^2}f(x;\theta)}{f(x;\theta)}-\left(\frac{\frac{\partial}{\partial\theta}f(x;\theta)}{f(x;\theta)}\right)^2\right)f(x;\theta)~{\rm d}x \\ &={\color{red} 0}-\int_X\left(\frac{\partial}{\partial\theta}\log L(X;\theta)\right)^2 f(x;\theta)~{\rm d}x \\ &=-\int_X S^2(X;\theta)f(x;\theta)~{\rm d}x\\ &=-\mathbb E[S^2(X;\theta)] \end{align*}

The red part equals zero because the integral and second derivative can be interchanged.
In the discrete case, the integral sign can be replaced by summation. This may produce an N-fold relationship.

We summarize several equivalent expressions for Fisher Information:

I(θ)=E[S2(X;θ)]=E[2θ2logL(X;θ)]=E[θS(X;θ)]=Var[S(X;θ)]I(\theta) = \mathbb E[S^2(X;\theta)] = -\mathbb E\left[\frac{\partial^2}{\partial\theta^2}\log L(X;\theta)\right] = -\mathbb E\left[\frac{\partial}{\partial\theta}S(X;\theta)\right] = Var[S(X;\theta)]

On the other hand, we have L(θ)=fX(x;θ)=h(x)exp{θ,T(x)A(θ)}L(\theta) = f_X(x;\theta) = h(x)\exp\{\langle\theta, T(x)\rangle-A(\theta)\}
Taking the logarithm and then the second derivative gives us:

2θ2logL(X;θ)=2θ2A(θ)\frac{\partial^2}{\partial\theta^2}\log L(X;\theta) = -\frac{\partial^2}{\partial\theta^2} A(\theta)

Thus, we can obtain:

I(θ)=E[2θ2logL(X;θ)]=E[2θ2A(θ)]=Var[T(X)]I(\theta) = -\mathbb E\left[\frac{\partial^2}{\partial\theta^2}\log L(X;\theta)\right] = -\mathbb E\left[-\frac{\partial^2}{\partial\theta^2} A(\theta) \right] =Var[T(X)]

We find that the Fisher Information of the natural parameter θ\theta is exactly the variance of the sufficient statistic Var[T(X)]Var[T(X)].

On the other hand,

【To be continued】

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.