Reading about the exponential family of distributions in the paper, I decided to organize my thoughts. It may not be entirely accurate, and I hope the experts will not hesitate to provide guidance.

The Exponential Family, also known as the exponential class or exponential distributions, is one of the most important **parameterized families of distributions** in statistics.

When studying the exponential family, it should be distinguished from the Exponential Distribution. The two are not the same.

The term "family" in English refers to a group with similar characteristics. The exponential family is a set of distributions whose probability density functions and probability distribution functions **change with the variation of distribution parameters.**

Common examples of the exponential family include:

Normal distribution, Chi-squared distribution, Binomial distribution, Multinomial distribution, Poisson distribution, Pascal distribution, $\beta$ distribution, $\Gamma$ distribution, Log-normal distribution, etc. For specifics, see Wikipedia entry and Zhihu column.

## Exponential Family#

The probability density function of the exponential family has the following form:

where $\theta$ is the unique parameter, and all $\theta$ satisfying the above equation form the parameter space $\Theta$, corresponding to the parameter distribution family $\{P_\theta:\theta\in\Theta\}$ which is the exponential family. It must be noted that the parameter $\theta$ here is not limited to narrow real numbers; it can also be an $n$-dimensional vector $\theta\in \mathbb{R}^n$.

As the parameter $\theta$ changes, the shape of the distribution $X$ (probability density function, probability distribution function, and corresponding graphs) will also change. The random variable $x$ follows the distribution $X$. The functions $T(x), h(x), \eta(\theta), A(\theta)$ are all known functions. The function $h(x)$ is a non-negative function.

$h(x)$ is commonly referred to as the base measure.

$T(x)$ is the sufficient statistic.

$A(\theta)$ is the cumulant generating function or the log-partition function (the logarithm of the partition function). Clearly, $A(\theta)$ is a real-valued function that returns a real number.

Here, $\eta(\theta)$ and $T(x)$ can be real numbers or vectors.

From the definition, it can be seen that, due to the properties of the exponential function, $\exp\{\cdot\} = e^{\{\cdot\}} > 0$ is non-negative. Thus, the **support set** of the exponential family only relates to $h(x)$. This means it only relates to $x$ and is independent of the unknown parameter $\theta$. We can use this to exclude non-exponential families (such as uniform distributions).

Here, a brief addition about the concept of the support set is needed. Simply put, for a real-valued function $f$, the support set of $f$ is defined as:

$\text{supp}(f)=\{x\in X:f(x)\neq0\}$The support set is a subset of the original domain $X$ of the function $f$. For more information, refer to Wikipedia entry or CSDN blog. In the probability density function, since probabilities are non-negative, the support set of the random variable can be defined as (see Zhihu column):

$\text{supp}(X) =\{x\in R : f_X(x)\gt 0\}$

## Several Equivalent Forms#

Based on the rules of operations with exponentials, two equivalent forms of the exponential family are given through equivalent transformations:

$f_\mathbf{X}(x;\theta) = h(x)g(\theta)\exp\{\langle\eta(\theta), T(x)\rangle\}$

$f_\mathbf{X}(x;\theta) = \exp\{\langle\eta(\theta), T(x)\rangle-A(\theta)+B(x)\}$

The corresponding substitution relationships are: $-A(\theta) = \ln g(\theta)$, $B(x)=\ln h(x)$

In particular, if we take $Z(\theta) = \dfrac{1}{g(\theta)}$, we can obtain another very common expression of the exponential family as follows. Here, $Z(\theta)$ is the partition function of this distribution.

## Canonical Form#

In the above definition, $\eta(\theta)$ is a function of the parameter $\theta$. In the exponential family, we require that $\eta(\cdot)$ is a bijective function (i.e., a one-to-one correspondence function). A bijection implies that the function is monotonic and differentiable, and has an inverse function.

Using the properties of bijective functions, we can simplify the form of the exponential family. Let $\hat\theta = \eta(\theta)$, and this transformation is reversible: $\theta = \eta^{-1}(\hat\theta)$. Thus, we obtain: $f_\mathbf{X}(x;\hat\theta) = h(x)\exp\{\langle\hat\theta, T(x)\rangle-A^\prime(\hat\theta)\}$

By equivalently replacing symbols, we obtain the **Canonical Form** of the exponential family as follows:

We usually refer to this updated parameter $\theta$ as the canonical parameter of the exponential family.

## Natural Form#

Although there are different definitions, it is generally believed that the **Natural Form** of the exponential family is **equivalent or nearly equivalent** to the canonical form. For example, see Stanford University materials, Berkeley University materials, MIT course materials, blog, Zhihu column 1 and Zhihu column 2.

Wikipedia provides another understanding, which will not be introduced here.

## Natural Parameter Space#

Before introducing the Natural Parameter Space, we first introduce the log-partition function $A(\theta)$

The partition function can be understood as a special form of normalization constant.

The log-partition function $A(\theta)$ ensures that $f_\mathbf{X}(x;\theta)$ is normalized, meaning it guarantees that $f_\mathbf{X}(x;\theta)$ is a probability density function. Understanding this normalization can refer to the expression in the previous section Several Equivalent Forms

where $Z(\theta) = \int_X h(x)\exp\{\langle\theta, T(x)\rangle\}~{\rm d}x$ is a function that is independent of $x$. Then integrating both sides gives:

The so-called natural parameter space is the set of parameters $\theta$ for which the partition function is finite ($\lt \infty$), that is:

The natural parameter space has some special properties. First, the natural parameter space $\mathcal N$ is a convex set, and the log-partition function $A(\theta)$ is a convex function. The proof is as follows:

Consider two different parameters $\theta_1\in\mathcal N,~\theta_2\in\mathcal N$, given $0\lt\lambda\lt 1$, prove that $\theta=\lambda\theta_1+(1-\lambda)\theta_2$ is also in the natural parameter space $\mathcal N$ (i.e., prove that $\theta\in\mathcal N$ also holds)

$\begin{aligned} Z(\theta) &= \exp\{A(\theta)\} = \exp\{A(\lambda\theta_1+(1-\lambda)\theta_2)\}\\ &=\int_X h(x)\exp\{\langle(\lambda\theta_1+(1-\lambda)\theta_2), T(x)\rangle\}~{\rm d}x \\ & = \int_X \left(h(x)^{\lambda}\exp\{\langle\lambda\theta_1, T(x)\rangle \}\right)\left(h(x)^{1-\lambda}\exp\{\langle(1-\lambda)\theta_2, T(x)\rangle\}\right)~{\rm d}x \\ &\leq \left(\int_X h(x)\exp\{\frac1\lambda\langle\lambda\theta_1, T(x)\rangle\} ~{\rm d}x \right)^\lambda \left(\int_X h(x)\exp\{\frac1{1-\lambda}\langle(1-\lambda)\theta_2, T(x)\rangle\} ~{\rm d}x \right)^{1-\lambda} \\ &=Z(\theta_1)^\lambda \cdot Z(\theta_2)^{1-\lambda} \end{aligned}$The $\leq$ in the above comes from Hölder's inequality, which can be referenced in Wolfram MathWorld or Zhihu column. It is worth mentioning that the famous mathematical software Mathematica was developed by Wolfram Research.

Since $\theta_1,\theta_2\in \mathcal N$, it follows that $Z(\theta_1),Z(\theta_2)\lt\infty$ holds. Therefore, $Z(\theta) = Z(\theta_1)^\lambda \cdot Z(\theta_2)^{1-\lambda} \lt \infty$ also holds, and by definition, we have $\theta\in\mathcal N$. Thus, it can be proven that the natural parameter space $\mathcal N$ is a convex set.

Taking the logarithm of the above expression gives:

$A(\theta) = A(\lambda\theta_1+(1-\lambda)\theta_2) \leq \lambda A(\theta_1) + (1-\lambda)A(\theta_2)$Thus, it can be proven that the log-partition function $A(\theta)$ is a convex function.

When $\theta_1\neq\theta_2$, Hölder's inequality cannot achieve equality, and $A(\theta)$ is a strictly convex function.For definitions of convex sets and convex functions, refer to the convex optimization tutorial Zhihu column or the classic textbook on convex optimization cvxbook by Stephen Boyd.

## Examples of Exponential Family#

Recalling the canonical form of the exponential family, we will prove that several common distributions belong to the exponential family.

## Bernoulli Distribution (Two-point Distribution)#

The probability mass function of the Bernoulli distribution (which is discrete, hence a probability mass function) is:

where $\lambda$ is the parameter of this Bernoulli distribution (the probability of the event occurring), $x =0$ (event does not occur), $x =1$ (event occurs). No other values of $x$ exist.

We rewrite the expression:

We take

Thus, it can be proven that the Bernoulli distribution belongs to the **single-parameter** exponential family.

## Poisson Distribution#

The probability mass function of the Poisson distribution is as follows:

Taking

It can be proven that the Poisson distribution belongs to the **single-parameter** exponential family.

## Gaussian Distribution (Normal Distribution)#

The probability density function of the Gaussian distribution is as follows:

Taking

It can be proven that the Gaussian distribution belongs to the **multi-parameter** exponential family.

## Properties of the Exponential Family#

## Sufficient Statistic#

For understanding sufficient statistics, in addition to this article, you can refer to Zhihu column or blog. These materials will also greatly aid in understanding the content. The notes in this article are partly derived from these materials.

Let $X_1,\cdots,X_n$ be a set of samples from $X$. Before observation, the samples $X_1,\cdots,X_n$ are random variables, and after observation, the samples $X_1,\cdots,X_n$ are specific values.

From the perspective of **mathematical statistics**, we hope to infer the original distribution from the samples. The sufficient statistic, as a statistic defined on the sample space, is a measurable function denoted as $T(X_1,\cdots, X_2)$, often written as $T(X)$. As a statistic, it reduces the information contained in the original random variable.

For example, when calculating the sample mean, the order of the sample values is information we do not care about.

For a set of samples, there exists a joint probability density function, denoted as $f(x)$. If this distribution does not have parameters (or parameters are known), then this function essentially characterizes all the information contained in this set of samples.

If the joint probability density function has an unknown parameter $\theta$, it is denoted as $f(x;\theta)$ or $f_\theta(x)$. Given the value of the statistic $T$ as $T=t$, if the corresponding conditional distribution $F_\theta(X|T=t)$ is a distribution independent of the unknown parameter $\theta$ (i.e., a determined distribution), then this statistic $T$ is a **sufficient statistic**.

A sufficient statistic retains all useful information about the parameter $\theta$ and eliminates useless information.

Building on sufficient statistics, we further introduce the **minimum sufficient statistic**. Intuitively, we would prefer the form of sufficient statistics to be as simple as possible, and this is the definition of minimum sufficient statistics.

If $T^\star = T^\star(X)$ is a sufficient statistic, and for any sufficient statistic $T=T(X)$, there exists a measurable function $\varphi$ such that $T^\star = \varphi(T)$, then $T^\star$ is a minimum sufficient statistic.

The logic of this definition is that if $T^\star$ is a sufficient statistic, then $T$ must also be a sufficient statistic.

## Derivatives and Expectations#

When learning about expectations, we know that calculating an expectation involves computing an integral. However, the special properties of the exponential family can link expectations with derivatives. Since differentiation is generally simpler than integration, we prefer derivatives.

Taking the first derivative of the cumulant generating function $A(\theta)$ gives us the expectation of the sufficient statistic $T$.

There are several points to note in this formula:

**Why differentiate with respect to $\theta^T$**? This can be simply understood as ensuring that when applying the chain rule, the derivative of $\langle\theta,T(x) \rangle$ yields $T(x)$ rather than $T(x)^T$.**Why can the derivative and integral symbols be interchanged**? This satisfies the Dominated Convergence Theorem.**Why does the expression include an additional $A(\theta)$**? We find that the denominator can be factored out as a partition function $Z(\theta)$, which is independent of the integral variable $x$, allowing us to move it inside according to the rules of exponentials. This step can refer to the previous section Several Equivalent Forms.**How does the last step become an expectation**? Because $f_\mathbf{X}(x;\theta)$ is a probability distribution.

## Derivatives and Variance#

Taking the second derivative of the cumulant generating function $A(\theta)$ gives us the variance of the sufficient statistic $T$.

Similar to the previous section, the interchange of derivatives and integrals is also used here, and the specific details can refer to the Dominated Convergence Theorem.

For derivatives and transposes of matrices and vectors, refer to Blog Garden. The blog and the referenced links provide detailed explanations.

## Parameterization#

Parameterization means representing something using parameters.

If the elements of the parameter $\theta$ of the exponential family are linearly independent, and the elements of the sufficient statistic $T(x)$ are also linearly independent, then we can call this exponential family a **minimal exponential family**.

It seems there is no corresponding Chinese translation for minimal exponential family, so a literal translation is used. However, translating it as the simplest exponential family might be more appropriate for the following reasons:

For those non-minimal exponential families, we can obtain a minimal exponential family through some suitable parameter substitution or transformation.

The log-partition function $A(\theta)$ of the minimal exponential family is a strictly convex function that satisfies Fenchel's inequality. Before introducing Fenchel's inequality, we first introduce convex conjugates.

Refer to Wikipedia.

Convex Conjugate

For an extended real-valued function $f: X\rightarrow\mathbb R~\cup~\{-\infty, +\infty\}$ on the original space $X$, its conjugate function on the dual space $X^*$ is denoted as$f^*=X^*\rightarrow\mathbb R~\cup~\{-\infty, +\infty\}$We define the correspondence between a point $x^*\in X^*$ in the dual space and a point $x\in X$ in the original space as:

$f^*(x^*)=\sup\{\langle x^*,x\rangle-f(x)\}$where $\sup$ is the supremum (least upper bound), and $\inf$ is the infimum (greatest lower bound), with the distinction that:

CSDN blog, Zhihu column

- Real-valued functions must have a supremum/infimum (which can be attained). However, they may not necessarily have a maximum or minimum (the maximum/minimum points may not be attainable by definition). For example, $f(x)=\frac{\sin x}{x}$.
- If the maximum/minimum can be attained, it is the supremum/infimum.

For $A(\theta)$, its convex conjugate $A^*(\theta^*)$ is given by

We define $\mu = \mathbb E[T(X)]$, thus $\dfrac{\partial}{\partial\theta^T}\left(\langle\theta^*,\theta\rangle-A(\theta)\right) = \theta^*-\mu$.

Therefore, when $\theta^*=\mu$, the derivative value is zero, achieving the supremum. The corresponding convex conjugate is $A^*(\mu)=\langle\mu,\theta\rangle-A(\theta)$, and with some rearrangement, we obtain:

**Fenchel's Inequality**

On the other hand, according to Fenchel's inequality, for any $x\in X,~x^*\in X^*$, we have:

Since $\mu\in\partial A(\theta)$, the above achieves equality.

**Mean Representation** The exponential family can be represented using standard parameterization (canonical parameterization) or mean parameterization, as $\theta$ and the mean $\mu$ are in one-to-one correspondence. That is, it can be viewed as a function of $\theta$ or as a function of the mean $\mu$.

## Statistical Inference#

## Maximum Likelihood Estimation for Population Mean#

First, let's review the concept of Maximum Likelihood Estimation (MLE).

There is an **unknown distribution**, and we have a series of sample observations. We aim to use these sample observations to infer the most likely distribution. This raises two questions:

**Is the model determined?**Generally, to simplify the problem, the model is specified. In practical problems, if the model is not specified, it may require trying each model one by one.**Are the parameters determined?**The parameters are uncertain. If the model is known, the typical operation is to fit the model using this set of sample observations and then infer the parameters.

**Using Maximum Likelihood Estimation to find the population mean $\mu$**. Steps:

- Given a set of independent and identically distributed sample observations from $n$ repetitions, denoted as $\mathcal D=(x_1,x_2,\cdots,x_N)$.
- Write the likelihood function. The method is to directly substitute these sample values into the probability density function and multiply the results.

- Take the logarithm of the likelihood function and differentiate to obtain the score function:

- Set the derivative to zero and solve the likelihood equation.

The essence of maximum likelihood estimation is to maximize the likelihood function. However, there are special cases:

- If the log-likelihood function is monotonic, leading to the absence of zero points for the derivative.
- Or due to too few samples, leading to the situation where zero points for the derivative exist but cannot be attained.
Generally, endpoint values are taken.

We define the population mean $\mu = \mathbb E[T(X)]$, and combining with the above, we obtain:

This equality (the red equality) holds because we have already proven it in the previous section Derivatives and Expectations.

**$\hat\mu_{MLE}$ is unbiased**. Because

**$\hat\mu_{MLE}$ is efficient**. It can be proven that $\hat\mu_{MLE}$ is the uniformly minimum-variance unbiased estimator (UMVUE).

As mentioned above, **the first derivative of the log-likelihood function is also called the score function**, denoted as:

where $X$ is the sample sequence $\{X_1,X_2,\cdots, X_n\}$, and the corresponding sample observations are $\{x_1, x_2,\cdots,x_n\}$.

Refer to Zhihu Q&A for the introduction of **Fisher Information**. Fisher Information is the second moment of the score function.

Fisher Information is used to measure the precision of parameter estimation.

The Fisher Information obtained from N observations is N times that obtained from a single observation.

In the following, we will take the Fisher Information from a single observation as an example.

**The score function is a function of $\theta$, and obviously, this Fisher Information matrix is also a function of $\theta$.** Refer to Wikipedia and online blog to prove:

Here, the red part equals zero because the integral and second derivative can be interchanged.

In the discrete case, the integral sign can be replaced by summation. This may produce an N-fold relationship.

Thus, $I(\theta) = \mathbb E[S^2(X;\theta)] - \mathbb E^2[S(X;\theta)] = Var[S(X;\theta)]$. That is, Fisher Information is the variance of the score function.

Since $S(X;\theta)$ is twice differentiable, we can prove:

The proof process is similar, because

$\begin{align*} \mathbb E\left[\frac{\partial^2}{\partial\theta^2}\log L(X;\theta)\right] &= \int_X \frac{\partial^2}{\partial\theta^2}\log L(X;\theta)f(x;\theta)~{\rm d}x = \int_X\frac{\partial}{\partial\theta}S(X;\theta)f(x;\theta)~{\rm d}x \\ &=\int_X\frac{\partial}{\partial\theta}\left(\frac{\frac{\partial}{\partial\theta}f(x;\theta)}{f(x;\theta)}\right)f(x;\theta)~{\rm d}x\\ &=\int_X\left(\frac{\frac{\partial^2}{\partial\theta^2}f(x;\theta)}{f(x;\theta)}-\left(\frac{\frac{\partial}{\partial\theta}f(x;\theta)}{f(x;\theta)}\right)^2\right)f(x;\theta)~{\rm d}x \\ &={\color{red} 0}-\int_X\left(\frac{\partial}{\partial\theta}\log L(X;\theta)\right)^2 f(x;\theta)~{\rm d}x \\ &=-\int_X S^2(X;\theta)f(x;\theta)~{\rm d}x\\ &=-\mathbb E[S^2(X;\theta)] \end{align*}$The red part equals zero because the integral and second derivative can be interchanged.

In the discrete case, the integral sign can be replaced by summation. This may produce an N-fold relationship.

**We summarize several equivalent expressions for Fisher Information**:

On the other hand, we have $L(\theta) = f_X(x;\theta) = h(x)\exp\{\langle\theta, T(x)\rangle-A(\theta)\}$

Taking the logarithm and then the second derivative gives us:

Thus, we can obtain:

We find that the Fisher Information of the natural parameter $\theta$ is exactly the variance of the sufficient statistic $Var[T(X)]$.

On the other hand,

【To be continued】