Derivation of KL divergence between normal distributions

KL divergence is widely used in Machine Learning Algorithms. I was implementing Variational Autoencoder using Chainer, where computing KL divergence between Normal Distributions is required. So, I tried to derive it on myself.

KL divergence

KL divergence is a metrics of how similar given two probability distributions are. It is defined as follows.
Let $P$, $Q$ are some probability distributions and its probability density functions are given by $p$, $q$, then KL divergence between $P$ and $Q$ is defined as $$ D_{KL}(P || Q) = \int p(x)\log{\frac{p(x)}{q(x)}}dx = E_P[\log{\frac{p(x)}{q(x)}}] $$ where $E_P[f(x)]$ means expectation of $f(x)$ by probability distribution $P$

KL divergence between normal distributions

Here is the final result of KL divergence $D_{KL}(p || q)$ between Normal Distributions.

Univariate Case

Let $P$ is $N(\mu_1, \sigma_1)$, $Q$ is $N(\mu_2, \sigma_2)$ and its probability density functions are $p$ and $q$, then KL divergence between $P$ and $Q$ is $$ D_{KL}(P || Q) = \log{\frac{\sigma_2}{\sigma_1}} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2} $$

Multivariate Case

Let $P$ is $N_d(\boldsymbol{\mu_1}, \boldsymbol{\Sigma_1})$, $Q$ is $N_d(\boldsymbol{\mu_2}, \boldsymbol{\Sigma_2})$($d$-dimensional normal distribution) and its probability density functions are $p$ and $q$, then KL divergence between $p$ and $q$ is $$ D_{KL}(P || Q) = \frac{1}{2}\{\log{\frac{|\boldsymbol{\Sigma_2}|}{|\boldsymbol{\Sigma_1}|}} + \mathrm{Tr}(\boldsymbol{\Sigma_2^{-1}}\boldsymbol{\Sigma_1}) + (\boldsymbol{\mu_1} - \boldsymbol{\mu_2})^{\mathrm{T}}\boldsymbol{\Sigma_2^{-1}}(\boldsymbol{\mu_1} - \boldsymbol{\mu_2}) - d\} $$

Derivation of univariate case

Let's calculate $\log{\frac{p(x)}{q(x)}}$ first.
$p(x)$ and $q(x)$ are defined as $p(x) = \frac{1}{\sqrt{2\pi}\sigma_1}e^{-\frac{1}{2\sigma_1^2}(x - \mu_1)^2}$, $q(x) = \frac{1}{\sqrt{2\pi}\sigma_2}e^{-\frac{1}{2\sigma_2^2}(x - \mu_2)^2}$

$$ \begin{align} \log{\frac{p(x)}{q(x)}} &= \log\{\frac{\sqrt{2\pi}\sigma_2}{\sqrt{2\pi}\sigma_1}e^{-\frac{1}{2\sigma_1^2}(x - \mu_1)^2 + \frac{1}{2\sigma_2^2}(x - \mu_2)^2}\} \\ &= \log{\frac{\sigma_2}{\sigma_1}} - \frac{1}{2\sigma_1^2}(x - \mu_1)^2 + \frac{1}{2\sigma_2^2}(x - \mu_2)^2 \end{align} $$ then applying expectation on it $$ \begin{align} E_P[\log{\frac{p(x)}{q(x)}}] &= E_P[\log{\frac{\sigma_2}{\sigma_1}} - \frac{1}{2\sigma_1^2}(x - \mu_1)^2 + \frac{1}{2\sigma_2^2}(x - \mu_2)^2] \\ &= E_P[\log{\frac{\sigma_2}{\sigma_1}}] - E_P[\frac{1}{2\sigma_1^2}(x - \mu_1)^2] + E_P[\frac{1}{2\sigma_2^2}(x - \mu_2)^2]] \\ &= \log{\frac{\sigma_2}{\sigma_1}} - \frac{1}{2\sigma_1^2}E_P[(x - \mu_1)^2] + \frac{1}{2\sigma_2^2}E_P[(x - \mu_2)^2] \\ &= \log{\frac{\sigma_2}{\sigma_1}} - \frac{1}{2\sigma_1^2}\sigma_1^2 + \frac{1}{2\sigma_2^2}E_P[(x - \mu_2)^2] \\ &= \log{\frac{\sigma_2}{\sigma_1}} - \frac{1}{2} + \frac{1}{2\sigma_2^2}E_P[(x - \mu_2)^2] \end{align} $$ for $E_P[(x - \mu_2)^2]$ $$ \begin{align} E_P[(x - \mu_2)^2] &= E_P[(x - \mu_1 + \mu_1 - \mu_2)^2] \\ &= E_P[(x - \mu_1)^2 + 2(x - \mu_1)(\mu_1 - \mu_2) + (\mu_1 - \mu_2)^2] \\ &= E_P[(x - \mu_1)^2] + 2(\mu_1 - \mu_2)E_P[(x - \mu_1)] + (\mu_1 - \mu_2)^2 \\ &= \sigma_1^2 + 2(\mu_1 - \mu_2)\cdot 0 + (\mu_1 - \mu_2)^2 \\ &= \sigma_1^2 + (\mu_1 - \mu_2)^2 \end{align} $$ finally $$ \begin{align} E_P[\log{\frac{p(x)}{q(x)}}] = \log{\frac{\sigma_2}{\sigma_1}} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2} \end{align} $$

Derivation of multivariate case

$p(\boldsymbol{x})$ and $q(\boldsymbol{x})$ are defined as $$ p(\boldsymbol{x}) = \frac{1}{(2\pi)^{\frac{d}{2}}|\boldsymbol{\Sigma_1}|^{\frac{1}{2}}}e^{ -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_1})^{\mathrm{T}}\boldsymbol{\Sigma_1}^{-1}(\boldsymbol{x} - \boldsymbol{\mu_1}) } \\ q(\boldsymbol{x}) = \frac{1}{(2\pi)^{\frac{d}{2}}|\boldsymbol{\Sigma_2}|^{\frac{1}{2}}}e^{ -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_2})^{\mathrm{T}}\boldsymbol{\Sigma_2}^{-1}(\boldsymbol{x} - \boldsymbol{\mu_2}) } $$ We will use following relationship between matrix trace and quadratic form. $$ \boldsymbol{x}^{\mathrm{T}}\boldsymbol{A}\boldsymbol{x} = \mathrm{Tr}( \boldsymbol{A}\boldsymbol{x}\boldsymbol{x}^{\mathrm{T}} ) $$ Let's calculate $\log{\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}}$ first. $$ \begin{align} \log{\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}} &= \log\{ \frac{(2\pi)^{\frac{d}{2}} |\boldsymbol{\Sigma_2}|^{\frac{1}{2}} }{ (2\pi)^{\frac{d}{2}} |\boldsymbol{\Sigma_1}|^{\frac{1}{2}}} e^{-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_1})^{\mathrm{T}}\boldsymbol{\Sigma_1}^{-1}(\boldsymbol{x} - \boldsymbol{\mu_1}) + \frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_2})^{\mathrm{T}}\boldsymbol{\Sigma_2}^{-1}(\boldsymbol{x} - \boldsymbol{\mu_2}) } \} \\ &= \log\{ \frac{|\boldsymbol{\Sigma_2}|^{\frac{1}{2}}}{|\boldsymbol{\Sigma_1}|^{\frac{1}{2}}} e^{ -\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_1})^{\mathrm{T}}\boldsymbol{\Sigma_1}^{-1}(\boldsymbol{x} - \boldsymbol{\mu_1}) + \frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_2})^{\mathrm{T}}\boldsymbol{\Sigma_2}^{-1}(\boldsymbol{x} - \boldsymbol{\mu_2}) } \} \\ &= \frac{1}{2}\log{\frac{|\boldsymbol{\Sigma_2}|}{|\boldsymbol{\Sigma_1}|}} - \frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_1})^{\mathrm{T}}\boldsymbol{\Sigma_1}^{-1}(\boldsymbol{x} - \boldsymbol{\mu_1}) + \frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu_2})^{\mathrm{T}}\boldsymbol{\Sigma_2}^{-1}(\boldsymbol{x} - \boldsymbol{\mu_2}) \end{align} $$ then applying relationship between trace and quadratic form
$$ \log{\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}} = \frac{1}{2}\log{\frac{|\boldsymbol{\Sigma_2}|}{|\boldsymbol{\Sigma_1}|}} - \frac{1}{2}\mathrm{Tr}\{ \boldsymbol{\Sigma_1}^{-1}(\boldsymbol{x} - \boldsymbol{\mu_1})(\boldsymbol{x} - \boldsymbol{\mu_1})^{\mathrm{T}} \} + \frac{1}{2}\mathrm{Tr}\{ \boldsymbol{\Sigma_2}^{-1}(\boldsymbol{x} - \boldsymbol{\mu_2})(\boldsymbol{x} - \boldsymbol{\mu_2})^{\mathrm{T}} \} $$ applying expectation on it gives
$$ E_P[\log{\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}}] = \frac{1}{2}\log{\frac{|\boldsymbol{\Sigma_2}|}{|\boldsymbol{\Sigma_1}|}} - \frac{1}{2}\mathrm{Tr}\{ \boldsymbol{\Sigma_1}^{-1}E_P[(\boldsymbol{x} - \boldsymbol{\mu_1})(\boldsymbol{x} - \boldsymbol{\mu_1})^{\mathrm{T}]} \} + \frac{1}{2}\mathrm{Tr}\{ \boldsymbol{\Sigma_2}^{-1}E_P[(\boldsymbol{x} - \boldsymbol{\mu_2})(\boldsymbol{x} - \boldsymbol{\mu_2})^{\mathrm{T}}] \} $$ for $E_P[(\boldsymbol{x} - \boldsymbol{\mu_1})(\boldsymbol{x} - \boldsymbol{\mu_1})^{\mathrm{T}}]$, by definition of population covariance matrix
$$ E_P[(\boldsymbol{x} - \boldsymbol{\mu_1})(\boldsymbol{x} - \boldsymbol{\mu_1})^{\mathrm{T}}] = \boldsymbol{\Sigma_1} $$ for $E_P[(\boldsymbol{x} - \boldsymbol{\mu_2})(\boldsymbol{x} - \boldsymbol{\mu_2})^{\mathrm{T}}]$
$$ \begin{align} E_P[(\boldsymbol{x} - \boldsymbol{\mu_2})(\boldsymbol{x} - \boldsymbol{\mu_2})^{\mathrm{T}}] &= E_P[(\boldsymbol{x} - \boldsymbol{\mu_1} + \boldsymbol{\mu_1} - \boldsymbol{\mu_2})(\boldsymbol{x} - \boldsymbol{\mu_1} + \boldsymbol{\mu_1} - \boldsymbol{\mu_2})^{\mathrm{T}}] \\ &= E_P[ (\boldsymbol{x} - \boldsymbol{\mu_1})\boldsymbol{x} - \boldsymbol{\mu_1}^{\mathrm{T}} + (\boldsymbol{x} - \boldsymbol{\mu_1})(\boldsymbol{\mu_1} - \boldsymbol{\mu_2})^{\mathrm{T}} + (\boldsymbol{\mu_1} - \boldsymbol{\mu_2})(\boldsymbol{x} - \boldsymbol{\mu_1})^{\mathrm{T}} + (\boldsymbol{\mu_1} - \boldsymbol{\mu_2})(\boldsymbol{\mu_1} - \boldsymbol{\mu_2})^{\mathrm{T}}] \\ &= \boldsymbol{\Sigma_1} + (\boldsymbol{\mu_1} - \boldsymbol{\mu_2})(\boldsymbol{\mu_1} - \boldsymbol{\mu_2})^{\mathrm{T}} \end{align} $$ finally
$$ \begin{align} E_P[\log{\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}}] &= \frac{1}{2}\log{\frac{|\boldsymbol{\Sigma_2}|}{|\boldsymbol{\Sigma_1}|}} - \frac{1}{2}\mathrm{Tr}\boldsymbol{I}_d + \frac{1}{2}\mathrm{Tr}\{\boldsymbol{\Sigma_2}^{-1}\boldsymbol{\Sigma_1} + \boldsymbol{\Sigma_2}^{-1}(\boldsymbol{\mu_1} - \boldsymbol{\mu_2})(\boldsymbol{\mu_1} - \boldsymbol{\mu_2})^{\mathrm{T}})\} \\ &= \frac{1}{2}\{\log{\frac{|\boldsymbol{\Sigma_2}|}{|\boldsymbol{\Sigma_1}|}} + \mathrm{Tr}(\boldsymbol{\Sigma_2^{-1}}\boldsymbol{\Sigma_1}) + (\boldsymbol{\mu_1} - \boldsymbol{\mu_2})^{\mathrm{T}}\boldsymbol{\Sigma_2^{-1}}(\boldsymbol{\mu_1} - \boldsymbol{\mu_2}) - d\} \end{align} $$

Comments