Bayesian Logistic Regression

In general, Bayesian logistic regression.^[1] is a Bayesian inference for logistic regression. Exact evaluation of the posterior distribution and predictive distribution is not tractable. However, Laplace approximation^[2] ^[3]and variational approximation^[4] can be applied to derive the Gaussian representation for the posterior distribution. In particular, compared to the Laplace approach, the greater flexibility of the variational approximation results in improved accuracy. The predictive distribution for new data points can be evaluated based on the approximation of the posterior distribution.

Posterior distribution[edit]

Laplace approximation[edit]

The Laplace approximation, also known as the S-L approximation (Spiegelhalter and Lauritzen (1990)), aims to find the maximum of the posterior distribution and then fitting a Gaussian centered at that point.

Strarting with a Gaussian prior is reasonable for seeking a Gaussian approximation for the posterior distribution:

p(\mathbf {w} )={\mathcal {N}}(\mathbf {w} |\mathbf {\mu _{0}} ,\mathbf {\Sigma _{0}} )

In the problem of two-class classification, the likelihood function is

p(\mathbf {t} |\mathbf {w} )=\prod _{n=1}^{N}y_{n}^{t_{n}}(1-y_{n})^{1-t_{n}}

where $\mathbf {t} =(t_{1},...,t_{N})^{T},t_{n}\in \{0,1\}$ and $y_{n}=p(C_{1}|\phi _{n})=\sigma (\mathbf {w^{T}\phi _{n}} ),\mathbf {\phi _{n}=\phi (x_{n})}$ is the feature vector.

Then the log posterior distribution over $\mathbf {w}$ is

\ln {p(\mathbf {w} |\mathbf {t} )}=-{\frac {1}{2}}(\mathbf {w} -\mathbf {\mu _{0}} )^{T}\mathbf {\Sigma _{0}} ^{-1}(\mathbf {w} -\mathbf {\mu _{0}} )+\sum _{n=1}^{N}\{t_{n}\ln {y_{n}}+(1-t_{n})\ln {(1-y_{n})}\}+constant

Maximum a posteriori (MAP) solution $\mathbf {\mu } _{WAP}$ is regarded as the mean and the inverse of the negative Hessian matrix is regarded as the covariance matrix of the Gaussian.

\mathbf {\mu } _{MAP}=\mathbf {\mu _{0}} +\mathbf {\Sigma _{0}} \sum _{n=1}^{N}(t_{n}-y_{n})\phi _{n}

\mathbf {\Sigma _{N}} ^{-1}=-\nabla \nabla \ln {p(\mathbf {w} |\mathbf {t} )}=\mathbf {\Sigma _{0}} ^{-1}+\sum _{n=1}^{N}y_{n}(1-y_{n})\phi _{n}\phi _{n}^{T}

Therefore, the Gaussian approximation to the posterior distribution has the form

q(\mathbf {w} )={\mathcal {N}}(\mathbf {w} |\mathbf {\mu } _{WAP},\mathbf {\Sigma _{N}} )

Variational approximation[edit]

The main idea is to find the lower bounds of the logistic sigmoid and the joint distribution, then use the bounds to approximate the posterior distribution based on the form of the exponential of a quadratic function.

The lower bound^[4] on the logistic sigmoid function is

\sigma (x)\geqslant \sigma (\xi )exp\{(x-\xi )/2-\lambda (\xi )(z^{2}-\xi ^{2})\}

where $\lambda (\xi )=-{\frac {1}{2\xi }}[\sigma (\xi )-{\frac {1}{2}}]$ , $\xi$ is a point near $x$ and regarded as a variational parameter.

Then

p(t|\mathbf {w} )=e^{\mathbf {w} ^{T}\mathbf {\phi (x)} t}\sigma (-\mathbf {w} ^{T}\mathbf {\phi (x)} )\geqslant e^{\mathbf {w} ^{T}\mathbf {\phi (x)} t}\sigma (\xi )exp\{-(\mathbf {w} ^{T}\mathbf {\phi (x)} +\xi )/2-\lambda (\xi )(\mathbf {w} ^{T}\mathbf {\phi (x)} ^{2}-\xi ^{2})\}

Thus, the joint distribution has the lower bound:

p(\mathbf {t} |\mathbf {w} )p(\mathbf {w} )\geqslant g(\mathbf {w} ,\mathbf {\xi } )p(\mathbf {w} )=\prod _{n=1}^{N}\sigma (\xi _{n})exp\{\mathbf {w} ^{T}\phi _{n}t_{n}-(\mathbf {w} ^{T}\phi _{n}+\xi _{n})/2-\lambda (\xi _{n})([\mathbf {w} ^{T}\phi _{n}]^{2}-\xi _{n}^{2})\}

\ln {p(\mathbf {t} |\mathbf {w} )p(\mathbf {w} )}\geqslant \ln {g(\mathbf {w} ,\mathbf {\xi } )}+\ln {p(\mathbf {w} )}=\ln {p(\mathbf {w} )}+\sum _{n=1}^{N}\{\ln \sigma (\xi _{n})+\mathbf {w} ^{T}\phi _{n}t_{n}-(\mathbf {w} ^{T}\phi _{n}+\xi _{n})/2-\lambda (\xi _{n})([\mathbf {w} ^{T}\phi _{n}]^{2}-\xi _{n}^{2})\}

Note that the right-hand side would be a reasonable probability density after being normalized.

With the assumption of Gaussian prior $p(\mathbf {w} )={\mathcal {N}}(\mathbf {w} |\mathbf {\mu _{0}} ,\mathbf {\Sigma _{0}} )$ , the inequality becomes

\ln {p(\mathbf {t} |\mathbf {w} )p(\mathbf {w} )}\geqslant -{\frac {1}{2}}(\mathbf {w} -\mathbf {\mu _{0}} )^{T}\mathbf {\Sigma _{0}} ^{-1}(\mathbf {w} -\mathbf {\mu _{0}} )+\sum _{n=1}^{N}\{\mathbf {w} ^{T}\phi _{n}(t_{n}-{\frac {1}{2}})-\lambda (\xi _{n})\mathbf {w} ^{T}(\phi _{n}\phi _{n}^{T})\mathbf {w} \}+constant

The lower bound is a quadratic funciton of $\mathbf {w}$ and therefore the variational approximation to the posterior distribution can be identified by

q(\mathbf {w} )={\mathcal {N}}(\mathbf {w} |\mathbf {\mu _{N}} ,\mathbf {\Sigma _{N}} )

where

\mathbf {\mu _{N}} =\mathbf {\Sigma _{N}} (\mathbf {\Sigma _{0}} ^{-1}\mathbf {\mu _{0}} +\sum _{n=1}^{N}(t_{n}-1/2)\phi _{n}),\mathbf {\Sigma _{N}} ^{-1}=\mathbf {\Sigma _{0}} ^{-1}+2\sum _{n=1}^{N}\lambda (\xi _{n})\phi _{n}\phi _{n}^{T}

The posterior covariance matrix still depends on the unknown variational parameters ${\xi _{n}}$ through $\lambda (\xi _{n})$ . EM algorithm^[4] can help optimize the parameters and find the values that achieves the tight lower bounds.

Note that the bound given above applies only to the two-class problem. With more than two classes problem, an alternative bound has been studied by Gibbs^[5]

Optimizing the variational parameters[edit]

Variational parameters $\{\xi _{n}\}$ can be determined by maximizing the lower bound on the marginal likelihood

\ln {p(\mathbf {t} )}=\ln {\int p(\mathbf {t} |\mathbf {w} )p(\mathbf {w} )d\mathbf {w} }\geqslant \ln {\int g(\mathbf {w} ,\mathbf {\xi } )p(\mathbf {w} )d\mathbf {w} }={\mathcal {L}}(\mathbf {\xi } )

A decent approach is to view $\mathbf {w}$ as a latent variable and then perform the EM algorithm. Choosing the initial values for the parameters $\{\xi _{n}^{old}\}$ as a start. In the E step, using these parameters to obtain the posterior distribution $q(\mathbf {w} )$ . In the M step, the expected log likelihood is given by

f(\mathbf {\xi } ,\mathbf {\xi } ^{old})=\mathbb {E} [\ln {g(\mathbf {w} ,\mathbf {\xi } )p(\mathbf {w} )}]

where the expectation is taken with respect to the posterior distribution $q(\mathbf {w} )$ .

Now setting the derivative with respect to $\xi _{n}$ equal to zero:

0=\lambda ^{'}(\xi _{n})(\phi _{n}^{T}\mathbb {E} [\mathbf {w} \mathbf {w} ^{T}]\phi _{n}-\xi _{n}^{2})

Note that $\lambda ^{'}(\xi )\neq 0$ , then

(\xi _{n}^{new})^{2}=\phi _{n}^{T}\mathbb {E} [\mathbf {w} \mathbf {w} ^{T}]\phi _{n}=\phi _{n}^{T}(\mathbf {\Sigma } _{N}+\mathbf {\mu } _{N}\mathbf {\mu } _{N}^{T})\phi _{n}

Repeating the E and M steps until ${\xi _{n}}$ meets a reasonable convergence criterion. EM algorithm can achieve a monotone improvement to the posterior approximation with each ${\xi _{n}}$ updated.

Predictive distribution[edit]

In the problem of two-class classification, the predictive distribution for class $C_{1}$ given a new feature vector $\phi (x)$ is approximated by marginalizing with respect to the posterior distribution $p(\mathbf {w} |\mathbf {t} )$ approxiamted by a Gaussian distribution $q(\mathbf {w} )$

p(C_{1}|\mathbf {\phi (x),t} )=\int p(C_{1}|\mathbf {\phi (x),w} )p(\mathbf {w} |\mathbf {t} )d\mathbf {w} \simeq \int \sigma (\mathbf {w} ^{T}\mathbf {\phi } )q(\mathbf {w} )d\mathbf {w}

Denoting $a=\mathbf {w} ^{T}\mathbf {\phi }$ ,

\sigma (\mathbf {w} ^{T}\mathbf {\phi } )=\int \delta (a-\mathbf {w} ^{T}\mathbf {\phi } )\sigma (a)da

where $\delta$ is the Dirac delta function. Therefore, the approximated predictive distribution has the form

\int \sigma (\mathbf {w} ^{T}\mathbf {\phi } )q(\mathbf {w} )d\mathbf {w} =\int \sigma (a)p(a)da

where $p(a)=\int \delta (a-\mathbf {w} ^{T}\mathbf {\phi } )q(\mathbf {w} )d\mathbf {w} ))$ can be regarded as integrating out all directions orthogonal to $\mathbf {\phi (x)}$ from the joint distribution $q(\mathbf {w} )$ ^[1]. The mean and covariance of this distribution are

\mu _{a}=\mathbb {E} [a]=\int p(a)ada=\int q(\mathbf {w} )\mathbf {w} ^{T}\mathbf {\phi (x)} d\mathbf {w} =\mathbf {\mu _{WAP}} ^{T}\mathbf {\phi (x)}

\sigma _{a}^{2}=\mathbb {Var} [a]=\int p(a)\{a^{2}-\mathbb {E} [a]^{2}\}da=\int q(\mathbf {w} )\{(\mathbf {w} ^{T}\mathbf {\phi (x)} )^{2}-(\mathbf {\mu _{WAP}} ^{T}\mathbf {\phi (x)} )^{2}\}d\mathbf {w} =\mathbf {\phi (x)} ^{T}\mathbf {\Sigma _{N}} \mathbf {\phi (x)}

Thus variational approximation to the predictive distribution is

p(C_{1}|\mathbf {t} )=\int \sigma (a)p(a)da=\int \sigma (a){\mathcal {N}}(a|\mu _{a},\sigma _{a}^{2})da

A good approximation of such integral can be obtained by using the relationship between the probit function $\Phi (a)$ and the logistic sigmoid function $\sigma (a)$ ^[6]^[2]. With certain rescaling parameter $\lambda$ , $\sigma (a)$ can be approximated by $\Phi (\lambda a)$ . Since an integral of a probit function with respect to Gaussian can be written analytically as another probit function, the approximation to the predictive distribution which is an integral of a logistic sigmoid with respect to Gaussian can be written as another logistic sigmoid

\int \sigma (a){\mathcal {N}}(a|\mu _{a},\sigma _{a}^{2})da\simeq \sigma ((1+\lambda ^{2}\sigma _{a}^{2})^{-1/2}\mu _{a})

Variational approximation and Laplace approximation of the posterior distribution have the same form of the predictive distribution.

References[edit]

↑ ^1.0 ^1.1 Bishop, Christopher (2006). Pattern Recognition and Machine Learning. Search this book on
↑ ^2.0 ^2.1 MacKay, D.J.C. (1992). "The evidence framework applied to classification networks". Neural Computation. 4 (5): 720–736. doi:10.1162/neco.1992.4.5.720. ISSN 0899-7667. Unknown parameter |s2cid= ignored (help)
↑ MacKay, D.J.C. (1992). "A practical Bayesian framework for back-propagation networks". Neural Computation. 4 (3): 448–472. doi:10.1162/neco.1992.4.3.448. ISSN 0899-7667. Unknown parameter |s2cid= ignored (help)
↑ ^4.0 ^4.1 ^4.2 Jaakkola, T.; M.I., Jordan (2000). "Bayesian parameter estimation via variational methods". Statistics and Computing. 10: 25–37. doi:10.1023/A:1008932416310. Unknown parameter |s2cid= ignored (help)
↑ Gibbs, M.N. (1997). Bayesian Gaussian processes for regression and classification (PhD). University of Cambridge.
↑ Spiegelhalter, D.; S., Lauritzen (1990). "Sequential updating of conditional probabilities on directed graphical structures". Networks. 20 (5): 579–605. doi:10.1002/net.3230200507.

This article "Bayesian Logistic Regression" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Bayesian Logistic Regression. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

[:B-1] 1.0 ^1.1 Bishop, Christopher (2006). Pattern Recognition and Machine Learning. Search this book on

[:0-2] 2.0 ^2.1 MacKay, D.J.C. (1992). "The evidence framework applied to classification networks". Neural Computation. 4 (5): 720–736. doi:10.1162/neco.1992.4.5.720. ISSN 0899-7667. Unknown parameter |s2cid= ignored (help)

[3] MacKay, D.J.C. (1992). "A practical Bayesian framework for back-propagation networks". Neural Computation. 4 (3): 448–472. doi:10.1162/neco.1992.4.3.448. ISSN 0899-7667. Unknown parameter |s2cid= ignored (help)

[:1-4] 4.0 ^4.1 ^4.2 Jaakkola, T.; M.I., Jordan (2000). "Bayesian parameter estimation via variational methods". Statistics and Computing. 10: 25–37. doi:10.1023/A:1008932416310. Unknown parameter |s2cid= ignored (help)

[5] Gibbs, M.N. (1997). Bayesian Gaussian processes for regression and classification (PhD). University of Cambridge.

[6] Spiegelhalter, D.; S., Lauritzen (1990). "Sequential updating of conditional probabilities on directed graphical structures". Networks. 20 (5): 579–605. doi:10.1002/net.3230200507.

[1]

[2]

[3]

[4]

[5]

[6]