In general, Bayesian logistic regression.[1] is a Bayesian inference for logistic regression. Exact evaluation of the posterior distribution and predictive distribution is not tractable. However, Laplace approximation[2] [3]and variational approximation[4] can be applied to derive the Gaussian representation for the posterior distribution. In particular, compared to the Laplace approach, the greater flexibility of the variational approximation results in improved accuracy. The predictive distribution for new data points can be evaluated based on the approximation of the posterior distribution.
Posterior distribution[edit]
Laplace approximation[edit]
The Laplace approximation, also known as the S-L approximation (Spiegelhalter and Lauritzen (1990)), aims to find the maximum of the posterior distribution and then fitting a Gaussian centered at that point.
Strarting with a Gaussian prior is reasonable for seeking a Gaussian approximation for the posterior distribution:
In the problem of two-class classification, the likelihood function is
where and is the feature vector.
Then the log posterior distribution over is
Maximum a posteriori (MAP) solution is regarded as the mean and the inverse of the negative Hessian matrix is regarded as the covariance matrix of the Gaussian.
Therefore, the Gaussian approximation to the posterior distribution has the form
Variational approximation[edit]
The main idea is to find the lower bounds of the logistic sigmoid and the joint distribution, then use the bounds to approximate the posterior distribution based on the form of the exponential of a quadratic function.
The lower bound[4] on the logistic sigmoid function is
where , is a point near and regarded as a variational parameter.
Then
Thus, the joint distribution has the lower bound:
Note that the right-hand side would be a reasonable probability density after being normalized.
With the assumption of Gaussian prior , the inequality becomes
The lower bound is a quadratic funciton of and therefore the variational approximation to the posterior distribution can be identified by
where
The posterior covariance matrix still depends on the unknown variational parameters through . EM algorithm[4] can help optimize the parameters and find the values that achieves the tight lower bounds.
Note that the bound given above applies only to the two-class problem. With more than two classes problem, an alternative bound has been studied by Gibbs[5]
Optimizing the variational parameters[edit]
Variational parameters can be determined by maximizing the lower bound on the marginal likelihood
A decent approach is to view as a latent variable and then perform the EM algorithm. Choosing the initial values for the parameters as a start. In the E step, using these parameters to obtain the posterior distribution . In the M step, the expected log likelihood is given by
where the expectation is taken with respect to the posterior distribution .
Now setting the derivative with respect to equal to zero:
Note that , then
Repeating the E and M steps until meets a reasonable convergence criterion. EM algorithm can achieve a monotone improvement to the posterior approximation with each updated.
Predictive distribution[edit]
In the problem of two-class classification, the predictive distribution for class given a new feature vector is approximated by marginalizing with respect to the posterior distribution approxiamted by a Gaussian distribution
Denoting ,
where is the Dirac delta function. Therefore, the approximated predictive distribution has the form
where can be regarded as integrating out all directions orthogonal to from the joint distribution [1]. The mean and covariance of this distribution are
Thus variational approximation to the predictive distribution is
A good approximation of such integral can be obtained by using the relationship between the probit function and the logistic sigmoid function [6][2]. With certain rescaling parameter , can be approximated by . Since an integral of a probit function with respect to Gaussian can be written analytically as another probit function, the approximation to the predictive distribution which is an integral of a logistic sigmoid with respect to Gaussian can be written as another logistic sigmoid
Variational approximation and Laplace approximation of the posterior distribution have the same form of the predictive distribution.
References[edit]
This article "Bayesian Logistic Regression" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Bayesian Logistic Regression. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.