Reparametrization trick

In machine learning, the reparametrization trick is a method to construct unbiased estimators of gradients, typically used for performing gradient descent on expectations.

It was introduced independently^[1] to variational inference by Kingma & Welling (2014)^[2], Rezende et al. (2014)^[3], and Titsias & Lazaro-Gredilla (2014)^[4].

Mathematical principle

A problem commonly encountered in machine learning is estimating gradient-of-expectation. The problem is as follows:

Given a family of distributions^{[note 1]}
${p_{θ}}_{θ \in Θ}$
, where each
$θ$
defines a probability measure
$p_{θ} (x) d x$
on a base space
$𝒳$
.

Given any nice (nice enough to allow differentiating under the integral sign) function

f_{θ} : 𝒳 \to ℝ

, the gradient-of-expectation is:

$\nabla_{θ} E_{x \sim p_{θ}} [f_{θ} (x)] = \int \nabla_{θ} f_{θ} (x) p_{θ} (x) d x + \int f_{θ} (x) \nabla_{θ} p_{θ} (x) d x$

(main)

The key problem is to estimate the gradient-of-expectation efficiently. There are several possible solutions.^{[note 2]}

Analytic

If the expectation has closed-form solution, then its gradient also has closed-form solution. This is essentially the only method used by statisticians before large-scale computing was available.

There are several problems with this. One, the closed-form solution rarely exists. Two, the closed-form solution might be very expensive to compute, especially when $Θ$ has a large dimension (curse of dimensionality). The typical solution for such intractable problems is Monte Carlo method.

Monte Carlo method

The naive method to estimating Equation (main) is to sample many $x_{i} \sim p_{θ}$ , estimate $𝔼_{x \sim p_{θ}} [f_{θ} (x)]$ , then vary $θ$ slightly to $θ + δ θ_{j}$ , estimate $𝔼_{x \sim p_{θ + δ θ_{j}}} [f_{θ + δ θ_{j}} (x)]$ , and so on, and finally fitting a linear operator $L$ such that $𝔼_{x \sim p_{θ + δ θ_{j}}} [f_{θ + δ θ_{j}} (x)] - 𝔼_{x \sim p_{θ}} [f_{θ} (x)] \approx L (δ θ_{j}) \forall j$ This is highly inefficient. To save time, one should instead compute the gradient exactly where it is tractable, instead of estimating the gradient everywhere. In most commonly used examples, while $\nabla_{θ} E_{x \sim p_{θ}} [f_{θ} (x)]$ is intractable, $\nabla_{θ} f_{θ} (x), \nabla_{θ} p_{θ} (x)$ are tractable (they are often designed by practitioners to be tractable).

As such, we consider the following expansion of Equation (main):

$\nabla_{θ} E_{x \sim p_{θ}} [f_{θ} (x)] = \int \nabla_{θ} f_{θ} (x) p_{θ} (x) d x + \int f_{θ} (x) \nabla_{θ} p_{θ} (x) d x$

(Monte Carlo)

Instead of estimating the integrals, then estimating the gradient, now we compute the gradients exactly, then estimate the integrals. The integrals are still intractable in general, but can be done by Monte Carlo integration. To perform Monte Carlo integration, an integral

\int_{x \in 𝒳} g (x) d x

must have a probability distribution to sample

x

from. If we use

p_{θ}

for Monte Carlo integration, we obtain

\int \frac{\nabla_{θ} f_{θ} (x) p_{θ} (x) + f_{θ} (x) \nabla_{θ} p_{θ} (x)}{p_{θ} (x)} p_{θ} (x) d x = E_{x \sim p_{θ}} [\nabla_{θ} f_{θ} (x) + f_{θ} (x) \nabla_{θ} \ln p_{θ} (x)]

and thus

$\nabla_{θ} E_{x \sim p_{θ}} [f_{θ} (x)] = E_{x \sim p_{θ}} [\nabla_{θ} f_{θ} (x) + f_{θ} (x) \nabla_{θ} \ln p_{θ} (x)]$

(REINFORCE)

This is the equation used in policy gradient methods, to be detailed below.

The reparametrization trick

The main issue in estimating Equation (main) is an "entanglement" between the distribution

x \sim p_{θ}

and the expectation of

f_{θ} (x)

to be estimated. The entanglement comes to the fore when we vary

θ

. The reparametrization trick pushes all dependence on

θ

into a deterministic function, and then perform the obvious

\nabla_{θ} 𝔼_{x \sim p} [f_{θ} (x)] = 𝔼_{x \sim p} [\nabla_{θ} f_{θ} (x)]

The prototypical example is the family of 1D normal distributions:

{N (μ, σ^{2})}_{μ \in ℝ, σ > 0}

. Given any

μ \in ℝ, σ > 0

, we can sample

x \sim N (μ, σ^{2})

as

x = f_{μ, σ} (x^{'})

, with

x^{'} \sim N (0, 1)

and

f_{μ, σ} (x) = μ + σ x

.

Remark: The idea is similar to probability transforms such as the Box–Muller transform, where we have only one "seed" random number generator, and must construct other probability distributions by performing deterministic transforms on the random numbers generated by it.

In general, given a family of distribution

{p_{θ}}_{θ \in Θ}

, we can perform the reparametrization trick if there exists a seed distribution

p

, and a transform function

g_{θ} (x)

, such that for any

θ \in Θ

, we can sample from

p_{θ}

by sampling

x \sim p

, then compute

g_{θ} (x)

. That is,

p_{θ} = p \circ g_{θ}^{- 1}

. With the seed distribution and the transform, we obtain the reparametrization trick equation:

$\nabla_{θ} E_{x \sim p_{θ}} [f_{θ} (x)] = \nabla_{θ} E_{x \sim p} [f_{θ} (g_{θ} (x))] = E_{x \sim p} [\nabla_{θ} f_{θ} (g_{θ} (x))]$

(reparametrization trick)

Motivation

Many problems in statistics and machine learning are of the form: find the "best" parameters. Usually, "best" is defined as "achieving minimal loss" or "maximum reward", and taking the gradient, if possible, is useful to optimization.

Parameter estimation in statistics

For example, consider the parameter estimation problem: Given data $x \in 𝒳$ sampled from a distribution $p_{θ_{0}}$ , or at least from a distribution close enough to some $p_{θ_{0}}$ . The problem is to estimate $θ_{0}$ . As we will see below, this often reduces to solving one of the following problems: $\min_{θ} 𝔼_{x \sim p} [f_{θ} (x)]; \min_{θ} 𝔼_{x \sim p_{θ}} [f (x)]$ This is no loss of generality, because if we have a sequence of independently sampled data $x_{1}, . . ., x_{n} \in 𝒳$ , then we can simply consider them as one big data $x : = (x_{1}, . . ., x_{n}) \in 𝒳^{n}$ with distribution $p_{θ_{0}}^{\otimes n}$ : $p_{θ_{0}}^{\otimes n} (x) d x = p_{θ_{0}} (x_{1}) \dots p_{θ_{0}} (x_{n}) d x_{1} \dots d x_{n}$ First approach (Blackwell 1951)^[5]^[6]: find a distribution that can best mock-up the observed samples. $\min_{θ} 𝔼_{x^{'} \sim p_{θ}} [d (x, x^{'})]$ where $d : 𝒳 \times 𝒳 \to [0, \infty)$ measures the difference between two points in $𝒳$ . It can be a metric, or be more general.

Set $f (x^{'}) : = d (x, x^{'})$ , then this problem reduces to: $\min_{θ} 𝔼_{x^{'} \sim p_{θ}} [f (x^{'})]$

Second approach (frequentist estimation): define a loss function $L : Θ \times Θ \to [0, \infty)$ , which measures the difference between two "points" (each point is a probability distribution, very big points indeed) in the space of distributions under consideration.

Then, minimize estimation risk (expected loss): $\min_{ζ \in Z} 𝔼_{x \sim p_{θ_{0}}} [L ({\hat{θ}}_{ζ} (x), θ_{0})]$ where ${\hat{θ}}_{ζ} : 𝒳 \to Θ$ is an estimator, parametrized by $ζ \in Z$ .

The above definition is not very interesting, since we could define the following "blind guess" estimator $\hat{θ} (x) = θ_{1}$ . It would happen to be exactly right if $θ_{1} = θ_{0}$ . Thus, we must additionally require the estimator to perform well on many different possible $θ_{0}$ .

There are many possible ways to formalize the idea of "perform well on many different $θ_{0}$ ". A common version is the following: $\min_{ζ \in Z} \max_{θ_{0} \in Θ} 𝔼_{x \sim p_{θ_{0}}} [L ({\hat{θ}}_{ζ} (x), θ_{0})]$ Set $f_{ζ} (x, θ_{0}) = L ({\hat{θ}}_{ζ} (x), θ_{0})$ , then this problem reduces to: $\min_{ζ \in Z} (\max_{θ_{0} \in Θ} 𝔼_{x \sim p_{θ_{0}}} [f_{ζ} (x, θ_{0})])$ which is almost in the form of $\min_{θ} 𝔼_{x \sim p} [f_{θ} (x)]$ , but not quite.

For example, when $Θ$ is a subset of $ℝ^{n}$ , $L (θ, θ_{0}) : = ‖ θ - θ_{0} ‖^{2}$ , and the set of estimators ${{\hat{θ}}_{ζ}}_{ζ \in Z}$ contains only unbiased estimators, then if the minimum-variance unbiased estimator exists, it is the solution to the above problem.^{[note 3]}

Third approach (Bayesian estimation): Imposing a maximum gives frequentist estimation a kind of "game-theoretic" flavor, since it is formally equivalent to a zero-sum game between a statistician and nature. The statistician proposes an estimator ${\hat{θ}}_{ζ}$ , and nature replies with $p_{θ_{0}}$ .

However, nature is uninterested, as the statistician's choice of ${\hat{θ}}_{ζ}$ has no effect on $p_{θ_{0}}$ . Consequently, Bayesian estimation models this by imposing a prior distribution $μ$ over $Θ$ , and optimizing the following: $\min_{ζ \in Z} 𝔼_{θ \sim μ} [𝔼_{x \sim p_{θ}} [L ({\hat{θ}}_{ζ} (x), θ)]]$ Set $f_{ζ} (x, θ) = L ({\hat{θ}}_{ζ} (x), θ)$ , then this problem reduces to: $\min_{ζ \in Z} 𝔼_{θ \in Θ, x \sim p_{θ}} [f_{ζ} (x, θ)]$ which is in the form of $\min_{θ} 𝔼_{x \sim p} [f_{θ} (x)]$ .

Reinforcement learning

In reinforcement learning, there is an "entanglement" between the distribution and the function.

To perform a gradient descent, one must estimate the gradient.

Main methods

There are many different ways to perform reparametrization trick, for diverse purposes.

Reparametrizing a distribution family

Given a family of distributions ${p_{θ}}_{θ}$ , we can apply the reparametrization trick if we have a way to generate the family into a constant "seed random generator" and a family of parametrized deterministic functions.

For example, the family of normal distributions on $ℝ^{n}$ is ${N (μ, Σ)}_{μ \in ℝ^{n}, Σ ⪰ 0}$ . Given any $μ \in ℝ^{n}, Σ ⪰ 0$ , we can sample $x \sim N (μ, Σ)$ as $x = f_{μ, Σ} (x^{'})$ , with $x^{'} \sim N (0, I_{n \times n})$ and $f_{μ, Σ} (x) = μ + M x$ , where $M M^{T} = Σ$ .

Since $Σ$ is non-negative-definite, $M$ is guaranteed to exist by the spectral theorem, but it is not unique. It can be found by Cholesky decomposition, or singular value decomposition. Different choices have different theoretical and practical advantages.^[7]

Gumbel max tricks

The prototype of Gumbel tricks is the Gumbel-max trick, which allows one to create any categorical distribution using just a Gumbel distribution random number generator.

The Gumbel-max trick generalizes to other distributions:^[8]

Theorem — Gumbel

Template:Proof

This provides a proof that the Gumbel, Weibull, and Fréchet distributions are max-stable, which is one-half of the extreme value theorem.

Gumbel softmax method

The Gumbel-max trick allows sampling from the categorical distribution, but it cannot be used for training with gradient descent, because $\nabla_{x} \arg \max_{i} (x_{i} + g_{i}) = 0$ . This is because $\arg \max_{i} (x_{i} + g_{i})$ is "hard", that is, insensitive for small variations of $x$ . Intuitively speaking, this can be interpreted as saying that the model simply predicts "category $i$ is most likely" without saying by how much it is the most likely category.

To create better gradients, the model should predict a distribution over the categories, and the standard method is the softmax function, creating the Gumbel softmax method:^[1]^[9] $p_{i} : = \frac{e^{β (x_{i} + g_{i})}}{\sum_{j} e^{β (x_{j} + g_{j})}}$ After imposing a good loss function, such as the cross-entropy loss, one can train the method by standard gradient descent: $L o s s = - \ln p_{i} = - \ln \frac{e^{β (x_{i} + g_{i})}}{\sum_{j} e^{β (x_{j} + g_{j})}}$ where $i$ is the correct label.

Estimating bounds of partition functions

Inspired by the connection between information theory and statistical mechanics^[10], energy-based models, such as the Boltzmann machine and the deep belief network, are statistical models defined by an energy function (or "potential function") and a temperature.

Consider a set of discrete random variables $X_{1}, . . ., X_{n}$ , each $X_{i}$ being the state of a particle $i$ . Let the system of $n$ particles interact, and let the energy of the entire system be $ϕ (x_{1}, . . ., x_{n})$ , when $X_{1} = x_{1}, . . ., X_{n} = x_{n}$ . Then, when the system is in contact with a heat bath with temperature $T$ , after reaching equilibrium, the distribution of the states of the system is the Boltzmann distribution: $P r (X_{1} = x_{1}, . . ., X_{n} = x_{n} | β) = \frac{e^{- β ϕ (x_{1}, . . ., x_{n})}}{\sum_{x'_{1}, . . ., x'_{n}} e^{- β ϕ (x'_{1}, . . ., x'_{n})}}$ where $β = \frac{1}{T}$ is the inverse temperature of the heat bath.

The normalizing constant $Z (β) = \sum_{x'_{1}, . . ., x'_{n}} e^{- β ϕ (x'_{1}, . . ., x'_{n})}$ is the partition function of the system. It depends on the temperature and the energy function.

In general, partition functions are intractable, making it important to estimate it in practice. There are a family of reparametrization tricks for accomplishing the estimation.

Upper bounds^[11]

Lower bounds^[12]

Applications

Variational autoencoder

In variational autoencoder,

Two similar methods

Inference compilation

Amortized inference

^[13]^[14]

Related methods

The reparametrization trick is a general technique

REINFORCE

The policy gradient method in reinforcement learning, proposed in (Williams, 1992),^[15] uses the following expansion of Equation (main):

$\nabla_{θ} E_{x \sim p_{θ}} [f_{θ} (x)] = E_{x \sim p_{θ}} [\nabla_{θ} f_{θ} (x) + f_{θ} (x) \nabla_{θ} \ln p_{θ} (x)]$

(policy gradient)

It is also called the "likelihood ratio method" and the "score function method".

The policy gradient method has many variants, such as with function approximation^[16], deterministic^[17]. An introduction is ^[18].

Notes

↑ Cite error: Invalid <ref> tag; no text was provided for refs named distribution family technical
↑ Cite error: Invalid <ref> tag; no text was provided for refs named merely more complex
↑ Cite error: Invalid <ref> tag; no text was provided for refs named even if MVUE does not exist

References

↑ ^1.0 ^1.1 Maddison, C.; Mnih, A.; Teh, Y. (2019). "The concrete distribution: A continuous relaxation of discrete random variables". Proceedings of the International Conference on Learning Representations.
↑ Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML].
↑ Rezende, Danilo Jimenez; Mohamed, Shakir; Wierstra, Daan (2014-06-18). "Stochastic Backpropagation and Approximate Inference in Deep Generative Models". International Conference on Machine Learning. PMLR: 1278–1286. arXiv:1401.4082.
↑ Titsias, Michalis; Lázaro-Gredilla, Miguel (2014-06-18). "Doubly Stochastic Variational Bayes for non-Conjugate Inference". International Conference on Machine Learning. PMLR: 1971–1979.
↑ Blackwell, David (1951-01-01). "Comparison of Experiments". Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. 2: 93–103.
↑ Keener, Robert W. (2010). Theoretical Statistics: Topics for a Core Course (Springer Texts in Statistics). p. 44. ISBN 978-1461426707. Search this book on
↑ Kessy, Agnan; Lewin, Alex; Strimmer, Korbinian (2018-10-02). "Optimal Whitening and Decorrelation". The American Statistician. 72 (4): 309–314. doi:10.1080/00031305.2016.1277159. ISSN 0003-1305. Unknown parameter |s2cid= ignored (help)
↑ Balog, Matej; Tripuraneni, Nilesh; Ghahramani, Zoubin; Weller, Adrian (2017-07-17). "Lost Relatives of the Gumbel Trick". International Conference on Machine Learning. PMLR: 371–379. arXiv:1706.04161.
↑ Eric, Jang; Shixiang, Gu; Ben, Poole (April 2017). "Categorical Reparametrization with Gumble-Softmax". ICLR 2017 - Conference Track.
↑ Jaynes, E. T. (1957-05-15). "Information Theory and Statistical Mechanics". Physical Review. 106 (4): 620–630. Bibcode:1957PhRv..106..620J. doi:10.1103/PhysRev.106.620.
↑ Hazan, Tamir; Jaakkola, Tommi (2012-06-27). "On the Partition Function and Random Maximum A-Posteriori Perturbations". arXiv:1206.6410 [cs.LG].
↑ Hazan, Tamir; Maji, Subhransu; Jaakkola, Tommi (2013). "On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations". Advances in Neural Information Processing Systems. Curran Associates, Inc. 26. arXiv:1309.7598.
↑ Le, Tuan Anh; Baydin, Atilim Gunes; Wood, Frank (2017-04-10). "Inference Compilation and Universal Probabilistic Programming". Artificial Intelligence and Statistics. PMLR: 1338–1348. arXiv:1610.09900.
↑ Le, Tuan Anh (19 December 2017). "Amortized Inference". www.tuananhle.co.uk. Archived from the original on 2022-06-26. Retrieved 2022-06-26.
↑ Williams, Ronald J. (1992), "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning", Reinforcement Learning, Boston, MA: Springer US, pp. 5–32, doi:10.1007/978-1-4615-3618-5_2, ISBN 978-1-4613-6608-9, retrieved 2022-06-26
↑ Sutton, Richard S; McAllester, David; Singh, Satinder; Mansour, Yishay (1999). "Policy Gradient Methods for Reinforcement Learning with Function Approximation". Advances in Neural Information Processing Systems. MIT Press. 12.
↑ Silver, David; Lever, Guy; Heess, Nicolas; Degris, Thomas; Wierstra, Daan; Riedmiller, Martin (2014-01-27). "Deterministic Policy Gradient Algorithms". International Conference on Machine Learning. PMLR: 387–395.
↑ Sutton, Richard S. (2018). "13". Reinforcement learning : an introduction. Andrew G. Barto (2 ed.). Cambridge, Massachusetts. ISBN 978-0-262-03924-6. OCLC 1043175824. Search this book on

This article "Reparametrization trick" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Reparametrization trick. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

[distribution_family_technical-5] Cite error: Invalid <ref> tag; no text was provided for refs named distribution family technical

[merely_more_complex-6] Cite error: Invalid <ref> tag; no text was provided for refs named merely more complex

[even_if_MVUE_does_not_exist-9] Cite error: Invalid <ref> tag; no text was provided for refs named even if MVUE does not exist

[:0-1] 1.0 ^1.1 Maddison, C.; Mnih, A.; Teh, Y. (2019). "The concrete distribution: A continuous relaxation of discrete random variables". Proceedings of the International Conference on Learning Representations.

[2] Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML].

[3] Rezende, Danilo Jimenez; Mohamed, Shakir; Wierstra, Daan (2014-06-18). "Stochastic Backpropagation and Approximate Inference in Deep Generative Models". International Conference on Machine Learning. PMLR: 1278–1286. arXiv:1401.4082.

[4] Titsias, Michalis; Lázaro-Gredilla, Miguel (2014-06-18). "Doubly Stochastic Variational Bayes for non-Conjugate Inference". International Conference on Machine Learning. PMLR: 1971–1979.

[7] Blackwell, David (1951-01-01). "Comparison of Experiments". Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. 2: 93–103.

[8] Keener, Robert W. (2010). Theoretical Statistics: Topics for a Core Course (Springer Texts in Statistics). p. 44. ISBN 978-1461426707. Search this book on

[10] Kessy, Agnan; Lewin, Alex; Strimmer, Korbinian (2018-10-02). "Optimal Whitening and Decorrelation". The American Statistician. 72 (4): 309–314. doi:10.1080/00031305.2016.1277159. ISSN 0003-1305. Unknown parameter |s2cid= ignored (help)

[11] Balog, Matej; Tripuraneni, Nilesh; Ghahramani, Zoubin; Weller, Adrian (2017-07-17). "Lost Relatives of the Gumbel Trick". International Conference on Machine Learning. PMLR: 371–379. arXiv:1706.04161.

[12] Eric, Jang; Shixiang, Gu; Ben, Poole (April 2017). "Categorical Reparametrization with Gumble-Softmax". ICLR 2017 - Conference Track.

[13] Jaynes, E. T. (1957-05-15). "Information Theory and Statistical Mechanics". Physical Review. 106 (4): 620–630. Bibcode:1957PhRv..106..620J. doi:10.1103/PhysRev.106.620.

[14] Hazan, Tamir; Jaakkola, Tommi (2012-06-27). "On the Partition Function and Random Maximum A-Posteriori Perturbations". arXiv:1206.6410 [cs.LG].

[15] Hazan, Tamir; Maji, Subhransu; Jaakkola, Tommi (2013). "On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations". Advances in Neural Information Processing Systems. Curran Associates, Inc. 26. arXiv:1309.7598.

[16] Le, Tuan Anh; Baydin, Atilim Gunes; Wood, Frank (2017-04-10). "Inference Compilation and Universal Probabilistic Programming". Artificial Intelligence and Statistics. PMLR: 1338–1348. arXiv:1610.09900.

[17] Le, Tuan Anh (19 December 2017). "Amortized Inference". www.tuananhle.co.uk. Archived from the original on 2022-06-26. Retrieved 2022-06-26.

[18] Williams, Ronald J. (1992), "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning", Reinforcement Learning, Boston, MA: Springer US, pp. 5–32, doi:10.1007/978-1-4615-3618-5_2, ISBN 978-1-4613-6608-9, retrieved 2022-06-26

[19] Sutton, Richard S; McAllester, David; Singh, Satinder; Mansour, Yishay (1999). "Policy Gradient Methods for Reinforcement Learning with Function Approximation". Advances in Neural Information Processing Systems. MIT Press. 12.

[20] Silver, David; Lever, Guy; Heess, Nicolas; Degris, Thomas; Wierstra, Daan; Riedmiller, Martin (2014-01-27). "Deterministic Policy Gradient Algorithms". International Conference on Machine Learning. PMLR: 387–395.

[21] Sutton, Richard S. (2018). "13". Reinforcement learning : an introduction. Andrew G. Barto (2 ed.). Cambridge, Massachusetts. ISBN 978-0-262-03924-6. OCLC 1043175824. Search this book on

[1]

[2]

[3]

[4]

[note 1]

[note 2]

[5]

[6]

[note 3]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]