Sampling

Exact sampling

The procedure stems from the fact that generic DPPs are mixtures of projection DPPs, suggesting the following two steps algorithm. Given the spectral decomposition of the kernel

(27)\[ K(x,y)=\sum_{i=1}^{\infty} \lambda_i \phi_i(x) \overline{\phi_i(y)},\]

Step 1. Draw \(B_i\sim\operatorname{\mathcal{B}er}(\lambda_i)\) independently and note \(\{i_1,\dots,i_{N}\} = \{i~;~B_i=1\}\),

Step 2. Sample from the projection DPP with kernel \(\tilde{K}(x,y) = \sum_{n=1}^{N}\phi_{i_n}(x) \overline{\phi_{i_n}(y)}\).

Important

  • Step 1. selects a component of the mixture, see [LMollerR12] Section 2.4.1

  • Step 2. generates a sample from the projection \(\operatorname{DPP}(\tilde{K})\). This can be done using Algorithm 18 of [HKPVirag06] who provide a generic projection DPP sampler that we describe in the next section.

Attention

The strong requirement of the procedure is that the eigendecomposition of the continuous kernel in (27) must be available.

See also

Spectral method for sampling finite DPPs.

Projection DPPs: the chain rule

In this section, we describe a generic procedure to perform Step 2., i.e., for sampling projection DPPs. It was originally proposed, in an abstract form, by [HKPVirag06] Algorithm 18.

For simplicity, consider a projection \(\operatorname{DPP}(K)\) with a real-valued orthogonal projection kernel. We note \(r=\operatorname{rank}(K)\) and write

\[K(x,y) = \sum_{i=1}^{r} \phi_i(x) \phi_i(y) = \Phi(x)^{\top} \Phi(y),\]

where \(\Phi(x) \triangleq (\phi_{1}(x), \dots, \phi_{r}(x))\) denotes the feature vector associated to \(x\in \mathbb{X}\).

Important

The eigendecomposition of the kernel is not mandatory for sampling projection \(\operatorname{DPP}(K)\).

Recall that the number of points of this projection \(\operatorname{DPP}(K)\) is \(\mu\)-almost surely equal to \(r=\operatorname{rank}(K)\).

Using the invariance by permutation of the determinant and the fact that \(K\) is an orthogonal projection kernel, it is sufficient to apply the chain rule to sample \((x_1, \dots, x_r)\) with joint distribution

(28)\[\begin{split}\mathbb{P}[(x_1, \dots, x_r)] &= \frac{1}{r!} \det [K(x_p,x_q)]_{p,q=1}^r \mu^{\otimes r}(d x_{1:r})\\ &= \frac{1}{r!} \det [\Phi(x_p)^{\top} \Phi(x_q))]_{p,q=1}^r \mu^{\otimes r}(d x_{1:r})\\ &= \frac{1}{r!} \operatorname{Volume}^2 \{\Phi(x_1), \dots \Phi(x_r)\} \mu^{\otimes r}(d x_{1:r}),\end{split}\]

and forget about the order the points were selected, to obtain a valid sample \(X=\{x_{1}, \dots, x_{r}\} \sim \operatorname{DPP}(K)\).

Hint

In the end, the joint distribution (28) shows that projection DPPs favors sets of \(r=\operatorname{rank}(\mathbf{K})\) of items are associated to feature vectors that span large volumes. This is another way of understanding repulsiveness.

The chain rule can be interpreted from a geometrical perspective

(29)\[\begin{split}\mathbb{P}[(x_1, \dots, x_r)] &= \dfrac{K(x_1,x_1)}{r} \mu(d x_{1}) \prod_{i=2}^{r} \dfrac{1}{r-(i-1)} \frac{\det \mathbf{K}_{i}} {\det \mathbf{K}_{i-1}} \mu(d x_{i})\\ &= \dfrac{\left\| \Phi(x_1) \right\|^2}{r} \mu(d x_{1}) \prod_{i=2}^{r} \dfrac{ \operatorname{distance}^2 \left(\Phi(x_i), \operatorname{Span} \{ \Phi(x_1), \dots, \Phi(x_{i-1}) \}\right) } {r-(i-1)} \mu(d x_{i}),\end{split}\]

where \(\mathbf{K}_{i-1} = [K(x_p,x_q)]_{p,q=1}^{i-1}\).

Using Woodbury’s formula the ratios of determinants in (29) can be expanded into

(30)\[\mathbb{P}[(x_1, \dots, x_r)] = \dfrac{K(x_1,x_1)}{r} \mu(d x_{1}) \prod_{i=2}^{r} \dfrac{ K(x_i, x_i) - \mathbf{K}_{i-1}(x_i)^{\top} \mathbf{K}_{i-1}^{-1} \mathbf{K}_{i-1}(x_i) } {r-(i-1)} \mu(d x_{i}),\]

where \(\mathbf{K}_{i-1} = [K(x_p,x_q)]_{p,q=1}^{i-1}\) and \(\mathbf{K}_{i-1}(x) = (K(x,x_1), \dots, K(x,x_{i-1}))^{\top}\).

Hint

  1. The chain rule (29) can be understood as an application of the base \(\times\) height formula.

  2. MLers will recognize in (30) the incremental posterior variance of the Gaussian Process (GP) associated to \(K\), see [RW06] Equation 2.26.

Caution

The connexion between the chain rule (30) and Gaussian Processes is valid in the case where the GP kernel is an orthogonal projection kernel, see also Caution.

Important

  1. The expression (28) indeed defines a probability distribution, with normalization constant \(r!\). In particular this distribution is exchangeable

  2. The successive ratios that appear in (29) and (30) are the normalized conditional densities (w.r.t., \(\mu\)) that drive the chain rule. The associated normalizing constants \(r-(i-1)\) are independent of the previous points.

Caution

The main differences with the finite case is that we need to be able to sample from the conditionals that appear in (30). This can be done using a rejection sampling mechanism but finding the right proposal density is a challenging but achievable problem, see, e.g., Multivariate Jacobi Ensemble.

See also

Perfect sampling

[DFL13] uses Coupling From The Past (CFTP).

Approximate sampling

See also

Approximation of \(K(x,y)=K(x-y)\) by Fourier series [LMollerR12] Section 4