Sampling

In contrast to the finite case where the ML community has put efforts in improving the efficiency and tractability of the sampling routine, much less has been done in the continuous setting.

Exact sampling

In this section, we describe the main techniques for sampling exactly continuous DPPs.

As for finite DPPs the most prominent one relies on the fact that generic DPPs are mixtures of projection DPPs.

Projection DPPs: the chain rule

Let’s focus on sampling from projection \(\operatorname{DPP}(K)\) with a real-valued orthogonal projection kernel \(K:\mathbb{X}\times \mathbb{X}\to \mathbb{R}\) and reference measure \(\mu\), that is

\[K(x,y)=K(y,x) \quad\text{and}\quad \int_{\mathbb{X}} K(x, z) K(z, y) \mu(d z) = K(x, y)\]

In this setting, recall that the number of points is \(\mu\)-almost surely equal to \(r=\operatorname{rank}(K)\).

To generate a valid sample \(X=\{x_{1}, \dots, x_{r}\} \sim \operatorname{DPP}(K)\), [HKPVirag06] Proposition 19 showed that it is sufficient to apply the chain rule to sample \((x_1, \dots, x_r)\) with joint distribution

(27)\[\mathbb{P}[(x_1, \dots, x_r)] = \frac{1}{r!} \det [K(x_p, x_q)]_{p,q=1}^r \mu^{\otimes r}(d x_{1:r})\]

and forget the order the points were selected.

The original projection DPP sampler of [HKPVirag06] Algorithm 18, was given in an abstract form, which can be implemented using the following strategy. Write the determinant in (27) as a telescopic product of ratios of determinants and use Schur complements to get

(28)\[\begin{split}\mathbb{P}[(x_1, \dots, x_r)] &= \dfrac{K(x_1,x_1)}{r} \mu(d x_{1}) \prod_{i=2}^{r} \dfrac{1}{r-(i-1)} \frac{\det \mathbf{K}_{i}} {\det \mathbf{K}_{i-1}} \mu(d x_{i})\\ &= \dfrac{K(x_1,x_1)}{r} \mu(d x_{1}) \prod_{i=2}^{r} \dfrac{ K(x_i, x_i) - \mathbf{K}_{i-1}(x_i)^{\top} \mathbf{K}_{i-1}^{-1} \mathbf{K}_{i-1}(x_i) } {r-(i-1)} \mu(d x_{i}),\end{split}\]

where \(\mathbf{K}_{i-1} = [K(x_p,x_q)]_{p,q=1}^{i-1}\) and \(\mathbf{K}_{i-1}(x) = (K(x,x_1), \dots, K(x,x_{i-1}))^{\top}\).

Important

  1. The expression (27) indeed defines a probability distribution, with normalization constant \(r!\). In particular this distribution is exchangeable, i.e., invariant by permutation of the coordinates.

  2. The successive ratios that appear in (28) are the normalized conditional densities (w.r.t. \(\mu\)) that drive the chain rule. The associated normalizing constants \(r-(i-1)\) are independent of the previous points.

  3. Sampling projection DPPs does not require the eigendecomposition of the kernel!

Hint

MLers will recognize (28) as the incremental posterior variance of a noise-free Gaussian Process (GP) model with kernel \(K\), see [RW06] Equation 2.26.

Caution

The connexion between the chain rule (28) and Gaussian Processes is valid in the case where the GP kernel is an orthogonal projection kernel, see also Caution.

Geometrical interpretation

When the eigendecomposition of the kernel is available, the chain rule can be interpreted and implemented from a geometrical perspective, see, e.g., [LMollerR12] Algorithm 1.

Writing the Gram formulation of the kernel as

\[K(x,y) = \sum_{i=1}^{r} \phi_i(x) \phi_i(y) = \Phi(x)^{\top} \Phi(y),\]

where \(\Phi(x) \triangleq (\phi_{1}(x), \dots, \phi_{r}(x))\) denotes the feature vector associated to \(x\in \mathbb{X}\).

The joint distribution (27) reads

(29)\[\begin{split}\mathbb{P}[(x_1, \dots, x_r)] &= \frac{1}{r!} \det [\Phi(x_p)^{\top} \Phi(x_q))]_{p,q=1}^r \mu^{\otimes r}(d x_{1:r})\\ &= \frac{1}{r!} \operatorname{Volume}^2 \{\Phi(x_1), \dots, \Phi(x_r)\} \mu^{\otimes r}(d x_{1:r}),\end{split}\]

Hint

The joint distribution (29) characterizes the fact that projection \(\operatorname{DPP}(K)\) favor sets of \(r=\operatorname{rank}(\mathbf{K})\) points \(\left(x_{1}, \dots, x_{r} \right)\) whose feature vectors \(\Phi(x_1), \dots \Phi(x_r)\) span a large volume. This is another way of understanding the repulsive/diversity feature of DPPs.

Then, the previous telescopic product of ratios of determinants in (28) can be understood as the base \(\times\) height formula applied to compute \(\operatorname{Volume}^2 \{\Phi(x_1), \dots, \Phi(x_r)\}\), so that

(30)\[\begin{split}\mathbb{P}[(x_1, \dots, x_r)] &= \dfrac{\left\langle \Phi(x_1)^{\top} \Phi(x_1) \right\rangle}{r} \mu(d x_{1}) \prod_{i=2}^{r} \dfrac{1}{r-(i-1)} \frac{\det \mathbf{K}_{i}} {\det \mathbf{K}_{i-1}} \mu(d x_{i})\\ &= \dfrac{\left\| \Phi(x_1) \right\|^2}{r} \mu(d x_{1}) \prod_{i=2}^{r} \dfrac{ \operatorname{distance}^2 \left(\Phi(x_i), \operatorname{Span} \{ \Phi(x_1), \dots, \Phi(x_{i-1}) \}\right) } {r-(i-1)} \mu(d x_{i}),\end{split}\]

where \(\mathbf{K}_{i-1} = \left[\left\langle \Phi(x_p)^{\top} \Phi(x_q) \right\rangle\right]_{p,q=1}^{i-1}\).

Hint

The overall procedure is akin to a sequential Gram-Schmidt orthogonalization of \(\Phi(x_{1}), \dots, \Phi(x_{N})\).

Attention

In contrast to the finite case where the conditionals are simply probability vectors, the chain rule formulations (28) and (30) require sampling from a continuous distribution. This can be done using a rejection sampling mechanism but finding a good proposal density with tight rejection bounds is a challenging problem [LMollerR12] Section 2.4. But it is achievable in some specific cases, see, e.g., Multivariate Jacobi Ensemble.

See also

Generic DPPs: the spectral method

The procedure stems from the fact that generic DPPs are mixtures of projection DPPs, suggesting the following two steps algorithm. Given the spectral decomposition of the kernel

(31)\[ K(x,y)=\sum_{i=1}^{\infty} \lambda_i \phi_i(x) \overline{\phi_i(y)},\]

Step 1. Draw \(B_i\sim\operatorname{\mathcal{B}er}(\lambda_i)\) independently and note \(\{i_1,\dots,i_{N}\} = \{i~;~B_i=1\}\),

Step 2. Sample from the projection DPP with kernel \(\tilde{K}(x,y) = \sum_{n=1}^{N}\phi_{i_n}(x) \overline{\phi_{i_n}(y)}\).

Important

  • Step 1. selects a component of the mixture, see [LMollerR12] Section 2.4.1

  • Step 2. generates a sample from the projection \(\operatorname{DPP}(\tilde{K})\), cf. previous section.

Attention

Contrary to projection DPPs, the general case requires the eigendecomposition of the kernel (31).

See also

Spectral method for sampling finite DPPs.

Perfect sampling

[DFL13] uses Coupling From The Past (CFTP).

Approximate sampling

See also

  • Approximation of \(K(x,y)=K(x-y)\) by Fourier series [LMollerR12] Section 4