Sampling¶
In contrast to the finite case where the ML community has put efforts in improving the efficiency and tractability of the sampling routine, much less has been done in the continuous setting.
Exact sampling¶
In this section, we describe the main techniques for sampling exactly continuous DPPs.
As for finite DPPs the most prominent one relies on the fact that generic DPPs are mixtures of projection DPPs.
Projection DPPs: the chain rule¶
Let’s focus on sampling from projection \(\operatorname{DPP}(K)\) with a real-valued orthogonal projection kernel \(K:\mathbb{X}\times \mathbb{X}\to \mathbb{R}\) and reference measure \(\mu\), that is
In this setting, recall that the number of points is \(\mu\)-almost surely equal to \(r=\operatorname{rank}(K)\).
To generate a valid sample \(X=\{x_{1}, \dots, x_{r}\} \sim \operatorname{DPP}(K)\), [HKPVirag06] Proposition 19 showed that it is sufficient to apply the chain rule to sample \((x_1, \dots, x_r)\) with joint distribution
and forget the order the points were selected.
The original projection DPP sampler of [HKPVirag06] Algorithm 18, was given in an abstract form, which can be implemented using the following strategy. Write the determinant in (27) as a telescopic product of ratios of determinants and use Schur complements to get
where \(\mathbf{K}_{i-1} = [K(x_p,x_q)]_{p,q=1}^{i-1}\) and \(\mathbf{K}_{i-1}(x) = (K(x,x_1), \dots, K(x,x_{i-1}))^{\top}\).
Important
The expression (27) indeed defines a probability distribution, with normalization constant \(r!\). In particular this distribution is exchangeable, i.e., invariant by permutation of the coordinates.
The successive ratios that appear in (28) are the normalized conditional densities (w.r.t. \(\mu\)) that drive the chain rule. The associated normalizing constants \(r-(i-1)\) are independent of the previous points.
Sampling projection DPPs does not require the eigendecomposition of the kernel!
Hint
MLers will recognize (28) as the incremental posterior variance of a noise-free Gaussian Process (GP) model with kernel \(K\), see [RW06] Equation 2.26.
Geometrical interpretation¶
When the eigendecomposition of the kernel is available, the chain rule can be interpreted and implemented from a geometrical perspective, see, e.g., [LMollerR12] Algorithm 1.
Writing the Gram formulation of the kernel as
where \(\Phi(x) \triangleq (\phi_{1}(x), \dots, \phi_{r}(x))\) denotes the feature vector associated to \(x\in \mathbb{X}\).
The joint distribution (27) reads
Hint
The joint distribution (29) characterizes the fact that projection \(\operatorname{DPP}(K)\) favor sets of \(r=\operatorname{rank}(\mathbf{K})\) points \(\left(x_{1}, \dots, x_{r} \right)\) whose feature vectors \(\Phi(x_1), \dots \Phi(x_r)\) span a large volume. This is another way of understanding the repulsive/diversity feature of DPPs.
Then, the previous telescopic product of ratios of determinants in (28) can be understood as the base \(\times\) height formula applied to compute \(\operatorname{Volume}^2 \{\Phi(x_1), \dots, \Phi(x_r)\}\), so that
where \(\mathbf{K}_{i-1} = \left[\left\langle \Phi(x_p)^{\top} \Phi(x_q) \right\rangle\right]_{p,q=1}^{i-1}\).
Hint
The overall procedure is akin to a sequential Gram-Schmidt orthogonalization of \(\Phi(x_{1}), \dots, \Phi(x_{N})\).
Attention
In contrast to the finite case where the conditionals are simply probability vectors, the chain rule formulations (28) and (30) require sampling from a continuous distribution. This can be done using a rejection sampling mechanism but finding a good proposal density with tight rejection bounds is a challenging problem [LMollerR12] Section 2.4. But it is achievable in some specific cases, see, e.g., Multivariate Jacobi Ensemble.
See also
Algorithm 18 [HKPVirag06] for the original abstract projection DPP sampler
Projection DPPs: the chain rule in the finite case
Some Orthogonal Polynomial Ensembles (specific instances of projection DPPs) can be sampled in \(\mathcal{O}(r^2)\) by computing the eigenvalues of properly randomised tridiagonal matrices.
The multivariate Jacobi ensemble whose
sample()
method relies on the chain rule described by (30).
Generic DPPs: the spectral method¶
The procedure stems from the fact that generic DPPs are mixtures of projection DPPs, suggesting the following two steps algorithm. Given the spectral decomposition of the kernel
Step 1. Draw \(B_i\sim\operatorname{\mathcal{B}er}(\lambda_i)\) independently and note \(\{i_1,\dots,i_{N}\} = \{i~;~B_i=1\}\),
Step 2. Sample from the projection DPP with kernel \(\tilde{K}(x,y) = \sum_{n=1}^{N}\phi_{i_n}(x) \overline{\phi_{i_n}(y)}\).
Important
Step 1. selects a component of the mixture, see [LMollerR12] Section 2.4.1
Step 2. generates a sample from the projection \(\operatorname{DPP}(\tilde{K})\), cf. previous section.
Attention
Contrary to projection DPPs, the general case requires the eigendecomposition of the kernel (31).
See also
Spectral method for sampling finite DPPs.
Approximate sampling¶
See also
Approximation of \(K(x,y)=K(x-y)\) by Fourier series [LMollerR12] Section 4