Properties

Throughout this section, we assume \(\mathbf{K}\) and \(\mathbf{L}\) satisfy the sufficient conditions (3) and (4) respectively.

Relation between correlation and likelihood kernels

  1. Considering the DPP defined by \(\mathbf{L} \succeq 0_N\), the associated correlation kernel \(\mathbf{K}\) (1) can be derived as

    (6)\[\mathbf{K} = \mathbf{L}(I+\mathbf{L})^{—1} = I - (I+\mathbf{L})^{—1}.\]

    See also

    Theorem 2.2 [KT12].

  2. Considering the DPP defined by \(0_N \preceq \mathbf{K} \prec I_N\), the associated likelihood kernel \(\mathbf{L}\) (2) can be derived as

    (7)\[\mathbf{L} = \mathbf{K}(I-\mathbf{K})^{—1} = -I + (I-\mathbf{K})^{—1}.\]

    See also

    Equation 25 [KT12].

Important

Thus, except for correlation kernels \(\mathbf{K}\) with some eigenvalues equal to \(1\), both \(\mathbf{K}\) and \(\mathbf{L}\) are diagonalizable in the same basis

(8)\[\mathbf{K} = U \Lambda U^{\dagger}, \quad \mathbf{L} = U \Gamma U^{\dagger} \qquad \text{with} \qquad \lambda_n = \frac{\gamma_n}{1+\gamma_n}.\]

Note

For DPPs with projection correlation kernel \(\mathbf{K}\), the likelihood kernel \(\mathbf{L}\) cannot be computed via (7), since \(\mathbf{K}\) has at least one eigenvalue equal to \(1\) (\(\mathbf{K}^2=\mathbf{K}\)).

Nevertheless, if you recall that the number of points of a projection DPP, then its likelihood reads

\[\mathbb{P}[\mathcal{X}=S] = \det \mathbf{K}_S 1_{|S|=\operatorname{rank}(\mathbf{K})} \quad \forall S\subset [N].\]
from numpy.random import randn, rand
from scipy.linalg import qr
from dppy.finite_dpps import FiniteDPP

r, N = 4, 10
eig_vals = rand(r)  # 0< <1
eig_vecs, _ = qr(randn(N, r), mode='economic')

DPP = FiniteDPP('correlation', **{'K_eig_dec': (eig_vals, eig_vecs)})
DPP.compute_L()

# - L (likelihood) kernel computed via:
# - eig_L = eig_K/(1-eig_K)
# - U diag(eig_L) U.T

Generic DPPs as mixtures of projection DPPs

Projection DPPs are the building blocks of the model in the sense that generic DPPs are mixtures of projection DPPs.

Important

Consider \(\mathcal{X} \sim \operatorname{DPP}(\mathbf{K})\) and write the spectral decomposition of the corresponding kernel as

\[\mathbf{K} = \sum_{n=1}^N \lambda_n u_n u_n^{\dagger}.\]

Then, denote \(\mathcal{X}^B \sim \operatorname{DPP}(\mathbf{K}^B)\) with

\[\mathbf{K}^B = \sum_{n=1}^N B_n u_n u_n^{\dagger}, \quad \text{where} \quad B_n \overset{\text{i.i.d.}}{\sim} \mathcal{B}er(\lambda_n),\]

where \(\mathcal{X}^B\) is obtained by first choosing \(B_1, \dots, B_N\) independently and then sampling from \(\operatorname{DPP}(\mathbf{K}^B)\) the DPP with orthogonal projection kernel \(\mathbf{K}^B\).

Finally, we have \(\mathcal{X} \overset{d}{=} \mathcal{X}^B\).

Number of points

For projection DPPs, i.e., when \(\mathbf{K}\) is an orthogonal projection matrix, one can show that \(|\mathcal{X}|=\operatorname{rank}(\mathbf{K})=\operatorname{Trace}(\mathbf{K})\) almost surely (see, e.g., Lemma 17 of [HKPVirag06] or Lemma 2.7 of [KT12]).

In the general case, based on the fact that generic DPPs are mixtures of projection DPPs, we have

(9)\[|\mathcal{X}| = \sum_{n=1}^N \operatorname{\mathcal{B}er} \left( \lambda_n \right) = \sum_{n=1}^N \operatorname{\mathcal{B}er} \left( \frac{\gamma_n}{1+\gamma_n} \right).\]

Note

From (9) it is clear that \(|\mathcal{X}|\leq \operatorname{rank}(\mathbf{K})=\operatorname{rank}(\mathbf{L})\).

Expectation

(10)\[\mathbb{E}[|\mathcal{X}|] = \operatorname{trace} \mathbf{K} = \sum_{n=1}^N \lambda_n = \sum_{n=1}^N \frac{\gamma_n}{1+\gamma_n}.\]

The expected size of a DPP with likelihood matrix \(\mathbf{L}\) is also related to the effective dimension \(d_{\text{eff}}(\mathbf{L}) = \operatorname{trace} (\mathbf{L}(\mathbf{L}+\mathbf{I})^{-1})= \operatorname{trace} \mathbf{K} = \mathbb{E}[|\mathcal{X}|]\) of \(\mathbf{L}\), a quantity with many applications in randomized numerical linear algebra and statistical learning theory (see e.g., [DerezinskiCV19]).

Variance

(11)\[\operatorname{\mathbb{V}ar}[|\mathcal{X}|] = \operatorname{trace} \mathbf{K} - \operatorname{trace} \mathbf{K}^2 = \sum_{n=1}^N \lambda_n(1-\lambda_n) = \sum_{n=1}^N \frac{\gamma_n}{(1+\gamma_n)^2}.\]

See also

Expectation and variance of Linear statistics.

import numpy as np
from scipy.linalg import qr
from dppy.finite_dpps import FiniteDPP

rng = np.random.RandomState(1)

r, N = 5, 10
eig_vals = rng.rand(r) # 0< <1
eig_vecs, _ = qr(rng.randn(N, r), mode='economic')

dpp_K = FiniteDPP('correlation', projection=False,
                **{'K_eig_dec': (eig_vals, eig_vecs)})

nb_samples = 2000
for _ in range(nb_samples):
    dpp_K.sample_exact(random_state=rng)

sizes = list(map(len, dpp_K.list_of_samples))
print('E[|X|]:\n emp={:.3f}, theo={:.3f}'
      .format(np.mean(sizes), np.sum(eig_vals)))
print('Var[|X|]:\n emp={:.3f}, theo={:.3f}'
      .format(np.var(sizes), np.sum(eig_vals*(1-eig_vals))))
E[|X|]:
 emp=1.581, theo=1.587
Var[|X|]:
 emp=0.795, theo=0.781

Special cases

  1. When the correlation kernel \(\mathbf{K}\) (1) of \(\operatorname{DPP}(\mathbf{K})\) is an orthogonal projection kernel, i.e., \(\operatorname{DPP}(\mathbf{K})\) is a projection DPP, we have

    (12)\[|\mathcal{X}| = \operatorname{rank}(\mathbf{K}) = \operatorname{trace}(\mathbf{K}), \quad \text{almost surely}.\]
    import numpy as np
    from scipy.linalg import qr
    from dppy.finite_dpps import FiniteDPP
    
    r, N = 4, 10
    eig_vals = np.ones(r)
    eig_vecs, _ = qr(rng.randn(N, r), mode='economic')
    
    DPP = FiniteDPP('correlation', projection=True,
                    **{'K_eig_dec': (eig_vals, eig_vecs)})
    
    for _ in range(1000):
        DPP.sample_exact()
    
    sizes = list(map(len, DPP.list_of_samples))
    # np.array(DPP.list_of_samples).shape = (1000, 4)
    
    assert([np.mean(sizes), np.var(sizes)] == [r, 0])
    

    Important

    Since \(|\mathcal{X}|=\operatorname{rank}(\mathbf{K})\) points, almost surely, the likelihood of the projection \(\operatorname{DPP}(\mathbf{K})\) reads

    (13)\[\mathbb{P}[\mathcal{X}=S] = \det \mathbf{K}_S 1_{|S|=\operatorname{rank} \mathbf{K}}.\]

    In other words, the projection DPP having for correlation kernel the orthogonal projection matrix \(\mathbf{K}\) coincides with the k-DPP having likelihood kernel \(\mathbf{K}\) when \(k=\operatorname{rank}(\mathbf{K})\).

  2. When the likelihood kernel \(\mathbf{L}\) of \(\operatorname{DPP}(\mathbf{L})\) (2) is an orthogonal projection kernel we have

    (14)\[|\mathcal{X}| \sim \operatorname{Binomial}(\operatorname{rank}(\mathbf{L}), 1/2).\]
    import numpy as np
    from scipy.stats import binom, chisquare
    from scipy.linalg import qr
    import matplotlib.pyplot as plt
    from dppy.finite_dpps import FiniteDPP
    
    
    r, N = 5, 10
    e_vals = np.ones(r)
    e_vecs, _ = qr(np.random.randn(N, r), mode='economic')
    
    dpp_L = FiniteDPP('likelihood',
                      projection=True,
                      **{'L_eig_dec': (e_vals, e_vecs)})
    
    nb_samples = 1000
    dpp_L.flush_samples
    for _ in range(nb_samples):
        dpp_L.sample_exact()
    
    sizes = list(map(len, dpp_L.list_of_samples))
    
    p = 0.5 # binomial parameter
    rv = binom(r, p)
    
    fig, ax = plt.subplots(1, 1)
    
    x = np.arange(0, r + 1)
    
    pdf = rv.pmf(x)
    ax.plot(x, pdf,
            'ro', ms=8,
            label=r'pdf $Bin({}, {})$'.format(r, p))
    
    hist = np.histogram(sizes, bins=np.arange(0, r + 2), density=True)[0]
    ax.vlines(x, 0, hist,
              colors='b', lw=5, alpha=0.5,
              label='hist of sizes')
    
    ax.legend(loc='best', frameon=False)
    
    plt.title('p_value = {:.3f}'.format(chisquare(hist, pdf)[1]))
    plt.show()
    

    (Source code, png, hires.png, pdf)

    ../_images/ex_plot_number_of_points_finite_dpp_L_projection.png

    Fig. 5 Distribution of the numbe of points of \(\operatorname{DPP}(\mathbf{L})\) with orthogonal projection kernel \(\mathbf{L}\) with rank \(5\).

Geometrical insights

Kernels satisfying the sufficient conditions (3) and (4) can be expressed as

\[\mathbf{K}_{ij} = \langle \phi_i, \phi_j \rangle \quad \text{and} \quad \mathbf{L}_{ij} = \langle \psi_i, \psi_j \rangle,\]

where each item is represented by a feature vector \(\phi_i\) (resp. \(\psi_i\)).

The geometrical view is then straightforward.

  1. The inclusion probabilities read

    \[\mathbb{P}[S\subset \mathcal{X}] = \det \mathbf{K}_S = \operatorname{Vol}^2 \{\phi_s\}_{s\in S}.\]
  2. The likelihood reads

    \[\mathbb{P}[\mathcal{X} = S] \propto \det \mathbf{L}_S = \operatorname{Vol}^2 \{\psi_s\}_{s\in S}.\]

That is to say, DPPs favor subsets \(S\) whose corresponding feature vectors span a large volume i.e. DPPs sample softened orthogonal bases.

Diversity

The determinantal structure of DPPs encodes the notion of diversity. Deriving the pair inclusion probability, also called the 2-point correlation function using (1), we obtain

\[\begin{split}\mathbb{P}[\{i, j\} \subset \mathcal{X}] &= \begin{vmatrix} \mathbb{P}[i \in \mathcal{X}] & \mathbf{K}_{i j}\\ \overline{\mathbf{K}_{i j}} & \mathbb{P}[j \in \mathcal{X}] \end{vmatrix}\\ &= \mathbb{P}[i \in \mathcal{X}] \mathbb{P}[j \in \mathcal{X}] - |\mathbf{K}_{i j}|^2,\end{split}\]

so that, the larger \(|\mathbf{K}_{i j}|\) less likely items \(i\) and \(j\) co-occur. If \(K_{ij}\) models the similarity between items \(i\) and \(j\), DPPs are thus random diverse sets of elements.

Conditioning

Like many other statistics of DPPs, the conditional probabilities can be expressed my means of a determinant and involve the correlation kernel \(\mathbf{K}\) (1).

For any disjoint subsets \(S, T \subset [N]\), i.e., such that \(S\cap T = \emptyset\) we have

(15)\[\mathbb{P}[T \subset \mathcal{X} \mid S \subset \mathcal{X}] = \det\left[\mathbf{K}_T - \mathbf{K}_{TS} \mathbf{K}_S^{-1} \mathbf{K}_{ST}\right],\]
(16)\[\mathbb{P}[T \subset \mathcal{X} \mid S \cap \mathcal{X} = \emptyset] = \det\left[\mathbf{K}_T - \mathbf{K}_{TS} (\mathbf{K}_S - I)^{-1} \mathbf{K}_{ST}\right].\]

See also