Chapter 3 Dirichlet Process
Given a base distribution \(G_0\) over a measurable set \(\Theta\), and a concentration parameter \(\alpha\), we can obtain a Dirichlet Process
\[G\sim DP(\alpha,G_0).\]
The Dirichlet Process is a stochastic process, i.e., a distribution over function spaces. In particular, it is a distribution over probability measures (for example, probability functions on \(\mathbb{R}^D\)). Though \(G_0\) might be continuous, each draw from \(G\) would be a discrete probability distribution. Moreover, its marginal distribution follows a Dirichlet distribution. That is, if \(B_1,B_2,\ldots, B_r\) forms a partition on \(\Theta\), then (Teh, 2011)
\[\left( G(B_1),G(B_2),\ldots, G(B_r)\right)\sim Dir\left(\alpha G_0(B_1),\alpha G_0(B_2),\ldots, \alpha G_0(B_r)\right)\] Furthermore, we can show the following properties:
For any measurable set \(B\subset \Theta\)
\[\text{(i)}\qquad E[G(B)] = G_0(B) \] and
\[\text{(ii)}\qquad Var[G(B)] = \frac{G_0(B)(1-G_0(B))}{\alpha} \]
Property (ii) means that given the base distribution \(G_0\), the parameter \(\alpha\) is inversely proportional to the variance. As an extension, in the context of a Dirichlet prior in a Bayesian model, a larger \(\alpha\) implies a stronger prior.
3.1 Dirichlet Process Mixture Model (DPMM)
We can use the Dirichlet process to build a mixture model that allows infinitely many components, i.e., the Dirichlet process mixture model (DPMM). Figure 3.1 is a visualization for a typical DPMM model.The observed data, \(x_1,x_2,\ldots,x_n\), is modeled by the set of latent parameters \(\theta_1,\theta_2,\ldots,\theta_n\) associated a kernel function \(F\), which are independently drawn from the distribution \(G\). Each \(x_k\) then follows the distribution \(F(\theta_k)\). Moreover, since \(G\) is a discrete distribution, it is possible for \(\theta_i=\theta_j\), in which case \(x_i\) and \(x_j\) come from the same component. As \(G\) is the mixing proportions, \(\pi_1,\pi_2,\ldots, \pi_n\) , are drawn from a GEM distribution with parameter \(\alpha\). Therefore, alternatively, we can describe the distribution of \(G\) as
\[G = \sum_{k=0}^\infty \pi_k\delta_k\]
Here, \(\delta_k\) is a point mass at \(\theta_k\). Alternatively, we can also view it as an indicator function about whether a data point \(x\) belongs to the \(k^{th}\) component.
Example
We will illustrate this idea by a Gaussian mixture model that assumes each component has the same standard deviation \(\sigma\) we already know. In this case, \(\theta_k\) is simply the center for each cluster, \(\mu_k\), for \(k=1,2,\ldots,n\), where each \(\mu_k\) is drawn from the base distribution \(G_0=Normal(\mu_0,\sigma_0)\).
\[x_k\mid \left(z_k,\left\{\theta^*_i\right\}\right) \sim Normal(\mu_{z_k},\sigma)\] \[z_k\mid\pi \overset{\text{i.i.d.}}{\sim} Mult(\pi)\]
\[\mu_k \overset{\text{i.i.d.}}{\sim} Normal(\mu_0,\sigma_0)\]
\[\pi\mid\alpha=\left(\pi_1,\pi_2,\ldots, \pi_n\right)\mid\alpha\sim GEM(\alpha) \] This can be summarized by the Dirichlet process model:
\[G = \sum_{k=1}^\infty \pi_k\delta_{\mu_k}\sim DP(\alpha,Normal(\mu_0,\sigma_0))\]
The following video illustrates the Dirichlet process mixture model as a generative model.
3.2 A Bayesian Perspective
The flexible nature of the Dirichlet process makes it one of the most widely-used Bayesian non-parametric models. Unlike a parametric prior that restricts the scope of inferences, a non-parametric prior has wide support. (Teh, 2011) Moreover, a natural extension of the Dirichlet-Multinomial conjugacy is the conjugacy for the Dirichlet process.
Given a set of observed values \(\theta_1,\theta_2,\ldots, \theta_n\) that takes values in \(\Theta\), and let \(B_1,B_2,\ldots,B_r\) be a finite partition of \(\Theta\), define \(n_k = \sum_{i=1}^n\delta_i(A_k)=|\{i:\theta_i\in B_k\}|\). We have
\[\left( G(B_1),G(B_2),\ldots, G(B_r)\right)\mid\theta_1,\theta_2,\ldots,\theta_n\sim Dir\left(\alpha G_0(B_1)+n_1,\alpha G_0(B_2)+n_2,\ldots, \alpha G_0(B_r)+n_r\right)\]
That is, the posterior distribution of \(G\) is also a Dirichlet process. Furthermore, we can show the posterior Dirichlet process has an updated concentration parameter of \(\alpha+n\) and updated base distribution \(\frac{\alpha G_0+\sum_{i=1}^n\delta_i}{a+n} = \frac{\alpha}{\alpha+n}G_0+\frac{n}{\alpha+n}\frac{\sum_{i=1}^n\delta_i}{n}\) (Teh, 2011):
\[G\mid \left(\theta_1,\theta_2,\ldots,\theta_n\right)\sim DP(\alpha+n,\frac{\alpha}{\alpha+n}G_0+\frac{n}{\alpha+n}\frac{\sum_{i=1}^n\delta_i}{n}) \] Thus, we can view the posterior base distribution as a weighted average between the prior base distribution \(G_0\) and the empirical distribution \(\frac{\sum_{i=1}^n\delta_i}{n}\). Moreover, on one hand, the smaller \(\alpha\) is, the smaller contribution \(G_0\) will have to the posterior base distribution; on the other hand, a large number of observations \(n\) would make the empirical distribution dominant the posterior.
3.3 Predictive Distribution
The posterior base distribution given observed \(\theta_k\) for \(k=1,2,\ldots,n\) is also the posterior predictive distribution for \(\theta_n+1\).
Let \(A\subset \Theta\). Since \(\theta_{n+1}\mid G,\theta_1,\theta_2,\ldots,\theta_k\sim G\), we have
\[\begin{align*} P(\theta_{n+1}\in A\mid \theta_1,\theta_2,\ldots,\theta_k)= &\int P(\theta_{n+1}\mid G,\theta_1,\theta_2,\ldots,\theta_k)P(G\mid \theta_1,\theta_2,\ldots,\theta_k)dP(G)\\ =&\int G(A)P(G\mid \theta_1,\theta_2,\ldots,\theta_k )dP(G)\\ =& E[(G(A)\mid \theta_1,\theta_2,\ldots,\theta_k]\\ =& \frac{1}{a+n}(aG_0(A)+\sum_{i=1}^n\delta_i(A)) \end{align*}\]
That is (Teh, 2011),
\[\theta_{n+1}\mid \theta_1,\theta_2,\ldots,\theta_k\sim \frac{1}{a+n}(aG_0+\sum_{i=1}^n\delta_i),\]
which can be described by the Blackwell-MacQueen urn scheme.
3.3.1 Blackwell-MacQueen Urn Scheme
The Blackwell-MacQueen urn scheme is a generalization of the Polya’s urn scheme described in Chapter 2 by allowing infinitely many colors. Specifically (Blackwell and MacQueen, 2013),
we assume the base distribution \(G_0\) is a distribution over colors, and each draw \(\theta_k\) represents a color. - We start with an empty urn.
On the \(k^{th}\) step
with probability \(\frac{a}{a+n}\), we pick a new color from the distribution \(G_0\) over the color space \(\Theta\), paint a new ball with this color, and put this ball to the urn;
with probability \(\frac{n}{a+n}\), we pick a random ball from the urn, record its color, paint a new ball with this existing color, and put the drawn ball and the new ball to the urn.
- in particular, in the second case, the probability to draw a ball with color \(\theta_k\) from the urn is proportional to \(n_k\), the existing number of balls in color \(\theta_k\).
The last observation above reflects a “rich-gets-richer” phenomenon. If we treat balls in the same color as a cluster, then larger clusters tend to grow faster. Suppose there are \(m\) clusters for the observed \(\theta_1,\theta_2,\ldots, \theta_n\). To reflect this view, we can also write the posterior predictive distribution as
\[\theta_{n+1}\mid \theta_1,\theta_2,\ldots,\theta_k\sim \frac{1}{a+n}(aG_0+\sum_{k=1}^m n_k\delta_k)\]
3.4 Appendix
3.4.1 Simulation Codes
The simulation for Section 3.1 is based on the codes (Broderick,n.d) below:
= 100
maxiters
<- function(
dpmm_sample alpha = 2,
mu0 = 0,
sig_0 = 1.5,
sig = 0.3
) {
# Args:
# alpha: DP-stick-breaking/GEM parameter
# mu0: normal mean for cluster param prior (repeated twice)
# sig_0: normal sd for cluster param prior (covar is diag with this value)
# sig: nromal sd for points in a cluster (covar is diag with this value)
# some colors for the points in a cluster
palette(rainbow(10))
# set up the figure into two sized figures
par(mar = rep(2,4))
layout(matrix(c(1,2), 2, 1, byrow=TRUE),
heights=c(1,4)
)
# initializations
= c() # GEM-generated probabilities
rho = 0 # sum of probabilities generated so far
rhosum = c(0) # cumulative sums of probs gen'd so far
csum = c() # colors to distinguish clusters
bar_colors = c() # points generated from the DPMM
x = c() # cluster assignments generated in the DPMM
z = c() # cluster means generated in the DPMM
mu = 0 # number of data points generated so far
N = 0 # number of data points to generate next round (user input)
newN
for(iter in 1:maxiters) {
print(iter)
# generate new DPMM data points
if(newN > 0) {
for(n in 1:newN) {
# uniform draw that decides which component is chosen
= runif(1)
u
# instantiate components until the chosen
# one is reached
while(rhosum < u) {
# beta stick-break
= rbeta(1,1,alpha)
V # mass resulting from stick-break
= (1-rhosum)*V
newrho
# update list of instantiated probabilities
= rbind(rho,newrho)
rho = rhosum + newrho
rhosum = c(csum, rhosum)
csum
# fill in bars on the plot for
# instantiated probabilities
= c(bar_colors,"grey")
bar_colors
# generate new cluster means for
# instantiated probabilities
= rnorm(2,mu0,sig_0)
newmu = rbind(mu, newmu)
mu
}
# decide which cluster was chosen for this data point
= max(which(csum < u))
thisz = c(z,thisz)
z # generate a data point in this cluster
= mu[thisz,]
thismu = rnorm(cbind(1,1),thismu,cbind(sig,sig))
newx = rbind(x,newx)
x
}
}# record correct number of data points
= N + newN
N
# plot the component probabilities instantiated so far
barplot(rbind(rho,1-rhosum),
beside=FALSE,
horiz=TRUE,
col=c(bar_colors, "white"), # remaining mass in (0,1)
ylim=c(0,1),
width=0.7,
main=bquote(rho~"~GEM("~.(alpha)~")")
)
# after initial display, plot how the next
# component is chosen
if(N > 0) {
points(u, 1, pch=25, col="red", bg="red")
}
# just gets the labels and figure limits right
# real plot still to come
plot(x,
pch=".",
xlim=c(-5,5),
ylim=c(-5,5),
main=paste("N = ", toString(N),
", #means = ", toString(length(rho)),
", #clust = ", toString(length(unique(z))),sep="")
)
# plot all the instantiated means
points(mu,
pch=15,
col="black"
)
# plot the data points generated from the DPMM thus far
points(x,
pch=19,
col=z
)
# Generate one new draw for each press of "enter".
# Writing a number generates that many new samples.
# Press 'x' when finished
<- readline()
line if(line == "x") {
return("done")
else if(line == "") {
} = 1
newN else {
} = as.numeric(line)
newN
}
}
}
dpmm_sample()
Reference
Blackwell, D., and MacQueen, J. B. (1973) Ferguson Distributions via Polya urnSchemes, The Annals of Statistics, 1, 353-355
Broderick, T. (n.d.). Tutorial of Bayesian Nonparametric. Tamara Broderick. Retrieved November 13, 2021, from https://tamarabroderick.com/tutorials.html.
Khalid, El-Arini. (2008) Dirichlet Processes A gentle tutorial, Select Lab Meeting
Teh Y. (2011) Dirichlet Process. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_219