Chandler May,${}^{\u2020}$ Alex Clemmer${}^{\u2021}$
Benjamin Van Durme${}^{\u2020}$

${}^{\u2020}$Human Language Technology Center of Excellence

Johns Hopkins University

${}^{\u2021}$Microsoft

cjmay@jhu.edu, clemmer.alexander@gmail.com, vandurme@cs.jhu.edu

${}^{\u2020}$Human Language Technology Center of Excellence

Johns Hopkins University

${}^{\u2021}$Microsoft

cjmay@jhu.edu, clemmer.alexander@gmail.com, vandurme@cs.jhu.edu

Previous research has established several methods of *online*
learning for latent Dirichlet allocation (LDA). However,
*streaming* learning for LDA—allowing only one pass over the
data and constant storage complexity—is not as well explored. We
use reservoir sampling to reduce the storage complexity of a
previously-studied online algorithm, namely the particle filter, to
constant. We then show that a simpler particle filter
implementation performs just as well, and that the quality of the
initialization dominates other factors of performance.

We extend a popular model, latent Dirichlet allocation (LDA), to unbounded streams of documents. In order for inference to be practical in this setting it must use constant space asymptotically and run in pseudo-linear time, perhaps $O\left(n\right)$ or $O\left(n\mathrm{log}n\right)$.

Canini et al. (2009) presented a method for LDA inference based on particle filters, where a sample set of models is updated online with each new token observed from a stream. In general, these models should be regularly resampled and rejuvenated using Markov Chain Monte Carlo (MCMC) steps over the history in order to improve the efficiency of the particle filter [10]. The particle filter of Canini et al. (2009) rejuvenates over independent draws from the history by storing all past observations and states. This algorithm thus has linear storage complexity and is not an online learning algorithm in a strict sense [6].

In the current work we propose using reservoir sampling in the rejuvenation step to reduce the storage complexity of the particle filter to $O\left(1\right)$. This improvement is practically useful in the large-data setting and is also scientifically interesting in that it recovers some of the cognitive plausibility which originally motivated Börschinger and Johnson (2012). However, in experiments on the dataset studied by Canini et al. (2009), we show that rejuvenation does not benefit the particle filter’s performance. Rather, performance is dominated by the effects of random initialization (a problem for which we provide a correction while abiding by the same constraints as Canini et al. (2009)). This result re-opens the question of whether rejuvenation is of practical importance in online learning for static Bayesian models.

For a sequence of $N$ words collected into documents of varying length, we denote the $j$-th word as ${w}_{j}$, and the document it occurs in as ${d}_{i}$. LDA [3] “explains” the occurrence of each word by postulating that a document was generated by repeatedly: (1) sampling a topic $z$ from ${\theta}^{\left(d\right)}$, the document-specific mixture of $T$ topics, and (2) sampling a word $w$ from ${\varphi}^{\left(z\right)}$, the probability distribution the $z$-th topic defines over the vocabulary.

The goal is to infer $\theta $ and $\varphi $, under the model:

${w}_{i}\mid {z}_{i},{\varphi}^{\left({z}_{i}\right)}$ | $\sim \text{Categorical}\left({\varphi}^{\left({z}_{i}\right)}\right)$ | ||

${\varphi}^{\left(z\right)}$ | $\sim \text{Dirichlet}\left(\beta \right)$ | ||

${z}_{i}\mid {\theta}^{\left({d}_{i}\right)}$ | $\sim \text{Categorical}\left({\theta}^{\left({d}_{i}\right)}\right)$ | ||

${\theta}^{\left(d\right)}$ | $\sim \text{Dirichlet}\left(\alpha \right)$ |

Computing $\varphi $ and $\theta $ exactly is generally intractable, motivating methods for approximate inference such as variational Bayesian inference [3], expectation propagation [17], and collapsed Gibbs sampling [11].

A limitation of these techniques is they require multiple passes over the data to obtain good samples of $\varphi $ and $\theta $. This requirement makes them impractical when the corpus is too large to fit directly into memory and in particular when the corpus grows without bound. This motivates online learning techniques, including sampling-based methods [2, 7] and stochastic variational inference [12, 15, 13]. However, where these approaches generally assume the ability to draw independent samples from the full dataset, we consider the case when it is infeasible to access arbitrary elements from the history. The one existing algorithm that can be directly applied under this constraint, to our knowledge, is the streaming variational Bayes framework [4] in which the posterior is recursively updated as new data arrives using a variational approximation.

[t]

initialize weights ${\omega}_{0}^{\left(p\right)}=1/P$ for $p=1,\mathrm{\dots},P$

\For$i=1,\mathrm{\dots},N$ \For $p=1,\mathrm{\dots},P$ set ${\omega}_{i}^{\left(p\right)}={\omega}_{i-1}^{\left(p\right)}\mathbf{P}({w}_{i}\mid {\mathbf{z}}_{i-1}^{\left(p\right)},{\mathbf{w}}_{i-1})$

sample ${z}_{i}^{\left(p\right)}$ w.p. $\mathbf{P}({z}_{i}^{\left(p\right)}\mid {\mathbf{z}}_{i-1}^{\left(p\right)},{\mathbf{w}}_{i})$. \If${\parallel \omega \parallel}_{2}^{-2}\le $ ESS \For $j\in \mathcal{R}\left(i\right)$ \For $p=1,\mathrm{\dots},P$ sample ${z}_{j}^{\left(p\right)}$ w.p. $\mathbf{P}({z}_{j}^{\left(p\right)}\mid {\mathbf{z}}_{i\backslash j}^{\left(p\right)},{\mathbf{w}}_{i})$

set ${\omega}_{i}^{\left(p\right)}=1/P$ for each particle

Particle filters are a family of sequential Monte Carlo (SMC) sampling
algorithms designed to estimate the posterior distribution of a system
with dynamic state
[8]. A particle filter approximates the posterior by
a weighted
sample of points, or particles, from the state space.
The particle cloud is updated recursively
for each new observation using importance sampling (an approach
called *sequential importance sampling*).

Canini et al. (2009) apply this approach to LDA after analytically integrating out $\varphi $ and $\theta $, obtaining a Rao-Blackwellized particle filter [9] that estimates the collapsed posterior $\mathbf{P}(\mathbf{z}\mid \mathbf{w})$. In this setting, the $P$ particles are samples of the topic assignment vector ${\mathbf{z}}^{\left(p\right)}$, and they are propagated forward in state space one token at a time. In general, the larger $P$ is, the more accurately we approximate the posterior; for small $P$, the approximation of the tails of the posterior will be particularly poor [19]. However, a larger value of $P$ increases the runtime and storage requirements of the algorithm.

We now describe the Rao-Blackwellized particle filter for LDA in detail (pseudocode is given in Algorithm 3). At the moment token $i$ is observed, the particles form a discrete approximation of the posterior up to the $\left(i-1\right)$-th word:

$\mathbf{P}({\mathbf{z}}_{i-1}\mid {\mathbf{w}}_{i-1})\approx {\displaystyle \sum _{p}}{\omega}_{i-1}^{\left(p\right)}{I}_{{\mathbf{z}}_{i-1}}\left({\mathbf{z}}_{i-1}^{\left(p\right)}\right)$ |

where ${I}_{\mathbf{z}}\left({\mathbf{z}}^{\prime}\right)$ is the indicator function, evaluating to 1 if $\mathbf{z}={\mathbf{z}}^{\prime}$ and 0 otherwise. Now each particle $p$ is propagated forward by drawing a topic ${z}_{i}^{\left(p\right)}$ from the conditional posterior distribution $\mathbf{P}({z}_{i}^{\left(p\right)}\mid {\mathbf{z}}_{i-1}^{\left(p\right)},{\mathbf{w}}_{i})$ and scaling the particle weight by $\mathbf{P}({w}_{i}\mid {\mathbf{z}}_{i-1}^{\left(p\right)},{\mathbf{w}}_{i-1})$. The particle cloud now approximates the posterior up to the $i$-th word:

$\mathbf{P}({\mathbf{z}}_{i}\mid {\mathbf{w}}_{i})\approx {\displaystyle \sum _{p}}{\omega}_{i}^{\left(p\right)}{I}_{{\mathbf{z}}_{i}}\left({\mathbf{z}}_{i}^{\left(p\right)}\right).$ |

Dropping the superscript $\left(p\right)$ for notational convenience, the conditional posterior used in the propagation step is given by

$\mathbf{P}\left({z}_{i}\right|{\mathbf{z}}_{i-1},{\mathbf{w}}_{i})$ | $\propto \mathbf{P}({z}_{i},{w}_{i}\mid {\mathbf{z}}_{i-1},{\mathbf{w}}_{i-1})$ | ||

$={\displaystyle \frac{{n}_{{z}_{i},i\backslash i}^{\left({w}_{i}\right)}+\beta}{{n}_{{z}_{i},i\backslash i}^{(\cdot )}+W\beta}}{\displaystyle \frac{{n}_{{z}_{i},i\backslash i}^{\left({d}_{i}\right)}+\alpha}{{n}_{\cdot ,i\backslash i}^{\left({d}_{i}\right)}+T\alpha}}$ |

where ${n}_{{z}_{i},i\backslash i}^{\left({w}_{i}\right)}$ is the number of times word ${w}_{i}$ has been assigned topic ${z}_{i}$ so far, ${n}_{{z}_{i},i\backslash i}^{(\cdot )}$ is the number of times any word has been assigned topic ${z}_{i}$, ${n}_{{z}_{i},i\backslash i}^{\left({d}_{i}\right)}$ is the number of times topic ${z}_{i}$ has been assigned to any word in document ${d}_{i}$, and ${n}_{\cdot ,i\backslash i}^{\left({d}_{i}\right)}$ is the number of words observed in document ${d}_{i}$. The particle weights are scaled as

$\frac{{\omega}_{i}^{\left(p\right)}}{{\omega}_{i-1}^{\left(p\right)}}$ | $\propto {\displaystyle \frac{\mathbf{P}({w}_{i}\mid {\mathbf{z}}_{i}^{\left(p\right)},{\mathbf{w}}_{i})\mathbf{P}({z}_{i}^{\left(p\right)}\mid {\mathbf{z}}_{i-1}^{\left(p\right)})}{Q({z}_{i}^{\left(p\right)}\mid {\mathbf{z}}_{i-1}^{\left(p\right)},{\mathbf{w}}_{i})}}$ | ||

$=\mathbf{P}({w}_{i}\mid {\mathbf{z}}_{i-1}^{\left(p\right)},{\mathbf{w}}_{i-1})$ |

where $Q$ is the proposal distribution for the particle state transition; in our case,

$Q({z}_{i}^{\left(p\right)}\mid {\mathbf{z}}_{i-1}^{\left(p\right)},{\mathbf{w}}_{i})=\mathbf{P}({z}_{i}^{\left(p\right)}\mid {\mathbf{z}}_{i-1}^{\left(p\right)},{\mathbf{w}}_{i}),$ |

minimizing the variance of the importance weights conditioned on ${\mathbf{w}}_{i}$ and ${\mathbf{z}}_{i-1}$ [9].

Over time the particle weights tend to diverge. To combat this inefficiency, after every state transition we estimate the effective sample size (ESS) of the particle weights as ${\parallel {\omega}_{i}\parallel}_{2}^{-2}$ [14] and resample the particles when that estimate drops below a prespecified threshold. Several resampling strategies have been proposed [9]; we perform multinomial resampling as in Pitt and Shephard (1999) and Ahmed et al. (2011), treating the weights as unnormalized probability masses on the particles.

After resampling we are likely to have several copies of the same particle, yielding a degenerate approximation to the posterior. To reintroduce diversity to the particle cloud we take MCMC steps over a sequence of states from the history [9, 10]. We call the indices of these states the rejuvenation sequence, denoted $\mathcal{R}\left(i\right)$ [7]. The transition probability for a state $j\in \mathcal{R}\left(i\right)$ is given by

$\mathbf{P}({z}_{j}\mid {\mathbf{z}}_{N\backslash j},{\mathbf{w}}_{N})\propto {\displaystyle \frac{{n}_{{z}_{j},N\backslash j}^{\left({w}_{j}\right)}+\beta}{{n}_{{z}_{j},N\backslash j}^{(\cdot )}+W\beta}}{\displaystyle \frac{{n}_{{z}_{j},N\backslash j}^{\left({d}_{j}\right)}+\alpha}{{n}_{\cdot ,N\backslash j}^{\left({d}_{j}\right)}+T\alpha}}$ |

where subscript $N\backslash j$ denotes counts up to token $N$, excluding those for token $j$.

The rejuvenation sequence can be chosen by the practitioner. Choosing a long sequence (large $\left|\mathcal{R}\left(i\right)\right|$) may result in a more accurate posterior approximation but also increases runtime and storage requirements. The tokens in $\mathcal{R}\left(i\right)$ may be chosen uniformly at random from the history or under a biased scheme that favors recent observations. The particle filter studied empirically by Canini et al. (2009) stored the entire history, incurring linear storage complexity in the size of the stream. Ahmed et al. (2011) instead sampled ten documents from the most recent 1000, achieving constant storage complexity at the cost of a recency bias. If we want to fit a model to a long non-i.i.d. stream, we require an unbiased rejuvenation sequence as well as sub-linear storage complexity.

Reservoir sampling is a widely-used family of algorithms for choosing an array (“reservoir”) of $k$ items. The most common example, presented in Vitter (1985) as Algorithm R, chooses $k$ elements of a stream such that each possible subset of $k$ elements is equiprobable. This effects sampling $k$ items uniformly without replacement, using runtime $O\left(n\right)$ (constant per update) and storage $O\left(k\right)$.

Initialize $k$-element array $R$ Stream $S$ \For $i=1,\mathrm{\dots},k$ $R\left[i\right]\leftarrow S\left[i\right]$ \For $i=k+1,\mathrm{\dots},length\left(S\right)$ $j\leftarrow random\left(1,i\right)$ \If $j\le k$ $R\left[j\right]\leftarrow S\left[i\right]$

To ensure constant space over an unbounded stream, we draw the rejuvenation sequence $\mathcal{R}\left(i\right)$ uniformly from a reservoir. As each token of the training data is ingested by the particle filter, we decide to insert that token into the reservoir, or not, independent of the other tokens in the current document. Thus, at the end of step $i$ of the particle filter, each of the $i$ tokens seen so far in the training sequence has an equal probability of being in the reservoir, hence being selected for rejuvenation.

We evaluate our particle filter on three datasets studied in
Canini et al. (2009): diff3, rel3, and
sim3. Each of these datasets is a collection of posts under
three categories from the 20 Newsgroups
dataset.^{1}^{1}diff3: {rec.sport.baseball,
sci.space, alt.atheism}; rel3:
talk.politics.{misc, guns, mideast}; and sim3:
comp.{graphics, os.ms-windows.misc, windows.x}. We use
a 60% training/40% testing split of this data that is available online.^{2}^{2}http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz

We preprocess the data by splitting each line on non-alphabet characters, converting the resulting tokens to lower-case, and filtering out any tokens that appear in a list of common English stop words. In addition, we remove the header of every file and filter every line that does not contain a non-trailing space (which removes embedded ASCII-encoded attachments). Finally, we shuffle the order of the documents. After these steps, we compute the vocabulary for each dataset as the set of all non-singleton types in the training data augmented with a special out-of-vocabulary symbol.

During training we report the out-of-sample NMI, calculated by holding the word proportions $\varphi $ fixed, running five sweeps of collapsed Gibbs sampling on the test set, and computing the topic for each document as the topic assigned to the most tokens in that document. Two Gibbs sweeps have been shown to yield good performance in practice [23]; we increase the number of sweeps to five after inspecting the stability on our dataset. The variance of the particle filter is often large, so for each experiment we perform 30 runs and plot the mean NMI inside bands spanning one sample standard deviation in either direction.

Our first set of experiments has a similar
parameterization^{3}^{3}$T=3,\alpha =\beta =0.1,P=100,ess=20,\left|\mathcal{R}\left(i\right)\right|=30$
to the experiments of Canini et al. (2009) except we draw the rejuvenation sequence from a reservoir.
We initialize the particle filter with 200 Gibbs sweeps on the first 10% of each dataset. Then, for each dataset, for rejuvenation disabled, rejuvenation based on a reservoir of size 1000, and rejuvenation based on the entire history (in turn), we perform 30 runs of the particle filter from that fixed initial model. Our results (Figure 1) resemble those of Canini et al. (2009); we believe the discrepancies are mostly attributable to differences in preprocessing.

In these experiments, the initial model was not chosen arbitrarily. Rather, an initial model that yielded out-of-sample NMI close to the initial out-of-sample NMI scores reported in the previous study was chosen from a set of 100 candidates.

We now investigate the significance of the initial model selection step used in the previous experiments. We run a new set of experiments in which the reservoir size is held fixed at 1000 and the size of the initialization sample is varied. Specifically, we vary the size of the initialization sample, in documents, between zero (corresponding to no Gibbs initialization), 30, 100, and 300, and also perform a run of batch Gibbs sampling (with no particle filter). In each case, 2000 Gibbs sweeps are performed. In these experiments, the initial models are not held fixed; for each of the 30 runs for each dataset, the initial model was generated by a different Gibbs chain. The results for these experiments, depicted in Figure 2, indicate that the size of the initialization sample improves mean NMI and reduces variance, and that the variance of the particle filter itself is dominated by the variance of the initial model.

We observed previously that variance in the Gibbs initialization of the model contributes significantly to variance of the overall algorithm, as measured by NMI. With this in mind, we consider whether we can reduce variance in the initialization by tuning the initial model. Thus we perform a set of experiments in which we perform Gibbs initialization 20 times on the initialization set, setting the particle filter’s initial model to the model out of these 20 with the highest in-sample NMI. This procedure is performed independently for each run of the particle filter. We may not always have labeled data for initialization, so we also consider a variation in which Gibbs initialization is performed 20 times on the first 80% of the initialization sample, held-out perplexity (per word) is estimated on the remaining 20%, using a first-moment particle learning approximation [20], and the particle filter is started from the model out of these 20 with the lowest held-out perplexity. The results, shown in Figure 3, show that we can ameliorate the variance due to initialization by tuning the initial model to NMI or perplexity.

Motivated by a desire for cognitive plausibility, Börschinger and Johnson (2011) used a particle filter to learn Bayesian word segmentation models, following the work of Canini et al. (2009). They later showed that rejuvenation improved performance [6], but this impaired cognitive plausibility by necessitating storage of all previous states and observations. We attempted to correct this by drawing the rejuvenation sequence from a reservoir, but our results indicate that the particle filter for LDA on our dataset is highly sensitive to initialization and not influenced by rejuvenation.

In the experiments of Börschinger and Johnson (2012), the particle
cloud appears to be resampled once per utterance with a large
rejuvenation sequence;^{4}^{4}
The ESS threshold is $P$; the
rejuvenation sequence is 100 or 1600 utterances,
almost one sixth of the training data.
each particle takes many more rejuvenation MCMC steps than
new state transitions and thus resembles a batch MCMC sampler.
In our experiments resampling is done on the order of once
per document,
leading to less than one rejuvenation step per transition.
Future work should carefully note this ratio: sampling
history much more often than new states improves performance
but contradicts the intuition behind particle filters.

We have also shown that tuning the initial model using in-sample NMI or held-out perplexity can improve mean NMI and reduce variance. Perplexity (or likelihood) is often used to estimate model performance in LDA [3, 11, 22, 12], and does not compare the inferred model against gold-standard labels, yet it appears to be a good proxy for NMI in our experiment. Thus, if initialization continues to be crucial to performance, at least we may have the flexibility of initializing without gold-standard labels.

We have focused on NMI as our evaluation metric for comparison with Canini et al. (2009). However, evaluation of topic models is a subject of considerable debate [22, 23, 18, 16] and it may be informative to investigate the effects of initialization and rejuvenation using other metrics such as perplexity or semantic coherence.

We have proposed reservoir sampling for reducing the storage complexity of a particle filter from linear to constant. This work was motivated as an expected improvement on the model of Canini et al. (2009). However, in the process of establishing an empirical baseline we discovered that rejuvenation does not play a significant role in the experiments of Canini et al. (2009). Moreover, we found that performance of the particle filter was strongly affected by the random initialization of the model, and suggested a simple approach to reduce the variability therein without using additional data. In conclusion, it is now an open question whether—and if so, under what assumptions—rejuvenation benefits particle filters for LDA and similar static Bayesian models.

We thank Frank Ferraro, Keith Levin, and Mark Dredze for discussions.

- [1] A. Ahmed, Q. Ho, J. Eisenstein, E. P. Xing, A. J. Smola and C. H. Teo(2011) Unified analysis of streaming news. pp. 267–276. Cited by: 3, 3.
- [2] A. Banerjee and S. Basu(2007) Topic models over text streams: a study of batch and online unsupervised learning. pp. 431–436. Cited by: 2.
- [3] D. M. Blei, A. Y. Ng and M. I. Jordan(2003-01) Latent Dirichlet allocation. Journal of Machine Learning Research 3, pp. 993–1022. Cited by: 2, 2, 6.
- [4] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson and M. I. Jordan(2013) Streaming variational Bayes. Cited by: 2.
- [5] B. Börschinger and M. Johnson(2011) A particle filter algorithm for Bayesian wordsegmentation. pp. 10–18. Cited by: 6.
- [6] B. Börschinger and M. Johnson(2012) Using rejuvenation to improve particle filtering for Bayesian word segmentation. pp. 85–89. Cited by: 1, 1, 6, 6.
- [7] K. R. Canini, L. Shi and T. L. Griffiths(2009) Online inference of topics with latent Dirichlet allocation. Cited by: 1, 1, 2, 3, 3, 3, 5, 5, 6, 6, 7.
- [8] A. Doucet, N. de Freitas and N. Gordon (Eds.)(2001) Sequential Monte Carlo methods in practice. Springer, New York. Cited by: 3.
- [9] A. Doucet, N. de Freitas, K. Murphy and S. Russell(2000) Rao-Blackwellised particle filtering for dynamic Bayesian networks. pp. 176–183. Cited by: 3, 3, 3, 3.
- [10] W. R. Gilks and C. Berzuini(2001) Following a moving target—Monte Carlo inference for dynamic Bayesian models. Journal of the Royal Statistical Society 63 (1), pp. 127–146. Cited by: 1, 3.
- [11] T. L. Griffiths and M. Steyvers(2004-04) Finding scientific topics. Proceedings of the National Academy of Sciences 101 (suppl 1), pp. 5228–5235. Cited by: 2, 6.
- [12] M. D. Hoffman, D. M. Blei and F. Bach(2010) Online learning for latent Dirichlet allocation. Cited by: 2, 6.
- [13] M. D. Hoffman, D. M. Blei, C. Wang and J. Paisley(2013-05) Stochastic variational inference. Journal of Machine Learning Research 14, pp. 1303–1347. Cited by: 2.
- [14] J. S. Liu and R. Chen(1998-09) Sequential Monte Carlo methods for dynamic systems. Journal of the American Statistical Association 93 (443), pp. 1032–1044. Cited by: 3.
- [15] D. Mimno, M. D. Hoffman and D. M. Blei(2012) Sparse stochastic inference for latent Dirichlet allocation. Cited by: 2.
- [16] D. Mimno, H. M. Wallach, E. Talley, M. Leenders and A. McCallum(2011) Optimizing semantic coherence in topic models. pp. 262–272. Cited by: 6.
- [17] T. Minka and J. Lafferty(2002) Expectation-propagation for the generative aspect model. pp. 352–359. Cited by: 2.
- [18] D. Newman, J. H. Lau, K. Grieser and T. Baldwin(2010) Automatic evaluation of topic coherence. pp. 100–108. Cited by: 6.
- [19] M. K. Pitt and N. Shephard(1999-06) Filtering via simulation: auxiliary particle filters. Journal of the American Statistical Association 94 (446), pp. 590–599. Cited by: 3, 3.
- [20] J. G. Scott and J. Baldridge(2013) A recursive estimate for the predictive likelihood in a topic model. Cited by: 5.
- [21] J. S. Vitter(1985-03) Random sampling with a reservoir. ACM Transactions on Mathematical Software 11 (1), pp. 37–57. Cited by: 4.
- [22] H. M. Wallach, I. Murray, R. Salakhutdinov and D. Mimno(2009) Evaluation methods for topic models. Cited by: 6, 6.
- [23] L. Yao, D. Mimno and A. McCallum(2009) Efficient methods for topic model inference on streaming document collections. pp. 937–946. Cited by: 5, 6.

Generated on Wed Jun 11 17:56:33 2014 by LaTeXML