Basic Reading About LDA (Latent Dirichlet allocation)

Mohammad Anisul Islam
3 min readApr 25, 2021

LDA is a procreative probabilistic model of a corpus. The basic knowledge is that documents are symbolized as random combinations over latent topics, where each topic is categorized by a distribution over words.
LDA adopts the following generative process for each document w in a corpus.

  1. Select N ∼ Poisson (ξ).
  2. 2. Select θ ∼ Dir (α)
  3. For each of the N words wn:
    (a) Select a topic z ∼ Multinomial (θ).
    (b) Select a word wn from p (wn | zn,β ), a multinomial probability conditioned on the
    topic zn.

In this basic model, various simplifying conventions are made, some of which we delete in succeeding pages. First, the dimensionality k (and thus the dimensionality of the subject variable z) of the Dirichlet distribution is assumed to be known and stable.

Finally, the Poisson statement is not serious about anything that follows and it is possible to use more truthful distributions of document length as needed. In addition, note that N is the sovereign of all other variables producing data (θ and z).

It is, therefore, an auxiliary variable and in the subsequent development, we can normally disregard its arbitrariness. A k-dimensional Dirichlet random
variable θ can take values in the (k -1)-simplex (a k-vector θ lies in the (k-1)-simplex if θi ≥ 0, ∑ kj=1 θi = 1), and has the following probability compactness on this simplex:

Finally, the probability of corpus is:

Above figure characterizes the LDA model as a probabilistic graphical model. There are three stages of the LDA demonstration, as the figure clearly points out. The parameters alpha and beta are parameters of the corpus level, which are required to be sampled once to generate a corpus.

The variables θd are document-level variables, tested once per document. Finally, the variables zdn and wdn are word-level variables and are experimented with once for each word in each document.

It is essential to differentiate LDA from a simple Dirichlet-multinomial clustering model. A conventional clustering approach will have a two-level model in which a Dirichlet is sampled once for a corpus, a multinomial clustering variable is nominated once for each corpus text, and a collection of terms is allocated for the document that is conditional on the cluster variable.

Such a model, as with many clustering models, restricts a text to being connected to a single topic. On the other hand, LDA contains three stages, and the topic node, in particular, is often sampled throughout the text. Documents may be correlated with different subjects under this paradigm.

These systems are frequently analyzed in Bayesian statistical simulation, where they are referred to as hierarchical models (Gellman et al., 1995) or, more generally, as conditionally sovereign hierarchical models (Kass and Steffey, 1989). These models are often sometimes referred to as Bayes parametric empirical models, a term that applies not only to a particular model configuration but also to the methods used in the model to estimate parameters (Morris, 1983). Indeed as we explain, we are following an empiric
Bayes approach to estimating parameters such as α and β in basic LDA executions, but we are also researching fuller Bayesian approaches.

This model is used to information classification and retrieval [J. Cao, T. Xia, J. Li, Y. Zhang and S. Tang, “A density-based method for adaptive LDA model selection,” Neurocomputing 72 (2009) 1775–1781, 2009]

--

--

Mohammad Anisul Islam

Software Engineer || Full Stack Developer || Technical Writer || Speaker