Language Models -1

Imagine we have a corpus representing a text in English language and we use it for training data. And we would like to assign probabilities to a given document of how likely is it to be a valid english text. A Language model assigns a probability to the document by assuming a probability distribution on the set of all documents. The distribution is assumed to be of fixed form (shape) with unknown parameters \theta.

Unigram Language Model: In this model we assume that the document is a sequences of words and where the number of words is a random variable and the words in each positions are INDEPENDENT random variables. \displaystyle P({\bf W}) = P(N) P(W_1)P(W_2)\cdots P(W_N). Assume a model with P(W =w) =\theta_w, for some parameters \theta_w. Find the likelihood of the corpus given some parameters and find this likely to find a choice of these parameters. In this case, the parameters turn out to be equal to what we expect- \theta_w is the fraction of the word w inside the corpus.

Smoothing : This model has many issues. Apart from the very simple assumptions about the document structures and independence between words, this model assign zero probability to words which are not contained in the corpus and hence zero probability to documents containing words not in the corpus. One way to deal with it is the idea of smoothing. Introduce more parameters to ensure the probabilities are positive. How do we determine these parameters? If we just use maximum likelihood on the corpus, we see get the same answer as before and we haven’t solved any problem. So we determine these parameters by training (maximum likelihood) on a separate batch of “held-out” training document.

n-gram models: In these models, we assume a Markov chain structure to the sequence of words. The probability of a word in a particular position just depends on few words occurring before it. These models also come with the parameters corresponding to the transition probabilities. The maximum likelihood estimates would give the obvious choices and we would need to smooth using a separate batch of training data.

KN Smoothing: This is different type of smoothing from before- where we use the number of different word types that precede a word as the smoothing parameter.


Posted in $.

Leave a comment