Constituent Structure
Distribution of Words in Sentences: N-grams, Phrase Structure Syntax and Parsing
Probability of each token chosen randomly (and independently of other tokens)
unigram(t)=Count(total word appearings)Count(times t appearing)​
Markov Assumption
Probability of each token chosen randomly (and independently of other tokens)
Probability of a token given the previous token
bigram(t,tprevious​)=Count(tprevious​)Count(tprevious​→t)​
Count(the) = 69_971
Count(the -> same) = 628
bigram(same, the) = count(the -> same) / count(the) = 628 / 69_971 = 0.0898
Additional Steps
Include probability that a word occurs at the beginning of a sentence, i.e. bigram(the, START)
Include probability that a token occurs at the end of a sentence, e.g. bigram(END, .)
Include non-zero probability for case when an unknown word follows a known one.
If a bigram has a zero count, "backoff" (use) the unigram of the word.
That is to replace bigram(current_word, previous_word) with unigram(current_word).
Markov Assumption
Probability of a word depends only on the previous word.
Trigrams, 4-grams, N-grams
Trigram Probability
trigram(t,t−1​,t−2​)=Count(t−2​→t−1​)Count(t−2​→t−1​→t)​
Example: count(the -> same -> as) / count(the -> same)
4-gram Probability
fourgram(t,t−1​,t−2​,t−3​)=Count(t−3​→t−2​→t−1​)Count(t−3​→t−2​→t−1​→t)​
Example: count(the -> same -> as -> an) / count(the -> same -> as)
N-gram Probability
ngram(t,t−1​,...,t−n+1​)=Count(t−n+1​→...→t−1​)Count(t−n+1​→...→t−1​→t)​
Markov Assumptions
Trigram Model: probability of a word depends only on the previous two words.
N-gram Model: probability of a word depends only on the previous N-1 words.
Probability of a sentence = Product of probabilities of each word.
Noun Phrases and Noun Groups
Both can have left modifiers. Only noun phrases can have right modifiers.
A noun group consists of: left modifiers of the head noun and the head noun
We will assume that all punctuation and coordinate conjunctions are outside of a noun group