Constituent Structure
Distribution of Words in Sentences: N-grams, Phrase Structure Syntax and Parsing
unigram
Probability of each token chosen randomly (and independently of other tokens)
Markov Assumption
Probability of each token chosen randomly (and independently of other tokens)
bigram
Probability of a token given the previous token
Example
Additional Steps
Include probability that a word occurs at the beginning of a sentence, i.e. bigram(the, START)
Include probability that a token occurs at the end of a sentence, e.g. bigram(END, .)
Include non-zero probability for case when an unknown word follows a known one.
Backoff Model
If a bigram has a zero count, "backoff" (use) the unigram of the word.
That is to replace bigram(current_word, previous_word)
with unigram(current_word)
.
Markov Assumption
Probability of a word depends only on the previous word.
Trigrams, 4-grams, N-grams
Trigram Probability
Example: count(the -> same -> as) / count(the -> same)
4-gram Probability
Example: count(the -> same -> as -> an) / count(the -> same -> as)
N-gram Probability
Markov Assumptions
Trigram Model: probability of a word depends only on the previous two words.
N-gram Model: probability of a word depends only on the previous N-1 words.
Probability of a sentence = Product of probabilities of each word.
Noun Phrases and Noun Groups
Both can have left modifiers. Only noun phrases can have right modifiers.
A noun group consists of: left modifiers of the head noun and the head noun
We will assume that all punctuation and coordinate conjunctions are outside of a noun group
Last updated