Constituent Structure
Distribution of Words in Sentences: N-grams, Phrase Structure Syntax and Parsing
Last updated
Distribution of Words in Sentences: N-grams, Phrase Structure Syntax and Parsing
Last updated
Probability of each token chosen randomly (and independently of other tokens)
Probability of each token chosen randomly (and independently of other tokens)
Probability of a token given the previous token
Include probability that a word occurs at the beginning of a sentence, i.e. bigram(the, START)
Include probability that a token occurs at the end of a sentence, e.g. bigram(END, .)
Include non-zero probability for case when an unknown word follows a known one.
If a bigram has a zero count, "backoff" (use) the unigram of the word.
That is to replace bigram(current_word, previous_word)
with unigram(current_word)
.
Probability of a word depends only on the previous word.
Example: count(the -> same -> as) / count(the -> same)
Example: count(the -> same -> as -> an) / count(the -> same -> as)
Trigram Model: probability of a word depends only on the previous two words.
N-gram Model: probability of a word depends only on the previous N-1 words.
Probability of a sentence = Product of probabilities of each word.
Both can have left modifiers. Only noun phrases can have right modifiers.
A noun group consists of: left modifiers of the head noun and the head noun
We will assume that all punctuation and coordinate conjunctions are outside of a noun group