Constituent Structure

Distribution of Words in Sentences: N-grams, Phrase Structure Syntax and Parsing

unigram

Probability of each token chosen randomly (and independently of other tokens)

Markov Assumption

Probability of each token chosen randomly (and independently of other tokens)

bigram

Probability of a token given the previous token

Example

Count(the) = 69_971
Count(the -> same) = 628

bigram(same, the) = count(the -> same) / count(the) = 628 / 69_971 = 0.0898

Additional Steps

  1. Include probability that a word occurs at the beginning of a sentence, i.e. bigram(the, START)

  2. Include probability that a token occurs at the end of a sentence, e.g. bigram(END, .)

  3. Include non-zero probability for case when an unknown word follows a known one.

Backoff Model

If a bigram has a zero count, "backoff" (use) the unigram of the word.

That is to replace bigram(current_word, previous_word) with unigram(current_word).

Markov Assumption

Probability of a word depends only on the previous word.

Trigrams, 4-grams, N-grams

Trigram Probability

Example: count(the -> same -> as) / count(the -> same)

4-gram Probability

Example: count(the -> same -> as -> an) / count(the -> same -> as)

N-gram Probability

Markov Assumptions

Trigram Model: probability of a word depends only on the previous two words.

N-gram Model: probability of a word depends only on the previous N-1 words.

Probability of a sentence = Product of probabilities of each word.

Noun Phrases and Noun Groups

Both can have left modifiers. Only noun phrases can have right modifiers.

  • A noun group consists of: left modifiers of the head noun and the head noun

  • We will assume that all punctuation and coordinate conjunctions are outside of a noun group

Last updated