📘
Notes
Blog
Natural Language Processing
Natural Language Processing
  • Overview
  • Introduction
  • Regular Expressions
  • HMM POS Tagging
  • Information Retrieval
  • Constituent Structure
  • Named Entities
Powered by GitBook
On this page
  • unigram
  • Markov Assumption
  • bigram
  • Example
  • Additional Steps
  • Backoff Model
  • Markov Assumption
  • Trigrams, 4-grams, N-grams
  • Trigram Probability
  • 4-gram Probability
  • N-gram Probability
  • Markov Assumptions
  • Noun Phrases and Noun Groups

Constituent Structure

Distribution of Words in Sentences: N-grams, Phrase Structure Syntax and Parsing

PreviousInformation RetrievalNextNamed Entities

Last updated 3 years ago

unigram

Probability of each token chosen randomly (and independently of other tokens)

unigram(t)=Count(times t appearing)Count(total word appearings)unigram(t) = {Count(times\ t\ appearing)\over{Count(total\ word\ appearings)}}unigram(t)=Count(total word appearings)Count(times t appearing)​

Markov Assumption

Probability of each token chosen randomly (and independently of other tokens)

bigram

Probability of a token given the previous token

bigram(t,tprevious)=Count(tprevious→t)Count(tprevious)bigram(t, t_{previous}) = {{Count({t_{previous}}\rightarrow{t})}\over{Count(t_{previous})}}bigram(t,tprevious​)=Count(tprevious​)Count(tprevious​→t)​

Example

Count(the) = 69_971
Count(the -> same) = 628

bigram(same, the) = count(the -> same) / count(the) = 628 / 69_971 = 0.0898

Additional Steps

  1. Include probability that a word occurs at the beginning of a sentence, i.e. bigram(the, START)

  2. Include probability that a token occurs at the end of a sentence, e.g. bigram(END, .)

  3. Include non-zero probability for case when an unknown word follows a known one.

Backoff Model

If a bigram has a zero count, "backoff" (use) the unigram of the word.

That is to replace bigram(current_word, previous_word) with unigram(current_word).

Markov Assumption

Probability of a word depends only on the previous word.

Trigrams, 4-grams, N-grams

Trigram Probability

Example: count(the -> same -> as) / count(the -> same)

4-gram Probability

Example: count(the -> same -> as -> an) / count(the -> same -> as)

N-gram Probability

Markov Assumptions

Trigram Model: probability of a word depends only on the previous two words.

N-gram Model: probability of a word depends only on the previous N-1 words.

Probability of a sentence = Product of probabilities of each word.

Noun Phrases and Noun Groups

Both can have left modifiers. Only noun phrases can have right modifiers.

  • A noun group consists of: left modifiers of the head noun and the head noun

  • We will assume that all punctuation and coordinate conjunctions are outside of a noun group

trigram(t,t−1,t−2)=Count(t−2→t−1→t)Count(t−2→t−1)trigram(t, t_{-1}, t_{-2})={Count({t_{-2}}\rightarrow{t_{-1}}\rightarrow{t})\over{Count({t_{-2}}\rightarrow{t_{-1}})}}trigram(t,t−1​,t−2​)=Count(t−2​→t−1​)Count(t−2​→t−1​→t)​

fourgram(t,t−1,t−2,t−3)=Count(t−3→t−2→t−1→t)Count(t−3→t−2→t−1)fourgram(t, t_{-1}, t_{-2}, t_{-3})={Count({t_{-3}}\rightarrow{t_{-2}}\rightarrow{t_{-1}}\rightarrow{t})\over{Count({t_{-3}}\rightarrow{t_{-2}}\rightarrow{t_{-1}})}}fourgram(t,t−1​,t−2​,t−3​)=Count(t−3​→t−2​→t−1​)Count(t−3​→t−2​→t−1​→t)​

ngram(t,t−1,...,t−n+1)=Count(t−n+1→...→t−1→t)Count(t−n+1→...→t−1)ngram(t, t_{-1}, ..., t_{-n+1})={Count({t_{-n+1}}\rightarrow{...}\rightarrow{t_{-1}}\rightarrow{t})\over{Count({t_{-n+1}}\rightarrow{...}\rightarrow{t_{-1}})}}ngram(t,t−1​,...,t−n+1​)=Count(t−n+1​→...→t−1​)Count(t−n+1​→...→t−1​→t)​