šŸ“˜
Notes
Blog
Natural Language Processing
Natural Language Processing
  • Overview
  • Introduction
  • Regular Expressions
  • HMM POS Tagging
  • Information Retrieval
  • Constituent Structure
  • Named Entities
Powered by GitBook
On this page
  • TF/IDF
  • Term Frequency (TF)
  • Inverse Document Frequency (IDF)
  • TF-IDF
  • Cosine Similarity
  • Example

Information Retrieval

Information Retrieval and Related Applications. TF/IDF, Cosine Similarity.

TF/IDF

Term Frequency (TF)

TF: number of times term t occurs in document (or alternative: number of terms divided by length of document)

TF(t,d)=Count(timesĀ ofĀ termĀ tĀ appearingĀ inĀ d)TF(t, d)=Count(times\ of\ term\ t\ appearing\ in\ d)TF(t,d)=Count(timesĀ ofĀ termĀ tĀ appearingĀ inĀ d)

Inverse Document Frequency (IDF)

IDF: logarithm of number of documents (in corpus) divided by number of documents containing term t

IDF(t)=log⁔Count(documentsĀ inĀ total)Count(documentsĀ containingĀ termĀ t)IDF(t)=\log{{Count(documents\ in\ total)}\over{Count(documents\ containing\ term\ t)}}IDF(t)=logCount(documentsĀ containingĀ termĀ t)Count(documentsĀ inĀ total)​

TF-IDF

TF_IDF(t,d)=TF(t,d)āˆ—IDF(t)TF\_IDF(t, d)=TF(t, d) * IDF(t)TF_IDF(t,d)=TF(t,d)āˆ—IDF(t)

Cosine Similarity

Cosine of the Angle Between the Vectors. Range is [0, 1]. The higher the value, the more similar the vectors.

Cosine(v1,v2)=v1ā‹…v2v12ā‹…v22Cosine(v1, v2) = \frac{v_1 \cdot v_2}{\sqrt{{v_1}^2} \cdot \sqrt{{v_2}^2}}Cosine(v1,v2)=v1​2​⋅v2​2​v1​⋅v2​​

Example

v1=[0,5,0,5,0]v1 = [0, 5, 0, 5, 0]v1=[0,5,0,5,0]

v2=[0,7,0,9,0]v2 = [0, 7, 0, 9, 0]v2=[0,7,0,9,0]

Cosine(v1,v2)=0āˆ—0+5āˆ—7+0āˆ—0+5āˆ—9+0āˆ—002+52+02+52+02+02+72+02+92+02=0.992Cosine(v1, v2) = {{0*0+5*7+0*0+5*9+0*0}\over{\sqrt{0^2+5^2+0^2+5^2+0^2}+\sqrt{0^2+7^2+0^2+9^2+0^2}}} = 0.992Cosine(v1,v2)=02+52+02+52+02​+02+72+02+92+02​0āˆ—0+5āˆ—7+0āˆ—0+5āˆ—9+0āˆ—0​=0.992

PreviousHMM POS TaggingNextConstituent Structure

Last updated 3 years ago