Information Retrieval
Information Retrieval and Related Applications. TF/IDF, Cosine Similarity.
TF/IDF
Term Frequency (TF)
TF: number of times term t occurs in document (or alternative: number of terms divided by length of document)
TF(t,d)=Count(timesĀ ofĀ termĀ tĀ appearingĀ inĀ d)
Inverse Document Frequency (IDF)
IDF: logarithm of number of documents (in corpus) divided by number of documents containing term t
IDF(t)=logCount(documentsĀ containingĀ termĀ t)Count(documentsĀ inĀ total)ā
TF-IDF
TF_IDF(t,d)=TF(t,d)āIDF(t)
Cosine Similarity
Cosine of the Angle Between the Vectors. Range is [0, 1]. The higher the value, the more similar the vectors.
Cosine(v1,v2)=v1ā2āā
v2ā2āv1āā
v2āā
Example
v1=[0,5,0,5,0]
v2=[0,7,0,9,0]
Cosine(v1,v2)=02+52+02+52+02ā+02+72+02+92+02ā0ā0+5ā7+0ā0+5ā9+0ā0ā=0.992