📅  最后修改于: 2023-12-03 14:47:57.893000             🧑  作者: Mango
TF-IDF (Term frequency-inverse document frequency) is a method used to quantify the importance of a term in a document or corpus. It is commonly used in Natural Language Processing to rank the importance of words in a text.
TF-IDF takes into account two factors:
Commonly, TF-IDF is calculated for individual words. However, when dealing with NLP, the context in which words appear is also important. Therefore, it might be useful to calculate TF-IDF for Bigrams and Trigrams as well.
To calculate TF-IDF for Bigrams and Trigrams, we need to define the following:
Document frequency (DF) is the number of documents that contain a certain term. When calculating the DF for Bigrams and Trigrams, we need to consider the following:
We can calculate the DF for Bigrams and Trigrams using the following code:
from collections import defaultdict
# Define documents
documents = ['This is a test document', 'This document is another test', 'And this is yet another test document']
# Define function to return list of n-grams
def get_ngrams(text, n):
words = text.lower().split()
return [' '.join(words[i:i+n]) for i in range(len(words)-n+1)]
# Calculate DF for Bigrams and Trigrams
df = defaultdict(int)
for document in documents:
for n in range(2,4):
for term in set(get_ngrams(document, n)):
df[term] += 1
print(df)
Term frequency (TF) is the number of times a term appears in a document. When calculating the TF for Bigrams and Trigrams, we need to consider the following:
We can calculate the TF for Bigrams and Trigrams using the following code:
# Define function to return dictionary of TF for n-grams
def get_tf(text, n):
words = text.lower().split()
ngrams = get_ngrams(text, n)
tf = defaultdict(int)
for ngram in ngrams:
tf[ngram] += 1
return tf
# Calculate TF for Bigrams and Trigrams
tf = []
for document in documents:
tf_doc = {}
for n in range(2,4):
tf_doc.update(get_tf(document, n))
tf.append(tf_doc)
print(tf)
Inverse document frequency (IDF) is the inverse of the number of documents that contain a term. When calculating IDF for Bigrams and Trigrams, we need to consider the following:
We can calculate the IDF for Bigrams and Trigrams using the following code:
import math
# Calculate IDF for Bigrams and Trigrams
idf = {}
for n in range(2,4):
for term in df.keys():
if len(term.split()) == n:
idf[term] = math.log(len(documents) / df[term])
print(idf)
Now that we have calculated the TF, IDF and DF for Bigrams and Trigrams, we can calculate the TF-IDF for each term in the documents using the following code:
# Calculate TF-IDF for Bigrams and Trigrams
tf_idf = []
for t in tf:
tf_idf_doc = {}
for term in t.keys():
tf_idf_doc[term] = t[term] * idf[term]
tf_idf.append(tf_idf_doc)
print(tf_idf)
In this article, we have discussed how to calculate TF-IDF for Bigrams and Trigrams. We have seen how to calculate DF, TF and IDF for Bigrams and Trigrams, and finally how to combine them to calculate TF-IDF.
Happy Coding!