📅  最后修改于: 2023-12-03 15:35:19.426000             🧑  作者: Mango
TF-IDF (Term Frequency-Inverse Document Frequency) is a simple yet powerful algorithm used in information retrieval and text mining. It is used to measure the importance of a term in a document or a corpus. The idea behind TF-IDF is that if a term appears frequently in a document, but rarely in the rest of the corpus, it is likely to be a key term in that document.
TF-IDF consists of two parts, TF (Term Frequency) and IDF (Inverse Document Frequency).
Term Frequency is a measure of how often a term appears in a document. It is calculated by dividing the number of times a term appears in a document by the total number of terms in the document.
double ComputeTermFrequency(string term, string document)
{
int count = 0;
var words = document.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words)
{
if (word.ToLower().Trim() == term.ToLower().Trim())
{
count++;
}
}
return (double)count / words.Length;
}
Inverse Document Frequency is a measure of how common or rare a term is across all the documents in a corpus. It is calculated by taking the logarithm of the total number of documents in the corpus divided by the number of documents that contain the term.
double ComputeInverseDocumentFrequency(string term, List<string> documents)
{
int count = 0;
foreach (var document in documents)
{
if (document.ToLower().Contains(term.ToLower()))
{
count++;
}
}
return Math.Log((double)documents.Count / count);
}
The TF-IDF value of a term is the product of its term frequency and inverse document frequency:
double ComputeTFIDF(string term, string document, List<string> documents)
{
double tf = ComputeTermFrequency(term, document);
double idf = ComputeInverseDocumentFrequency(term, documents);
return tf * idf;
}
TF-IDF is a simple and effective way to measure the importance of a term in a document or a corpus. By combining term frequency and inverse document frequency, it provides a powerful tool for information retrieval and text mining.