📜  sklean tfidf - Python (1)

📅  最后修改于: 2023-12-03 15:20:09.262000             🧑  作者: Mango

Scikit-Learn TFIDF - Python

TFIDF, short for Term Frequency-Inverse Document Frequency, is a widely used technique in Natural Language Processing to extract features from text data. It helps in identifying the most important words in a corpus by calculating a score based on the frequency of a word in a document and how often it appears in the entire corpus. Scikit-Learn is a popular Python library that provides robust implementations of various machine learning algorithms, including TFIDF.

Installation

Scikit-Learn can be installed using pip, the Python package manager.

pip install -U scikit-learn
Usage

In Scikit-Learn, the TfidfVectorizer class can be used to compute TFIDF values for a set of documents. The following is a basic example of how to use it:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)

print(tfidf)

The output will be a sparse matrix of shape (4, 9) as shown below:

    (0, 7)	0.44130468060898347
    (0, 4)	0.5528163151095023
    (0, 2)	0.44130468060898347
    (0, 0)	0.5528163151095023
    (1, 7)	0.32254768961935256
    (1, 4)	0.4032420524860352
    (1, 2)	0.32254768961935256
    (1, 0)	0.4032420524860352
    (1, 8)	0.6307838456099367
    (2, 3)	0.5268204886809443
    (2, 6)	0.5268204886809443
    (2, 5)	0.5268204886809443
    (2, 1)	0.37722822729323855
    (2, 7)	0.32448702075729087
    (3, 7)	0.44130468060898347
    (3, 4)	0.5528163151095023
    (3, 2)	0.44130468060898347
    (3, 0)	0.5528163151095023

The rows represent the documents in the corpus, and the columns represent the unique words in the corpus. The values in the matrix are the TFIDF scores for that word in that document.

Parameters
  • max_df : Maximum document frequency. This can be either a fractional number (e.g. 0.5) or an integer (e.g. 10). Words that appear in more than this fraction or number of documents will be ignored.
  • min_df : Minimum document frequency. This can be either a fractional number or an integer. Words that appear in fewer than this fraction or number of documents will be ignored.
  • ngram_range : The range of n-grams to consider. This can be a tuple of two integers specifying the minimum and maximum n-gram size (e.g. (1, 2) for unigrams and bigrams).
  • stop_words : A list of stopwords to ignore. This can be either "english" (to use the built-in English stopwords), a list of stopwords, or None (to use no stopwords).
  • tokenizer : A custom tokenizer function that will be used to split the text into tokens.
  • analyzer : The feature extraction strategy to use. This can be either "word" (to use words as the features) or "char" (to use characters as the features).
Conclusion

In this article, we discussed the Scikit-Learn TFIDF implementation in Python. TFIDF is an important technique in NLP and can be used for various tasks like text classification and document clustering. With Scikit-Learn, we can easily compute the TFIDF values for a given set of documents.