📅  最后修改于: 2023-12-03 15:32:37.968000             🧑  作者: Mango
LDA is a topic modeling technique used to uncover the hidden topics present within a collection of documents. The LDA algorithm models a document as a mixture of topics, each represented by a probability distribution over words in the vocabulary.
In this article, we will demonstrate how to perform LDA using Scikit-Learn in Python.
Before we start with the LDA implementation, let's first load the necessary dataset. We will be using the 20 Newsgroups dataset which is a collection of articles from 20 different categories.
from sklearn.datasets import fetch_20newsgroups
newsgroups_data = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)
Next, we need to preprocess the data by removing any stop words, punctuation, and numbers.
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stopwords_eng = stopwords.words('english')
pattern = r'\b[A-Za-z]+\b'
def preprocess_text(text):
text = text.lower()
text = re.sub('[^A-Za-z]+', ' ', text)
words = text.split()
words = [lemmatizer.lemmatize(word) for word in words if not word in stopwords_eng]
words = [word for word in words if len(word) > 2]
words = ' '.join(words)
return words
preprocessed_data = [preprocess_text(text) for text in newsgroups_data.data]
Now, we will extract the features from the preprocessed data using the CountVectorizer class provided by Scikit-Learn.
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(max_features=5000)
feature_matrix = count_vectorizer.fit_transform(preprocessed_data)
Finally, we can now perform LDA on the feature matrix using the LatentDirichletAllocation class provided by Scikit-Learn.
from sklearn.decomposition import LatentDirichletAllocation
lda_model = LatentDirichletAllocation(n_components=20, max_iter=10, learning_method='online', random_state=42)
lda_model.fit(feature_matrix)
To display the results, we can print the top words associated with each topic.
def display_topics(model, feature_names, num_top_words):
for index, topic in enumerate(model.components_):
message = f'Topic {index}: '
message += ' '.join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]])
print(message)
display_topics(lda_model, count_vectorizer.get_feature_names(), 20)
This will display the top 20 words associated with each of the 20 topics generated by our LDA model.
In this article, we have demonstrated how to perform LDA using Scikit-Learn in Python. We loaded the necessary dataset, preprocessed the data, extracted the features, and modeled with LDA. Finally, we displayed the top words associated with each topic.