📅  最后修改于: 2020-10-16 02:31:21             🧑  作者: Mango
本章介绍有关Gensim的潜在语义索引(LSI)和分层Dirichlet过程(HDP)主题模型的创建。
最初在Gensim中使用潜在狄利克雷分配(LDA)实现的主题建模算法是潜在语义索引(LSI) 。它也被称为潜在语义分析(LSA) 。它于1988年由Scott Deerwester,Susan Dumais,George Furnas,Richard Harshman,Thomas Landaur,Karen Lochbaum和Lynn Streeter申请了专利。
在本节中,我们将建立我们的LSI模型。可以通过建立LDA模型的相同方法来完成。我们需要从gensim.models中导入LSI模型。
实际上,LSI是一种NLP技术,特别是在分布式语义上。它分析了一组文档和这些文档包含的术语之间的关系。如果我们谈论它的工作原理,那么它将构建一个矩阵,其中包含来自大量文本的每个文档的字数统计。
构造后,为了减少行数,LSI模型使用了一种称为奇异值分解(SVD)的数学技术。除了减少行数之外,它还保留了列之间的相似性结构。
在矩阵中,行代表唯一的单词,列代表每个文档。它基于分布假设进行工作,即假设含义相近的单词将出现在相同类型的文本中。
在这里,我们将使用LSI(潜在语义索引)从数据集中提取自然讨论的主题。
我们将使用的数据集是“ 20个新闻组”的数据集,其中包含来自新闻报道各个部分的数千篇新闻文章。在Sklearn数据集下可用。我们可以在以下Python脚本的帮助下轻松下载-
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
让我们在以下脚本的帮助下查看一些示例新闻-
newsgroups_train.data[:4]
["From: lerxst@wam.umd.edu (where's my thing)\nSubject:
WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization:
University of Maryland, College Park\nLines: 15\n\n
I was wondering if anyone out there could enlighten me on this car
I saw\nthe other day. It was a 2-door sports car,
looked to be from the late 60s/\nearly 70s. It was called a Bricklin.
The doors were really small. In addition,\nthe front bumper was separate from
the rest of the body. This is \nall I know. If anyone can tellme a model name,
engine specs, years\nof production, where this car is made, history, or
whatever info you\nhave on this funky looking car,
please e-mail.\n\nThanks,\n- IL\n ---- brought to you by your neighborhood
Lerxst ----\n\n\n\n\n",
"From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject:
SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords:
SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization:
University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA
fair number of brave souls who upgraded their SI clock oscillator have\nshared their
experiences for this poll. Please send a brief message detailing\nyour experiences with
the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat
sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies
are especially requested.\n\nI will be summarizing in the next two days, so please add
to the network\nknowledge base if you have done the clock upgrade and haven't answered
this\npoll. Thanks.\n\nGuy Kuo \n",
'From: twillis@ec.ecn.purdue.edu (Thomas E Willis)\nSubject:
PB questions...\nOrganization: Purdue University Engineering Computer
Network\nDistribution: usa\nLines: 36\n\nwell folks, my mac plus finally gave up the
ghost this weekend after\nstarting life as a 512k way back in 1985. sooo, i\'m in the
market for a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking into
picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully)
somebody can answer:\n\n* does anybody know any dirt on when the next round of
powerbook\nintroductions are expected? i\'d heard the 185c was supposed to make
an\nappearence "this summer" but haven\'t heard anymore on it - and since i\ndon\'t
have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has
anybody heard rumors about price drops to the powerbook line like the\nones the duo\'s
just went through recently?\n\n* what\'s the impression of the display on the 180? i
could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don\'t
really have\na feel for how much "better" the display is (yea, it looks great in
the\nstore, but is that all "wow" or is it really that good?). could i solicit\nsome
opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk
size and money hit to get the active display? (i realize\nthis is a real subjective
question, but i\'ve only played around with the\nmachines in a computer store breifly
and figured the opinions of somebody\nwho actually uses the machine daily might prove
helpful).\n\n* how well does hellcats perform? ;)\n\nthanks a bunch in advance for any
info - if you could email, i\'ll post a\nsummary (news reading time is at a premium
with finals just around the\ncorner... :( )\n--\nTom Willis \\ twillis@ecn.purdue.edu
\\ Purdue Electrical
Engineering\n---------------------------------------------------------------------------\
n"Convictions are more dangerous enemies of truth than lies." - F. W.\nNietzsche\n',
'From: jgreen@amber (Joe Green)\nSubject: Re: Weitek P9000 ?\nOrganization: Harris
Computer Systems Division\nLines: 14\nDistribution: world\nNNTP-Posting-Host:
amber.ssd.csd.harris.com\nX-Newsreader: TIN [version 1.1 PL9]\n\nRobert J.C. Kyanko
(rob@rjck.UUCP) wrote:\n > abraxis@iastate.edu writes in article <
abraxis.734340159@class1.iastate.edu>:\n> > Anyone know about the Weitek P9000
graphics chip?\n > As far as the low-level stuff goes, it looks pretty nice. It\'s
got this\n > quadrilateral fill command that requires just the four
points.\n\nDo you have Weitek\'s address/phone number? I\'d like to get some
information\nabout this chip.\n\n--\nJoe Green\t\t\t\tHarris
Corporation\njgreen@csd.harris.com\t\t\tComputer Systems Division\n"The only thing that
really scares me is a person with no sense of humor."\n\t\t\t\t\t\t-- Jonathan
Winters\n']
我们需要NLTK的停用词和Scapy的英语模型。两者都可以如下下载-
import nltk;
nltk.download('stopwords')
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
为了建立LSI模型,我们需要导入以下必要的包-
import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import matplotlib.pyplot as plt
现在我们需要导入停用词并使用它们-
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
现在,借助Gensim的simple_preprocess(),我们需要将每个句子标记为单词列表。我们还应该删除标点符号和不必要的字符。为了做到这一点,我们将创建一个名为send_to_words()的函数-
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))
众所周知,二字词是在文档中经常同时出现的两个单词,而trigram是在文档中经常同时出现的三个单词。借助Gensim的短语模型,我们可以做到这一点-
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
接下来,我们需要过滤掉停用词。除此之外,我们还将创建函数来生成二元组,三元组和词形化-
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]
def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
现在,我们需要构建字典和语料库。我们也在前面的示例中做到了-
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
我们已经实现了训练LSI模型所需的一切。现在,该构建LSI主题模型了。对于我们的实现示例,可以在以下代码行的帮助下完成-
lsi_model = gensim.models.lsimodel.LsiModel(
corpus=corpus, id2word=id2word, num_topics=20,chunksize=100
)
让我们看一下构建LDA主题模型的完整实现示例-
import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
print(data_words[:4]) #it will print the data after prepared for stopwords
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]
def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
data_lemmatized = lemmatization(
data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
)
print(data_lemmatized[:4]) #it will print the lemmatized data.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]]
#it will print the words with their frequencies.
lsi_model = gensim.models.lsimodel.LsiModel(
corpus=corpus, id2word=id2word, num_topics=20,chunksize=100
)
现在,我们可以使用上面创建的LSI模型来获取主题。
上面创建的LSI模型(lsi_model)可用于查看文档中的主题。可以在以下脚本的帮助下完成-
pprint(lsi_model.print_topics())
doc_lsi = lsi_model[corpus]
[
(0,
'1.000*"ax" + 0.001*"_" + 0.000*"tm" + 0.000*"part" + 0.000*"pne" + '
'0.000*"biz" + 0.000*"mbs" + 0.000*"end" + 0.000*"fax" + 0.000*"mb"'),
(1,
'0.239*"say" + 0.222*"file" + 0.189*"go" + 0.171*"know" + 0.169*"people" + '
'0.147*"make" + 0.140*"use" + 0.135*"also" + 0.133*"see" + 0.123*"think"')
]
诸如LDA和LSI之类的主题模型有助于总结和组织无法手动分析的大型文本档案。除了LDA和LSI,Gensim中另一个强大的主题模型是HDP(分层Dirichlet过程)。它基本上是一种混合成员模型,用于对分组数据进行无监督分析。与LDA(它的有限对应项)不同,HDP从数据中推断出主题的数量。
为了在Gensim中实现HDP,我们需要训练语料库和字典(如在实现LDA和LSI主题模型时在上述示例中所做的一样),可以从gensim.models.HdpModel中导入HDP主题模型。在这里,我们还将对20Newsgroup数据实现HDP主题模型,并且步骤也相同。
对于我们的语料库和字典(在上述示例中为LSI和LDA模型创建的),我们可以按以下方式导入HdpModel-
Hdp_model = gensim.models.hdpmodel.HdpModel(corpus=corpus, id2word=id2word)
HDP模型(Hdp_model)可用于查看文档中的主题。可以在以下脚本的帮助下完成-
pprint(Hdp_model.print_topics())
[
(0,
'0.009*line + 0.009*write + 0.006*say + 0.006*article + 0.006*know + '
'0.006*people + 0.005*make + 0.005*go + 0.005*think + 0.005*be'),
(1,
'0.016*line + 0.011*write + 0.008*article + 0.008*organization + 0.006*know '
'+ 0.006*host + 0.006*be + 0.005*get + 0.005*use + 0.005*say'),
(2,
'0.810*ax + 0.001*_ + 0.000*tm + 0.000*part + 0.000*mb + 0.000*pne + '
'0.000*biz + 0.000*end + 0.000*wwiz + 0.000*fax'),
(3,
'0.015*line + 0.008*write + 0.007*organization + 0.006*host + 0.006*know + '
'0.006*article + 0.005*use + 0.005*thank + 0.004*get + 0.004*problem'),
(4,
'0.004*line + 0.003*write + 0.002*believe + 0.002*think + 0.002*article + '
'0.002*belief + 0.002*say + 0.002*see + 0.002*look + 0.002*organization'),
(5,
'0.005*line + 0.003*write + 0.003*organization + 0.002*article + 0.002*time '
'+ 0.002*host + 0.002*get + 0.002*look + 0.002*say + 0.001*number'),
(6,
'0.003*line + 0.002*say + 0.002*write + 0.002*go + 0.002*gun + 0.002*get + '
'0.002*organization + 0.002*bill + 0.002*article + 0.002*state'),
(7,
'0.003*line + 0.002*write + 0.002*article + 0.002*organization + 0.001*none '
'+ 0.001*know + 0.001*say + 0.001*people + 0.001*host + 0.001*new'),
(8,
'0.004*line + 0.002*write + 0.002*get + 0.002*team + 0.002*organization + '
'0.002*go + 0.002*think + 0.002*know + 0.002*article + 0.001*well'),
(9,
'0.004*line + 0.002*organization + 0.002*write + 0.001*be + 0.001*host + '
'0.001*article + 0.001*thank + 0.001*use + 0.001*work + 0.001*run'),
(10,
'0.002*line + 0.001*game + 0.001*write + 0.001*get + 0.001*know + '
'0.001*thing + 0.001*think + 0.001*article + 0.001*help + 0.001*turn'),
(11,
'0.002*line + 0.001*write + 0.001*game + 0.001*organization + 0.001*say + '
'0.001*host + 0.001*give + 0.001*run + 0.001*article + 0.001*get'),
(12,
'0.002*line + 0.001*write + 0.001*know + 0.001*time + 0.001*article + '
'0.001*get + 0.001*think + 0.001*organization + 0.001*scope + 0.001*make'),
(13,
'0.002*line + 0.002*write + 0.001*article + 0.001*organization + 0.001*make '
'+ 0.001*know + 0.001*see + 0.001*get + 0.001*host + 0.001*really'),
(14,
'0.002*write + 0.002*line + 0.002*know + 0.001*think + 0.001*say + '
'0.001*article + 0.001*argument + 0.001*even + 0.001*card + 0.001*be'),
(15,
'0.001*article + 0.001*line + 0.001*make + 0.001*write + 0.001*know + '
'0.001*say + 0.001*exist + 0.001*get + 0.001*purpose + 0.001*organization'),
(16,
'0.002*line + 0.001*write + 0.001*article + 0.001*insurance + 0.001*go + '
'0.001*be + 0.001*host + 0.001*say + 0.001*organization + 0.001*part'),
(17,
'0.001*line + 0.001*get + 0.001*hit + 0.001*go + 0.001*write + 0.001*say + '
'0.001*know + 0.001*drug + 0.001*see + 0.001*need'),
(18,
'0.002*option + 0.001*line + 0.001*flight + 0.001*power + 0.001*software + '
'0.001*write + 0.001*add + 0.001*people + 0.001*organization + 0.001*module'),
(19,
'0.001*shuttle + 0.001*line + 0.001*roll + 0.001*attitude + 0.001*maneuver + '
'0.001*mission + 0.001*also + 0.001*orbit + 0.001*produce + 0.001*frequency')
]