在Python中使用NLTK对停用词进行语音标记

📌 相关文章

📜 在Python中使用NLTK对停用词进行语音标记

📅 最后修改于: 2020-04-27 14:22:06 🧑 作者: Mango

自然语言工具包(NLTK)是用于构建文本分析程序的平台。语音标签是NLTK模块更强大的方面之一。
为了运行下面的Python程序，您必须安装NLTK。请遵循安装步骤。

打开终端，运行pip install nltk。
在命令提示符下编写Python，以便Python Interactive Shell准备执行您的代码/脚本。
输入import NLTK
nltk.download()

将会弹出一个GUI，然后选择下载所有软件包的“全部”，然后单击“下载”。这将为您提供所有标记器，分块器，其他算法以及所有语料库，因此这就是安装将花费大量时间的原因。
例子：

import nltk
nltk.download()

让我们敲出一些快速的词汇：
语料库：正文；
词汇：单词及其含义。
令牌： 每个“实体”都是根据规则划分的内容的一部分。
在语料库语言学中，词性标记(POS标记或PoS标记或POST)也称为语法标记或单词类别歧义消除

输入: Everything is all about money.
输出: [('Everything', 'NN'), ('is', 'VBZ'),
          ('all', 'DT'),('about', 'IN'),
          ('money', 'NN'), ('.', '.')]

以下是标签的列表，它们的含义以及一些示例：
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent‘s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go ‘to‘ the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when

文本可能包含停用词，例如“ the”，“ is”，“ are”。可以从要处理的文本中过滤停用词。nlp研究中没有通用的停用词列表，但是nltk模块包含停用词列表。
您可以添加自己的停用词。转到您的NLTK下载目录路径 -> 语料库 -> 停用词 ->更新停用词文件取决于您使用的语言。在这里，我们使用英语(stopwords.words(‘english’))。

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
// 虚拟文字
txt = "Sukanya, Rajib and Naba are my good friends. " \
    "Sukanya is getting married next year. " \
    "Marriage is a big step in one’s life." \
    "It is both exciting and frightening. " \
    "But friendship is a sacred bond between people." \
    "It is a special kind of love between us. " \
    "Many of you must have tried searching for a friend "\
    "but never found the right one."
# sent_tokenize是来自nltk.tokenize.punkt模块的PunktSentenceTokenizer的实例之一
tokenized = sent_tokenize(txt)
for i in tokenized:
    # 单词分词器用于查找字符串中的单词和标点符号
    wordsList = nltk.word_tokenize(i)
    # 从wordList中删除停用词
    wordsList = [w for w in wordsList if not w in stop_words]
    #  使用匕首，这是词性标记器或POS标记器的一部分.
    tagged = nltk.pos_tag(wordsList)
    print(tagged)

输出：

[('Sukanya', 'NNP'), ('Rajib', 'NNP'), ('Naba', 'NNP'), ('good', 'JJ'), ('friends', 'NNS')]
[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')]
[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')]
[('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')]
[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'VBD'), ('bond', 'NN'), ('people', 'NNS')]
[('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'VB'), ('us', 'PRP')]
[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'),
('never', 'RB'), ('found', 'VBD'), ('right', 'RB'), ('one', 'CD')]

基本上，POS标记器的目标是将语言(主要是语法上的)信息分配给子句单元。这样的单元被称为令牌，并且在大多数情况下，其对应于单词和符号(例如标点符号)。