在Python中使用 NLTK 使用停用词进行部分语音标记

自然语言工具包 (NLTK) 是一个用于构建文本分析程序的平台。 NLTK 模块更强大的方面之一是词性标记。
为了运行下面的Python程序，你必须安装 NLTK。请按照安装步骤操作。

打开你的终端，运行pip install nltk 。
在命令提示符下编写Python ，以便Python Interactive Shell 准备好执行您的代码/脚本。
输入导入 nltk
nltk.download()

将弹出一个 GUI，然后选择下载所有包的“全部”，然后单击“下载”。这将为您提供所有标记器、分块器、其他算法和所有语料库，这就是安装需要相当长的时间的原因。
例子：

import nltk
nltk.download()

让我们快速淘汰一些词汇：
语料库：正文，单数。 Corpora 是 this 的复数形式。
词典：单词及其含义。
令牌：每个“实体”都是根据规则拆分的任何内容的一部分。
在语料库语言学中，词性标注（词性标注或词性标注或POST ），也称为语法标注或词类消歧。

Input: Everything is all about money.
Output: [('Everything', 'NN'), ('is', 'VBZ'), 
          ('all', 'DT'),('about', 'IN'), 
          ('money', 'NN'), ('.', '.')]

以下是标签列表、它们的含义以及一些示例：

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective – ‘big’
JJR adjective, comparative – ‘bigger’
JJS adjective, superlative – ‘biggest’
LS list marker 1)
MD modal – could, will
NN noun, singular ‘- desk’
NNS noun plural – ‘desks’
NNP proper noun, singular – ‘Harrison’
NNPS proper noun, plural – ‘Americans’
PDT predeterminer – ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun – I, he, she
PRP$ possessive pronoun – my, his, hers
RB adverb – very, silently,
RBR adverb, comparative – better
RBS adverb, superlative – best
RP particle – give up
TO – to go ‘to’ the store.
UH interjection – errrrrrrrm
VB verb, base form – take
VBD verb, past tense – took
VBG verb, gerund/present participle – taking
VBN verb, past participle – taken
VBP verb, sing. present, non-3d – take
VBZ verb, 3rd person sing. present – takes
WDT wh-determiner – which
WP wh-pronoun – who, what
WP$ possessive wh-pronoun, eg- whose
WRB wh-adverb, eg- where, when

编程需要懂一点英语

文本可能包含停用词，如“the”、“is”、“are”。可以从要处理的文本中过滤掉停用词。在 nlp 研究中没有通用的停用词列表，但是 nltk 模块包含停用词列表。
您可以添加自己的停用词。转到您的 NLTK 下载目录路径->语料库->停用词-> 更新停用词文件取决于您使用的语言。这里我们使用英语（stopwords.words('english')）。

Python

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
 
// Dummy text
txt = "Sukanya, Rajib and Naba are my good friends. " \
    "Sukanya is getting married next year. " \
    "Marriage is a big step in one’s life." \
    "It is both exciting and frightening. " \
    "But friendship is a sacred bond between people." \
    "It is a special kind of love between us. " \
    "Many of you must have tried searching for a friend "\
    "but never found the right one."
 
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt module
 
tokenized = sent_tokenize(txt)
for i in tokenized:
     
    # Word tokenizers is used to find the words
    # and punctuation in a string
    wordsList = nltk.word_tokenize(i)
 
    # removing stop words from wordList
    wordsList = [w for w in wordsList if not w in stop_words]
 
    #  Using a Tagger. Which is part-of-speech
    # tagger or POS-tagger.
    tagged = nltk.pos_tag(wordsList)
 
    print(tagged)

输出：

[('Sukanya', 'NNP'), ('Rajib', 'NNP'), ('Naba', 'NNP'), ('good', 'JJ'), ('friends', 'NNS')]
[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')]
[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')]
[('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')]
[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'VBD'), ('bond', 'NN'), ('people', 'NNS')]
[('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'VB'), ('us', 'PRP')]
[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'), 
('never', 'RB'), ('found', 'VBD'), ('right', 'RB'), ('one', 'CD')]

基本上，词性标注器的目标是将语言（主要是语法）信息分配给子句单元。这样的单位被称为记号，并且大部分时间对应于单词和符号（例如标点符号）。