Python|使用 TextBlob 进行部分语音标记
TextBlob 模块用于构建文本分析程序。 TextBlob 模块更强大的方面之一是词性标记。安装 TextBlob 运行以下命令:
$ pip install -U textblob
$ python -m textblob.download_corpora
这将安装 TextBlob 并下载必要的 NLTK 语料库。由于大量的分词器、分块器、其他算法以及所有要下载的语料库,上述安装将需要相当长的时间。
Let’s knock out some quick vocabulary: Corpus : Body of text, singular. Corpora is the plural of this. Lexicon : Words and their meanings. Token : Each “entity” that is a part of whatever was split up based on rules.
在语料库语言学中,词性标注(POS 标注或 PoS 标注或 POST),也称为语法标注或词类消歧。
Input: Everything is all about money.
Output: [('Everything', 'NN'), ('is', 'VBZ'),
('all', 'DT'), ('about', 'IN'),
('money', 'NN')]
以下是标签列表、它们的含义以及一些示例:
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent‘s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go ‘to‘ the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-adverb where, when
Python3
# from textblob lib import TextBlob method
from textblob import TextBlob
text = ("Sukanya, Rajib and Naba are my good friends. " +
"Sukanya is getting married next year. " +
"Marriage is a big step in one’s life." +
"It is both exciting and frightening. " +
"But friendship is a sacred bond between people." +
"It is a special kind of love between us. " +
"Many of you must have tried searching for a friend "+
"but never found the right one.")
# create a textblob object
blob_object = TextBlob(text)
# Part-of-speech tags can be accessed
# through the tags property of blob object.'
# print word with pos tag.
print(blob_object.tags)
输出 :
[('Sukanya', 'NNP'),
('Rajib', 'NNP'),
('and', 'CC'),
('Naba', 'NNP'),
('are', 'VBP'),
('my', 'PRP$'),
('good', 'JJ'),
('friends', 'NNS'),
('Sukanya', 'NNP'),
('is', 'VBZ'),
('getting', 'VBG'),
('married', 'VBN'),
('next', 'JJ'),
('year', 'NN'),
('Marriage', 'NN'),
('is', 'VBZ'),
('a', 'DT'),
('big', 'JJ'),
('step', 'NN'),
('in', 'IN'),
('one', 'CD'),
('’', 'NN'),
('s', 'NN'),
('life.It', 'NN'),
('is', 'VBZ'),
('both', 'DT'),
('exciting', 'VBG'),
('and', 'CC'),
('frightening', 'NN'),
('But', 'CC'),
('friendship', 'NN'),
('is', 'VBZ'),
('a', 'DT'),
('sacred', 'JJ'),
('bond', 'NN'),
('between', 'IN'),
('people.It', 'NN'),
('is', 'VBZ'),
('a', 'DT'),
('special', 'JJ'),
('kind', 'NN'),
('of', 'IN'),
('love', 'NN'),
('between', 'IN'),
('us', 'PRP'),
('Many', 'JJ'),
('of', 'IN'),
('you', 'PRP'),
('must', 'MD'),
('have', 'VB'),
('tried', 'VBN'),
('searching', 'VBG'),
('for', 'IN'),
('a', 'DT'),
('friend', 'NN'),
('but', 'CC'),
('never', 'RB'),
('found', 'VBD'),
('the', 'DT'),
('right', 'JJ'),
('one', 'NN')]
基本上,词性标注器的目标是将语言(主要是语法)信息分配给子句单元。这样的单位被称为记号,大多数时候,对应于单词和符号(例如标点符号)。