自然语言处理 |可能的词标签
nltk.probability.FreqDist用于通过计算树库语料库中的词频来找到最常见的词。 ConditionalFreqDist类是为标记的单词创建的,我们计算每个单词的每个标记的频率。然后使用这些计数来构建一个频繁单词的模型作为键,每个单词最频繁的标签作为一个值。代码 #1:创建函数
Python3
# Loading Libraries
from nltk.probability import FreqDist, ConditionalFreqDist
# Making function
def word_tag_model(words, tagged_words, limit = 200):
fd = FreqDist(words)
cfd = ConditionalFreqDist(tagged_words)
most_freq = (word for word, count in fd.most_common(limit))
return dict((word, cfd[word].max())
for word in most_freq)
Python3
# loading libraries
from tag_util import word_tag_model
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
# initializing training and testing set
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
# Initializing the model
model = word_tag_model(treebank.words(),
treebank.tagged_words())
# Initializing the Unigram
tag = UnigramTagger(model = model)
print ("Accuracy : ", tag.evaluate(test_data))
Python3
# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
default_tagger = DefaultTagger('NN')
likely_tagger = UnigramTagger(
model = model, backoff = default_tagger)
tag = backoff_tagger(train_sents, [
UnigramTagger, BigramTagger,
TrigramTagger], backoff = likely_tagger)
print ("Accuracy : ", tag.evaluate(test_data))
Python3
# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
default_tagger = DefaultTagger('NN')
tagger = backoff_tagger(train_sents, [
UnigramTagger, BigramTagger,
TrigramTagger], backoff = default_tagger)
likely_tag = UnigramTagger(model = model, backoff = tagger)
print ("Accuracy : ", likely_tag.evaluate(test_data))
代码 #2 : 将函数与 UnigramTagger 一起使用
Python3
# loading libraries
from tag_util import word_tag_model
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
# initializing training and testing set
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
# Initializing the model
model = word_tag_model(treebank.words(),
treebank.tagged_words())
# Initializing the Unigram
tag = UnigramTagger(model = model)
print ("Accuracy : ", tag.evaluate(test_data))
输出 :
Accuracy : 0.559680552557738
代码#3:让我们试试退避链
Python3
# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
default_tagger = DefaultTagger('NN')
likely_tagger = UnigramTagger(
model = model, backoff = default_tagger)
tag = backoff_tagger(train_sents, [
UnigramTagger, BigramTagger,
TrigramTagger], backoff = likely_tagger)
print ("Accuracy : ", tag.evaluate(test_data))
输出 :
Accuracy : 0.8806820634578028
注意:退避链增加了准确性。我们可以通过有效地使用 UnigramTagger 类来进一步改进这个结果。代码 #4:训练过的标注器的手动覆盖
Python3
# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
default_tagger = DefaultTagger('NN')
tagger = backoff_tagger(train_sents, [
UnigramTagger, BigramTagger,
TrigramTagger], backoff = default_tagger)
likely_tag = UnigramTagger(model = model, backoff = tagger)
print ("Accuracy : ", likely_tag.evaluate(test_data))
输出 :
Accuracy : 0.8824088063889488