📜  自然语言处理 |可能的词标签

📅  最后修改于: 2022-05-13 01:55:18.847000             🧑  作者: Mango

自然语言处理 |可能的词标签

nltk.probability.FreqDist用于通过计算树库语料库中的词频来找到最常见的词。 ConditionalFreqDist类是为标记的单词创建的,我们计算每个单词的每个标记的频率。然后使用这些计数来构建一个频繁单词的模型作为键,每个单词最频繁的标签作为一个值。代码 #1:创建函数

Python3
# Loading Libraries
from nltk.probability import FreqDist, ConditionalFreqDist
 
# Making function
def word_tag_model(words, tagged_words, limit = 200):
     
    fd = FreqDist(words)
    cfd = ConditionalFreqDist(tagged_words)
    most_freq = (word for word, count in fd.most_common(limit))
     
return dict((word, cfd[word].max())
             for word in most_freq)


Python3
# loading libraries
from tag_util import word_tag_model
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
 
# initializing training and testing set   
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
 
# Initializing the model
model = word_tag_model(treebank.words(),
                       treebank.tagged_words())
 
# Initializing the Unigram
tag = UnigramTagger(model = model)
 
print ("Accuracy : ", tag.evaluate(test_data))


Python3
# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
 
default_tagger = DefaultTagger('NN')
 
likely_tagger = UnigramTagger(
        model = model, backoff = default_tagger)
 
tag = backoff_tagger(train_sents, [
        UnigramTagger, BigramTagger,
        TrigramTagger], backoff = likely_tagger)
     
print ("Accuracy : ", tag.evaluate(test_data))


Python3
# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
 
default_tagger = DefaultTagger('NN')
 
tagger = backoff_tagger(train_sents, [
        UnigramTagger, BigramTagger,
        TrigramTagger], backoff = default_tagger)
     
likely_tag = UnigramTagger(model = model, backoff = tagger)
 
print ("Accuracy : ", likely_tag.evaluate(test_data))


代码 #2 : 将函数与 UnigramTagger 一起使用

Python3

# loading libraries
from tag_util import word_tag_model
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
 
# initializing training and testing set   
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
 
# Initializing the model
model = word_tag_model(treebank.words(),
                       treebank.tagged_words())
 
# Initializing the Unigram
tag = UnigramTagger(model = model)
 
print ("Accuracy : ", tag.evaluate(test_data))

输出 :

Accuracy : 0.559680552557738

代码#3:让我们试试退避链

Python3

# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
 
default_tagger = DefaultTagger('NN')
 
likely_tagger = UnigramTagger(
        model = model, backoff = default_tagger)
 
tag = backoff_tagger(train_sents, [
        UnigramTagger, BigramTagger,
        TrigramTagger], backoff = likely_tagger)
     
print ("Accuracy : ", tag.evaluate(test_data))

输出 :

Accuracy : 0.8806820634578028

注意:退避链增加了准确性。我们可以通过有效地使用 UnigramTagger 类来进一步改进这个结果。代码 #4:训练过的标注器的手动覆盖

Python3

# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
 
default_tagger = DefaultTagger('NN')
 
tagger = backoff_tagger(train_sents, [
        UnigramTagger, BigramTagger,
        TrigramTagger], backoff = default_tagger)
     
likely_tag = UnigramTagger(model = model, backoff = tagger)
 
print ("Accuracy : ", likely_tag.evaluate(test_data))

输出 :

Accuracy : 0.8824088063889488