自然语言处理 |基于训练标记器的分块器 |设置 1

训练分块器是手动指定正则表达式 (regex) 分块模式的替代方法。但是手动训练以指定表达式是一项繁琐的任务，因为它遵循命中和试验方法来获得准确的正确模式。因此，现有的语料库数据可用于训练分块器。

在下面的代码中，我们使用treebank_chunk 语料库来生成树形式的分块句子。
-> 训练基于标注器的分块器——TagChunker 类使用chunked_sents()方法。
-> 要从树列表中提取(pos, iob)元组列表 - TagChunker 类使用辅助函数conll_tag_chunks() 。

然后这些元组最终用于训练标注器。它为词性标签学习 IOB 标签。

代码 #1：让我们了解用于训练的 Chunker 类。

from nltk.chunk import ChunkParserI
from nltk.chunk.util import tree2conlltags, conlltags2tree
from nltk.tag import UnigramTagger, BigramTagger
from tag_util import backoff_tagger
  
  
def conll_tag_chunks(chunk_data):
      
    tagged_data = [tree2conlltags(tree) for 
                    tree in chunk_data]
      
    return [[(t, c) for (w, t, c) in sent] 
            for sent in tagged_data]
      
class TagChunker(ChunkParserI):
      
    def __init__(self, train_chunks, 
                 tagger_classes =[UnigramTagger, BigramTagger]):
          
        train_data = conll_tag_chunks(train_chunks)
        self.tagger = backoff_tagger(train_data, tagger_classes)
          
    def parse(self, tagged_sent):
        if not tagged_sent: 
            return None
          
        (words, tags) = zip(*tagged_sent)
        chunks = self.tagger.tag(tags)
        wtc = zip(words, chunks)
          
        return conlltags2tree([(w, t, c) for (w, (t, c)) in wtc])

输出：

Training TagChunker

代码#2：使用标签块。

# loading libraries
from chunkers import TagChunker
from nltk.corpus import treebank_chunk
  
# data from treebank_chunk corpus
train_data = treebank_chunk.chunked_sents()[:3000]
test_data = treebank_chunk.chunked_sents()[3000:]
  
# Initailazing 
chunker = TagChunker(train_data)

代码 #3：评估 TagChunker

# testing
score = chunker.evaluate(test_data)
  
a = score.accuracy()
p = score.precision()
r = recall
  
print ("Accuracy of TagChunker : ", a)
print ("\nPrecision of TagChunker : ", p)
print ("\nRecall of TagChunker : ", r)

输出：

Accuracy of TagChunker : 0.9732039335251428

Precision of TagChunker : 0.9166534370535006

Recall of TagChunker : 0.9465573770491803