自然语言处理 |基于分类器的分块 |设置 1

与大多数词性标注器不同， ClassifierBasedTagger class从特征中学习。可以创建ClassifierChunker class ，使其可以从单词和词性标签中学习，而不是像TagChunker class那样仅从词性标签中学习。

使用来自tree2conlltags() chunk_trees2train_chunks()将 (word, pos, iob) 3 元组转换为 ((word, pos), iob) 2 元组，以保持与 2 元组 (word, pos) 格式的兼容性训练ClassiferBasedTagger class所需的。代码#1：让我们理解

# Loading Libraries
from nltk.chunk import ChunkParserI
from nltk.chunk.util import tree2conlltags, conlltags2tree
from nltk.tag import ClassifierBasedTagger
  
def chunk_trees2train_chunks(chunk_sents):
  
    # Using tree2conlltags
    tag_sents = [tree2conlltags(sent) for 
                 sent in chunk_sents]
  
    3-tuple is converted to 2-tuple
    return [[((w, t), c) for 
             (w, t, c) in sent] for sent in tag_sents]

现在，需要一个特征检测器函数来传递给 ClassifierBasedTagger。与 ClassifierChunker 类（下一个定义）一起使用的任何特征检测器函数都应该识别标记是（word，pos）元组的列表，并且具有与 prev_next_pos_iob() 相同的函数签名。为了给分类器提供尽可能多的信息，这个特征集包含当前、上一个和下一个单词和词性标签，以及前一个 IOB 标签。

代码#2：检测器函数

def prev_next_pos_iob(tokens, index, history):
      
    word, pos = tokens[index]
    if index == 0:
        prevword, prevpos, previob = ('', )*3
    else:
        prevword, prevpos = tokens[index-1]
        previob = history[index-1]
          
    if index == len(tokens) - 1:
        nextword, nextpos = ('', )*2
    else:
        nextword, nextpos = tokens[index + 1]
        feats = {'word': word,
                 'pos': pos,
                 'nextword': nextword,
                 'nextpos': nextpos,
                 'prevword': prevword,
                 'prevpos': prevpos,
                 'previob': previob
                 }
    return feats

现在，需要ClassifierChunker class ，它使用内部ClassifierBasedTagger和来自chunk_trees2train_chunks()的训练句子和使用prev_next_pos_iob()提取的特征。作为ChunkerParserI的子类， ClassifierChunker实现了parse()方法，使用conlltags2tree()将内部标注器生成的 ((w, t), c) 元组转换为 Trees

代码#3：

class ClassifierChunker(ChunkParserI):
    def __init__(self, train_sents, 
                 feature_detector = prev_next_pos_iob, **kwargs):
          
        if not feature_detector:
            feature_detector = self.feature_detector
            train_chunks = chunk_trees2train_chunks(train_sents)
            self.tagger = ClassifierBasedTagger(train = train_chunks,
            feature_detector = feature_detector, **kwargs)
              
    def parse(self, tagged_sent):
          
        if not tagged_sent: return None
        chunks = self.tagger.tag(tagged_sent)
          
        return conlltags2tree(
                [(w, t, c) for ((w, t), c) in chunks])