自然语言处理 |位置标签提取

不同种类的 ChunkParserI 子类可用于识别 LOCATION 块。因为它使用地名词典来识别位置词。地名词典是一个WordListCorpusReader class ，包含以下位置词：

国名
美国各州和缩写
墨西哥各州
美国主要城市
加拿大各省

LocationChunker class通过迭代标记的句子来查找在地名词典中找到的单词。当它找到一个或多个位置词时，它会使用 IOB 标签创建一个 LOCATION 块。 IOB LOCATION 标记在iob_locations()中生成， parse()方法将 IOB 标记转换为 Tree。

代码 #1：LocationChunker 类

from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree
from nltk.corpus import gazetteers
  
class LocationChunker(ChunkParserI):
    def __init__(self):
        self.locations = set(gazetteers.words())
        self.lookahead = 0
        for loc in self.locations:
            nwords = loc.count(' ')
        if nwords > self.lookahead:
            self.lookahead = nwords

代码 #2：iob_locations() 方法

def iob_locations(self, tagged_sent):
      
    i = 0
    l = len(tagged_sent)
    inside = False
      
    while i < l:
        word, tag = tagged_sent[i]
        j = i + 1
        k = j + self.lookahead
        nextwords, nexttags = [], []
        loc = False
          
    while j < k:
        if ' '.join([word] + nextwords) in self.locations:
            if inside:
                yield word, tag, 'I-LOCATION'
            else:
                yield word, tag, 'B-LOCATION'
            for nword, ntag in zip(nextwords, nexttags):
                yield nword, ntag, 'I-LOCATION'
                loc, inside = True, True
                i = j
                break
              
        if j < l:
            nextword, nexttag = tagged_sent[j]
            nextwords.append(nextword)
            nexttags.append(nexttag)
            j += 1
        else:
            break
        if not loc:
            inside = False
            i += 1
            yield word, tag, 'O'
              
    def parse(self, tagged_sent):
        iobs = self.iob_locations(tagged_sent)
        return conlltags2tree(iobs)

代码#3：使用 LocationChunker 类解析句子

from nltk.chunk import ChunkParserI
from chunkers import sub_leaves
from chunkers import LocationChunker
  
t = loc.parse([('San', 'NNP'), ('Francisco', 'NNP'),
               ('CA', 'NNP'), ('is', 'BE'), ('cold', 'JJ'), 
               ('compared', 'VBD'), ('to', 'TO'), ('San', 'NNP'),
               ('Jose', 'NNP'), ('CA', 'NNP')])
  
print ("Location : \n", sub_leaves(t, 'LOCATION'))

输出：

Location : 
[[('San', 'NNP'), ('Francisco', 'NNP'), ('CA', 'NNP')], 
[('San', 'NNP'), ('Jose', 'NNP'), ('CA', 'NNP')]]