自然语言处理 |位置标签提取
不同种类的 ChunkParserI 子类可用于识别 LOCATION 块。因为它使用地名词典来识别位置词。地名词典是一个WordListCorpusReader class
,包含以下位置词:
- 国名
- 美国各州和缩写
- 墨西哥各州
- 美国主要城市
- 加拿大各省
LocationChunker class
通过迭代标记的句子来查找在地名词典中找到的单词。当它找到一个或多个位置词时,它会使用 IOB 标签创建一个 LOCATION 块。 IOB LOCATION 标记在iob_locations()
中生成, parse()
方法将 IOB 标记转换为 Tree。
代码 #1:LocationChunker 类
from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree
from nltk.corpus import gazetteers
class LocationChunker(ChunkParserI):
def __init__(self):
self.locations = set(gazetteers.words())
self.lookahead = 0
for loc in self.locations:
nwords = loc.count(' ')
if nwords > self.lookahead:
self.lookahead = nwords
代码 #2:iob_locations() 方法
def iob_locations(self, tagged_sent):
i = 0
l = len(tagged_sent)
inside = False
while i < l:
word, tag = tagged_sent[i]
j = i + 1
k = j + self.lookahead
nextwords, nexttags = [], []
loc = False
while j < k:
if ' '.join([word] + nextwords) in self.locations:
if inside:
yield word, tag, 'I-LOCATION'
else:
yield word, tag, 'B-LOCATION'
for nword, ntag in zip(nextwords, nexttags):
yield nword, ntag, 'I-LOCATION'
loc, inside = True, True
i = j
break
if j < l:
nextword, nexttag = tagged_sent[j]
nextwords.append(nextword)
nexttags.append(nexttag)
j += 1
else:
break
if not loc:
inside = False
i += 1
yield word, tag, 'O'
def parse(self, tagged_sent):
iobs = self.iob_locations(tagged_sent)
return conlltags2tree(iobs)
代码#3:使用 LocationChunker 类解析句子
from nltk.chunk import ChunkParserI
from chunkers import sub_leaves
from chunkers import LocationChunker
t = loc.parse([('San', 'NNP'), ('Francisco', 'NNP'),
('CA', 'NNP'), ('is', 'BE'), ('cold', 'JJ'),
('compared', 'VBD'), ('to', 'TO'), ('San', 'NNP'),
('Jose', 'NNP'), ('CA', 'NNP')])
print ("Location : \n", sub_leaves(t, 'LOCATION'))
输出 :
Location :
[[('San', 'NNP'), ('Francisco', 'NNP'), ('CA', 'NNP')],
[('San', 'NNP'), ('Jose', 'NNP'), ('CA', 'NNP')]]