自然语言处理 |提取命名实体

识别命名实体是一种特定类型的块提取，它使用实体标签和块标签。
常见的实体标签包括 PERSON、LOCATION 和 ORGANIZATION。 POS 标记的句子被解析为具有正常分块的块树，但树标签可以是实体标签来代替块短语标签。 NLTK 已经有一个预训练的命名实体分块器，可以使用 nltk.chunk 模块中的ne_chunk()方法。此方法将单个句子分块成树。

代码 #1：在 treebank_chunk 语料库的标记句子上使用 ne-chunk()

from nltk.corpus import treebank_chunk
from nltk.chunk import ne_chunk
  
ne_chunk(treebank_chunk.tagged_sents()[0])

输出：

Tree('S', [Tree('PERSON', [('Pierre', 'NNP')]), Tree('ORGANIZATION', 
[('Vinken', 'NNP')]), (', ', ', '), ('61', 'CD'), ('years', 'NNS'), 
('old', 'JJ'), (', ', ', '), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'),
('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), 
('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')])

找到两个实体标签：PERSON 和 ORGANIZATION。这些子树中的每一个都包含一个被识别为 PERSON 或 ORGANIZATION 的单词列表。代码#2：使用所有子树的叶子提取命名实体的方法

def sub_leaves(tree, label):
    return [t.leaves() 
            for t in tree.subtrees(
                    lambda s: label() == label)]

代码 #3 ：使用方法从树中获取所有 PERSON 或 ORGANIZATION 叶子

tree = ne_chunk(treebank_chunk.tagged_sents()[0])
  
from chunkers import sub_leaves
print ("Named entities of PERSON : ", 
       sub_leaves(tree, 'PERSON'))
  
print ("\nNamed entites of ORGANIZATION : ", 
       sub_leaves(tree, 'ORGANIZATION'))

输出：

Named entities of PERSON : [[('Pierre', 'NNP')]]

Named entites of ORGANIZATION : [[('Vinken', 'NNP')]]

要一次处理多个句子，使用chunk_ne_sents() 。在下面的代码中，处理来自 treebank_chunk.tagged_sents( treebank_chunk.tagged_sents()的前 10 个句子以获得 ORGANIZATION sub_leaves() 。代码 #4：让我们了解chunk_ne_sents()

from nltk.chunk import chunk_ne_sents
from nltk.corpus import treebank_chunk
  
trees = chunk_ne_sents(treebank_chunk.tagged_sents()[:10])
[sub_leaves(t, 'ORGANIZATION') for t in trees]

输出：

[[[('Vinken', 'NNP')]], [[('Elsevier', 'NNP')]], [[('Consolidated', 'NNP'), 
('Gold', 'NNP'), ('Fields', 'NNP')]], [], [], [[('Inc.', 'NNP')], 
[('Micronite', 'NN')]], [[('New', 'NNP'), ('England', 'NNP'),
('Journal', 'NNP')]], [[('Lorillard', 'NNP')]], [], []]