命名实体识别

命名实体识别（NER）是最重要的数据预处理任务之一。它涉及识别文本中的关键信息并将其分类为一组预定义的类别。实体基本上是在文本中一直谈论或引用的事物。

NER是NLP的形式。

从本质上讲，NLP 只是一个两步过程，以下是所涉及的两个步骤：

从文本中检测实体
将它们分为不同的类别

一些类别是 NER 中最重要的架构，例如：

人
组织
地点/地点

其他常见任务包括对以下内容进行分类：

约会时间。
表达
数字测量（金钱、百分比、重量等）
电子邮件地址

NE 中的歧义

对于一个人来说，类别的定义在直觉上是相当清晰的，但是对于计算机来说，在分类上就有些模糊了。让我们看一些模棱两可的例子：
- 英格兰（组织）赢得 2019 年世界杯 vs 2019 年世界杯发生在英格兰（地点）。
- 华盛顿（位置）是美国的首都 vs 美国的第一任总统是华盛顿（人） 。

NER的方法

一种方法是使用不同的机器学习算法训练多类分类模型，但它需要大量标记。除了标注模型，还需要对上下文有深刻的理解，以处理句子的歧义。这使其成为简单机器学习的一项具有挑战性的任务 /
另一种方法是由 NLP Speech Tagger 和 NLTK 实现的条件随机字段。它是一种概率模型，可用于对单词等顺序数据进行建模。 CRF 可以捕捉对句子上下文的深刻理解。在这个模型中，输入 ${\text{X}} = \left \{ \vec{x}_{1} ,\vec{x}_{2} ,\vec{x}_{3}, \ldots,\vec{x}_{T} \right \}$

${\text{p}}\left( {{\text{y|}}{\mathbf{x}}} \right) = \frac{1}{{z\left( \vec{x} \right)}}\mathop \prod \limits_{t = 1}^{T} { \exp }\left\{ {\mathop \sum \limits_{k = 1}^{k} \omega_{k} f_{k} \left( {y_{t} ,y_{t - 1} ,\vec{x}_{t} } \right)} \right\}$

基于深度学习的 NER：深度学习 NER 比以前的方法准确得多，因为它能够组合单词。这是因为它使用了一种叫做词嵌入的方法，能够理解各种词之间的语义和句法关系。它还能够自动学习分析特定主题以及高级单词。这使得深度学习 NER 适用于执行多项任务。深度学习本身可以完成大部分重复性工作，因此研究人员可以更有效地利用他们的时间。

执行

在这个实现中，我们将使用两个不同的框架执行命名实体识别：Spacy 和 NLTK。此代码可以在 colab 上运行，但用于可视化目的。我推荐当地的环境。我们可以使用 pip install 安装以下框架
首先，我们使用 Spacy 执行命名实体识别。

Python3

# command to run before code
! pip install spacy
! pip install nltk
! python -m spacy download en_core_web_sm
  
# imports and load spacy english language package
import spacy
from spacy import displacy
from spacy import tokenizer
nlp = spacy.load('en_core_web_sm')
  
#Load the text and process it
# I copied the text from python wiki
text =("Python is an interpreted, high-level and general-purpose programming language 
       "Pythons design philosophy emphasizes code readability with" 
       "its notable use of significant indentation."
       "Its language constructs and object-oriented approachaim to"
       "help programmers write clear and"
       "logical code for small and large-scale projects")
# text2 = # copy the paragrphs from  https://www.python.org/doc/essays/ 
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list(doc.sents)
print(sentences)
# tokenization
for token in doc:
    print(token.text)
# print entities
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
# now we use displaycy function on doc2
displacy.render(doc, style='ent', jupyter=True)

Python3

# import modules and download packages
import nltk
nltk.download('words')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
nltk.download('state_union')
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
  
# process the text and print Named entities
# tokenization
train_text = state_union.raw()
  
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
# function 
def get_named _entity():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=False)
            namedEnt.draw()
    except:
        pass
get_named_entity()

[Python is an interpreted, high-level and general-purpose programming language.,
 Pythons design philosophy emphasizes code readability with its notable use of significant indentation.,
 Its language constructs and object-oriented approachaim to help programmers write clear, logical code for small and large-scale projects]
 # tokens
 Python
is
an
interpreted
,
high
-
level
and
general
-
purpose
programming
language
.
Pythons
design
philosophy
emphasizes
code
readability
with
its
notable
use
of
significant
indentation
.
Its
language
constructs
and
object
-
oriented
approachaim
to
help
programmers
write
clear
,
logical
code
for
small
and
large
-
scale
projects
# named entity
[('Python', 0, 6, 'ORG')]

#here ORG stands for Organization

doc2 上的 Spacy 实体标签

以下是 spacy 实体标签的列表及其含义：

Spacy 命名实体识别标签

现在我们在 NLTK 上执行命名实体识别任务。

蟒蛇3

# import modules and download packages
import nltk
nltk.download('words')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
nltk.download('state_union')
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
  
# process the text and print Named entities
# tokenization
train_text = state_union.raw()
  
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
# function 
def get_named _entity():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=False)
            namedEnt.draw()
    except:
        pass
get_named_entity()

NER的一个句子示例