自然语言处理 |使用 RegEx 进行分块和分块

块提取或部分解析是从句子中有意义地提取短语（用词性标记）的过程。
块由词组成，词的种类是使用词性标签定义的。人们甚至可以定义一种模式或不能成为chuck 一部分的单词，这些单词被称为chinks 。 ChunkRule 类指定要在块中包含和排除的单词或模式。

定义块模式：
Chuck 模式是普通的正则表达式，经过修改和设计以匹配旨在匹配词性标签序列的词性标签。尖括号用于指定单个标签，例如 -匹配名词标签。可以以相同的方式定义多个标签。

代码 #1：将块转换为 RegEx 模式。

Python3

# Laading Library
from nltk.chunk.regexp import tag_pattern2re_pattern
 
# Chunk Pattern to RegEx Pattern
print("Chunk Pattern : ", tag_pattern2re_pattern('?+'))

Python3

from nltk.chunk import RegexpParser
 
# Introducing the Pattern
chunker = RegexpParser(r'''
NP:
{<.*>*}
}{
''')
 
chunker.parse([('the', 'DT'), ('book', 'NN'), (
    'has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])

输出：

Chunk Pattern :  ()?(<(NN[^\{\}]*)>)+

花括号用于指定像 {} 这样的块并指定缝隙模式，只需翻转大括号 }{。对于特定的短语类型，这些规则（块和缝隙模式）可以组合成语法。

代码 #2：使用 RegExParser 解析句子。

Python3

from nltk.chunk import RegexpParser
 
# Introducing the Pattern
chunker = RegexpParser(r'''
NP:
{<.*>*}
}{
''')
 
chunker.parse([('the', 'DT'), ('book', 'NN'), (
    'has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])

输出：

Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), 
Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])