自然语言处理 |块树到文本和链接块转换
我们可以将树或子树转换回句子或块字符串。为了理解如何做——下面的代码使用了 treebank_chunk 语料库的第一棵树。
代码#1:用空格连接树中的单词。
# Loading library
from nltk.corpus import treebank_chunk
# tree
tree = treebank_chunk.chunked_sents()[0]
print ("Tree : \n", tree)
print ("\nTree leaves : \n", tree.leaves())
print ("\nSentence from tree : \n", ' '.join(
[w for w, t in tree.leaves()]))
输出 :
Tree :
(S
(NP Pierre/NNP Vinken/NNP), /,
(NP 61/CD years/NNS)
old/JJ, /,
will/MD
join/VB
(NP the/DT board/NN)
as/IN
(NP a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD)
./.)
Tree leaves :
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (', ', ', '), ('61', 'CD'),
('years', 'NNS'), ('old', 'JJ'), (', ', ', '), ('will', 'MD'), ('join', 'VB'),
('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),
('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
Sentence from tree :
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29 .
和上面的代码一样,标点符号不正确,因为句号和逗号被视为特殊词。所以,他们也得到了周围的空间。但是在下面的代码中,我们使用正则表达式替换来解决这个问题。
代码 #2: chunk_tree_to_sent()
函数来改进代码 1
import re
# defining regex expression
punct_re = re.compile(r'\s([, \.;\?])')
def chunk_tree_to_sent(tree, concat =' '):
s = concat.join([w for w, t in tree.leaves()])
return re.sub(punct_re, r'\g<1>', s)
代码 #3:评估 chunk_tree_to_sent()
# Loading library
from nltk.corpus import treebank_chunk
from transforms import chunk_tree_to_sent
# tree
tree = treebank_chunk.chunked_sents()[0]
print ("Tree : \n", tree)
print ("\nTree leaves : \n", tree.leaves())
print ("Tree to sentence : ", chunk_tree_to_sent(tree))
输出 :
Tree :
(S
(NP Pierre/NNP Vinken/NNP), /,
(NP 61/CD years/NNS)
old/JJ, /,
will/MD
join/VB
(NP the/DT board/NN)
as/IN
(NP a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD)
./.)
Tree leaves :
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (', ', ', '), ('61', 'CD'),
('years', 'NNS'), ('old', 'JJ'), (', ', ', '), ('will', 'MD'), ('join', 'VB'),
('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),
('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
Tree to sentence :
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
链接块转换
转换函数可以链接在一起以标准化块,并且生成的块通常更短,并且仍然具有相同的含义。
在下面的代码中——一个单独的块和一个可选的转换函数列表被传递给函数。此函数将调用块上的每个转换函数并返回最终块。
代码#4:
def transform_chunk(
chunk, chain = [filter_insignificant,
swap_verb_phrase, swap_infinitive_phrase,
singularize_plural_noun], trace = 0):
for f in chain:
chunk = f(chunk)
if trace:
print (f.__name__, ':', chunk)
return chunk
代码 #5:评估 transform_chunk
from transforms import transform_chunk
chunk = [('the', 'DT'), ('book', 'NN'), ('of', 'IN'),
('recipes', 'NNS'), ('is', 'VBZ'), ('delicious', 'JJ')]
print ("Chunk : \n", chunk)
print ("\nTransformed Chunk : \n", transform_chunk(chunk))
输出 :
Chunk :
[('the', 'DT'), ('book', 'NN'), ('of', 'IN'), ('recipes', 'NNS'),
('is', 'VBZ'), ('delicious', 'JJ')]
Transformed Chunk :
[('delicious', 'JJ'), ('recipe', 'NN'), ('book', 'NN')]