Python – 使用 Enchant 标记文本
Enchant
是Python中的一个模块,用于检查单词的拼写,给出正确单词的建议。此外,给出单词的反义词和同义词。它检查字典中是否存在单词。
Enchant
还提供了enchant.tokenize
模块来标记文本。标记化涉及从文本正文中拆分单词。
一些将经常使用的术语是:
- 语料库——文本的主体,单数。 Corpora 是 this 的复数形式。
- 词典——单词及其含义。
- 代币——每个“实体”都是根据规则拆分的任何事物的一部分。例如,当一个句子被“标记化”为单词时,每个单词都是一个标记。
我们将使用get_tokenizer()
来标记文本。它将语言代码作为输入并返回适当的标记化类。然后我们用一些文本实例化这个类,它将返回一个迭代器,它将产生该文本中包含的单词。
分词器生成的项目是 (WORD, POS) 形式的元组,其中 WORD 是分词后的词,POS 是该词所在的字符串位置。
# import the module
from enchant.tokenize import get_tokenizer
# the text to be tokenized
text = ("Natural language processing (NLP) is a field " +
"of computer science, artificial intelligence " +
"and computational linguistics concerned with " +
"the interactions between computers and human " +
"(natural) languages, and, in particular, " +
"concerned with programming computers to " +
"fruitfully process large natural language " +
"corpora. Challenges in natural language " +
"processing frequently involve natural " +
"language understanding, natural language" +
"generation frequently from formal, machine" +
"-readable logical forms), connecting language " +
"and machine perception, managing human-" +
"computer dialog systems, or some combination " +
"thereof.")
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
token_list =[]
for words in tokenizer(text):
token_list.append(words)
# print the words with POS
print(token_list)
输出 :
[(‘Natural’, 0), (‘language’, 8), (‘processing’, 17), (‘NLP’, 29), (‘is’, 34), (‘a’, 37), (‘field’, 39), (‘of’, 45), (‘computer’, 48), (‘science’, 57), (‘artificial’, 66), (‘intelligence’, 77), (‘and’, 90), (‘computational’, 94), (‘linguistics’, 108), (‘concerned’, 120), (‘with’, 130), (‘the’, 135), (‘interactions’, 139), (‘between’, 152), (‘computers’, 160), (‘and’, 170), (‘human’, 174), (‘natural’, 181), (‘languages’, 190), (‘and’, 201), (‘in’, 206), (‘particular’, 209), (‘concerned’, 221), (‘with’, 231), (‘programming’, 236), (‘computers’, 248), (‘to’, 258), (‘fruitfully’, 261), (‘process’, 272), (‘large’, 280), (‘natural’, 286), (‘language’, 294), (‘corpora’, 303), (‘Challenges’, 312), (‘in’, 323), (‘natural’, 326), (‘language’, 334), (‘processing’, 343), (‘frequently’, 354), (‘involve’, 365), (‘natural’, 373), (‘language’, 381), (‘understanding’, 390), (‘natural’, 405), (‘languagegeneration’, 413), (‘frequently’, 432), (‘from’, 443), (‘formal’, 448), (‘machine’, 456), (‘readable’, 464), (‘logical’, 473), (‘forms’, 481), (‘connecting’, 489), (‘language’, 500), (‘and’, 509), (‘machine’, 513), (‘perception’, 521), (‘managing’, 533), (‘human’, 542), (‘computer’, 548), (‘dialog’, 557), (‘systems’, 564), (‘or’, 573), (‘some’, 576), (‘combination’, 581), (‘thereof’, 593)]
只打印单词,而不是 POS :
# print only the words
word_list =[]
for tokens in token_list:
word_list.append(tokens[0])
print(word_list)
输出 :
[‘Natural’, ‘language’, ‘processing’, ‘NLP’, ‘is’, ‘a’, ‘field’, ‘of’, ‘computer’, ‘science’, ‘artificial’, ‘intelligence’, ‘and’, ‘computational’, ‘linguistics’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘natural’, ‘languages’, ‘and’, ‘in’, ‘particular’, ‘concerned’, ‘with’, ‘programming’, ‘computers’, ‘to’, ‘fruitfully’, ‘process’, ‘large’, ‘natural’, ‘language’, ‘corpora’, ‘Challenges’, ‘in’, ‘natural’, ‘language’, ‘processing’, ‘frequently’, ‘involve’, ‘natural’, ‘language’, ‘understanding’, ‘natural’, ‘languagegeneration’, ‘frequently’, ‘from’, ‘formal’, ‘machine’, ‘readable’, ‘logical’, ‘forms’, ‘connecting’, ‘language’, ‘and’, ‘machine’, ‘perception’, ‘managing’, ‘human’, ‘computer’, ‘dialog’, ‘systems’, ‘or’, ‘some’, ‘combination’, ‘thereof’]
在评论中写代码?请使用 ide.geeksforgeeks.org,生成链接并在此处分享链接。