Python – 使用 Enchant 标记文本

Enchant是Python中的一个模块，用于检查单词的拼写，给出正确单词的建议。此外，给出单词的反义词和同义词。它检查字典中是否存在单词。

Enchant还提供了enchant.tokenize模块来标记文本。标记化涉及从文本正文中拆分单词。

一些将经常使用的术语是：

语料库——文本的主体，单数。 Corpora 是 this 的复数形式。
词典——单词及其含义。
代币——每个“实体”都是根据规则拆分的任何事物的一部分。例如，当一个句子被“标记化”为单词时，每个单词都是一个标记。

我们将使用get_tokenizer()来标记文本。它将语言代码作为输入并返回适当的标记化类。然后我们用一些文本实例化这个类，它将返回一个迭代器，它将产生该文本中包含的单词。
分词器生成的项目是 (WORD, POS) 形式的元组，其中 WORD 是分词后的词，POS 是该词所在的字符串位置。

# import the module
from enchant.tokenize import get_tokenizer
  
# the text to be tokenized 
text = ("Natural language processing (NLP) is a field " + 
       "of computer science, artificial intelligence " + 
       "and computational linguistics concerned with " +  
       "the interactions between computers and human " +  
       "(natural) languages, and, in particular, " +  
       "concerned with programming computers to " + 
       "fruitfully process large natural language " +  
       "corpora. Challenges in natural language " +  
       "processing frequently involve natural " + 
       "language understanding, natural language" +  
       "generation frequently from formal, machine" +  
       "-readable logical forms), connecting language " +  
       "and machine perception, managing human-" + 
       "computer dialog systems, or some combination " +  
       "thereof.")
  
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
  
token_list =[]
for words in tokenizer(text):
    token_list.append(words)
  
# print the words with POS
print(token_list)

输出：

[(‘Natural’, 0), (‘language’, 8), (‘processing’, 17), (‘NLP’, 29), (‘is’, 34), (‘a’, 37), (‘field’, 39), (‘of’, 45), (‘computer’, 48), (‘science’, 57), (‘artificial’, 66), (‘intelligence’, 77), (‘and’, 90), (‘computational’, 94), (‘linguistics’, 108), (‘concerned’, 120), (‘with’, 130), (‘the’, 135), (‘interactions’, 139), (‘between’, 152), (‘computers’, 160), (‘and’, 170), (‘human’, 174), (‘natural’, 181), (‘languages’, 190), (‘and’, 201), (‘in’, 206), (‘particular’, 209), (‘concerned’, 221), (‘with’, 231), (‘programming’, 236), (‘computers’, 248), (‘to’, 258), (‘fruitfully’, 261), (‘process’, 272), (‘large’, 280), (‘natural’, 286), (‘language’, 294), (‘corpora’, 303), (‘Challenges’, 312), (‘in’, 323), (‘natural’, 326), (‘language’, 334), (‘processing’, 343), (‘frequently’, 354), (‘involve’, 365), (‘natural’, 373), (‘language’, 381), (‘understanding’, 390), (‘natural’, 405), (‘languagegeneration’, 413), (‘frequently’, 432), (‘from’, 443), (‘formal’, 448), (‘machine’, 456), (‘readable’, 464), (‘logical’, 473), (‘forms’, 481), (‘connecting’, 489), (‘language’, 500), (‘and’, 509), (‘machine’, 513), (‘perception’, 521), (‘managing’, 533), (‘human’, 542), (‘computer’, 548), (‘dialog’, 557), (‘systems’, 564), (‘or’, 573), (‘some’, 576), (‘combination’, 581), (‘thereof’, 593)]

编程需要懂一点英语

只打印单词，而不是 POS ：

# print only the words
word_list =[]
  
for tokens in token_list:
    word_list.append(tokens[0])
print(word_list)

输出：

[‘Natural’, ‘language’, ‘processing’, ‘NLP’, ‘is’, ‘a’, ‘field’, ‘of’, ‘computer’, ‘science’, ‘artificial’, ‘intelligence’, ‘and’, ‘computational’, ‘linguistics’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘natural’, ‘languages’, ‘and’, ‘in’, ‘particular’, ‘concerned’, ‘with’, ‘programming’, ‘computers’, ‘to’, ‘fruitfully’, ‘process’, ‘large’, ‘natural’, ‘language’, ‘corpora’, ‘Challenges’, ‘in’, ‘natural’, ‘language’, ‘processing’, ‘frequently’, ‘involve’, ‘natural’, ‘language’, ‘understanding’, ‘natural’, ‘languagegeneration’, ‘frequently’, ‘from’, ‘formal’, ‘machine’, ‘readable’, ‘logical’, ‘forms’, ‘connecting’, ‘language’, ‘and’, ‘machine’, ‘perception’, ‘managing’, ‘human’, ‘computer’, ‘dialog’, ‘systems’, ‘or’, ‘some’, ‘combination’, ‘thereof’]

编程需要懂一点英语

在评论中写代码？请使用 ide.geeksforgeeks.org，生成链接并在此处分享链接。