Python – 使用 Enchant 过滤文本

Enchant是Python中的一个模块，用于检查单词的拼写，给出正确单词的建议。此外，给出单词的反义词和同义词。它检查字典中是否存在单词。

Enchant还提供了enchant.tokenize模块来标记文本。标记化涉及从文本正文中拆分单词。但有时并非所有单词都需要标记化。假设我们进行拼写检查，那么习惯上会忽略电子邮件地址和 URL。这可以通过使用过滤器修改标记化过程来实现。
当前实现的过滤器是：

电子邮件过滤器
URL过滤器
维基词过滤器

示例 1：电子邮件过滤器

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import EmailFilter
  
# the text to be tokenized
text = "The email is abc@gmail.com"
  
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
  
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)
  
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [EmailFilter])
  
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

输出：

Printing tokens without filtering:
[(‘The’, 0), (’email’, 4), (‘is’, 10), (‘abc’, 13), (‘gmail’, 17), (‘com’, 23)]

Printing tokens after filtering:
[(‘The’, 0), (’email’, 4), (‘is’, 10)

编程需要懂一点英语

示例 2： URL 过滤器

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import URLFilter
  
# the text to be tokenized
text = "This is an URL: https://www.geeksforgeeks.org/"
  
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
  
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)
  
  
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [URLFilter])
  
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

输出：

Printing tokens without filtering:
[(‘This’, 0), (‘is’, 5), (‘an’, 8), (‘URL’, 11), (‘https’, 16), (‘www’, 24), (‘geeksforgeeks’, 28), (‘org’, 42)]

Printing tokens after filtering:
[(‘This’, 0), (‘is’, 5), (‘an’, 8), (‘URL’, 11)]

编程需要懂一点英语

示例 3： WikiWordFilter
WikiWord 是由两个或多个首字母大写的单词组成的单词，它们一起运行。

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import WikiWordFilter
  
# the text to be tokenized
text = "VersionFiveDotThree is an example of WikiWord"
  
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
  
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)
  
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [WikiWordFilter])
  
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

输出：

Printing tokens without filtering:
[(‘VersionFiveDotThree’, 0), (‘is’, 20), (‘an’, 23), (‘example’, 26), (‘of’, 34), (‘WikiWord’, 37)]

Printing tokens after filtering:
[(‘is’, 20), (‘an’, 23), (‘example’, 26), (‘of’, 34)]

编程需要懂一点英语

在评论中写代码？请使用 ide.geeksforgeeks.org，生成链接并在此处分享链接。