📜  Python – 使用 Enchant 过滤文本

📅  最后修改于: 2022-05-13 01:54:31.676000             🧑  作者: Mango

Python – 使用 Enchant 过滤文本

Enchant是Python中的一个模块,用于检查单词的拼写,给出正确单词的建议。此外,给出单词的反义词和同义词。它检查字典中是否存在单词。

Enchant还提供了enchant.tokenize模块来标记文本。标记化涉及从文本正文中拆分单词。但有时并非所有单词都需要标记化。假设我们进行拼写检查,那么习惯上会忽略电子邮件地址和 URL。这可以通过使用过滤器修改标记化过程来实现。
当前实现的过滤器是:

  • 电子邮件过滤器
  • URL过滤器
  • 维基词过滤器

示例 1:电子邮件过滤器

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import EmailFilter
  
# the text to be tokenized
text = "The email is abc@gmail.com"
  
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
  
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)
  
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [EmailFilter])
  
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

输出 :

示例 2: URL 过滤器

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import URLFilter
  
# the text to be tokenized
text = "This is an URL: https://www.geeksforgeeks.org/"
  
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
  
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)
  
  
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [URLFilter])
  
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

输出 :

示例 3: WikiWordFilter
WikiWord 是由两个或多个首字母大写的单词组成的单词,它们一起运行。

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import WikiWordFilter
  
# the text to be tokenized
text = "VersionFiveDotThree is an example of WikiWord"
  
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
  
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)
  
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [WikiWordFilter])
  
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

输出 :