Python – 使用 Enchant 过滤文本
Enchant
是Python中的一个模块,用于检查单词的拼写,给出正确单词的建议。此外,给出单词的反义词和同义词。它检查字典中是否存在单词。
Enchant
还提供了enchant.tokenize
模块来标记文本。标记化涉及从文本正文中拆分单词。但有时并非所有单词都需要标记化。假设我们进行拼写检查,那么习惯上会忽略电子邮件地址和 URL。这可以通过使用过滤器修改标记化过程来实现。
当前实现的过滤器是:
- 电子邮件过滤器
- URL过滤器
- 维基词过滤器
示例 1:电子邮件过滤器
# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import EmailFilter
# the text to be tokenized
text = "The email is abc@gmail.com"
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
token_list.append(words)
print(token_list)
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [EmailFilter])
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
token_list_filter.append(words)
print(token_list_filter)
输出 :
Printing tokens without filtering:
[(‘The’, 0), (’email’, 4), (‘is’, 10), (‘abc’, 13), (‘gmail’, 17), (‘com’, 23)]
Printing tokens after filtering:
[(‘The’, 0), (’email’, 4), (‘is’, 10)
示例 2: URL 过滤器
# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import URLFilter
# the text to be tokenized
text = "This is an URL: https://www.geeksforgeeks.org/"
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
token_list.append(words)
print(token_list)
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [URLFilter])
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
token_list_filter.append(words)
print(token_list_filter)
输出 :
Printing tokens without filtering:
[(‘This’, 0), (‘is’, 5), (‘an’, 8), (‘URL’, 11), (‘https’, 16), (‘www’, 24), (‘geeksforgeeks’, 28), (‘org’, 42)]
Printing tokens after filtering:
[(‘This’, 0), (‘is’, 5), (‘an’, 8), (‘URL’, 11)]
示例 3: WikiWordFilter
WikiWord 是由两个或多个首字母大写的单词组成的单词,它们一起运行。
# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import WikiWordFilter
# the text to be tokenized
text = "VersionFiveDotThree is an example of WikiWord"
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
token_list.append(words)
print(token_list)
# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [WikiWordFilter])
# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
token_list_filter.append(words)
print(token_list_filter)
输出 :
Printing tokens without filtering:
[(‘VersionFiveDotThree’, 0), (‘is’, 20), (‘an’, 23), (‘example’, 26), (‘of’, 34), (‘WikiWord’, 37)]
Printing tokens after filtering:
[(‘is’, 20), (‘an’, 23), (‘example’, 26), (‘of’, 34)]
在评论中写代码?请使用 ide.geeksforgeeks.org,生成链接并在此处分享链接。