📅  最后修改于: 2020-04-27 14:04:33             🧑  作者: Mango
词干是产生词根/基词形态变异的过程。阻止程序通常称为阻止算法或阻止程序。词干算法将单词“ chocolates”,“ chocolatey”,“ chocolate”减少到词根“ choco”,将“ retrieval”,“ retrieved”,“ retrieves”减少到词干“ retrieve”。
提取词根“ like”的更多示例包括:
-> "likes"
-> "liked"
-> "likely"
-> "liking"
在制止错误: 主要有在制止两个错误- Overstemming和Understemming。当两个单词词根不同而词干相同时,就会发生词干过度错误。当两个词的词根不相同但词干不同时,就会发生词干不足。
词干的应用是:
词干是可取的,因为它可能会减少冗余,因为在大多数情况下,词干和它们的变体/衍生词含义相同。
以下是使用NLTK的词干实现:
代码1:
# 导入这些模块
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
# 选择一些词干
words = ["program", "programs", "programer", "programing", "programers"]
for w in words:
print(w, " : ", ps.stem(w))
输出:
program : program
programs : program
programer : program
programing : program
programers : program
代码2:句子中的单词词干
# 导入模块
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "Programers program with programing languages"
words = word_tokenize(sentence)
for w in words:
print(w, " : ", ps.stem(w))
输出:
Programers : program
program : program
with : with
programing : program
languages : languag