📜  Python|使用 NLTK 词干词干

📅  最后修改于: 2022-05-13 01:55:22.077000             🧑  作者: Mango

Python|使用 NLTK 词干词干

先决条件:词干简介
词干提取是产生词根/基本词的形态变体的过程。词干提取程序通常称为词干提取算法或词干分析器。词干算法将单词“chocolates”、“chocolatey”、“choco”简化为词根,“chocolate”和“retrieval”、“retrieve”、“retrieves”简化为词干“retrieve”。

Some more example of stemming for root word "like" include:

-> "likes"
-> "liked"
-> "likely"
-> "liking"

词干错误:
词干提取主要有两个错误——词干过度词干不足。当两个词的词根不同但词干相同时,就会发生过度词干。词干不足发生在两个词的词根不是不同词干的同一个词根。

词干的应用是:

  • 词干用于搜索引擎等信息检索系统。
  • 它用于确定领域分析中的领域词汇。

词干是可取的,因为它可以减少冗余,因为大多数时候词干及其变形/派生词的含义相同。

下面是使用 NLTK 实现词干提取:

代码#1:

Python3
# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
  
ps = PorterStemmer()
 
# choose some words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers"]
 
for w in words:
    print(w, " : ", ps.stem(w))


Python3
# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
  
ps = PorterStemmer()
  
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)
  
for w in words:
    print(w, " : ", ps.stem(w))


输出:

program  :  program
programs  :  program
programmer  :  program
programming  :  program
programmers  :  program

代码 #2:从句子中提取单词

Python3

# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
  
ps = PorterStemmer()
  
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)
  
for w in words:
    print(w, " : ", ps.stem(w))

输出 :

Programmers  :  program
program  :  program
with  :  with
programming  :  program
languages  :  languag