Python|使用余弦相似度测量两个句子之间的相似度

余弦相似度是内积空间的两个非零向量之间的相似度度量，用于度量它们之间夹角的余弦值。
相似度 = (AB) / (||A||.||B||)其中 A 和 B 是向量。

本程序使用余弦相似度和 nltk 工具包模块。要执行这个程序 nltk 必须安装在你的系统中。为了安装nltk 模块，请按照以下步骤操作 -

1. Open terminal(Linux).
2. sudo pip3 install nltk
3. python3
4. import nltk
5. nltk.download(‘all’)

使用的功能：

nltk.tokenize: It is used for tokenization. Tokenization is the process by which big quantity of text is divided into smaller parts called tokens. word_tokenize(X) split the given sentence X into words and return list.

nltk.corpus: In this program, it is used to get a list of stopwords. A stop word is a commonly used word (such as “the”, “a”, “an”, “in”).

编程需要懂一点英语

下面是Python的实现——

# Program to measure the similarity between 
# two sentences using cosine similarity.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
  
# X = input("Enter first string: ").lower()
# Y = input("Enter second string: ").lower()
X ="I love horror movies"
Y ="Lights out is a horror movie"
  
# tokenization
X_list = word_tokenize(X) 
Y_list = word_tokenize(Y)
  
# sw contains the list of stopwords
sw = stopwords.words('english') 
l1 =[];l2 =[]
  
# remove stop words from the string
X_set = {w for w in X_list if not w in sw} 
Y_set = {w for w in Y_list if not w in sw}
  
# form a set containing keywords of both strings 
rvector = X_set.union(Y_set) 
for w in rvector:
    if w in X_set: l1.append(1) # create a vector
    else: l1.append(0)
    if w in Y_set: l2.append(1)
    else: l2.append(0)
c = 0
  
# cosine formula 
for i in range(len(rvector)):
        c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
print("similarity: ", cosine)

输出：

similarity:  0.2886751345948129