📜  在文件中找到与输入句子最相似的句子 |自然语言处理

📅  最后修改于: 2022-05-13 01:55:06.948000             🧑  作者: Mango

在文件中找到与输入句子最相似的句子 |自然语言处理

在本文中,我们将找到文件中与输入句子最相似的句子。

例子:

File content:
"This is movie."
"This is romantic movie"
"This is a girl."

Input: "This is a boy"

Similar sentence to input: 
"This is a girl", "This is movie".

方法:

  1. 创建一个列表来存储文件的所有唯一词。
  2. 通过将每个单词与列表内容进行比较,将文件的所有句子转换为二进制格式,经过清理(去除停用词、词干等)
  3. 将输入语句转换为二进制格式。
  4. 找出输入句子中与每个句子相似的词的个数,并将值存储在名为相似性索引的列表中。
  5. 找到相似度指数的最大值,返回相似词最大的句子。

文件内容:

获得类似句子的代码:

Python3
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk
from nltk.corpus import stopwords
  
  
nltk.download('stopwords')
ps = PorterStemmer()
f = open('romyyy.txt')
a = sent_tokenize(f.read())
  
# removal of stopwords
stop_words = list(stopwords.words('english'))
  
# removal of punctuation signs
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~'''
s = [(word_tokenize(a[i])) for i in range(len(a))]
outer_1 = []
  
for i in range(len(s)):
    inner_1 = []
      
    for j in range(len(s[i])):
          
        if s[i][j] not in (punc or stop_words):
            s[i][j] = ps.stem(s[i][j])
              
            if s[i][j] not in stop_words:
                inner_1.append(s[i][j].lower())
      
    outer_1.append(set(inner_1))
rvector = outer_1[0]
  
for i in range(1, len(s)):
    rvector = rvector.union(outer_1[i])
outer = []
  
for i in range(len(outer_1)):
    inner = []
      
    for w in rvector:
          
        if w in outer_1[i]:
            inner.append(1)
          
        else:
            inner.append(0)
    outer.append(inner)
comparison = input("Input: ")
  
  
check = (word_tokenize(comparison))
check = [ps.stem(check[i]).lower() for i in range(len(check))]
  
  
check1 = []
for w in rvector:
    if w in check:
        check1.append(1)  # create a vector
    else:
        check1.append(0)
  
ds = []
  
for j in range(len(outer)):
    similarity_index = 0
    c = 0
      
    if check1 == outer[j]:
        ds.append(0)
    else:
        for i in range(len(rvector)):
  
            c += check1[i]*outer[j][i]
  
        similarity_index += c
        ds.append(similarity_index)
  
  
ds
maximum = max(ds)
print()
print()
print("Similar sentences: ")
for i in range(len(ds)):
  
    if ds[i] == maximum:
        print(a[i])


输出: