在文件中找到与输入句子最相似的句子 |自然语言处理
在本文中,我们将找到文件中与输入句子最相似的句子。
例子:
File content:
"This is movie."
"This is romantic movie"
"This is a girl."
Input: "This is a boy"
Similar sentence to input:
"This is a girl", "This is movie".
方法:
- 创建一个列表来存储文件的所有唯一词。
- 通过将每个单词与列表内容进行比较,将文件的所有句子转换为二进制格式,经过清理(去除停用词、词干等)
- 将输入语句转换为二进制格式。
- 找出输入句子中与每个句子相似的词的个数,并将值存储在名为相似性索引的列表中。
- 找到相似度指数的最大值,返回相似词最大的句子。
文件内容:
获得类似句子的代码:
Python3
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
ps = PorterStemmer()
f = open('romyyy.txt')
a = sent_tokenize(f.read())
# removal of stopwords
stop_words = list(stopwords.words('english'))
# removal of punctuation signs
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~'''
s = [(word_tokenize(a[i])) for i in range(len(a))]
outer_1 = []
for i in range(len(s)):
inner_1 = []
for j in range(len(s[i])):
if s[i][j] not in (punc or stop_words):
s[i][j] = ps.stem(s[i][j])
if s[i][j] not in stop_words:
inner_1.append(s[i][j].lower())
outer_1.append(set(inner_1))
rvector = outer_1[0]
for i in range(1, len(s)):
rvector = rvector.union(outer_1[i])
outer = []
for i in range(len(outer_1)):
inner = []
for w in rvector:
if w in outer_1[i]:
inner.append(1)
else:
inner.append(0)
outer.append(inner)
comparison = input("Input: ")
check = (word_tokenize(comparison))
check = [ps.stem(check[i]).lower() for i in range(len(check))]
check1 = []
for w in rvector:
if w in check:
check1.append(1) # create a vector
else:
check1.append(0)
ds = []
for j in range(len(outer)):
similarity_index = 0
c = 0
if check1 == outer[j]:
ds.append(0)
else:
for i in range(len(rvector)):
c += check1[i]*outer[j][i]
similarity_index += c
ds.append(similarity_index)
ds
maximum = max(ds)
print()
print()
print("Similar sentences: ")
for i in range(len(ds)):
if ds[i] == maximum:
print(a[i])
输出: