📜  使用Python为文件创建倒排索引

📅  最后修改于: 2022-05-13 01:54:38.169000             🧑  作者: Mango

使用Python为文件创建倒排索引

倒排索引是一种索引数据结构,用于存储从内容(例如单词或数字)到其在文档或一组文档中的位置的映射。简而言之,它是一种类似于哈希图的数据结构,可将您从单词引导到文档或网页。

创建倒排索引

我们将创建一个单词级别的倒排索引,也就是说,它将返回单词所在的行列表。我们还将创建一个字典,其中键值表示文件中存在的单词,字典的值将由包含它们所在行号的列表表示。要在 Jupiter notebook 中创建文件,请使用魔法函数:

%%writefile file.txt
This is the first word.
This is the second text, Hello! How are you?
This is the third, this is it now.

这将创建一个名为 file.txt 的文件,其中将包含以下内容。

读取文件:

Python3
# this will open the file
file = open('file.txt', encoding='utf8')
read = file.read()
file.seek(0)
read
  
# to obtain the
# number of lines
# in file
line = 1
for word in read:
    if word == '\n':
        line += 1
print("Number of lines in file is: ", line)
  
# create a list to
# store each line as
# an element of list
array = []
for i in range(line):
    array.append(file.readline())
  
array


Python3
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~'''
for ele in read:  
    if ele in punc:  
        read = read.replace(ele, " ")  
          
read
  
# to maintain uniformity
read=read.lower()                    
read


Python3
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
  
for i in range(1):
    # this will convert
    # the word into tokens
    text_tokens = word_tokenize(read)
  
tokens_without_sw = [
    word for word in text_tokens if not word in stopwords.words()]
  
print(tokens_without_sw)


Python3
dict = {}
  
for i in range(line):
    check = array[i].lower()
    for item in tokens_without_sw:
  
        if item in check:
            if item not in dict:
                dict[item] = []
  
            if item in dict:
                dict[item].append(i+1)
  
dict


输出:

Number of lines in file is: 3
['This is the first word.\n',
'This is the second text, Hello! How are you?\n',
'This is the third, this is it now.']

使用的功能:

  • 打开:用于打开文件。
  • read:该函数用于读取文件的内容。
  • seek(0):将光标返回到文件的开头。

删除标点符号:

Python3

punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~'''
for ele in read:  
    if ele in punc:  
        read = read.replace(ele, " ")  
          
read
  
# to maintain uniformity
read=read.lower()                    
read

输出:

'this is the first word \n
this is the second text hello how are you \n
this is the third this is it now '

通过删除停用词来清理数据:

停用词是那些没有情感的词,可以安全地忽略而不牺牲句子的含义。

Python3

from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
  
for i in range(1):
    # this will convert
    # the word into tokens
    text_tokens = word_tokenize(read)
  
tokens_without_sw = [
    word for word in text_tokens if not word in stopwords.words()]
  
print(tokens_without_sw)

输出:

['first', 'word', 'second', 'text', 'hello', 'third']

创建倒排索引:

Python3

dict = {}
  
for i in range(line):
    check = array[i].lower()
    for item in tokens_without_sw:
  
        if item in check:
            if item not in dict:
                dict[item] = []
  
            if item in dict:
                dict[item].append(i+1)
  
dict

输出:

{'first': [1],
'word': [1],
'second': [2], 
'text': [2], 
'hello': [2], 
'third': [3]}