使用 N-gram 和 CDF 的概念进行单词预测

对-CDF和N-gram有一些基本的了解
问题陈述——给定任何输入单词和文本文件，预测文本文件中输入单词之后可能出现的下 n 个单词。

例子：

Input  :  is 
Output :  is it simply makes sure that there are never

Input  :  is
Output :  is split, all the maximum amount of objects, it

Input  :  the
Output : the exact same position. There will be some.

注意– 为了说明示例，我将变量语料库分配给了一些文本。如果你想测试真实世界文本数据的数据，你可以在这里找到数据。

解决方案——我们可以使用概率的概念来解决这个问题。首先，我们必须计算文本文件中输入之后所有单词出现的频率（n-gram，这里是 1-gram，因为我们总是在整个数据文件中找到下一个 1 个单词）。然后使用这些频率，计算所有这些词的 CDF，然后从中选择一个随机词。为了选择这个随机词，我们取一个随机数并找到大于或等于该随机数的最小 CDF。我们这样做是因为我们想要每种情况下最可能的答案。所以这可以通过 cdf 来实现，因为它给出了列表中每个单词的累积概率。
找到 CDF 后，我们可以轻松找到相应的单词并将该单词附加到输出字符串中。现在，如果您愿意，您还可以将单词附加到输入字符串并发送整个字符串以重复该过程以查找下一个单词，或者您可以只发送使用 cdf 找到的单词。我已经使用前一种方法做到了。

注意– 如果您多次输入相同的单词，您将得到不同的输出。这取决于数据文件的大小。文件越大，不同输出的可能性越大。

上述算法的代码

Python3

import random
from collections import Counter
 
# This function calculates the freq of the (i+1)th
# word in the whole corpus, where i is the index of
# the sentence or the word.
 
def next_word_freq(array, sentence):
     
    sen_len, word_list = len(sentence.split()), []
     
    for i in range(len(array)):
 
        # If the sentence matches the sentence in the range (i, i+x)
        # and the length is less than the length of the corpus, append
        # the word to word_list.
         
        if ' '.join(array[i : i + sen_len]).lower() == sentence.lower():
 
            if i + sen_len < len(array) - 1:
 
                word_list.append(array[i + sen_len])
 
    # Return the count of each word in word_list
     
    return dict(Counter(word_list))
 
# Calculate the CDF of each word in the
# Counter dictionary.
 
def CDF(d):
     
    prob_sum, sum_vals = 0, sum(d.values())
     
    for k, v in d.items():
 
        # Calculate the PMF of each word by dividing
        # the freq. by total of all frequencies then add
        # all the PMFs till ith word which is the CDF of
        # the ith word.
         
        pmf = v / sum_vals
        prob_sum += pmf
        d[k] = prob_sum
 
    # Return cdf dictionary
     
    return d
 
# The main function reads the sentence/word as input
# from user and reads the corpus file. For faster processing,
# we have taken only the first 1000 words.
 
 
def main(sent, x, n):
 
    # I am using this sample text here to illustrate the output.
    # If anyone wants to use a text file, he can use the same. The code
    # to read corpus from file has been commented below.
 
    # corpus = open('a.txt','r').read()
 
    corpus = '''text The chance is unlikely if not done programmatically.
    However, imagine the game spawning multiple players at a spawn point,
    this would be the exact same location. I'm not quite sure what you
    mean with spin,     what does the integer reflect? Why is it a
    mismatch between data and structure? The structure does not
    assume a set amount of objects, it can be anything, that's why new
    nodes are created. It simply makes sure that there are not more than
    X leafs inside 1 node. The random is no option of course.
    My splitting algorithm always created the maximum amount of nodes
    already, split over the current node. But I guess I have to change
    this behaviour? Actually, all the books have different authors. And
    most have a different location too. There will be some with the same
    location, but different authors, though. I think my library should be
    able to store books with the same position. There are never
    equally-attractive leaf nodes. If a node is split, all childs will
    reflect a different part of the parent node.'''
     
    l = corpus.split()
 
    # "temp_out" will be used to store each partial sentence
    # which will later be stored into "sent". "out" is used to store
    # the final output.
     
    temp_out = ''
    out = sent + ' '
     
    for i in range(n - x):
 
        # calling the next_word_freq method that returns
        # the frequency of each word next to sent in the
        # whole word corpus.
         
        func_out = next_word_freq(l, sent)
 
        # cdf_dict stores the cdf of each word in the above map
        # that is calculated using method CDF.
         
        cdf_dict = CDF(func_out)
         
        # We use a random number to predict the next word.
        # The word having its CDF greater than or equal to rand
        # and less than or equal to 1.
         
        rand = random.uniform(0, 1)
 
        # If cdf_dict is empty, it means the word.sentence entered by you
        # does not exist in the corpus. Hence, break the loop and just print
        # the word entered by you. To implement this we use try-except block.
        # If an error occurs it implies there aren't enough values to unpack
        # and this can happen only when your input is absent from the corpus.
         
        try: key, val = zip(*cdf_dict.items())
        except: break
 
        # Iterate through the cdf values and find the smallest value
        # greater than or equal to the random number. That value is the
        # cdf of your predicted word. Add the key of the value to the output
        # string and update the "sent" variable as "temp_out".
         
        for j in range(len(val)):
             
            if rand <= val[j]:
                pos = j
                break
                     
        temp_out = key[pos]
        out = out + temp_out + ' '
        sent = temp_out
         
    print(out, end = '\n\n')
 
if __name__ == '__main__':
 
    inp_sent = 'is'
    # The output will have 10 words, including the input sentence/word.
    main(inp_sent, len(inp_sent), 10)
 
# Code contributed by Gagan Talreja.

上面显示的概念用于自然语言处理等领域。这是一种天真的方法，只是为了说明这个概念。实际上，还有更多用于单词预测的算法。你可以在这里找到其中之一