使用 NLTK 的 N-Gram 语言建模

语言建模是确定任何单词序列的概率的方法。语言模型在各种各样的应用，如语音识别，垃圾邮件过滤等，其实用，语言建模是许多国家的最先进的自然语言处理模型的实现背后的主要目的。

语言建模方法：

两种类型的语言建模：

统计语言建模：统计语言建模或语言建模是概率模型的发展，该模型能够在给定前面的单词的情况下预测序列中的下一个单词。 N-gram 语言建模等示例。
神经语言建模：神经网络方法在独立语言模型上以及在语音识别和机器翻译等具有挑战性的任务中将模型合并到更大的模型中时，都取得了比经典方法更好的结果。执行神经语言模型的一种方法是通过词嵌入。

N-gram

N-gram 可以定义为来自给定文本或语音样本的 n 个项目的连续序列。根据应用，项目可以是字母、单词或碱基对。 N-gram 通常是从文本或语音语料库（长文本数据集）中收集的。

N-gram 语言模型：

N-gram 语言模型预测给定 N-gram 在该语言的任何单词序列中出现的概率。一个好的 N-gram 模型可以预测句子中的下一个词，即p(w|h) 的值

N-gram 的示例，例如 unigram（“This”、“article”、“is”、“on”、“NLP”）或 bi-gram（'This article'、'article is'、'is on'、'在 NLP 上'）。

现在，我们将建立一个关于如何使用

.我们需要计算 p(w|h)，其中是下一个单词的候选。例如在上面的例子中，让我们考虑一下，我们想计算给定前一个词的最后一个词是“NLP”的概率是多少：

$p(NLP | this\, article\, is\, on)$

对上述方程进行推广后，可以计算为：

$p(w_5 | w_1, w_2, w_3, w_4) \, or \, P(W)$

$= p(w_n | w_1, w_2...w_n)$

但是我们如何计算呢？答案在于概率链式法则：

$P(A|B) = \frac{P(A,B)}{P(B)}\\ P(A,B) = P(A|B)P(B)\\$

现在推广上面的等式：

$P(X_1,X_2, ...,X_n) = P(X_1) P(X_2 | X_1) P(X_3 | X_1, X_2) .... P(X_n | X_1, X_2,...X_n)\\ P(w_1 w_2 w_3 ...w_n) =\prod_i P(w_i | w_1 w_2 ... w_n)$

使用马尔可夫假设简化上述公式：

$P(w_i | w_1, w_2, ...w_{i-1}) \approx P(w_i | w_{i-k},... w_{i-1} )$

对于一元组：

$P(w_1 w_2, ... w_n) \approx \prod_i P(w_i)$

对于比格：

$P(w_i | w_1 w_2, ..w_{i-1}) \approx P(w_i | w_{i-1})$

执行

Python3

# imports
import string
import random
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('reuters')
from nltk.corpus import reuters
from nltk import FreqDist
  
# imput the reuters sentences
sents  =reuters.sents()
  
# write the removal characters such as : Stopwords and panctuation
stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
string.punctuation
removal_list = list(stop_words) + list(string.punctuation)+ ['lt','rt']
removal_list
  
# generate unigrams bigrams trigrams
unigram=[]
bigram=[]
trigram=[]
tokenized_text=[]
for sentence in sents:
  sentence = list(map(lambda x:x.lower(),sentence))
  for word in sentence:
        if word== '.':
            sentence.remove(word) 
        else:
            unigram.append(word)
    
  tokenized_text.append(sentence)
  bigram.extend(list(ngrams(sentence, 2,pad_left=True, pad_right=True)))
  trigram.extend(list(ngrams(sentence, 3, pad_left=True, pad_right=True)))
  
# remove the n-grams with removable words
def remove_stopwords(x):     
    y = []
    for pair in x:
        count = 0
        for word in pair:
            if word in removal_list:
                count = count or 0
            else:
                count = count or 1
        if (count==1):
            y.append(pair)
    return (y)
unigram = remove_stopwords(unigram)
bigram = remove_stopwords(bigram)
trigram = remove_stopwords(trigram)
  
# generate frequency of n-grams 
freq_bi = FreqDist(bigram)
freq_tri = FreqDist(trigram)
  
d = defaultdict(Counter)
for a, b, c in freq_tri:
    if(a != None and b!= None and c!= None):
      d[a, b] += freq_tri[a, b, c]
        
  
# Next word prediction      
s=''
def pick_word(counter):
    "Chooses a random element."
    return random.choice(list(counter.elements()))
prefix = "he", "said"
print(" ".join(prefix))
s = " ".join(prefix)
for i in range(19):
    suffix = pick_word(d[prefix])
    s=s+' '+suffix
    print(s)
    prefix = prefix[1], suffix

he said
he said kotc
he said kotc made
he said kotc made profits
he said kotc made profits of
he said kotc made profits of 265
he said kotc made profits of 265 ,
he said kotc made profits of 265 , 457
he said kotc made profits of 265 , 457 vs
he said kotc made profits of 265 , 457 vs loss
he said kotc made profits of 265 , 457 vs loss eight
he said kotc made profits of 265 , 457 vs loss eight cts
he said kotc made profits of 265 , 457 vs loss eight cts net
he said kotc made profits of 265 , 457 vs loss eight cts net loss
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343 ,
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343 , 266
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343 , 266 ,
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343 , 266 , 000
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343 , 266 , 000 shares

语言建模指标

熵：熵，作为信息的量的量度输送通过香农。下面是表示熵的公式

$H(p) = \sum_{x} p(x)\cdot (-log(p(x)))\\$

H(p) 总是大于等于 0。

交叉熵：它测量经训练的模型来表示测试数据的能力（ $W_{1}^{i-1}$ ）。

$H(p) =\sum_{i=1}^{x} \frac{1}{n} (-log_2(p(w_i | w_{1}^{i-1})))$

交叉熵总是大于或等于熵，即模型的不确定性可以不小于真实的不确定性。

困惑度：困惑度是对概率分布预测样本的好坏程度的衡量标准。可以理解为对不确定性的度量。困惑度可以通过对 2 的指数的交叉熵来计算。

$2^{Cross-Entropy}$

以下是语言模型分配的测试集的Probability的计算公式，用词数归一化：

$PP(W) = \sqrt[n]{\prod_{i=1}^{N}\frac{1}{P(w_i | w_{i-1})}}$

例如：

让我们举一个句子的例子： “自然语言处理”。为了预测第一个单词，假设该单词具有以下概率：

word	P(word \| )
The	0.4
Processing	0.3
Natural	0.12
Language	0.18

现在，我们知道将第一个单词作为自然词的概率。但是，在“Natural ”这个词之后得到“Language ”这个词，得到下一个词的概率是多少。

word	P(word \| ‘Natural’ )
The	0.05
Processing	0.3
Natural	0.15
Language	0.5

得到生成单词“Natural Language”的概率后，得到“ Processing ”的概率是多少。

word	P(word \| ‘Language’ )
The	0.1
Processing	0.7
Natural	0.1
Language	0.1

现在，困惑度可以计算为：

$PP(W) = \sqrt[n]{\prod_{i=1}^{N}\frac{1}{P(w_i | w_{i-1})}} = \sqrt[3]{\frac{1}{0.12 * 0.5 * 0.7}} \approx 2.876$

从中我们还可以计算熵：

$Entropy = log_2(2.876) = 1.524$

缺点：

为了获得更好的文本上下文，我们需要更高的 n 值，但这也会增加计算开销。
n-gram 中 n 值的增加也会导致稀疏性。

参考

N-gram 语言建模斯坦福幻灯片