📜  自然语言处理测序

📅  最后修改于: 2022-05-13 01:58:07.237000             🧑  作者: Mango

自然语言处理测序

NLP 排序是我们将通过训练神经网络从大型语料库或语句体生成的数字序列。我们将采用一组句子并根据训练集句子为它们分配数字标记。

例子:

sentences = [
'I love geeksforgeeks',
'You love geeksforgeeks',
'What do you think about geeksforgeeks?'
]

Word Index: {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4,
             'what': 5, 'do': 6, 'think': 7, 'about': 8}

Sequences: [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]

现在,如果测试集包含网络以前从未见过的单词,或者我们必须预测句子中的单词,那么我们可以添加一个简单的占位符标记。

Let the test set be :

 test_data = [
'i really love geeksforgeeks',
'Do you like geeksforgeeks'
]

然后我们将为它以前没有见过的单词定义一个额外的占位符。默认情况下,占位符的索引为 1。

词索引 = {'placeholder': 1, 'geeksforgeeks': 2, 'love': 3, 'you': 4, 'i': 5, 'what': 6, 'do': 7, 'think': 8、'关于':9}



序列 = [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]

由于“真的”和“喜欢”这个词在被简单地替换为索引为 1 的占位符之前还没有遇到过。

所以,测试序列现在变成,

测试序列 = [[5, 1, 3, 2], [7, 4, 1, 2]]

代码:使用 TensorFlow 实现
# importing all the modules required
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
  
# the initial corpus of sentences or the training set
sentences = [
    'I love geeksforgeeks',
    'You love geeksforgeeks',
    'What do you think about geeksforgeeks?'
]
  
tokenizer = Tokenizer(num_words = 100)
  
# the tokenizer also removes punctuations
tokenizer.fit_on_texts(sentences)  
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print("Word Index: ", word_index)
print("Sequences: ", sequences)
  
# defining a placeholder token and naming it as placeholder
tokenizer = Tokenizer(num_words=100, 
                      oov_token="placeholder")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
  
  
sequences = tokenizer.texts_to_sequences(sentences)
print("\nSequences = ", sequences)
  
  
# the training data with words the network hasn't encountered
test_data = [
    'i really love geeksforgeeks',
    'Do you like geeksforgeeks'
]
  
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

输出:

Word Index:  {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4, 'what': 5, 'do': 6, 'think': 7, 'about': 8}
Sequences:  [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]

Sequences =  [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]

Test Sequence =  [[5, 1, 3, 2], [7, 4, 1, 2]]