自然语言处理 |训练分词器并过滤句子中的停用词

为什么我们需要训练一个句子标记器？
在 NLTK 中，默认句子标记器适用于一般用途，并且效果很好。但它可能不适用于某种文本，因为该文本可能使用非标准标点符号，或者它可能具有独特的格式。因此，为了处理这种情况，训练句子标记器可以产生更准确的句子标记。

让我们考虑以下文本以理解该概念。这种文本在任何网络文本语料库中都很常见。

Example of TEXT:
A guy: So, what are your plans for the party?
B girl: well! I am not going!
A guy: Oh, but u should enjoy.

要下载文本文件，请单击此处。

代码 #1：训练分词器

# Loading Libraries
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
  
text = webtext.raw('C:\\Geeksforgeeks\\data_for_training_tokenizer.txt')
sent_tokenizer = PunktSentenceTokenizer(text)
sents_1 = sent_tokenizer.tokenize(text)
  
print(sents_1[0])
print("\n"sents_1[678])

输出：

'White guy: So, do you have any plans for this evening?'

'Hobo: Got any spare change?'

代码 #2：默认句子分词器

from nltk.tokenize import sent_tokenize
sents_2 = sent_tokenize(text)
  
print(sents_2[0])
print("\n"sents_2[678])

输出：

'White guy: So, do you have any plans for this evening?'

'Girl: But you already have a Big Mac...\r\nHobo: Oh, this is all theatrical.'

第二个输出中的这种差异很好地说明了为什么训练您自己的句子标记器很有用，尤其是当您的文本不是典型的段落句子结构时。

培训如何运作？
PunktSentenceTokenizer class遵循无监督学习算法来学习什么构成了断句。它是无监督的，因为不需要提供任何带标签的训练数据，只需要原始文本。

过滤标记化句子中的停用词

停用词是文本中出现的常见词，但通常对句子的含义没有贡献。它们对于信息检索和自然语言处理的目的几乎不重要。例如——“the”和“a”。大多数搜索引擎会从搜索查询和文档中过滤掉停用词。
NLTK 库带有一个停用词语料库nltk_data/corpora/stopwords/ ，其中包含许多语言的单词列表。

代码 #3： Python的停用词

# Loading Library
from nltk.corpus import stopwords
  
# Using stopwords from English Languages
english_stops = set(stopwords.words('english'))
  
# Printing stopword list present in English
words = ["Let's", 'see', 'how', "it's", 'working']
  
print ("Before stopwords removal: ", words)
print ("\nAfter stopwords removal : ",
       [word for word in words if word not in english_stops])

输出：

Before stopwords removal:  ["Let's", 'see', 'how', "it's", 'working']

After stopwords removal :  ["Let's", 'see', 'working']
?

代码 #4：NLTK 停用词中使用的语言的完整列表。

stopwords.fileids()

输出：

['danish', 'dutch', 'english', 'finnish', 'french', 'german',
'hungarian', 'italian', 'norwegian', 'portuguese', 'russian',
'spanish', 'swedish', 'turkish']