📜  Python |使用TextBlob对文本进行标记

📅  最后修改于: 2021-04-16 08:09:34             🧑  作者: Mango

TextBlob模块是一个Python库,并提供了一个简单的API来访问其方法并执行基本的NLP任务。它建立在NLTK模块的顶部。

在终端中使用以下命令安装TextBlob:

pip install -U textblob
python -m textblob.download_corpora

这将安装TextBlob并下载必要的NLTK语料库。由于大量的令牌生成器,分块器,其他算法以及所有要下载的语料库,因此上述安装将花费相当长的时间。

一些经常使用的术语是:

  • 语料库–文本主体,单数。语料库是其中的复数形式。
  • 词汇–单词及其含义。
  • 令牌–每个“实体”都是根据规则拆分的内容的一部分。例如,当将一个句子“标记”为单词时,每个单词都是一个标记。如果您将段落中的句子标记化,则每个句子也可以是标记。

因此,基本上标记化涉及从文本主体中拆分句子和单词。

# from textblob lib. import TextBlob method
from textblob import TextBlob
  
text = ("Natural language processing (NLP) is a field " + 
       "of computer science, artificial intelligence " + 
       "and computational linguistics concerned with " +  
       "the interactions between computers and human " +  
       "(natural) languages, and, in particular, " +  
       "concerned with programming computers to " + 
       "fruitfully process large natural language " +  
       "corpora. Challenges in natural language " +  
       "processing frequently involve natural " + 
       "language understanding, natural language" +  
       "generation frequently from formal, machine" +  
       "-readable logical forms), connecting language " +  
       "and machine perception, managing human-" + 
       "computer dialog systems, or some combination " +  
       "thereof.")
    
# create a TextBlob object
blob_object = TextBlob(text)
  
# tokenize paragraph into words.
print(" Word Tokenize :\n", blob_object.words)
  
# tokenize paragraph into sentences.
print("\n Sentence Tokenize :\n", blob_object.sentences)

输出 :