Python |使用TextBlob对文本进行标记

📌 相关文章

📜 Python |使用TextBlob对文本进行标记

📅 最后修改于: 2021-04-16 08:09:34 🧑 作者: Mango

TextBlob模块是一个Python库，并提供了一个简单的API来访问其方法并执行基本的NLP任务。它建立在NLTK模块的顶部。

在终端中使用以下命令安装TextBlob：

pip install -U textblob
python -m textblob.download_corpora

这将安装TextBlob并下载必要的NLTK语料库。由于大量的令牌生成器，分块器，其他算法以及所有要下载的语料库，因此上述安装将花费相当长的时间。

一些经常使用的术语是：

语料库–文本主体，单数。语料库是其中的复数形式。
词汇–单词及其含义。
令牌–每个“实体”都是根据规则拆分的内容的一部分。例如，当将一个句子“标记”为单词时，每个单词都是一个标记。如果您将段落中的句子标记化，则每个句子也可以是标记。

因此，基本上标记化涉及从文本主体中拆分句子和单词。

# from textblob lib. import TextBlob method
from textblob import TextBlob
  
text = ("Natural language processing (NLP) is a field " + 
       "of computer science, artificial intelligence " + 
       "and computational linguistics concerned with " +  
       "the interactions between computers and human " +  
       "(natural) languages, and, in particular, " +  
       "concerned with programming computers to " + 
       "fruitfully process large natural language " +  
       "corpora. Challenges in natural language " +  
       "processing frequently involve natural " + 
       "language understanding, natural language" +  
       "generation frequently from formal, machine" +  
       "-readable logical forms), connecting language " +  
       "and machine perception, managing human-" + 
       "computer dialog systems, or some combination " +  
       "thereof.")
    
# create a TextBlob object
blob_object = TextBlob(text)
  
# tokenize paragraph into words.
print(" Word Tokenize :\n", blob_object.words)
  
# tokenize paragraph into sentences.
print("\n Sentence Tokenize :\n", blob_object.sentences)

输出：

Word Tokenize :
[‘Natural’, ‘language’, ‘processing’, ‘NLP’, ‘is’, ‘a’, ‘field’, ‘of’, ‘computer’, ‘science’, ‘artificial’, ‘intelligence’, ‘and’, ‘computational’, ‘linguistics’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘natural’, ‘languages’, ‘and’, ‘in’, ‘particular’, ‘concerned’, ‘with’, ‘programming’, ‘computers’, ‘to’, ‘fruitfully’, ‘process’, ‘large’, ‘natural’, ‘language’, ‘corpora’, ‘Challenges’, ‘in’, ‘natural’, ‘language’, ‘processing’, ‘frequently’, ‘involve’, ‘natural’, ‘language’, ‘understanding’, ‘natural’, ‘languagegeneration’, ‘frequently’, ‘from’, ‘formal’, ‘machine-readable’, ‘logical’, ‘forms’, ‘connecting’, ‘language’, ‘and’, ‘machine’, ‘perception’, ‘managing’, ‘human-computer’, ‘dialog’, ‘systems’, ‘or’, ‘some’, ‘combination’, ‘thereof’]

Sentence Tokenize :
[Sentence(“Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.”), Sentence(“Challenges in natural language processing frequently involve natural language understanding, natural language generation frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.”)]

为什么编程需要懂一点英语