📜  自然语言处理 |分类文本语料库

📅  最后修改于: 2022-05-13 01:54:18.393000             🧑  作者: Mango

自然语言处理 |分类文本语料库

如果我们有大量的文本数据,那么可以将其分类为单独的部分。

代码 #1:分类

Python3
# Loading brown corpus
from nltk.corpus import brown
 
brown.categories()


Python3
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
 
reader = CategorizedPlaintextCorpusReader(
        '.', r'movie_.*\.txt', cat_pattern = r'movie_(\w+)\.txt')
 
print ("Categorize : ", reader.categories())
 
print ("\nNegative field : ", reader.fileids(categories =['neg']))
 
print ("\nPositive field : ", reader.fileids(categories =['pos']))


Python3
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
 
reader = CategorizedPlaintextCorpusReader(
        '.', r'movie_.*\.txt', cat_map ={'movie_pos.txt': ['pos'],
                                        'movie_neg.txt': ['neg']})
     
print ("Categorize : ", reader.categories())


输出 :

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government',
'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion',
'reviews', 'romance', 'science_fiction']

如何对语料库进行分类?
最简单的方法是为每个类别创建一个文件。以下是来自 movie_reviews 语料库的两段摘录:

  • movie_pos.txt
  • movie_neg.txt

使用这两个文件,我们将有两个类别——pos 和 neg。

代码#2:让我们分类

Python3

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
 
reader = CategorizedPlaintextCorpusReader(
        '.', r'movie_.*\.txt', cat_pattern = r'movie_(\w+)\.txt')
 
print ("Categorize : ", reader.categories())
 
print ("\nNegative field : ", reader.fileids(categories =['neg']))
 
print ("\nPositive field : ", reader.fileids(categories =['pos']))

输出 :

Categorize : ['neg', 'pos']

Negative field : ['movie_neg.txt']

Positive field : ['movie_pos.txt']

代码 #3:在 cat_map 中使用而不是 cat_pattern

Python3

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
 
reader = CategorizedPlaintextCorpusReader(
        '.', r'movie_.*\.txt', cat_map ={'movie_pos.txt': ['pos'],
                                        'movie_neg.txt': ['neg']})
     
print ("Categorize : ", reader.categories())

输出 :

Categorize : ['neg', 'pos']