Python - 泰米尔语文本的预处理

预处理是自然语言处理的主要部分。为了以高精度对任何文本进行分类，清理过的数据起着重要作用。因此，在分析或分类之前，NLP 的第一步是对数据进行预处理。许多Python库支持英语语言的预处理。但是对于泰米尔语，可用的预处理库非常少。以下是一些泰米尔语文本预处理技术的示例。

本文涉及的预处理技术有

标点去除
代币化
停用词删除

去除标点符号：

Python3

# Importing python string function
import string     
# Printing Inbuilt punctuation function
print(string.punctuation)

Python3

# Function for removing punctuation
def punctuation_remove(text_data): 
    # Appending non punctuated words
    punctuation ="".join([t for t in text_data if t not in string.punctuation])  
    return punctuation
  
# Passing input to the function
punctuation_removed = punctuation_remove("வெற்றி *பெற வேண்டும், என்ற பதற்றம் ^இல்லாமல் _இருப்பது தான் 'வெற்றி பெறுவதற்கான சிறந்த வழி.") 
print(punctuation_removed)

Python3

# importing python regular expression module
import re    
  
# Function for tokenization
def tokenization(text_data):
   # Splitting the sentence into words where space is found.
   tokens_text = re.split(' ',text_data)      
   return tokens_text
    
    
# Passing the punctuation removed text as parameter for tokenization  
tokenized_text = tokenization(punctuation_removed)  
print(tokenized_text)

Python3

# Importing Natural Language Toolkit python library
import nltk
  
# Storing all the Tamil stop words in the variable retrieved from the file ‘tamil’ 
stopwords = nltk.corpus.stopwords.words('tamil')  
  
# Function for removing stop words
def stopwords_remove(text_data):
    # Appending words which are not stop words  
    removed= [s for s in text_data if s not in stopwords]  
    return removed
  
# Passing tokenized text as parameter for removing stop words
stopwords_removed = stopwords_remove(tokenized_text) 
print(stopwords_removed)

输出：

!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~

编程需要懂一点英语

文中若有上述标点符号，预处理后将被删除。这可以通过使用Python字符串模块来删除。

蟒蛇3

# Function for removing punctuation
def punctuation_remove(text_data): 
    # Appending non punctuated words
    punctuation ="".join([t for t in text_data if t not in string.punctuation])  
    return punctuation
  
# Passing input to the function
punctuation_removed = punctuation_remove("வெற்றி *பெற வேண்டும், என்ற பதற்றம் ^இல்லாமல் _இருப்பது தான் 'வெற்றி பெறுவதற்கான சிறந்த வழி.") 
print(punctuation_removed)

输出：

வெற்றி பெற வேண்டும் என்ற பதற்றம் இல்லாமல் இருப்பது தான் வெற்றி பெறுவதற்கான சிறந்த வழி

编程需要懂一点英语

说明：删除给定文本中的所有标点符号。

代币化：

标记化只不过是将句子中的每个单词拆分为一个标记，用于进一步分类。要将文本转换为标记，使用了Python中的正则表达式模块。

蟒蛇3

# importing python regular expression module
import re    
  
# Function for tokenization
def tokenization(text_data):
   # Splitting the sentence into words where space is found.
   tokens_text = re.split(' ',text_data)      
   return tokens_text
    
    
# Passing the punctuation removed text as parameter for tokenization  
tokenized_text = tokenization(punctuation_removed)  
print(tokenized_text)

输出：

[‘வெற்றி’, ‘பெற’, ‘வேண்டும்’, ‘என்ற’, ‘பதற்றம்’, ‘இல்லாமல்’, ‘இருப்பது’, ‘தான்’, ‘வெற்றி’, ‘பெறுவதற்கான’, ‘சிறந்த’, ‘வழி’]

编程需要懂一点英语

解释：句子中的所有单词都被拆分为标记。

停用词去除：

停用词是一种语言中经常使用的词。这些词对于句子的意思来说是不必要的。可以通过使用Python中的 NLTK 包来去除停用词。 NLTK 包支持多种语言，如英语、法语、德语、芬兰语、意大利语等，但不支持泰米尔语。因此，在给定的链接中下载泰米尔语的停用词 – Github 链接并将文件命名为泰米尔语并将其放置在系统中的以下位置 – “ ....\AppData\Roaming\nltk_data\corpora\stopwords“

经过这个过程，NLTK 包也支持泰米尔语停用词

蟒蛇3

# Importing Natural Language Toolkit python library
import nltk
  
# Storing all the Tamil stop words in the variable retrieved from the file ‘tamil’ 
stopwords = nltk.corpus.stopwords.words('tamil')  
  
# Function for removing stop words
def stopwords_remove(text_data):
    # Appending words which are not stop words  
    removed= [s for s in text_data if s not in stopwords]  
    return removed
  
# Passing tokenized text as parameter for removing stop words
stopwords_removed = stopwords_remove(tokenized_text) 
print(stopwords_removed)

输出：

[‘வெற்றி’, ‘பெற’, ‘பதற்றம்’, ‘இல்லாமல்’, ‘இருப்பது’, ‘வெற்றி’, ‘பெறுவதற்கான’, ‘சிறந்த’, ‘வழி’]

编程需要懂一点英语

说明：去除了停用词“வேண்டும்”、“என்ற”和“தான்”。