📜  使用Python规范化文本数据

📅  最后修改于: 2022-05-13 01:55:20.766000             🧑  作者: Mango

使用Python规范化文本数据

在本文中,我们将学习如何使用Python规范化文本数据。让我们讨论一些概念:

  • 文本数据询问系统收集的材料,这些材料由书面、印刷或电子出版的文字组成,通常是有目的地书写或从语音中转录。
  • 文本规范化是将文本转换为一种以前没有的规范形式的方法。在存储或处理文本之前规范化文本允许分离关注点,因为在对其执行操作之前输入是确保一致的。文本规范化需要意识到要规范化的文本类型以及之后的处理方式;没有通用的标准化程序。

所需步骤

在这里,我们将讨论文本规范化所需的一些基本步骤。

  • 输入文本字符串,
  • 将字符串的所有字母转换为一个大小写(小写或大写),
  • 如果数字对于转换为单词必不可少,则删除所有数字,
  • 删除标点符号,其他形式的语法,
  • 去除空格,
  • 删除停用词,
  • 以及任何其他计算。

我们正在按照上述步骤进行文本规范化,每一步都可以通过某种方式完成。所以我们将讨论整个过程中的每一件事。

文本字符串

Python3
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
print(string)


Python3
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
print(lower_string)


Python3
# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)


Python3
# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
print(no_punc_string)


Python3
# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
 
# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)


Python3
# download stpwords
import nltk
nltk.download('stopwords')
 
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)
 
# assign string
no_wspace_string='python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows'
 
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
 
# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
         
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)


Python3
# import regex
import re
 
# download stpwords
import nltk
nltk.download('stopwords')
 
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
 
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
 
# remove white spaces
no_wspace_string = no_punc_string.strip()
no_wspace_string
 
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
 
# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
         
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
 
# output
print(no_stpwords_string)


输出:

大小写转换(小写

在Python中,lower() 是用于字符串处理的内置方法。 lower() 方法从给定的字符串。它将所有大写字符转换为小写。如果不存在大写字符,则返回原始字符串。

蟒蛇3

# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
print(lower_string)

输出:

删除号码

如果数字与您的分析无关,请删除它们。通常,正则表达式用于删除数字。

蟒蛇3

# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)

输出:

删除标点符号

用标点符号替换的部分也可以使用正则表达式来执行。在这种情况下,我们使用某些正则表达式将所有标点符号替换为空字符串。

蟒蛇3

# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
print(no_punc_string)

输出:

去除空白

strip()函数是Python编程语言中的一个内置函数,它返回删除前导和尾随字符的字符串副本(基于传递的字符串参数)。

蟒蛇3

# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
 
# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)

输出:

删除停用词

停用词是语言中最常见的词,如“the”、“a”、“on”、“is”、“all”。这些词没有重要意义,通常与文本相距甚远。可以使用舌头工具包 (NLTK) 去除停用词,这是一组用于符号和统计舌头处理的库和程序。

蟒蛇3

# download stpwords
import nltk
nltk.download('stopwords')
 
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)
 
# assign string
no_wspace_string='python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows'
 
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
 
# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
         
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)

输出:

在这里,我们可以使用Python规范化文本数据。下面是完整的Python程序:

蟒蛇3

# import regex
import re
 
# download stpwords
import nltk
nltk.download('stopwords')
 
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
 
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
 
# remove white spaces
no_wspace_string = no_punc_string.strip()
no_wspace_string
 
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
 
# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
         
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
 
# output
print(no_stpwords_string)

输出: