使用Python规范化文本数据

在本文中，我们将学习如何使用Python规范化文本数据。让我们讨论一些概念：

文本数据询问系统收集的材料，这些材料由书面、印刷或电子出版的文字组成，通常是有目的地书写或从语音中转录。
文本规范化是将文本转换为一种以前没有的规范形式的方法。在存储或处理文本之前规范化文本允许分离关注点，因为在对其执行操作之前输入是确保一致的。文本规范化需要意识到要规范化的文本类型以及之后的处理方式；没有通用的标准化程序。

所需步骤

在这里，我们将讨论文本规范化所需的一些基本步骤。

输入文本字符串，
将字符串的所有字母转换为一个大小写（小写或大写），
如果数字对于转换为单词必不可少，则删除所有数字，
删除标点符号，其他形式的语法，
去除空格，
删除停用词，
以及任何其他计算。

我们正在按照上述步骤进行文本规范化，每一步都可以通过某种方式完成。所以我们将讨论整个过程中的每一件事。

文本字符串

Python3

# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
print(string)

Python3

# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
print(lower_string)

Python3

# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)

Python3

# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
print(no_punc_string)

Python3

# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
 
# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)

Python3

# download stpwords
import nltk
nltk.download('stopwords')
 
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)
 
# assign string
no_wspace_string='python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows'
 
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
 
# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
         
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)

Python3

# import regex
import re
 
# download stpwords
import nltk
nltk.download('stopwords')
 
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
 
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
 
# remove white spaces
no_wspace_string = no_punc_string.strip()
no_wspace_string
 
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
 
# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
         
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
 
# output
print(no_stpwords_string)

输出：

” Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2’s end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows).”

编程需要懂一点英语

大小写转换（小写）

在Python中，lower() 是用于字符串处理的内置方法。 lower() 方法从给定的字符串。它将所有大写字符转换为小写。如果不存在大写字符，则返回原始字符串。

蟒蛇3

# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
print(lower_string)

输出：

” python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much python 2 code does not run unmodified on python 3. with python 2’s end-of-life, only python 3.6.x[30] and later are supported, with older versions still supporting e.g. windows 7 (and old installers not restricted to 64-bit windows).”

编程需要懂一点英语

删除号码

如果数字与您的分析无关，请删除它们。通常，正则表达式用于删除数字。

蟒蛇3

# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)

输出：

” python ., released in , was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python . with python ‘s end-of-life, only python ..x[] and later are supported, with older versions still supporting e.g. windows (and old installers not restricted to -bit windows).”

编程需要懂一点英语

删除标点符号

用标点符号替换的部分也可以使用正则表达式来执行。在这种情况下，我们使用某些正则表达式将所有标点符号替换为空字符串。

蟒蛇3

# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
print(no_punc_string)

输出：

‘ python released in was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python with python s endoflife only python x and later are supported with older versions still supporting eg windows and old installers not restricted to bit windows’

编程需要懂一点英语

去除空白

strip()函数是Python编程语言中的一个内置函数，它返回删除前导和尾随字符的字符串副本（基于传递的字符串参数）。

蟒蛇3

# import regex
import re
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
 
# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)

输出：

‘python released in was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python with python s endoflife only python x and later are supported with older versions still supporting eg windows and old installers not restricted to bit windows’

编程需要懂一点英语

删除停用词

停用词是语言中最常见的词，如“the”、“a”、“on”、“is”、“all”。这些词没有重要意义，通常与文本相距甚远。可以使用舌头工具包 (NLTK) 去除停用词，这是一组用于符号和统计舌头处理的库和程序。

蟒蛇3

# download stpwords
import nltk
nltk.download('stopwords')
 
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)
 
# assign string
no_wspace_string='python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows'
 
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
 
# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
         
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)

输出：

在这里，我们可以使用Python规范化文本数据。下面是完整的Python程序：

蟒蛇3

# import regex
import re
 
# download stpwords
import nltk
nltk.download('stopwords')
 
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
 
 
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
 
# convert to lower case
lower_string = string.lower()
 
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
 
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
 
# remove white spaces
no_wspace_string = no_punc_string.strip()
no_wspace_string
 
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
 
# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
         
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
 
# output
print(no_stpwords_string)

输出：