使用Python规范化文本数据
在本文中,我们将学习如何使用Python规范化文本数据。让我们讨论一些概念:
- 文本数据询问系统收集的材料,这些材料由书面、印刷或电子出版的文字组成,通常是有目的地书写或从语音中转录。
- 文本规范化是将文本转换为一种以前没有的规范形式的方法。在存储或处理文本之前规范化文本允许分离关注点,因为在对其执行操作之前输入是确保一致的。文本规范化需要意识到要规范化的文本类型以及之后的处理方式;没有通用的标准化程序。
所需步骤
在这里,我们将讨论文本规范化所需的一些基本步骤。
- 输入文本字符串,
- 将字符串的所有字母转换为一个大小写(小写或大写),
- 如果数字对于转换为单词必不可少,则删除所有数字,
- 删除标点符号,其他形式的语法,
- 去除空格,
- 删除停用词,
- 以及任何其他计算。
我们正在按照上述步骤进行文本规范化,每一步都可以通过某种方式完成。所以我们将讨论整个过程中的每一件事。
文本字符串
Python3
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
print(string)
Python3
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
print(lower_string)
Python3
# import regex
import re
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)
Python3
# import regex
import re
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
print(no_punc_string)
Python3
# import regex
import re
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)
Python3
# download stpwords
import nltk
nltk.download('stopwords')
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)
# assign string
no_wspace_string='python released in was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python with python s endoflife only python x and later are supported with older versions still supporting eg windows and old installers not restricted to bit windows'
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
# remove stopwords
no_stpwords_string=""
for i in lst_string:
if not i in stop_words:
no_stpwords_string += i+' '
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)
Python3
# import regex
import re
# download stpwords
import nltk
nltk.download('stopwords')
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
# remove white spaces
no_wspace_string = no_punc_string.strip()
no_wspace_string
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
# remove stopwords
no_stpwords_string=""
for i in lst_string:
if not i in stop_words:
no_stpwords_string += i+' '
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
# output
print(no_stpwords_string)
输出:
” Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2’s end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows).”
大小写转换(小写)
在Python中,lower() 是用于字符串处理的内置方法。 lower() 方法从给定的字符串。它将所有大写字符转换为小写。如果不存在大写字符,则返回原始字符串。
蟒蛇3
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
print(lower_string)
输出:
” python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much python 2 code does not run unmodified on python 3. with python 2’s end-of-life, only python 3.6.x[30] and later are supported, with older versions still supporting e.g. windows 7 (and old installers not restricted to 64-bit windows).”
删除号码
如果数字与您的分析无关,请删除它们。通常,正则表达式用于删除数字。
蟒蛇3
# import regex
import re
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)
输出:
” python ., released in , was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python . with python ‘s end-of-life, only python ..x[] and later are supported, with older versions still supporting e.g. windows (and old installers not restricted to -bit windows).”
删除标点符号
用标点符号替换的部分也可以使用正则表达式来执行。在这种情况下,我们使用某些正则表达式将所有标点符号替换为空字符串。
蟒蛇3
# import regex
import re
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
print(no_punc_string)
输出:
‘ python released in was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python with python s endoflife only python x and later are supported with older versions still supporting eg windows and old installers not restricted to bit windows’
去除空白
strip()函数是Python编程语言中的一个内置函数,它返回删除前导和尾随字符的字符串副本(基于传递的字符串参数)。
蟒蛇3
# import regex
import re
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)
输出:
‘python released in was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python with python s endoflife only python x and later are supported with older versions still supporting eg windows and old installers not restricted to bit windows’
删除停用词
停用词是语言中最常见的词,如“the”、“a”、“on”、“is”、“all”。这些词没有重要意义,通常与文本相距甚远。可以使用舌头工具包 (NLTK) 去除停用词,这是一组用于符号和统计舌头处理的库和程序。
蟒蛇3
# download stpwords
import nltk
nltk.download('stopwords')
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)
# assign string
no_wspace_string='python released in was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python with python s endoflife only python x and later are supported with older versions still supporting eg windows and old installers not restricted to bit windows'
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
# remove stopwords
no_stpwords_string=""
for i in lst_string:
if not i in stop_words:
no_stpwords_string += i+' '
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)
输出:
在这里,我们可以使用Python规范化文本数据。下面是完整的Python程序:
蟒蛇3
# import regex
import re
# download stpwords
import nltk
nltk.download('stopwords')
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
# remove white spaces
no_wspace_string = no_punc_string.strip()
no_wspace_string
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
# remove stopwords
no_stpwords_string=""
for i in lst_string:
if not i in stop_words:
no_stpwords_string += i+' '
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
# output
print(no_stpwords_string)
输出: