使用正则表达式从 Dataframe 的指定列中提取标点符号
先决条件: Python中的正则表达式
在本文中,我们将了解如何使用 Regex 提取 Dataframe 指定列中使用的标点符号。
首先,我们正在制作包含所有标点符号的正则表达式: [!”\$%&\'()*+,\-.\/:;=#@?\[\\\]^_`{|} ~]* 然后我们将特定列的每一行传递给re.findall()函数 用于提取标点符号,然后将提取的标点符号分配给 Dataframe 中的新列。
re.findall()函数用于提取 字符串字符串。从左到右扫描字符串,并按找到的顺序返回匹配项。
Syntax: re.findall(regex, string)
Return: All non-overlapping matches of pattern in string, as a list of strings.
现在,让我们创建一个数据框:
Python3
# import required libraries
import pandas as pd
import re
# creating Dataframe with
# name and their comments
df = pd.DataFrame({
'Name' : ['Akash', 'Ashish', 'Ayush',
'Diksha' , 'Radhika'],
'Comments': ['Hey! Akash how r u' ,
'Why are you asking this to me?' ,
'Today, what we are going to do.' ,
'No plans for today why?' ,
'Wedding plans, what are you saying?']},
columns = ['Name', 'Comments']
)
# show the Dataframe
df
Python3
# define a function for extracting
# the punctuations
def check_find_punctuations(text):
# regular expression containing
# all punctuation
result = re.findall(r'[!"\$%&\'()*+,\-.\/:;=#@?\[\\\]^_`{|}~]*',
text)
# form a string
string = "".join(result)
# list of strings return
return list(string)
# creating new column name
# as a punctuation_used and
# applying user defined function
# on each rows of Comments column
df['punctuation_used'] = df['Comments'].apply(
lambda x : check_find_punctuations(x)
)
# show the Dataframe
df
输出:
现在,从列注释中提取标点符号:
Python3
# define a function for extracting
# the punctuations
def check_find_punctuations(text):
# regular expression containing
# all punctuation
result = re.findall(r'[!"\$%&\'()*+,\-.\/:;=#@?\[\\\]^_`{|}~]*',
text)
# form a string
string = "".join(result)
# list of strings return
return list(string)
# creating new column name
# as a punctuation_used and
# applying user defined function
# on each rows of Comments column
df['punctuation_used'] = df['Comments'].apply(
lambda x : check_find_punctuations(x)
)
# show the Dataframe
df
输出: