如何预处理 Pandas DataFrame 中的字符串数据？

有时，我们正在处理的数据可能被塞入单个列中，但是对于我们处理数据而言，数据应该分散到不同的列中，并且这些列必须具有不同的数据类型。当所有数据组合成一个字符串时，需要对该字符串进行预处理。本文是关于在 Pandas DataFrame 中预处理字符串数据的。

方法 1：通过使用 PandasSeries.str.extract()函数：

Syntax:

Series.str.extract(pat, flags=0, expand=True)

Parameters:

pat: regex expression which helps us divide data into columns.
flags: by default 0 no flags, int parameter.
expand: Returns a DataFrame with one column per capture group if True.

returns:

method returns a dataframe or series

编程需要懂一点英语

第 1 步：导入包

熊猫包已导入。

Python3

# import packages
import pandas as pd

Python3

# creating data
data = {'CovidData': ['Anhui 1.0 2020-01-22 17:00:00',
                      'Beijing 14.0 2020-01-22 17:00:00',
                      'Washington 1.0 2020-01-24 17:00:00',
                      'Victoria 3.0 2020-01-31 23:59:00',
                      'Macau 10.0 2020-02-06 14:23:04']}
  
#creating a pandas dataframe 
dataset = pd.DataFrame(data)

Python3

dataset['LastUpdated'] = dataset['CovidData'].str.extract(
    '(....-..-.. ..:..:..)', expand=True)
dataset['LastUpdated']

Python3

dataset['State'] = dataset['CovidData'].str.extract('([A-Za-z]+)', expand=True)
dataset['State']

Python3

dataset['confirmed_cases'] = dataset['CovidData'].str.extract(
    '(\d+.\d)', expand=True)
dataset['confirmed_cases']

Python3

# import packages
import pandas as pd
from stop_words import get_stop_words
import re
  
# stop words
stop_words = get_stop_words('en')
  
# reading the csv file
data = pd.read_csv('test.csv')
  
print('Before string processing : ')
print(data[(data['PhraseId'] >= 157139) & (
    data['PhraseId'] <= 157141)]['Phrase'])
  
# converting all text to lower case in the Phrase column
data['Phrase'] = data['Phrase'].apply(str.lower)
  
# using regex to remove punctuation
data['Phrase'] = data['Phrase'].apply(lambda x: re.sub(r'[^\w\s]', '', x)
                                      )
  
# removing stop words
data['Phrase'] = data['Phrase'].apply(lambda x: ' '.join(
    w for w in x.split() if w not in stop_words))
  
print('After string processing : ')
data[(data['PhraseId'] >= 157139) & (data['PhraseId'] <= 157141)]['Phrase']

第 2 步：创建数据框：

pd.DataFrame() 方法用于创建给定字典的数据框。我们创建一个需要预处理的数据框。所有数据在开始时以字符串格式驻留在单个列中。

Python3

# creating data
data = {'CovidData': ['Anhui 1.0 2020-01-22 17:00:00',
                      'Beijing 14.0 2020-01-22 17:00:00',
                      'Washington 1.0 2020-01-24 17:00:00',
                      'Victoria 3.0 2020-01-31 23:59:00',
                      'Macau 10.0 2020-02-06 14:23:04']}
  
#creating a pandas dataframe 
dataset = pd.DataFrame(data)

海峡extract() 采用正则表达式字符串和其他参数将数据提取到列中。 (....-..-.. ..:..:..) 用于提取格式为 (yyyy-mm-dd hh:mm:ss) 的日期，Datetime 对象采用该格式。

Python3

dataset['LastUpdated'] = dataset['CovidData'].str.extract(
    '(....-..-.. ..:..:..)', expand=True)
dataset['LastUpdated']

输出：

海峡extract() 接受一个正则表达式字符串“([A-Za-z]+)”。它提取具有字母的字符串。

Python3

dataset['State'] = dataset['CovidData'].str.extract('([A-Za-z]+)', expand=True)
dataset['State']

输出：

'(\d+.\d)' 用于匹配小数。 + 表示 '.'（十进制）之前的一个或多个数字和小数点后的一个数字。例如：12.1、3.5 等……

Python3

dataset['confirmed_cases'] = dataset['CovidData'].str.extract(
    '(\d+.\d)', expand=True)
dataset['confirmed_cases']

输出：

预处理前的数据框：

预处理后的数据框：

方法 2：使用 apply()函数

在这种方法中，我们预处理了一个包含电影评论的数据集，它是烂番茄数据集。导入了 panda 的包、re 和 stop_words 包。我们将停用词存储在一个名为 stop_words 的变量中。数据集是在 pd.read_csv() 方法的帮助下导入的。我们使用 apply() 方法来预处理字符串数据。 str.lower 用于将所有字符串数据转换为小写。 re.sub(r'[^\w\s]', ”, x) 帮助我们摆脱标点符号，最后，我们从字符串数据中删除 stop_words。由于 CSV 文件很大，因此会显示一部分数据以查看差异。

要查看和下载 CSV 文件，请单击此处。

Python3

# import packages
import pandas as pd
from stop_words import get_stop_words
import re
  
# stop words
stop_words = get_stop_words('en')
  
# reading the csv file
data = pd.read_csv('test.csv')
  
print('Before string processing : ')
print(data[(data['PhraseId'] >= 157139) & (
    data['PhraseId'] <= 157141)]['Phrase'])
  
# converting all text to lower case in the Phrase column
data['Phrase'] = data['Phrase'].apply(str.lower)
  
# using regex to remove punctuation
data['Phrase'] = data['Phrase'].apply(lambda x: re.sub(r'[^\w\s]', '', x)
                                      )
  
# removing stop words
data['Phrase'] = data['Phrase'].apply(lambda x: ' '.join(
    w for w in x.split() if w not in stop_words))
  
print('After string processing : ')
data[(data['PhraseId'] >= 157139) & (data['PhraseId'] <= 157141)]['Phrase']

输出：