如何在Python中将分类字符串数据转换为数字？

数据集具有数字和分类特征。分类特征是指字符串数据类型，易于人类理解。但是，机器不能直接解释分类数据。因此，必须将分类数据转换为数值数据以便进一步处理。

有很多方法可以将分类数据转换为数值数据。在本文中，我们将讨论两种最常用的方法，即：

虚拟变量编码
标签编码

在这两种方法中，我们都使用相同的数据，数据集的链接在这里

方法一：虚拟变量编码

我们将使用 pandas.get_dummies函数将分类字符串数据转换为数字。

句法：

pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

Parameters :

data : Pandas Series, or DataFrame
prefix : str, list of str, or dict of str, default None. String to append DataFrame column names
prefix_sep : str, default ‘_’. If appending prefix, separator/delimiter to use.
dummy_na : bool, default False. Add a column to indicate NaNs, if False NaNs are ignored.
columns : list-like, default None. Column names in the DataFrame to be encoded.
sparse : bool, default False. Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).
drop_first : bool, default False. Whether to get k-1 dummies out of k categorical levels by removing the first level.
dtype : dtype, default np.uint8. It specifies the data type for new columns.

Returns : DataFrame

编程需要懂一点英语

逐步实施

第 1 步：导入库

Python3

# importing pandas as pd
import pandas as pd

Python3

# importing data using .read_csv() function
df = pd.read_csv('data.csv')
 
# printing DataFrame
df

Python3

# using .get_dummies function to convert
# the categorical datatype to numerical
# and storing the returned dataFrame
# in a new variable df1
df1 = pd.get_dummies(df['Purchased'])
 
# using pd.concat to concatenate the dataframes
# df and df1 and storing the concatenated
# dataFrame in df.
df = pd.concat([df, df1], axis=1).reindex(df.index)
 
# removing the column 'Purchased' from df
# as it is of no use now.
df.drop('Purchased', axis=1, inplace=True)
 
# printing df
df

Python3

# importing pandas as pd
import pandas as pd

Python3

#importing data using .read_csv() function
df = pd.read_csv('data.csv')
 
#printing DataFrame
df

Python3

# Importing LabelEncoder from Sklearn
# library from preprocessing Module.
from sklearn.preprocessing import LabelEncoder
 
# Creating a instance of label Encoder.
le = LabelEncoder()
 
# Using .fit_transform function to fit label
# encoder and return encoded label
label = le.fit_transform(df['Purchased'])
 
# printing label
label

Python3

# removing the column 'Purchased' from df
# as it is of no use now.
df.drop("Purchased", axis=1, inplace=True)
 
# Appending the array to our dataFrame
# with column name 'Purchased'
df["Purchased"] = label
 
# printing Dataframe
df

第 2 步：导入数据

Python3

# importing data using .read_csv() function
df = pd.read_csv('data.csv')
 
# printing DataFrame
df

输出：

第 3 步：将分类数据列转换为数值。

我们会将“Purchased”列从分类数据类型转换为数值数据类型。

Python3

# using .get_dummies function to convert
# the categorical datatype to numerical
# and storing the returned dataFrame
# in a new variable df1
df1 = pd.get_dummies(df['Purchased'])
 
# using pd.concat to concatenate the dataframes
# df and df1 and storing the concatenated
# dataFrame in df.
df = pd.concat([df, df1], axis=1).reindex(df.index)
 
# removing the column 'Purchased' from df
# as it is of no use now.
df.drop('Purchased', axis=1, inplace=True)
 
# printing df
df

输出：

方法二：标签编码

我们将使用sklearn库中的.LabelEncoder()将分类数据转换为数值数据。我们将在此过程中使用函数fit_transform()。

句法：

fit_transform(y)

Parameters :

y : array-like of shape (n_samples). Target Values.

Returns : array-like of shape (n_samples) .Encoded labels.

编程需要懂一点英语

逐步实施

第 1 步：导入库

Python3

# importing pandas as pd
import pandas as pd

第 2 步：导入数据

Python3

#importing data using .read_csv() function
df = pd.read_csv('data.csv')
 
#printing DataFrame
df

输出：

第 3 步：将分类数据列转换为数值。

我们会将“Purchased”列从分类数据类型转换为数值数据类型。

Python3

# Importing LabelEncoder from Sklearn
# library from preprocessing Module.
from sklearn.preprocessing import LabelEncoder
 
# Creating a instance of label Encoder.
le = LabelEncoder()
 
# Using .fit_transform function to fit label
# encoder and return encoded label
label = le.fit_transform(df['Purchased'])
 
# printing label
label

输出：

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

第 4 步：将标签数组附加到我们的 DataFrame

Python3

# removing the column 'Purchased' from df
# as it is of no use now.
df.drop("Purchased", axis=1, inplace=True)
 
# Appending the array to our dataFrame
# with column name 'Purchased'
df["Purchased"] = label
 
# printing Dataframe
df

输出：