如何在Python中将分类字符串数据转换为数字?
数据集具有数字和分类特征。分类特征是指字符串数据类型,易于人类理解。但是,机器不能直接解释分类数据。因此,必须将分类数据转换为数值数据以便进一步处理。
有很多方法可以将分类数据转换为数值数据。在本文中,我们将讨论两种最常用的方法,即:
- 虚拟变量编码
- 标签编码
在这两种方法中,我们都使用相同的数据,数据集的链接在这里
方法一:虚拟变量编码
我们将使用 pandas.get_dummies函数将分类字符串数据转换为数字。
句法:
pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Parameters :
- data : Pandas Series, or DataFrame
- prefix : str, list of str, or dict of str, default None. String to append DataFrame column names
- prefix_sep : str, default ‘_’. If appending prefix, separator/delimiter to use.
- dummy_na : bool, default False. Add a column to indicate NaNs, if False NaNs are ignored.
- columns : list-like, default None. Column names in the DataFrame to be encoded.
- sparse : bool, default False. Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).
- drop_first : bool, default False. Whether to get k-1 dummies out of k categorical levels by removing the first level.
- dtype : dtype, default np.uint8. It specifies the data type for new columns.
Returns : DataFrame
逐步实施
第 1 步:导入库
Python3
# importing pandas as pd
import pandas as pd
Python3
# importing data using .read_csv() function
df = pd.read_csv('data.csv')
# printing DataFrame
df
Python3
# using .get_dummies function to convert
# the categorical datatype to numerical
# and storing the returned dataFrame
# in a new variable df1
df1 = pd.get_dummies(df['Purchased'])
# using pd.concat to concatenate the dataframes
# df and df1 and storing the concatenated
# dataFrame in df.
df = pd.concat([df, df1], axis=1).reindex(df.index)
# removing the column 'Purchased' from df
# as it is of no use now.
df.drop('Purchased', axis=1, inplace=True)
# printing df
df
Python3
# importing pandas as pd
import pandas as pd
Python3
#importing data using .read_csv() function
df = pd.read_csv('data.csv')
#printing DataFrame
df
Python3
# Importing LabelEncoder from Sklearn
# library from preprocessing Module.
from sklearn.preprocessing import LabelEncoder
# Creating a instance of label Encoder.
le = LabelEncoder()
# Using .fit_transform function to fit label
# encoder and return encoded label
label = le.fit_transform(df['Purchased'])
# printing label
label
Python3
# removing the column 'Purchased' from df
# as it is of no use now.
df.drop("Purchased", axis=1, inplace=True)
# Appending the array to our dataFrame
# with column name 'Purchased'
df["Purchased"] = label
# printing Dataframe
df
第 2 步:导入数据
Python3
# importing data using .read_csv() function
df = pd.read_csv('data.csv')
# printing DataFrame
df
输出:
第 3 步:将分类数据列转换为数值。
我们会将“Purchased”列从分类数据类型转换为数值数据类型。
Python3
# using .get_dummies function to convert
# the categorical datatype to numerical
# and storing the returned dataFrame
# in a new variable df1
df1 = pd.get_dummies(df['Purchased'])
# using pd.concat to concatenate the dataframes
# df and df1 and storing the concatenated
# dataFrame in df.
df = pd.concat([df, df1], axis=1).reindex(df.index)
# removing the column 'Purchased' from df
# as it is of no use now.
df.drop('Purchased', axis=1, inplace=True)
# printing df
df
输出:
方法二:标签编码
我们将使用sklearn库中的.LabelEncoder()将分类数据转换为数值数据。我们将在此过程中使用函数fit_transform()。
句法 :
fit_transform(y)
Parameters :
- y : array-like of shape (n_samples). Target Values.
Returns : array-like of shape (n_samples) .Encoded labels.
逐步实施
第 1 步:导入库
Python3
# importing pandas as pd
import pandas as pd
第 2 步:导入数据
Python3
#importing data using .read_csv() function
df = pd.read_csv('data.csv')
#printing DataFrame
df
输出:
第 3 步:将分类数据列转换为数值。
我们会将“Purchased”列从分类数据类型转换为数值数据类型。
Python3
# Importing LabelEncoder from Sklearn
# library from preprocessing Module.
from sklearn.preprocessing import LabelEncoder
# Creating a instance of label Encoder.
le = LabelEncoder()
# Using .fit_transform function to fit label
# encoder and return encoded label
label = le.fit_transform(df['Purchased'])
# printing label
label
输出:
array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
第 4 步:将标签数组附加到我们的 DataFrame
Python3
# removing the column 'Purchased' from df
# as it is of no use now.
df.drop("Purchased", axis=1, inplace=True)
# Appending the array to our dataFrame
# with column name 'Purchased'
df["Purchased"] = label
# printing Dataframe
df
输出: