如何处理Python中分类变量的缺失值?
机器学习是一个研究领域,它使计算机能够在没有明确编程的情况下进行学习。我们经常会遇到列中缺少某些值的数据集。当我们将机器学习模型应用于数据集时,这会导致问题。这增加了我们在训练机器学习模型时出错的机会。
我们使用的数据集是:
Python3
# import modules
import pandas as pd
import numpy as np
# assign dataset
df = pd.read_csv("train.csv", header=None)
df.head
Python3
# counting number of values of all the columns
cnt_missing = (df[[1, 2, 3, 4,
5, 6, 7, 8]] == 0).sum()
print(cnt_missing)
Python
from numpy import nan
df[[1, 2, 3, 4, 5]] = df[[1, 2, 3, 4, 5]].replace(0, nan)
df.head(10)
Python3
# printing initial shape
print(df.shape)
df.dropna(inplace=True)
# final shape of the data with
# missing rows removed
print(df.shape)
Python3
# filling missing values
# with mean column values
df.fillna(df.mean(), inplace=True)
df.sample(10)
Python3
# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
# defining the imputer
imputer = SimpleImputer(missing_values=nan,
strategy='mean')
# transform the dataset
transformed_values = imputer.fit_transform(value)
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())
Python3
# filling missing values
# with mean column values
df.fillna(df.median(), inplace=True)
df.head(10)
Python3
# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
# defining the imputer
imputer = SimpleImputer(missing_values=nan,
strategy='median')
# transform the dataset
transformed_values = imputer.fit_transform(value)
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())
Python3
# filling missing values
# with mean column values
df.fillna(df.mode(), inplace=True)
df.sample(10)
Python3
# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
# defining the imputer
imputer = SimpleImputer(missing_values=nan,
strategy='most_frequent')
# transform the dataset
transformed_values = imputer.fit_transform(value)
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())
计算缺失数据:
蟒蛇3
# counting number of values of all the columns
cnt_missing = (df[[1, 2, 3, 4,
5, 6, 7, 8]] == 0).sum()
print(cnt_missing)
我们看到对于 1,2,3,4,5 列数据丢失。现在我们将用 NaN 替换所有 0 值。
Python
from numpy import nan
df[[1, 2, 3, 4, 5]] = df[[1, 2, 3, 4, 5]].replace(0, nan)
df.head(10)
处理缺失数据很重要,因此我们将通过以下方法解决此问题:
方法#1
第一种方法是简单地删除具有缺失数据的行。
蟒蛇3
# printing initial shape
print(df.shape)
df.dropna(inplace=True)
# final shape of the data with
# missing rows removed
print(df.shape)
但在这方面,出现的问题是,当我们有小数据集时,如果我们删除缺少数据的行,那么数据集会变得非常小,机器学习模型在小数据集上不会给出好的结果。
所以为了避免这个问题,我们有第二种方法。下一个方法是输入缺失值。我们通过用一些随机值或其余数据的中值/平均值替换缺失值来做到这一点。
方法#2
我们首先通过数据的平均值来估算缺失值。
蟒蛇3
# filling missing values
# with mean column values
df.fillna(df.mean(), inplace=True)
df.sample(10)
我们也可以通过使用 SimpleImputer 类来做到这一点。 SimpleImputer 是一个 scikit-learn 类,它有助于处理预测模型数据集中的缺失数据。它用指定的占位符替换 NaN 值。 它是通过使用SimpleImputer()方法实现的,该方法采用以下参数:
SimpleImputer(missing_values, strategy, fill_value)
- missing_values : The missing_values placeholder which has to be imputed. By default is NaN.
- stategy : The data which will replace the NaN values from the dataset. The strategy argument can take the values – ‘mean'(default), ‘median’, ‘most_frequent’ and ‘constant’.
- fill_value : The constant value to be given to the NaN data using the constant strategy.
蟒蛇3
# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
# defining the imputer
imputer = SimpleImputer(missing_values=nan,
strategy='mean')
# transform the dataset
transformed_values = imputer.fit_transform(value)
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())
方法#3
我们首先通过数据的中位数来估算缺失值。中位数是一组数据的中间值。要确定数字序列中的中值,必须先将数字按升序排列。
蟒蛇3
# filling missing values
# with mean column values
df.fillna(df.median(), inplace=True)
df.head(10)
我们也可以通过使用SimpleImputer类来做到这一点。
蟒蛇3
# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
# defining the imputer
imputer = SimpleImputer(missing_values=nan,
strategy='median')
# transform the dataset
transformed_values = imputer.fit_transform(value)
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())
方法#4
我们首先通过数据的模式来估算缺失值。众数是一组观测值中出现频率最高的值。例如,{6, 3, 9, 6, 6, 5, 9, 3} 众数是 6,因为它最常出现。
蟒蛇3
# filling missing values
# with mean column values
df.fillna(df.mode(), inplace=True)
df.sample(10)
我们也可以通过使用SimpleImputer类来做到这一点。
蟒蛇3
# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
# defining the imputer
imputer = SimpleImputer(missing_values=nan,
strategy='most_frequent')
# transform the dataset
transformed_values = imputer.fit_transform(value)
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())