如何处理Python中分类变量的缺失值？

机器学习是一个研究领域，它使计算机能够在没有明确编程的情况下进行学习。我们经常会遇到列中缺少某些值的数据集。当我们将机器学习模型应用于数据集时，这会导致问题。这增加了我们在训练机器学习模型时出错的机会。

我们使用的数据集是：

Python3

# import modules
import pandas as pd
import numpy as np
  
# assign dataset
df = pd.read_csv("train.csv", header=None)
df.head

Python3

# counting number of values of all the columns
cnt_missing = (df[[1, 2, 3, 4, 
                   5, 6, 7, 8]] == 0).sum()
print(cnt_missing)

Python

from numpy import nan
df[[1, 2, 3, 4, 5]] = df[[1, 2, 3, 4, 5]].replace(0, nan)
df.head(10)

Python3

# printing initial shape
print(df.shape)
df.dropna(inplace=True)
  
# final shape of the data with
# missing rows removed
print(df.shape)

Python3

# filling missing values 
# with mean column values
df.fillna(df.mean(), inplace=True)
df.sample(10)

Python3

# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
  
value = df.values
  
# defining the imputer
imputer = SimpleImputer(missing_values=nan, 
                        strategy='mean')
  
# transform the dataset
transformed_values = imputer.fit_transform(value)
  
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())

Python3

# filling missing values 
# with mean column values
df.fillna(df.median(), inplace=True)
df.head(10)

Python3

# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
  
# defining the imputer
imputer = SimpleImputer(missing_values=nan, 
                        strategy='median')
  
# transform the dataset
transformed_values = imputer.fit_transform(value)
  
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())

Python3

# filling missing values 
# with mean column values
df.fillna(df.mode(), inplace=True)
df.sample(10)

Python3

# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
  
# defining the imputer
imputer = SimpleImputer(missing_values=nan, 
                        strategy='most_frequent')
  
# transform the dataset
transformed_values = imputer.fit_transform(value)
  
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())

计算缺失数据：

蟒蛇3

# counting number of values of all the columns
cnt_missing = (df[[1, 2, 3, 4, 
                   5, 6, 7, 8]] == 0).sum()
print(cnt_missing)

我们看到对于 1,2,3,4,5 列数据丢失。现在我们将用 NaN 替换所有 0 值。

Python

from numpy import nan
df[[1, 2, 3, 4, 5]] = df[[1, 2, 3, 4, 5]].replace(0, nan)
df.head(10)

处理缺失数据很重要，因此我们将通过以下方法解决此问题：

方法#1

第一种方法是简单地删除具有缺失数据的行。

蟒蛇3

# printing initial shape
print(df.shape)
df.dropna(inplace=True)
  
# final shape of the data with
# missing rows removed
print(df.shape)

但在这方面，出现的问题是，当我们有小数据集时，如果我们删除缺少数据的行，那么数据集会变得非常小，机器学习模型在小数据集上不会给出好的结果。

所以为了避免这个问题，我们有第二种方法。下一个方法是输入缺失值。我们通过用一些随机值或其余数据的中值/平均值替换缺失值来做到这一点。

方法#2

我们首先通过数据的平均值来估算缺失值。

蟒蛇3

# filling missing values 
# with mean column values
df.fillna(df.mean(), inplace=True)
df.sample(10)

我们也可以通过使用 SimpleImputer 类来做到这一点。 SimpleImputer 是一个 scikit-learn 类，它有助于处理预测模型数据集中的缺失数据。它用指定的占位符替换 NaN 值。它是通过使用SimpleImputer()方法实现的，该方法采用以下参数：

SimpleImputer(missing_values, strategy, fill_value)

missing_values : The missing_values placeholder which has to be imputed. By default is NaN.
stategy : The data which will replace the NaN values from the dataset. The strategy argument can take the values – ‘mean'(default), ‘median’, ‘most_frequent’ and ‘constant’.
fill_value : The constant value to be given to the NaN data using the constant strategy.

编程需要懂一点英语

蟒蛇3

# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
  
value = df.values
  
# defining the imputer
imputer = SimpleImputer(missing_values=nan, 
                        strategy='mean')
  
# transform the dataset
transformed_values = imputer.fit_transform(value)
  
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())

方法#3

我们首先通过数据的中位数来估算缺失值。中位数是一组数据的中间值。要确定数字序列中的中值，必须先将数字按升序排列。

蟒蛇3

# filling missing values 
# with mean column values
df.fillna(df.median(), inplace=True)
df.head(10)

我们也可以通过使用SimpleImputer类来做到这一点。

蟒蛇3

# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
  
# defining the imputer
imputer = SimpleImputer(missing_values=nan, 
                        strategy='median')
  
# transform the dataset
transformed_values = imputer.fit_transform(value)
  
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())

方法#4

我们首先通过数据的模式来估算缺失值。众数是一组观测值中出现频率最高的值。例如，{6, 3, 9, 6, 6, 5, 9, 3} 众数是 6，因为它最常出现。

蟒蛇3

# filling missing values 
# with mean column values
df.fillna(df.mode(), inplace=True)
df.sample(10)

我们也可以通过使用SimpleImputer类来做到这一点。

蟒蛇3

# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
  
# defining the imputer
imputer = SimpleImputer(missing_values=nan, 
                        strategy='most_frequent')
  
# transform the dataset
transformed_values = imputer.fit_transform(value)
  
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())