用 fancyimpute 丢失数据插补

在现实世界的数据集中，总会有一些数据丢失。这主要与收集数据的方式有关。缺失数据在创建预测模型方面发挥着重要作用，因为有些算法在缺失数据集的情况下表现不佳。

花式输入

fancyimpute 是一个缺失数据插补算法的库。 Fancyimpute 使用机器学习算法来估算缺失值。 Fancyimpute 使用所有列来估算缺失值。有两种方法可以使用 Fancyimpute 估算缺失数据

KNN 或 K-最近邻
链式方程的 MICE 或多重插补

K-最近邻

为了填补缺失值，KNN 在所有特征中找出相似的数据点。然后取所有点的平均值来填充缺失值。

Python3

import pandas as pd
import numpy as np
# importing the KNN from fancyimpute library
from fancyimpute import KNN
  
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4],
                   [5,      7,  8,     2],
                   [2,      5,  7,     9]],
                  columns = list('ABCD'))
  
# printing the dataframe
print(df)
  
# calling the KNN class
knn_imputer = KNN()
# imputing the missing value with knn imputer
df = knn_imputer.fit_transform(df)
  
# printing dataframe
print(df)

Python3

import pandas as pd
import numpy as np
# importing the MICE from fancyimpute library
from fancyimpute import IterativeImputer
  
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4],
                   [5,      7,  8,     2],
                   [2,      5,  7,     9]],
                  columns = list('ABCD'))
  
# printing the dataframe
print(df)
  
# calling the  MICE class
mice_imputer = IterativeImputer()
# imputing the missing value with mice imputer
df = mice_imputer.fit_transform(df)
  
# printing dataframe
print(df)

输出：

A    B    C  D
0  NaN  2.0  NaN  0
1  3.0  4.0  NaN  1
2  NaN  NaN  NaN  5
3  NaN  3.0  NaN  4
4  5.0  7.0  8.0  2
5  2.0  5.0  7.0  9
Imputing row 1/6 with 2 missing, elapsed time: 0.001
[[3.23556938 2.         7.75630267 0.]
 [3.         4.         7.825      1.]
 [3.67647071 3.46386587 7.64000033 5.]
 [3.35514006 3.         7.59183674 4.]
 [5.         7.         8.         2.]
 [2.         5.         7.         9.]]

通过链式方程进行多重插补：

MICE 使用多重插补而不是单一插补，这会导致统计不确定性。 MICE 对样本数据执行多元回归并取它们的平均值

Python3

import pandas as pd
import numpy as np
# importing the MICE from fancyimpute library
from fancyimpute import IterativeImputer
  
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4],
                   [5,      7,  8,     2],
                   [2,      5,  7,     9]],
                  columns = list('ABCD'))
  
# printing the dataframe
print(df)
  
# calling the  MICE class
mice_imputer = IterativeImputer()
# imputing the missing value with mice imputer
df = mice_imputer.fit_transform(df)
  
# printing dataframe
print(df)

输出

A    B    C   D
0  NaN  2.0  NaN  0
1  3.0  4.0  NaN  1
2  NaN  NaN  NaN  5
3  NaN  3.0  NaN  4
4  5.0  7.0  8.0  2
5  2.0  5.0  7.0  9
[[3.27262261 2.         7.9809332  0 ]
 [3.         4.         7.9193547  1.]
 [2.91717117 4.35730239 7.47523962 5.]
 [2.77722048 3.         7.53760743 4.]
 [5.         7.         8.         2.]
 [2.         5.         7.         9.]]