如何在 R 中估算缺失值?
在本文中,我们将讨论如何在 R 编程语言中估算缺失值。
在大多数数据集中,可能会因为未输入或由于某些错误而缺少值。用另一个值替换这些缺失值称为数据插补。有几种估算方法。常见的包括用该列/特征中的平均值、最小值或最大值替换。不同的数据集和特征将需要一种类型的插补方法。例如,考虑一家公司的销售业绩数据集,如果特征损失有缺失值,那么替换最小值会更合乎逻辑。
正在使用的数据集:
估算一列
方法1 :用平均值手动估算
让我们用这一整列的平均值来估算一列数据的缺失值,即marks1。
Syntax :
mean(x, trim = 0, na.rm = FALSE, …)
Parameter:
- x – any object
- trim – observations to be trimmed from each end of x before the mean is computed
- na.rm – FALSE to remove NA values
示例:估算缺失值
R
# create a adataframe
data <- data.frame(marks1 = c(NA, 22, NA, 49, 75),
marks2 = c(81, 14, NA, 61, 12),
marks3 = c(78.5, 19.325, NA, 28, 48.002))
# impute manually
data$marks1[is.na(data$marks1)] <- mean(data$marks1, na.rm = T)
data
R
# install and load the required packages
install.packages("Hmisc")
library(Hmisc)
# create a adataframe
data <- data.frame(marks1 = c(NA, 22, NA, 49, 75),
marks2 = c(81, 14, NA, 61, 12),
marks3 = c(78.5, 19.325, NA, 28,
48.002))
# fill missing values of marks2 with median
impute(data$marks2, median)
R
# install and load the required packages
install.packages("Hmisc")
library(Hmisc)
# create a adataframe
data <- data.frame(marks1 = c(NA, 22, NA, 49, 75),
marks2 = c(81, 14, NA, 61, 12),
marks3 = c(78.5, 19.325, NA, 28,
48.002))
# impute with a specific number
# replace NA with 2000
impute(data$marks3, 2000)
R
# create a adataframe
data <- data.frame(marks1 = c(NA, 22, NA, 49, 75),
marks2 = c(81, 14, NA, 61, 12),
marks3 = c(78.5, 19.325, NA, 28,
48.002))
# getting median of each column using apply()
all_column_median <- apply(data, 2, median, na.rm=TRUE)
# imputing median value with NA
for(i in colnames(data))
data[,i][is.na(data[,i])] <- all_column_median[i]
data
输出:
方法 2 :使用 Hmisc 库并使用中值进行估算
使用 Hmisc 库中的函数impute( ) 让我们用整列的中值估算数据的列标记2。
示例:估算缺失值
R
# install and load the required packages
install.packages("Hmisc")
library(Hmisc)
# create a adataframe
data <- data.frame(marks1 = c(NA, 22, NA, 49, 75),
marks2 = c(81, 14, NA, 61, 12),
marks3 = c(78.5, 19.325, NA, 28,
48.002))
# fill missing values of marks2 with median
impute(data$marks2, median)
输出:
方法 3:使用特定的常量值进行估算
使用 Hmisc 库中的函数impute( ) 让我们用一个常数值来估算数据的列标记2。
示例:估算缺失值
R
# install and load the required packages
install.packages("Hmisc")
library(Hmisc)
# create a adataframe
data <- data.frame(marks1 = c(NA, 22, NA, 49, 75),
marks2 = c(81, 14, NA, 61, 12),
marks3 = c(78.5, 19.325, NA, 28,
48.002))
# impute with a specific number
# replace NA with 2000
impute(data$marks3, 2000)
输出:
估算整个数据集:
这可以通过使用 apply()函数将每列的中值与 NA 插补来完成。
Syntax:
apply(X, MARGIN, FUN, …)
Parameter:
- X – an array, including a matrix
- MARGIN – a vector
- FUN – the function to be applied
示例:估算整个数据集
R
# create a adataframe
data <- data.frame(marks1 = c(NA, 22, NA, 49, 75),
marks2 = c(81, 14, NA, 61, 12),
marks3 = c(78.5, 19.325, NA, 28,
48.002))
# getting median of each column using apply()
all_column_median <- apply(data, 2, median, na.rm=TRUE)
# imputing median value with NA
for(i in colnames(data))
data[,i][is.na(data[,i])] <- all_column_median[i]
data
输出: