在 R 中使用条形图可视化缺失数据
在本文中,我们将讨论如何使用 R 编程语言通过条形图可视化缺失数据。
缺失数据是那些未记录的数据点,即未输入数据集中的数据点。通常,缺失数据表示为NA或NaN甚至空单元格。
使用中的数据集:
在较大的数据集的情况下,很少丢失的数据可能不会影响整体信息,而在较小的数据集的情况下可能会造成巨大的信息损失。根据数据集删除或估算这些缺失的数据。为了决定如何处理丢失的数据,我们将首先了解如何可视化丢失的数据点。
让我们首先计算缺失值的总数。
示例:计算缺失值
R
# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# count the total number of missing values
sum(is.na(df))
R
# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# count number of missing values in each
# attribute/column
sapply(df, function(x) sum(is.na(x)))
R
# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# converting a frequecy table for missing
# values to dataframe
freqDf <- data.frame(table(is.na(df)))
# barplot for vizualisation
barplot(freqDf$Freq , main = "Total Missing values",
xlab = "Missing Data", ylab = "Frequency",
names.arg = c("FALSE","TRUE"),
col = c("#80dfff","lightgreen"))
# legend for barplot
legend("topright",
c("Non-Missing Values","Missing Values"),
fill = c("#80dfff","lightgreen"))
R
# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# frequency table for missing data for 1 column,
# here age column is taken
freqDf2 <- data.frame(table(is.na(df$age)))
# barplot for 1 column/feature
barplot(freqDf2$Freq,
main = "Total Missing values",xlab = "Missing Data",
ylab = "Frequency",names.arg = c("FALSE","TRUE"),
col = c("#ffb3b3","#99e6ff"))
# legend for barplot
legend("topright",
c("Non-Missing Values","Missing Values"),
fill = c("#ffb3b3","#99e6ff"))
R
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# function convert dataframe to binary TRUE/FALSE matrix
toBinaryMatrix <- function(df){
m<-c()
for(i in colnames(df)){
x<-sum(is.na(df[,i]))
# missing value count
m<-append(m,x)
# non-missing value count
m<-append(m,nrow(df)-x)
}
# adding column and row names to matrix
a<-matrix(m,nrow=2)
rownames(a)<-c("TRUE","FALSE")
colnames(a)<-colnames(df)
return(a)
}
# function call
binMat = toBinaryMatrix(df)
binMat
R
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# stacked barplot for missing data in all columns
barplot(binMat,
main = "Missing values in all features",xlab = "Frequency",
col = c("#4dffd2","#ff9999"))
# legend for barplot
legend("bottomright",
c("Missing values","Non-Missing values"),
fill = c("#4dffd2","#ff9999"))
R
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# grouped barplot for missing data in all columns
barplot(binMat,
main = "Missing values in all features",xlab = "Frequency",
col = c("#ffff99","#33bbff"),beside=TRUE,
horiz = TRUE)
# legend for barplot
legend("right",c("Missing values","Non-Missing values"),
fill = c("#ffff99","#33bbff"))
输出:
5
我们还可以找出每个属性/列中有多少缺失值。
示例:计算每个属性/列中的缺失值
电阻
# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# count number of missing values in each
# attribute/column
sapply(df, function(x) sum(is.na(x)))
输出:
age name grade
2 3 0
可视化所有缺失值
让我们首先使用 R 中的 barplot()函数可视化整个数据的缺失值和非缺失值的频率。
Syntax of barplot():
barplot(x, name.args = NULL, col = ” “, main = ” ” , xlab = ” “, ylab = ” ” , beside = FALSE , horiz = TRUE …)
Parameters:
- x : vector or matrix
- names.arg : label for each bar
- col : color for the bars
- main : title of the barplot
- xlab : label for x-axis
- ylab : label for y-axis
- beside : to specify grouped or stacked barplot
- horiz : orientation of bars (horizontal or vertical)
示例:可视化所有缺失值
电阻
# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# converting a frequecy table for missing
# values to dataframe
freqDf <- data.frame(table(is.na(df)))
# barplot for vizualisation
barplot(freqDf$Freq , main = "Total Missing values",
xlab = "Missing Data", ylab = "Frequency",
names.arg = c("FALSE","TRUE"),
col = c("#80dfff","lightgreen"))
# legend for barplot
legend("topright",
c("Non-Missing Values","Missing Values"),
fill = c("#80dfff","lightgreen"))
输出:
可视化一列的缺失数据
为此,我们选择我们尝试可视化的列,然后执行必要的操作。
示例:可视化一列的缺失数据
电阻
# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# frequency table for missing data for 1 column,
# here age column is taken
freqDf2 <- data.frame(table(is.na(df$age)))
# barplot for 1 column/feature
barplot(freqDf2$Freq,
main = "Total Missing values",xlab = "Missing Data",
ylab = "Frequency",names.arg = c("FALSE","TRUE"),
col = c("#ffb3b3","#99e6ff"))
# legend for barplot
legend("topright",
c("Non-Missing Values","Missing Values"),
fill = c("#ffb3b3","#99e6ff"))
输出:
可视化所有列的缺失数据
让我们创建一个函数,将数据帧转换为二进制 TRUE/FALSE 矩阵,然后使用 R 中的条形图将其可视化。
示例:可视化所有列的缺失数据
电阻
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# function convert dataframe to binary TRUE/FALSE matrix
toBinaryMatrix <- function(df){
m<-c()
for(i in colnames(df)){
x<-sum(is.na(df[,i]))
# missing value count
m<-append(m,x)
# non-missing value count
m<-append(m,nrow(df)-x)
}
# adding column and row names to matrix
a<-matrix(m,nrow=2)
rownames(a)<-c("TRUE","FALSE")
colnames(a)<-colnames(df)
return(a)
}
# function call
binMat = toBinaryMatrix(df)
binMat
输出:
age name grade
TRUE 2 3 0
FALSE 4 3 6
堆积条形图
缺失值可以与使用堆叠条形图呈现的值形成对比。
示例:堆叠条形图
电阻
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# stacked barplot for missing data in all columns
barplot(binMat,
main = "Missing values in all features",xlab = "Frequency",
col = c("#4dffd2","#ff9999"))
# legend for barplot
legend("bottomright",
c("Missing values","Non-Missing values"),
fill = c("#4dffd2","#ff9999"))
输出:
分组条形图
另一个有用的可视化是分组条形图。
示例:分组条形图
电阻
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
# grouped barplot for missing data in all columns
barplot(binMat,
main = "Missing values in all features",xlab = "Frequency",
col = c("#ffff99","#33bbff"),beside=TRUE,
horiz = TRUE)
# legend for barplot
legend("right",c("Missing values","Non-Missing values"),
fill = c("#ffff99","#33bbff"))
输出: