📜  在 R 中使用条形图可视化缺失数据

📅  最后修改于: 2022-05-13 01:54:49.072000             🧑  作者: Mango

在 R 中使用条形图可视化缺失数据

在本文中,我们将讨论如何使用 R 编程语言通过条形图可视化缺失数据。

缺失数据是那些未记录的数据点,即未输入数据集中的数据点。通常,缺失数据表示为NANaN甚至空单元格。

使用中的数据集:

缺失数据示例

在较大的数据集的情况下,很少丢失的数据可能不会影响整体信息,而在较小的数据集的情况下可能会造成巨大的信息损失。根据数据集删除或估算这些缺失的数据。为了决定如何处理丢失的数据,我们将首先了解如何可视化丢失的数据点。



让我们首先计算缺失值的总数。

示例:计算缺失值

R
# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# count the total number of missing values
sum(is.na(df))


R
# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# count number of missing values in each 
# attribute/column
sapply(df, function(x) sum(is.na(x)))


R
# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# converting a frequecy table for missing 
# values to dataframe 
freqDf <- data.frame(table(is.na(df)))
  
# barplot for vizualisation
barplot(freqDf$Freq , main = "Total Missing values",
xlab = "Missing Data", ylab = "Frequency", 
        names.arg = c("FALSE","TRUE"),
col = c("#80dfff","lightgreen"))
  
# legend for barplot
legend("topright",
c("Non-Missing Values","Missing Values"),
fill = c("#80dfff","lightgreen"))


R
# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# frequency table for missing data for 1 column,
# here age column is taken
freqDf2 <- data.frame(table(is.na(df$age)))
  
# barplot for 1 column/feature
barplot(freqDf2$Freq,
main = "Total Missing values",xlab = "Missing Data",
ylab = "Frequency",names.arg = c("FALSE","TRUE"),
col = c("#ffb3b3","#99e6ff"))
  
# legend for barplot
legend("topright",
c("Non-Missing Values","Missing Values"),
       fill = c("#ffb3b3","#99e6ff"))


R
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# function convert dataframe to binary TRUE/FALSE matrix
toBinaryMatrix <- function(df){
m<-c()
for(i in colnames(df)){
    x<-sum(is.na(df[,i]))
    # missing value count
    m<-append(m,x)
    # non-missing value count
    m<-append(m,nrow(df)-x) 
}
  
# adding column and row names to matrix
a<-matrix(m,nrow=2)
rownames(a)<-c("TRUE","FALSE")
colnames(a)<-colnames(df)
  
return(a)
}
  
# function call
binMat = toBinaryMatrix(df)
binMat


R
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# stacked barplot for missing data in all columns
barplot(binMat,
main = "Missing values in all features",xlab = "Frequency",
col = c("#4dffd2","#ff9999"))
  
# legend for barplot
legend("bottomright",
c("Missing values","Non-Missing values"),
fill = c("#4dffd2","#ff9999"))


R
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# grouped barplot for missing data in all columns
barplot(binMat,
main = "Missing values in all features",xlab = "Frequency",
col = c("#ffff99","#33bbff"),beside=TRUE,
horiz = TRUE)
  
# legend for barplot
legend("right",c("Missing values","Non-Missing values"),
fill = c("#ffff99","#33bbff"))


输出:

5

我们还可以找出每个属性/列中有多少缺失值。

示例:计算每个属性/列中的缺失值

电阻

# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# count number of missing values in each 
# attribute/column
sapply(df, function(x) sum(is.na(x)))

输出:



age name grade
2   3    0

可视化所有缺失值

让我们首先使用 R 中的 barplot()函数可视化整个数据的缺失值和非缺失值的频率。

示例:可视化所有缺失值

电阻

# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# converting a frequecy table for missing 
# values to dataframe 
freqDf <- data.frame(table(is.na(df)))
  
# barplot for vizualisation
barplot(freqDf$Freq , main = "Total Missing values",
xlab = "Missing Data", ylab = "Frequency", 
        names.arg = c("FALSE","TRUE"),
col = c("#80dfff","lightgreen"))
  
# legend for barplot
legend("topright",
c("Non-Missing Values","Missing Values"),
fill = c("#80dfff","lightgreen"))

输出:

可视化一列的缺失数据

为此,我们选择我们尝试可视化的列,然后执行必要的操作。

示例:可视化一列的缺失数据

电阻

# Creating a sample dataframe using 3 vectors
age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# frequency table for missing data for 1 column,
# here age column is taken
freqDf2 <- data.frame(table(is.na(df$age)))
  
# barplot for 1 column/feature
barplot(freqDf2$Freq,
main = "Total Missing values",xlab = "Missing Data",
ylab = "Frequency",names.arg = c("FALSE","TRUE"),
col = c("#ffb3b3","#99e6ff"))
  
# legend for barplot
legend("topright",
c("Non-Missing Values","Missing Values"),
       fill = c("#ffb3b3","#99e6ff"))

输出:



可视化所有列的缺失数据

让我们创建一个函数,将数据帧转换为二进制 TRUE/FALSE 矩阵,然后使用 R 中的条形图将其可视化。

示例:可视化所有列的缺失数据

电阻

age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# function convert dataframe to binary TRUE/FALSE matrix
toBinaryMatrix <- function(df){
m<-c()
for(i in colnames(df)){
    x<-sum(is.na(df[,i]))
    # missing value count
    m<-append(m,x)
    # non-missing value count
    m<-append(m,nrow(df)-x) 
}
  
# adding column and row names to matrix
a<-matrix(m,nrow=2)
rownames(a)<-c("TRUE","FALSE")
colnames(a)<-colnames(df)
  
return(a)
}
  
# function call
binMat = toBinaryMatrix(df)
binMat

输出:

age    name  grade
TRUE    2     3       0
FALSE    4     3       6

堆积条形图

缺失值可以与使用堆叠条形图呈现的值形成对比。

示例:堆叠条形图

电阻

age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# stacked barplot for missing data in all columns
barplot(binMat,
main = "Missing values in all features",xlab = "Frequency",
col = c("#4dffd2","#ff9999"))
  
# legend for barplot
legend("bottomright",
c("Missing values","Non-Missing values"),
fill = c("#4dffd2","#ff9999"))

输出:

分组条形图

另一个有用的可视化是分组条形图。

示例:分组条形图

电阻

age = c(12,34,NA,7,15,NA)
name = c('rob',NA,"arya","jon",NA,NA)
grade = c("A","A","D","B","C","B")
df <- data.frame(age,name,grade)
  
# grouped barplot for missing data in all columns
barplot(binMat,
main = "Missing values in all features",xlab = "Frequency",
col = c("#ffff99","#33bbff"),beside=TRUE,
horiz = TRUE)
  
# legend for barplot
legend("right",c("Missing values","Non-Missing values"),
fill = c("#ffff99","#33bbff"))

输出: