如何查找 R 数据框的分组汇总统计信息?
查找数据框的分组汇总统计信息对于理解我们的数据框非常有用。摘要包括统计数据:给定数据帧的平均值、中值、最小值、最大值和四分位数。可以在单个列或变量或整个数据帧上计算摘要。在本文中,我们将了解如何在 R 编程语言中查找数据帧的分组汇总统计信息。
R语言导入数据
在下面的代码中,我们使用了一个内置数据集:鸢尾花数据集。然后我们可以使用head()或tail()函数检查我们的数据集,该函数将打印数据帧的顶部和底部。在下面的代码中,我们显示了示例数据帧的前 10 行。
R
# import data
df <- iris
# inspecting the dataset
head(df, 10)
R
df <- iris
summary(df$Sepal.Length)
R
df <- iris
summary(df$Sepal.Width, digits = 3)
R
df <- iris
summary(df)
R
df <- data.frame(
Weekday = factor(rep(c("Mon", "Tues", "Wed",
"Thurs", "Fri"), each = 4),
levels = c("Mon", "Tues", "Wed",
"Thurs", "Fri")),
Quarter = paste0("Q", rep(1:4, each = 5)),
Delay = c(9.9, 5.4, 8.8, 6.9, 4.9, 9.7, 7.9, 5, 8.8,
11.1, 10.2, 9.3, 12.2, 10.2, 9.2, 9.7, 12.2,
8.1, 7.9, 5.6))
df
R
library(dplyr)
df <- data.frame(
Weekday = factor(rep(c("Mon", "Tues", "Wed", "Thurs",
"Fri"), each = 4),
levels = c("Mon", "Tues", "Wed", "Thurs",
"Fri")),
Quarter = paste0("Q", rep(1:4, each = 5)),
Delay = c(9.9, 5.4, 8.8, 6.9, 4.9, 9.7, 7.9, 5, 8.8,
11.1, 10.2, 9.3, 12.2, 10.2, 9.2, 9.7, 12.2,
8.1, 7.9, 5.6))
df %>%
group_by(Weekday) %>%
summarize(min_delay = min(Delay), max_delay = max(Delay))
R
# sample dataframe
df2 <- data.frame(
Quarter = paste0("Q", rep(1:4, each = 4)),
Week = rep(c("Weekday", "Weekend"), each=2, times=4),
Direction = rep(c("Inbound", "Outbound"), times=8),
Delay = c(10.8, 9.7, 15.5, 10.4, 11.8, 8.9, 5.5,
3.3, 10.6, 8.8, 6.6, 5.2, 9.1, 7.3, 5.3, 4.4))
df2
R
library(dplyr)
# sample dataframe
df2 <- data.frame(
Quarter = paste0("Q", rep(1:4, each = 4)),
Week = rep(c("Weekday", "Weekend"), each=2, times=4),
Direction = rep(c("Inbound", "Outbound"), times=8),
Delay = c(10.8, 9.7, 15.5, 10.4, 11.8, 8.9, 5.5,
3.3, 10.6, 8.8, 6.6, 5.2, 9.1, 7.3, 5.3, 4.4))
# summarizing by group
df2 %>%
group_by(Quarter, Week) %>%
summarize(min_delay = min(Delay), max_delay = max(Delay))
输出:
单个变量或列的摘要
我们的数据帧存储在“ df ”变量中。我们要打印摘要 列:萼片.长度。因此,我们将“df$Sepal.length”作为参数传递给 summary()函数。
Syntax: summary(dataframe$column_name)
总结() 函数接受一个数据框列并返回:
- 中心趋势->均值和中位数,
- 四分位距-> 25th 和 75th 四分位数,
- 该单列的范围- >最小值和最大值。
示例 1:
电阻
df <- iris
summary(df$Sepal.Length)
输出:
示例 2:我们还可以将“ digits ”作为参数传递,该参数指定我们要更正输出值的小数位数
Syntax: summary(dataframe$column_name , digits=number_of_decimal_places)
电阻
df <- iris
summary(df$Sepal.Width, digits = 3)
输出:
整个数据框的摘要
在下面的代码中,我们将整个数据帧作为参数传递给了summary()函数,因此它计算了一个 整个数据框的摘要(所有列或变量)
Syntax: summary(dataframe_name)
电阻
df <- iris
summary(df)
输出:
数据的分组汇总
为了更好地理解 R 中的 Dataframe,建议参考 R – Data Frames 文章。
让我们首先创建一个示例数据框:
电阻
df <- data.frame(
Weekday = factor(rep(c("Mon", "Tues", "Wed",
"Thurs", "Fri"), each = 4),
levels = c("Mon", "Tues", "Wed",
"Thurs", "Fri")),
Quarter = paste0("Q", rep(1:4, each = 5)),
Delay = c(9.9, 5.4, 8.8, 6.9, 4.9, 9.7, 7.9, 5, 8.8,
11.1, 10.2, 9.3, 12.2, 10.2, 9.2, 9.7, 12.2,
8.1, 7.9, 5.6))
df
输出:
总结单变量的分组数据
我们的数据框由 3 个变量组成: Week-day 、 Quarter和Delay 。我们将要总结的变量是Delay ,在这个过程中, Quarter变量将被折叠。
在下面的代码中,我们将使用 dplyr 包。 R 中的 dplyr 包是一种数据操作结构,它提供了一组统一的动词,有助于解决最常见的数据操作障碍。我们将进行一场 使用GROUP_BY()函数,并使用总结()函数的摘要操作分组操作。然后我们将计算 2 个统计汇总:最大延迟时间和最小延迟时间。
Syntax: group_by(variable_name)
电阻
library(dplyr)
df <- data.frame(
Weekday = factor(rep(c("Mon", "Tues", "Wed", "Thurs",
"Fri"), each = 4),
levels = c("Mon", "Tues", "Wed", "Thurs",
"Fri")),
Quarter = paste0("Q", rep(1:4, each = 5)),
Delay = c(9.9, 5.4, 8.8, 6.9, 4.9, 9.7, 7.9, 5, 8.8,
11.1, 10.2, 9.3, 12.2, 10.2, 9.2, 9.7, 12.2,
8.1, 7.9, 5.6))
df %>%
group_by(Weekday) %>%
summarize(min_delay = min(Delay), max_delay = max(Delay))
输出:
汇总多变量的分组数据
让我们创建另一个示例数据框 -> df2:
电阻
# sample dataframe
df2 <- data.frame(
Quarter = paste0("Q", rep(1:4, each = 4)),
Week = rep(c("Weekday", "Weekend"), each=2, times=4),
Direction = rep(c("Inbound", "Outbound"), times=8),
Delay = c(10.8, 9.7, 15.5, 10.4, 11.8, 8.9, 5.5,
3.3, 10.6, 8.8, 6.6, 5.2, 9.1, 7.3, 5.3, 4.4))
df2
输出:
按组汇总数据:
在 在这种情况下,我们的数据框有 4 个变量: Quarter, Week, Direction, Delay 。在下面的代码中,我们按照Quarter和Week进行了分组汇总,在这个过程中,变量Direction被折叠了。
Syntax: group_by(variable_name1,variable_name2 )
电阻
library(dplyr)
# sample dataframe
df2 <- data.frame(
Quarter = paste0("Q", rep(1:4, each = 4)),
Week = rep(c("Weekday", "Weekend"), each=2, times=4),
Direction = rep(c("Inbound", "Outbound"), times=8),
Delay = c(10.8, 9.7, 15.5, 10.4, 11.8, 8.9, 5.5,
3.3, 10.6, 8.8, 6.6, 5.2, 9.1, 7.3, 5.3, 4.4))
# summarizing by group
df2 %>%
group_by(Quarter, Week) %>%
summarize(min_delay = min(Delay), max_delay = max(Delay))
输出: