R 编程中的数据处理
Data Munging 是将数据从不可用或错误的形式转换为有用形式的通用技术。如果没有一定程度的数据处理(无论是专业用户还是自动化系统执行),数据就无法为下游消费做好准备。基本上,手动清理数据的过程称为数据清理。在 R 编程中,以下方式面向数据处理过程:
- 申请()家庭
- 总计的()
- dplyr包
- plyr包
使用 apply() 系列进行数据处理
在 R 的apply()集合中,最基本的函数是apply()函数。除此之外,还有 lapply() , sapply()和tapply() 。 apply()的整个集合可以被认为是循环的替代品。它是最受限制的函数类型。它应该在包含所有同质元素的矩阵上执行。如果使用数据框或任何其他类型的对象执行apply()函数,该函数将首先将其更改为矩阵,然后执行其操作。它基本上用于避免显式使用循环结构或构造。
Syntax:
apply(X, margin, function)
Parameters:
x: an array or matrix
margin: a value between 1 and 2 in order to decide where to apply the function [ 1- row; 2- column]
function: the function to apply
例子:
R
# Using apply()
m <- matrix(C <- (1:10),
nrow = 5,
ncol = 6)
m
a_m <- apply(m, 2, sum)
a_m
R
# Using lapply()
movies <- c("SPIDERMAN", "BATMAN",
"AVENGERS", "FROZEN")
movies
movies_lower <- lapply(movies,
tolower)
str(movies_lower)
R
# Using tapply()
data(iris)
tapply(iris$Sepal.Width,
iris$Species,
median)
R
# R prograom to illustrate
# aggregate() function
assets <- data.frame(
asset.class = c("equity", "equity",
"equity", "option",
"option", "option",
"bond", "bond"),
rating = c("AAA", "A", "A",
"AAA", "BB", "BB",
"AAA", "A"),
counterparty.a = c(runif(3), rnorm(5)),
counterparty.b = c(runif(3), rnorm(5)),
counterparty.c = c(runif(3), rnorm(5)))
assets
exposures <- aggregate(
x = assets[c("counterparty.a",
"counterparty.b",
"counterparty.c")],
by = assets[c("asset.class", "rating")],
FUN = function(market.values){
sum(pmax(market.values, 0))
})
exposures
R
# Using ddply()
library(plyr)
dfx <- data.frame(
group = c(rep('A', 8),
rep('B', 15),
rep('C', 6)),
sex = sample(c("M", "F"),
size = 29,
replace = TRUE),
age = runif(n = 29,
min = 18,
max = 54)
)
ddply(dfx, .(group, sex), summarize,
mean = round(mean(age), 2),
sd = round(sd(age), 2))
R
# Using llply()
library(plyr)
x <- list(a = 1:10, beta = exp(-3:3),
logic = c(TRUE, FALSE,
FALSE, TRUE))
llply(x, mean)
llply(x, quantile, probs = 1:3 / 4)
R
# Using dplyr package
# Import the library
library(dplyr)
# Using arrange()
starwars %>%
arrange(desc(mass))
# Using filter()
starwars %>%
filter(species == "Droid")
# Using mutate()
starwars %>%
mutate(name,
bmi = mass / ((height / 100) ^ 2)) %>%
select(name:mass, bmi)
# Using select()
starwars %>%
select(name, ends_with("color"))
# Using summarise()
starwars %>% group_by(species) %>%
summarise(n = n(),
mass = mean(mass, na.rm = TRUE)) %>%
filter(n > 1)
输出:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 6 1 6 1 6
[2,] 2 7 2 7 2 7
[3,] 3 8 3 8 3 8
[4,] 4 9 4 9 4 9
[5,] 5 10 5 10 5 10
[1] 15 40 15 40 15 40
在上面的示例中,我们按列计算元素的总和。因此,对于大量数据,我们可以轻松产生所需的输出。
lapply()函数用于对列表执行操作,并返回与输入列表大小相同的结果列表。 lapply()中的“l”指的是列表。 lapply()函数不需要 margin 参数。
Syntax:
lapply(X, func)
Parameters:
X: the list or a vector or an object
func: the function to apply
例子:
R
# Using lapply()
movies <- c("SPIDERMAN", "BATMAN",
"AVENGERS", "FROZEN")
movies
movies_lower <- lapply(movies,
tolower)
str(movies_lower)
输出:
[1] "SPIDERMAN" "BATMAN" "AVENGERS" "FROZEN"
List of 4
$ : chr "spiderman"
$ : chr "batman"
$ : chr "avengers"
$ : chr "frozen"
sapply()函数接受任何向量或对象或列表,并执行与lapply()函数完全相同的操作。它们都具有相同的语法。
tapply()函数用于计算或测量平均值、中位数、最大值等,或对变量的每个因子执行函数。它有效地用于创建任何向量的子集,然后对其应用或执行任何函数。
Syntax:
tapply(X, index, func = NULL)
Parameters:
X: an object or vector
index: a list of factor
func: the function to apply
例子:
R
# Using tapply()
data(iris)
tapply(iris$Sepal.Width,
iris$Species,
median)
输出:
setosa versicolor virginica
3.4 2.8 3.0
在数据处理中使用 aggregate()
在 R 中, aggregate()函数用于通过对子数据帧的每一列应用函数来组合或聚合输入数据帧。为了执行聚合或应用aggregate()函数,我们必须包括以下内容:
- 我们希望聚合的输入数据
- 数据中将用于分组的变量
- 要应用的函数或计算
在应用特定函数后, aggregate()函数将始终返回一个数据帧,其中包含来自输入数据帧的所有唯一值。我们只能在聚合函数。为了在aggregate()函数中包含多个函数,我们需要使用plyr包。
Syntax:
aggregate(formula, data, function)
Parameters:
formula: the variable(s) of the input data frame we want to apply functions on.
data: the data that we want to use for group by operation.
function: the function or calculation to be applied.
例子:
R
# R prograom to illustrate
# aggregate() function
assets <- data.frame(
asset.class = c("equity", "equity",
"equity", "option",
"option", "option",
"bond", "bond"),
rating = c("AAA", "A", "A",
"AAA", "BB", "BB",
"AAA", "A"),
counterparty.a = c(runif(3), rnorm(5)),
counterparty.b = c(runif(3), rnorm(5)),
counterparty.c = c(runif(3), rnorm(5)))
assets
exposures <- aggregate(
x = assets[c("counterparty.a",
"counterparty.b",
"counterparty.c")],
by = assets[c("asset.class", "rating")],
FUN = function(market.values){
sum(pmax(market.values, 0))
})
exposures
输出:
asset.class rating counterparty.a counterparty.b counterparty.c
1 equity AAA 0.08250275 0.5474595 0.9966172
2 equity A 0.33931258 0.6442402 0.2348197
3 equity A 0.68078755 0.5962635 0.6126720
4 option AAA -0.47624689 -0.4622881 -1.2362731
5 option BB -0.78860284 0.3219559 -1.2847157
6 option BB -0.59461727 -0.2840014 -0.5739735
7 bond AAA 1.65090747 1.0918564 0.6179858
8 bond A -0.05402813 0.1602164 1.1098481
asset.class rating counterparty.a counterparty.b counterparty.c
1 bond A 0.00000000 0.1602164 1.1098481
2 equity A 1.02010013 1.2405038 0.8474916
3 bond AAA 1.65090747 1.0918564 0.6179858
4 equity AAA 0.08250275 0.5474595 0.9966172
5 option AAA 0.00000000 0.0000000 0.0000000
6 option BB 0.00000000 0.3219559 0.0000000
我们可以看到,在上面的示例中,资产数据框的值已经聚合在“asset.class”和“rating”列上。
使用 plyr 包进行数据处理
plyr 包用于拆分、应用和组合数据。 plyr是一组工具,可用于拆分巨大或大数据以创建同质片段,然后在每个片段上应用一个函数,最后组合所有结果值。我们已经可以在 R 中执行这些操作,但是使用plyr我们可以很容易地做到这一点,因为:
- 名称、参数和输出完全一致
- 方便的并行性
- 输入和输出都涉及数据框、矩阵或列表
- 为了跟踪长时间执行或运行的程序,它提供了一个进度条
- 内置信息丰富的错误消息和错误恢复
- 通过所有转换维护的标签。
我们将在本节中讨论的两个函数是ddply()和llply() 。对于给定数据帧的每个子集, ddply()应用一个函数,然后组合结果。
Syntax:
ddply(.data, .variables, .fun = NULL, …, .progress = “none”, .inform = FALSE,
.drop = TRUE, .parallel = FALSE, .paropts = NULL)
Parameters:
data: the data frame that is to be processed
variable: the variable based on which it will split the data frame
fun: the function to be applied
…: other arguments that are passed to fun
progress: name of the progress bar
inform: whether to produce any informative error message
drop: combination of variables that is not in the input data frame should be preserved or dropped.
parallel: whether to apply function parallel
paropts: list of extra or additional options passed
例子:
R
# Using ddply()
library(plyr)
dfx <- data.frame(
group = c(rep('A', 8),
rep('B', 15),
rep('C', 6)),
sex = sample(c("M", "F"),
size = 29,
replace = TRUE),
age = runif(n = 29,
min = 18,
max = 54)
)
ddply(dfx, .(group, sex), summarize,
mean = round(mean(age), 2),
sd = round(sd(age), 2))
输出:
group sex mean sd
1 A F 41.00 9.19
2 A M 35.76 12.14
3 B F 34.75 11.70
4 B M 40.01 10.10
5 C F 25.13 10.37
6 C M 43.26 7.63
现在我们将看到如何使用llply()来处理数据。 llply()函数用于列表的每个元素,我们对它们应用一个函数,组合的结果输出也是一个列表。
Syntax:
llply(.data, .fun = NULL,
…, .progress = “none”, .inform = FALSE,
.parallel = FALSE, .paropts = NULL)
例子:
R
# Using llply()
library(plyr)
x <- list(a = 1:10, beta = exp(-3:3),
logic = c(TRUE, FALSE,
FALSE, TRUE))
llply(x, mean)
llply(x, quantile, probs = 1:3 / 4)
输出:
$a
[1] 5.5
$beta
[1] 4.535125
$logic
[1] 0.5
$a
25% 50% 75%
3.25 5.50 7.75
$beta
25% 50% 75%
0.2516074 1.0000000 5.0536690
$logic
25% 50% 75%
0.0 0.5 1.0
使用 dplyr 包进行数据处理
dplyr 包可以被认为是一种数据操作语法,它为我们提供了一组一致的动词,帮助我们解决一些最常见的数据操作挑战:
- 安排() 用于更改行的顺序。
- filter()用于根据它们的值或基于值来选择案例。
- mutate()用于添加新变量,这些新变量是现有变量的函数。
- select()用于根据名称选择或选择变量。
- 总结() 用于将多个值减少为单个摘要。
dplyr下还有更多功能。 dplyr使用非常高效的后端,从而减少了计算的等待时间。它比plyr包更有效。
Syntax:
arrange(.data, …, .by_group = FALSE)
filter(.data, …)
mutate(.data, …)
select(.data, …)
summarize(X, by, fun, …, stat.name = deparse(substitute(X)),
type = c(“variable”,”matrix”), subset = TRUE, keepcolnames = FALSE)
例子:
R
# Using dplyr package
# Import the library
library(dplyr)
# Using arrange()
starwars %>%
arrange(desc(mass))
# Using filter()
starwars %>%
filter(species == "Droid")
# Using mutate()
starwars %>%
mutate(name,
bmi = mass / ((height / 100) ^ 2)) %>%
select(name:mass, bmi)
# Using select()
starwars %>%
select(name, ends_with("color"))
# Using summarise()
starwars %>% group_by(species) %>%
summarise(n = n(),
mass = mean(mass, na.rm = TRUE)) %>%
filter(n > 1)
输出:
> starwars %>% arrange(desc(mass))
# A tibble: 87 x 13
name height mass hair_color skin_color eye_color birth_year gender homeworld species films vehicles starships
1 Jabba D~ 175 1358 NA green-tan, ~ orange 600 hermap~ Nal Hutta Hutt
2 Grievous 216 159 none brown, white green, ye~ NA male Kalee Kaleesh
3 IG-88 200 140 none metal red 15 none NA Droid
4 Darth V~ 202 136 none white yellow 41.9 male Tatooine Human
5 Tarfful 234 136 brown brown blue NA male Kashyyyk Wookiee
6 Owen La~ 178 120 brown, grey light blue 52 male Tatooine Human
7 Bossk 190 113 none green red 53 male Trandosha Trando~
8 Chewbac~ 228 112 brown unknown blue 200 male Kashyyyk Wookiee
9 Jek Ton~ 180 110 brown fair blue NA male Bestine ~ Human
10 Dexter ~ 198 102 none brown yellow NA male Ojom Besali~
# ... with 77 more rows
> starwars %>% filter(species == "Droid")
# A tibble: 5 x 13
name height mass hair_color skin_color eye_color birth_year gender homeworld species films vehicles starships
1 C-3PO 167 75 NA gold yellow 112 NA Tatooine Droid
2 R2-D2 96 32 NA white, blue red 33 NA Naboo Droid
3 R5-D4 97 32 NA white, red red NA NA Tatooine Droid
4 IG-88 200 140 none metal red 15 none NA Droid
5 BB8 NA NA none none black NA none NA Droid
> starwars %>% mutate(name, bmi = mass / ((height / 100) ^ 2)) %>% select(name:mass, bmi)
# A tibble: 87 x 4
name height mass bmi
1 Luke Skywalker 172 77 26.0
2 C-3PO 167 75 26.9
3 R2-D2 96 32 34.7
4 Darth Vader 202 136 33.3
5 Leia Organa 150 49 21.8
6 Owen Lars 178 120 37.9
7 Beru Whitesun lars 165 75 27.5
8 R5-D4 97 32 34.0
9 Biggs Darklighter 183 84 25.1
10 Obi-Wan Kenobi 182 77 23.2
# ... with 77 more rows
> starwars %>% select(name, ends_with("color"))
# A tibble: 87 x 4
name hair_color skin_color eye_color
1 Luke Skywalker blond fair blue
2 C-3PO NA gold yellow
3 R2-D2 NA white, blue red
4 Darth Vader none white yellow
5 Leia Organa brown light brown
6 Owen Lars brown, grey light blue
7 Beru Whitesun lars brown light blue
8 R5-D4 NA white, red red
9 Biggs Darklighter black light brown
10 Obi-Wan Kenobi auburn, white fair blue-gray
# ... with 77 more rows
> starwars %>% group_by(species) %>%
+ summarise(n = n(),mass = mean(mass, na.rm = TRUE)) %>%
+ filter(n > 1)
# A tibble: 9 x 3
species n mass
1 Droid 5 69.8
2 Gungan 3 74
3 Human 35 82.8
4 Kaminoan 2 88
5 Mirialan 2 53.1
6 Twi'lek 2 55
7 Wookiee 2 124
8 Zabrak 2 80
9 NA 5 48