R 编程中的数据处理

Data Munging 是将数据从不可用或错误的形式转换为有用形式的通用技术。如果没有一定程度的数据处理（无论是专业用户还是自动化系统执行），数据就无法为下游消费做好准备。基本上，手动清理数据的过程称为数据清理。在 R 编程中，以下方式面向数据处理过程：

申请（）家庭
总计的（）
dplyr包
plyr包

使用 apply() 系列进行数据处理

在 R 的apply()集合中，最基本的函数是apply()函数。除此之外，还有 lapply() ， sapply()和tapply() 。 apply()的整个集合可以被认为是循环的替代品。它是最受限制的函数类型。它应该在包含所有同质元素的矩阵上执行。如果使用数据框或任何其他类型的对象执行apply()函数，该函数将首先将其更改为矩阵，然后执行其操作。它基本上用于避免显式使用循环结构或构造。

Syntax:

apply(X, margin, function)

Parameters:

x: an array or matrix

margin: a value between 1 and 2 in order to decide where to apply the function [ 1- row; 2- column]

function: the function to apply

编程需要懂一点英语

例子：

R

# Using apply()
m <- matrix(C <- (1:10),
            nrow = 5,
            ncol = 6)
m
a_m <- apply(m, 2, sum)
a_m

R

# Using lapply()
movies <- c("SPIDERMAN", "BATMAN",
            "AVENGERS", "FROZEN")
movies
movies_lower <- lapply(movies,
                       tolower)
str(movies_lower)

R

# Using tapply()
data(iris)
tapply(iris$Sepal.Width,
       iris$Species,
       median)

R

# R prograom to illustrate
# aggregate() function
assets <- data.frame(
  asset.class = c("equity", "equity",
                  "equity", "option",
                  "option", "option",
                  "bond", "bond"),
       rating = c("AAA", "A", "A",
                  "AAA", "BB", "BB",
                  "AAA", "A"),
counterparty.a = c(runif(3), rnorm(5)),
counterparty.b = c(runif(3), rnorm(5)),
counterparty.c = c(runif(3), rnorm(5)))
assets
exposures <- aggregate(
  x = assets[c("counterparty.a",
               "counterparty.b",
               "counterparty.c")],
 by = assets[c("asset.class", "rating")],
 FUN = function(market.values){
 sum(pmax(market.values, 0))
                       })
exposures

R

# Using ddply()
library(plyr)
dfx <- data.frame(
  group = c(rep('A', 8),
            rep('B', 15),
            rep('C', 6)),
  sex = sample(c("M", "F"),
               size = 29,
               replace = TRUE),
  age = runif(n = 29,
              min = 18,
              max = 54)
)
 
ddply(dfx, .(group, sex), summarize,
      mean = round(mean(age), 2),
      sd = round(sd(age), 2))

R

# Using llply()
library(plyr)
x <- list(a = 1:10, beta = exp(-3:3),
          logic = c(TRUE, FALSE,
                    FALSE, TRUE))
llply(x, mean)
llply(x, quantile, probs = 1:3 / 4)

R

# Using dplyr package
 
# Import the library
library(dplyr)
 
# Using arrange()
starwars %>%
    arrange(desc(mass))
 
# Using filter()
starwars %>%
    filter(species == "Droid")
 
# Using mutate()
starwars %>%
    mutate(name,
    bmi = mass / ((height / 100)  ^ 2)) %>%
    select(name:mass, bmi)
 
# Using select()
starwars %>%
    select(name, ends_with("color"))
 
# Using summarise()
starwars %>% group_by(species) %>%
  summarise(n = n(),
  mass = mean(mass, na.rm = TRUE)) %>%
  filter(n > 1)

输出：

[,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    6    1    6    1    6
[2,]    2    7    2    7    2    7
[3,]    3    8    3    8    3    8
[4,]    4    9    4    9    4    9
[5,]    5   10    5   10    5   10

[1] 15 40 15 40 15 40

在上面的示例中，我们按列计算元素的总和。因此，对于大量数据，我们可以轻松产生所需的输出。

lapply()函数用于对列表执行操作，并返回与输入列表大小相同的结果列表。 lapply()中的“l”指的是列表。 lapply()函数不需要 margin 参数。

Syntax:

lapply(X, func)

Parameters:

X: the list or a vector or an object

func: the function to apply

编程需要懂一点英语

例子：

R

# Using lapply()
movies <- c("SPIDERMAN", "BATMAN",
            "AVENGERS", "FROZEN")
movies
movies_lower <- lapply(movies,
                       tolower)
str(movies_lower)

输出：

[1] "SPIDERMAN" "BATMAN"    "AVENGERS"  "FROZEN"   

List of 4
 $ : chr "spiderman"
 $ : chr "batman"
 $ : chr "avengers"
 $ : chr "frozen"

sapply()函数接受任何向量或对象或列表，并执行与lapply()函数完全相同的操作。它们都具有相同的语法。

tapply()函数用于计算或测量平均值、中位数、最大值等，或对变量的每个因子执行函数。它有效地用于创建任何向量的子集，然后对其应用或执行任何函数。

Syntax:

tapply(X, index, func = NULL)

Parameters:

X: an object or vector

index: a list of factor
func: the function to apply

编程需要懂一点英语

例子：

R

# Using tapply()
data(iris)
tapply(iris$Sepal.Width,
       iris$Species,
       median)

输出：

setosa versicolor  virginica 
 3.4        2.8        3.0

在数据处理中使用 aggregate()

在 R 中， aggregate()函数用于通过对子数据帧的每一列应用函数来组合或聚合输入数据帧。为了执行聚合或应用aggregate()函数，我们必须包括以下内容：

我们希望聚合的输入数据
数据中将用于分组的变量
要应用的函数或计算

在应用特定函数后， aggregate()函数将始终返回一个数据帧，其中包含来自输入数据帧的所有唯一值。我们只能在聚合函数。为了在aggregate()函数中包含多个函数，我们需要使用plyr包。

Syntax:

aggregate(formula, data, function)

Parameters:

formula: the variable(s) of the input data frame we want to apply functions on.

data: the data that we want to use for group by operation.
function: the function or calculation to be applied.

编程需要懂一点英语

例子：

R

# R prograom to illustrate
# aggregate() function
assets <- data.frame(
  asset.class = c("equity", "equity",
                  "equity", "option",
                  "option", "option",
                  "bond", "bond"),
       rating = c("AAA", "A", "A",
                  "AAA", "BB", "BB",
                  "AAA", "A"),
counterparty.a = c(runif(3), rnorm(5)),
counterparty.b = c(runif(3), rnorm(5)),
counterparty.c = c(runif(3), rnorm(5)))
assets
exposures <- aggregate(
  x = assets[c("counterparty.a",
               "counterparty.b",
               "counterparty.c")],
 by = assets[c("asset.class", "rating")],
 FUN = function(market.values){
 sum(pmax(market.values, 0))
                       })
exposures

输出：

asset.class rating counterparty.a counterparty.b counterparty.c
1      equity    AAA     0.08250275      0.5474595      0.9966172
2      equity      A     0.33931258      0.6442402      0.2348197
3      equity      A     0.68078755      0.5962635      0.6126720
4      option    AAA    -0.47624689     -0.4622881     -1.2362731
5      option     BB    -0.78860284      0.3219559     -1.2847157
6      option     BB    -0.59461727     -0.2840014     -0.5739735
7        bond    AAA     1.65090747      1.0918564      0.6179858
8        bond      A    -0.05402813      0.1602164      1.1098481

  asset.class rating counterparty.a counterparty.b counterparty.c
1        bond      A     0.00000000      0.1602164      1.1098481
2      equity      A     1.02010013      1.2405038      0.8474916
3        bond    AAA     1.65090747      1.0918564      0.6179858
4      equity    AAA     0.08250275      0.5474595      0.9966172
5      option    AAA     0.00000000      0.0000000      0.0000000
6      option     BB     0.00000000      0.3219559      0.0000000

我们可以看到，在上面的示例中，资产数据框的值已经聚合在“asset.class”和“rating”列上。

使用 plyr 包进行数据处理

plyr 包用于拆分、应用和组合数据。 plyr是一组工具，可用于拆分巨大或大数据以创建同质片段，然后在每个片段上应用一个函数，最后组合所有结果值。我们已经可以在 R 中执行这些操作，但是使用plyr我们可以很容易地做到这一点，因为：

名称、参数和输出完全一致
方便的并行性
输入和输出都涉及数据框、矩阵或列表
为了跟踪长时间执行或运行的程序，它提供了一个进度条
内置信息丰富的错误消息和错误恢复
通过所有转换维护的标签。

我们将在本节中讨论的两个函数是ddply()和llply() 。对于给定数据帧的每个子集， ddply()应用一个函数，然后组合结果。

Syntax:

ddply(.data, .variables, .fun = NULL, …, .progress = “none”, .inform = FALSE,

.drop = TRUE, .parallel = FALSE, .paropts = NULL)

Parameters:

data: the data frame that is to be processed

variable: the variable based on which it will split the data frame

fun: the function to be applied

…: other arguments that are passed to fun

progress: name of the progress bar

inform: whether to produce any informative error message

drop: combination of variables that is not in the input data frame should be preserved or dropped.

parallel: whether to apply function parallel

paropts: list of extra or additional options passed

编程需要懂一点英语

例子：

R

# Using ddply()
library(plyr)
dfx <- data.frame(
  group = c(rep('A', 8),
            rep('B', 15),
            rep('C', 6)),
  sex = sample(c("M", "F"),
               size = 29,
               replace = TRUE),
  age = runif(n = 29,
              min = 18,
              max = 54)
)
 
ddply(dfx, .(group, sex), summarize,
      mean = round(mean(age), 2),
      sd = round(sd(age), 2))

输出：

group sex  mean    sd
1     A   F 41.00  9.19
2     A   M 35.76 12.14
3     B   F 34.75 11.70
4     B   M 40.01 10.10
5     C   F 25.13 10.37
6     C   M 43.26  7.63

现在我们将看到如何使用llply()来处理数据。 llply()函数用于列表的每个元素，我们对它们应用一个函数，组合的结果输出也是一个列表。

Syntax:

llply(.data, .fun = NULL,
…, .progress = “none”, .inform = FALSE,
.parallel = FALSE, .paropts = NULL)

编程需要懂一点英语

例子：

R

# Using llply()
library(plyr)
x <- list(a = 1:10, beta = exp(-3:3),
          logic = c(TRUE, FALSE,
                    FALSE, TRUE))
llply(x, mean)
llply(x, quantile, probs = 1:3 / 4)

输出：

$a
[1] 5.5

$beta
[1] 4.535125

$logic
[1] 0.5

$a
 25%  50%  75% 
3.25 5.50 7.75 

$beta
      25%       50%       75% 
0.2516074 1.0000000 5.0536690 

$logic
25% 50% 75% 
0.0 0.5 1.0

使用 dplyr 包进行数据处理

dplyr 包可以被认为是一种数据操作语法，它为我们提供了一组一致的动词，帮助我们解决一些最常见的数据操作挑战：

安排（） 用于更改行的顺序。
filter()用于根据它们的值或基于值来选择案例。
mutate()用于添加新变量，这些新变量是现有变量的函数。
select()用于根据名称选择或选择变量。
总结（） 用于将多个值减少为单个摘要。

dplyr下还有更多功能。 dplyr使用非常高效的后端，从而减少了计算的等待时间。它比plyr包更有效。

Syntax:

arrange(.data, …, .by_group = FALSE)

filter(.data, …)

mutate(.data, …)

select(.data, …)

summarize(X, by, fun, …, stat.name = deparse(substitute(X)),

type = c(“variable”,”matrix”), subset = TRUE, keepcolnames = FALSE)

编程需要懂一点英语

例子：

R

# Using dplyr package
 
# Import the library
library(dplyr)
 
# Using arrange()
starwars %>%
    arrange(desc(mass))
 
# Using filter()
starwars %>%
    filter(species == "Droid")
 
# Using mutate()
starwars %>%
    mutate(name,
    bmi = mass / ((height / 100)  ^ 2)) %>%
    select(name:mass, bmi)
 
# Using select()
starwars %>%
    select(name, ends_with("color"))
 
# Using summarise()
starwars %>% group_by(species) %>%
  summarise(n = n(),
  mass = mean(mass, na.rm = TRUE)) %>%
  filter(n > 1)

输出：

> starwars %>% arrange(desc(mass))
# A tibble: 87 x 13
   name     height  mass hair_color  skin_color   eye_color  birth_year gender  homeworld species films vehicles starships
                                                       
 1 Jabba D~    175  1358 NA          green-tan, ~ orange          600   hermap~ Nal Hutta Hutt    
 2 Grievous    216   159 none        brown, white green, ye~       NA   male    Kalee     Kaleesh 
 3 IG-88       200   140 none        metal        red              15   none    NA        Droid   
 4 Darth V~    202   136 none        white        yellow           41.9 male    Tatooine  Human   
 5 Tarfful     234   136 brown       brown        blue             NA   male    Kashyyyk  Wookiee 
 6 Owen La~    178   120 brown, grey light        blue             52   male    Tatooine  Human   
 7 Bossk       190   113 none        green        red              53   male    Trandosha Trando~ 
 8 Chewbac~    228   112 brown       unknown      blue            200   male    Kashyyyk  Wookiee 
 9 Jek Ton~    180   110 brown       fair         blue             NA   male    Bestine ~ Human   
10 Dexter ~    198   102 none        brown        yellow           NA   male    Ojom      Besali~ 
# ... with 77 more rows

> starwars %>% filter(species == "Droid")
# A tibble: 5 x 13
  name  height  mass hair_color skin_color  eye_color birth_year gender homeworld species films     vehicles  starships
                                                   
1 C-3PO    167    75 NA         gold        yellow           112 NA     Tatooine  Droid     
2 R2-D2     96    32 NA         white, blue red               33 NA     Naboo     Droid     
3 R5-D4     97    32 NA         white, red  red               NA NA     Tatooine  Droid     
4 IG-88    200   140 none       metal       red               15 none   NA        Droid     
5 BB8       NA    NA none       none        black             NA none   NA        Droid     

> starwars %>% mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>% select(name:mass, bmi)
# A tibble: 87 x 4
   name               height  mass   bmi
                    
 1 Luke Skywalker        172    77  26.0
 2 C-3PO                 167    75  26.9
 3 R2-D2                  96    32  34.7
 4 Darth Vader           202   136  33.3
 5 Leia Organa           150    49  21.8
 6 Owen Lars             178   120  37.9
 7 Beru Whitesun lars    165    75  27.5
 8 R5-D4                  97    32  34.0
 9 Biggs Darklighter     183    84  25.1
10 Obi-Wan Kenobi        182    77  23.2
# ... with 77 more rows

> starwars %>% select(name, ends_with("color"))
# A tibble: 87 x 4
   name               hair_color    skin_color  eye_color
                                     
 1 Luke Skywalker     blond         fair        blue     
 2 C-3PO              NA            gold        yellow   
 3 R2-D2              NA            white, blue red      
 4 Darth Vader        none          white       yellow   
 5 Leia Organa        brown         light       brown    
 6 Owen Lars          brown, grey   light       blue     
 7 Beru Whitesun lars brown         light       blue     
 8 R5-D4              NA            white, red  red      
 9 Biggs Darklighter  black         light       brown    
10 Obi-Wan Kenobi     auburn, white fair        blue-gray
# ... with 77 more rows

> starwars %>% group_by(species) %>% 
+   summarise(n = n(),mass = mean(mass, na.rm = TRUE)) %>%
+   filter(n > 1)
# A tibble: 9 x 3
  species      n  mass
       
1 Droid        5  69.8
2 Gungan       3  74  
3 Human       35  82.8
4 Kaminoan     2  88  
5 Mirialan     2  53.1
6 Twi'lek      2  55  
7 Wookiee      2 124  
8 Zabrak       2  80  
9 NA           5  48