📜  R 编程中的 tidyr 包

📅  最后修改于: 2022-05-13 01:55:18.314000             🧑  作者: Mango

R 编程中的 tidyr 包

R 语言中的是 R 函数、编译代码和示例数据的集合它们存储在R 环境中名为“库”的目录下。默认情况下,R 在安装过程中会安装一组包。 R 中最重要的包之一是tidyr包。 tidyr包的唯一目的是简化创建整洁数据的过程。 Tidy data 描述了一种存储数据的标准方法,该方法在整个tidyverse中尽可能使用。如果您曾经确保您的数据是整洁的,那么您将花费更少的时间来处理这些工具,而将更多的时间用于您的分析。

安装

要在 R 编程中使用包,必须首先安装包。可以使用命令install.packages(“packagename”)完成此任务。要安装整个tidyverse包,请输入:

install.packages("tidyverse")

安装 tidyverse

或者,要安装tidyr包,请输入:

install.packages("tidyr")

要从 GitHub 安装开发版本,请键入:

# install.packages("devtools")
devtools::install_github("tidyverse/tidyr")

tidyr 包中的重要动词函数

数据集:

在进入重要的动词函数之前,让我们先准备数据集。定义一个数据集tidy_dataframe ,其中包含有关特定组中人员频率的数据。

R
# load the tidyr package
library(tidyr)
  
n = 10
# creating a data frame
tidy_dataframe = data.frame(
                      S.No = c(1:n), 
                    Group.1 = c(23, 345, 76, 212, 88, 
                                199, 72, 35, 90, 265),
                    Group.2 = c(117, 89, 66, 334, 90, 
                               101, 178, 233, 45, 200),
                    Group.3 = c(29, 101, 239, 289, 176,
                                320, 89, 109, 199, 56))
  
# print the elements of the data frame
tidy_dataframe


R
# using gather() function on tidy_dataframe
long <- tidy_dataframe %>% 
            gather(Group, Frequency,
                   Group.1:Group.3)
  
# print the data frame in a long format
long


R
# import tidyr package
library(tidyr)
long <- tidy_dataframe %>%
            gather(Group, Frequency,
                   Group.1:Group.3)
  
# use separate() function to make data wider
separate_data <- long %>% 
            separate(Group, c("Allotment", 
                              "Number"))
  
# print the wider format
separate_data


R
# import tidyr package 
library(tidyr)
  
long <- tidy_dataframe %>%
            gather(Group, Frequency, 
                   Group.1:Group.3)
  
# use separate() function to make data wider
separate_data <- long %>% 
            separate(Group, c("Allotment",
                              "Number"))
  
# use unite() function to glue 
# Allotment and Number columns
unite_data <- separate_data %>%
            unite(Group, Allotment, 
                  Number, sep = ".")
  
# print the new data frame
unite_data


R
# import tidyr package 
library(tidyr)
  
long <- tidy_dataframe %>% 
            gather(Group, Frequency,
                   Group.1:Group.3)
  
# use separate() function to make data wider
separate_data <- long %>% 
            separate(Group, c("Allotment", 
                              "Number"))
  
# use unite() function to glue
# Allotment and Number columns
unite_data <- separate_data %>% 
            unite(Group, Allotment,
                  Number, sep = ".")
  
# use unite() function to make data wider
back_to_wide <- unite_data %>% 
            spread(Group, Frequency)
  
# print the new data frame
back_to_wide


R
# import tidyr package
library(tidyr)
  
df <- tidy_dataframe
  
# nest column Group.1 in 
# tidy_dataframe using nest()
df %>% nest(data = c(Group.1))


R
# import the tidyr package
library(tidyr)
  
df <- iris
names(iris)
  
# nesting the species column in 
# the df data frame using nest()
head(df %>% nest(data = c(Species)))  # Output (i)
  
# unnesting the species column 
# in the df data frame using unnest()
head(df %>% unnest(Species,.drop = NA,
                   .preserve = NULL)) # Output (ii)


R
# import the tidyr package
df <- data.frame(Month = 1:6, 
                 Year = c(2000, rep(NA, 5)))
  
# print the df data frame
df                   # Output (i)
  
# use fill() to fill missing values in 
# Year column in df data frame
df %>% fill(Year)    # Output (ii)


R
# import the tidyr package
library(tidyr)
  
# creating a numeric vector
num_vec <- c(1, 7, 9, 14, 19, 20)
  
# use full_seq() to fill missing
# values in num_vec
full_seq(num_vector, 1)


R
# import tidyr package
library(tidyr)
  
# create a tibble df with missing values
df <- tibble(S.No = c(1:10),
             Name = c('John', 'Smith', 'Peter',
                      'Luke', 'King', rep(NA, 5)))
  
# print df tibble
df                    # Output (i)
  
# use drop_na() to drop columns 
# in df with missing values 
df %>% drop_na(Name)  # Output (ii)


R
# import tidyr package
library(tidyr)
  
df <- data.frame(S.No = c(1:10),
                 Name = c('John', 'Smith', 
                          'Peter', 'Luke',
                          'King', rep(NA, 5)))
  
df                                      # Output (i)
  
# use replace_na() to replace missing values or na
df %>% replace_na(list(Name = 'Henry')) # Output (ii)


输出:

S.No Group.1 Group.2 Group.3
1     1      23     117      29
2     2     345      89     101
3     3      76      66     239
4     4     212     334     289
5     5      88      90     176
6     6     199     101     320
7     7      72     178      89
8     8      35     233     109
9     9      90      45     199
10   10     265     200      56

tidyr 包提供了可用于数据清洗的各种重要功能。那些是:

  • 收集()函数:它需要多列并将它们收集成键值对。基本上它使“宽”数据更长。 gather()函数将获取多个列并将它们折叠成键值对,并根据需要复制所有其他列。

Parameter

Description

datathe data frame.
key, value

the names of new key and value columns, 

as strings or as symbols.

…….

the selection of columns. If left empty, all variables are selected. 

You can supply bare variable names, select all variables between

 x and z with x:z, exclude y with -y.

na.rmif set TRUE, it will remove rows from output where the value column is NA.
convert

is set TRUE, it will automatically run type.convert() on the key column. 

This is useful if the column types are actually numeric,

 integer, or logical.

factor_key

if FALSE, the default, the key values will be stored as a character vector.

 If TRUE, will be stored as a factor, which preserves

the original ordering of the columns.

例子:

现在为了更好地理解,我们将使用collect()函数使我们的数据变长。

R

# using gather() function on tidy_dataframe
long <- tidy_dataframe %>% 
            gather(Group, Frequency,
                   Group.1:Group.3)
  
# print the data frame in a long format
long

输出:

S.No  Group   Frequency
1     1 Group.1        23
2     2 Group.1       345
3     3 Group.1        76
4     4 Group.1       212
5     5 Group.1        88
6     6 Group.1       199
7     7 Group.1        72
8     8 Group.1        35
9     9 Group.1        90
10   10 Group.1       265
11    1 Group.2       117
12    2 Group.2        89
13    3 Group.2        66
14    4 Group.2       334
15    5 Group.2        90
16    6 Group.2       101
17    7 Group.2       178
18    8 Group.2       233
19    9 Group.2        45
20   10 Group.2       200
21    1 Group.3        29
22    2 Group.3       101
23    3 Group.3       239
24    4 Group.3       289
25    5 Group.3       176
26    6 Group.3       320
27    7 Group.3        89
28    8 Group.3       109
29    9 Group.3       199
30   10 Group.3        56
  • 单独()函数:它将较长的数据转换为更广泛的格式。 separate()函数将单个字符列转换为多列。

Parameter

Description

dataA data frame.
colColumn name or position.
into

Names of new variables to create as character vector. 

Use NA to omit the variable in the output.

sepThe separator between the columns.
removeIf set TRUE, it will remove input column from the output data frame.
convertIf TRUE, will run type.convert() with as.is = TRUE on new columns.

例子:

我们可以说使用gather()创建的长数据集适合使用,但我们可以使用separate()进一步分解Group 变量。

R

# import tidyr package
library(tidyr)
long <- tidy_dataframe %>%
            gather(Group, Frequency,
                   Group.1:Group.3)
  
# use separate() function to make data wider
separate_data <- long %>% 
            separate(Group, c("Allotment", 
                              "Number"))
  
# print the wider format
separate_data

输出:

S.No Allotment Number Frequency
1     1     Group      1        23
2     2     Group      1       345
3     3     Group      1        76
4     4     Group      1       212
5     5     Group      1        88
6     6     Group      1       199
7     7     Group      1        72
8     8     Group      1        35
9     9     Group      1        90
10   10     Group      1       265
11    1     Group      2       117
12    2     Group      2        89
13    3     Group      2        66
14    4     Group      2       334
15    5     Group      2        90
16    6     Group      2       101
17    7     Group      2       178
18    8     Group      2       233
19    9     Group      2        45
20   10     Group      2       200
21    1     Group      3        29
22    2     Group      3       101
23    3     Group      3       239
24    4     Group      3       289
25    5     Group      3       176
26    6     Group      3       320
27    7     Group      3        89
28    8     Group      3       109
29    9     Group      3       199
30   10     Group      3        56
  • unity()函数:它将两列合并为一列。 unite()函数是一种方便的函数,可以将多个变量值粘贴到一个中。本质上,它将单个观察的两个变量组合成一个变量。
Parameter

Description

dataA data frame.
colThe name of the new column.
….A selection of desired columns. If empty, all variables are selected. 
sepA separator to use between values.
removeIf TRUE, remove input columns from output data frame.

例子:

团结是分开的恭维。要撤消separate() ,我们可以使用unite() ,它将两个变量合并为一个。在这里,我们将使用分隔符“。”合并两列 Group 和 Number。 .

R

# import tidyr package 
library(tidyr)
  
long <- tidy_dataframe %>%
            gather(Group, Frequency, 
                   Group.1:Group.3)
  
# use separate() function to make data wider
separate_data <- long %>% 
            separate(Group, c("Allotment",
                              "Number"))
  
# use unite() function to glue 
# Allotment and Number columns
unite_data <- separate_data %>%
            unite(Group, Allotment, 
                  Number, sep = ".")
  
# print the new data frame
unite_data

输出:

S.No   Group Frequency
1     1 Group.1        23
2     2 Group.1       345
3     3 Group.1        76
4     4 Group.1       212
5     5 Group.1        88
6     6 Group.1       199
7     7 Group.1        72
8     8 Group.1        35
9     9 Group.1        90
10   10 Group.1       265
11    1 Group.2       117
12    2 Group.2        89
13    3 Group.2        66
14    4 Group.2       334
15    5 Group.2        90
16    6 Group.2       101
17    7 Group.2       178
18    8 Group.2       233
19    9 Group.2        45
20   10 Group.2       200
21    1 Group.3        29
22    2 Group.3       101
23    3 Group.3       239
24    4 Group.3       289
25    5 Group.3       176
26    6 Group.3       320
27    7 Group.3        89
28    8 Group.3       109
29    9 Group.3       199
30   10 Group.3        56
  • spread()函数:它有助于将较长的格式重塑为更宽的格式。 spread()函数将键值对分布在多个列中。
Parameter

Description

dataA data frame.
keyColumn names or positions.
valueColumn names or positions.
fillIf set, missing values will be replaced with this value. 
convertIf TRUE, type.convert() with asis = TRUE will be run on each of the new columns.

例子:

我们可以使用spread()函数将数据从 long 转换回 Wide。

R

# import tidyr package 
library(tidyr)
  
long <- tidy_dataframe %>% 
            gather(Group, Frequency,
                   Group.1:Group.3)
  
# use separate() function to make data wider
separate_data <- long %>% 
            separate(Group, c("Allotment", 
                              "Number"))
  
# use unite() function to glue
# Allotment and Number columns
unite_data <- separate_data %>% 
            unite(Group, Allotment,
                  Number, sep = ".")
  
# use unite() function to make data wider
back_to_wide <- unite_data %>% 
            spread(Group, Frequency)
  
# print the new data frame
back_to_wide

输出:

S.No Group.1 Group.2 Group.3
1     1      23     117      29
2     2     345      89     101
3     3      76      66     239
4     4     212     334     289
5     5      88      90     176
6     6     199     101     320
7     7      72     178      89
8     8      35     233     109
9     9      90      45     199
10   10     265     200      56
  • nest()函数:它创建一个包含所有嵌套变量的数据框列表。嵌套隐含地是一种汇总操作。这与其他适用于整个数据集的摘要(尤其是模型)结合使用非常有用。
Parameter

Description

dataA data frame.
….A selection of columns. If empty, all variables are selected.
.keyThe name of the new column, as a string or symbol.

示例:让我们尝试从我们在数据集中创建的 tidy_dataframe 嵌套 Group.2 列。

R

# import tidyr package
library(tidyr)
  
df <- tidy_dataframe
  
# nest column Group.1 in 
# tidy_dataframe using nest()
df %>% nest(data = c(Group.1))

输出:

# A tibble: 10 x 4
    S.No Group.1 Group.3 data            
                    
 1     1      23      29 
 2     2     345     101 
 3     3      76     239 
 4     4     212     289 
 5     5      88     176 
 6     6     199     320 
 7     7      72      89 
 8     8      35     109 
 9     9      90     199 
10    10     265      56 
  • unnest()函数:它基本上反转了嵌套操作。它使列表的每个元素成为自己的行。它可以处理包含原子向量、列表或数据帧(但不能混合不同类型)的列表列。
Parameter

Description

dataA data frame
….Specification of columns to unnest.  If omitted, defaults to all list-columns.
.drop

Should additional list columns be dropped? By default,

it will drop them if unnesting 

the specified columns requires the rows to be duplicated.

.idData frame identifier.
.sep

If non-NULL, the names of unnested data frame columns 

will combine the name of the original list-col with

the names from nested data frame, separated by .sep.

.preserve

List-columns to preserve in the output. These will be

 duplicated in the same way as atomic vectors.

例子:

我们将尝试在 tidyr 包的 iris 数据框中嵌套和取消嵌套 Species 列。

R

# import the tidyr package
library(tidyr)
  
df <- iris
names(iris)
  
# nesting the species column in 
# the df data frame using nest()
head(df %>% nest(data = c(Species)))  # Output (i)
  
# unnesting the species column 
# in the df data frame using unnest()
head(df %>% unnest(Species,.drop = NA,
                   .preserve = NULL)) # Output (ii)

输出(一):

# A tibble: 6 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width data            
                                          
1          5.1         3.5          1.4         0.2 
2          4.9         3            1.4         0.2 
3          4.7         3.2          1.3         0.2 
4          4.6         3.1          1.5         0.2 
5          5           3.6          1.4         0.2 
6          5.4         3.9          1.7         0.4 

输出(ii):

# A tibble: 6 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
                                  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa 
  • fill()函数:用于使用上一个条目填充选定列中的缺失值。这在值不重复的常见输出格式中很有用,每次更改时都会记录它们。缺失值在原子向量中被替换; NULL 在列表中被替换。
Parameter

Description

dataA data frame.
….A selection of columns. If empty, nothing happens.
directionDirection in which to fill missing values. Currently, either “down” (the default) or “up”

例子:

R

# import the tidyr package
df <- data.frame(Month = 1:6, 
                 Year = c(2000, rep(NA, 5)))
  
# print the df data frame
df                   # Output (i)
  
# use fill() to fill missing values in 
# Year column in df data frame
df %>% fill(Year)    # Output (ii)

输出(一):

Month Year
1     1 2000
2     2   NA
3     3   NA
4     4   NA
5     5   NA
6     6   NA

输出(ii):

Month Year
1     1 2000
2     2 2000
3     3 2000
4     4 2000
5     5 2000
6     6 2000
  • full_seq()函数:它基本上填充了向量中应该观察到但没有观察到的缺失值。向量应该是数字。
Parameter

Description

xA numeric vector.
periodGap between each observation.
tolNumerical tolerance for checking periodicity.

例子:

R

# import the tidyr package
library(tidyr)
  
# creating a numeric vector
num_vec <- c(1, 7, 9, 14, 19, 20)
  
# use full_seq() to fill missing
# values in num_vec
full_seq(num_vector, 1)

输出:

[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
  • drop_na()函数:此函数删除包含缺失值的行。
Parameter

Description

 data A data frame.
 …. A selection of columns. If empty, all variables are selected.

例子:

R

# import tidyr package
library(tidyr)
  
# create a tibble df with missing values
df <- tibble(S.No = c(1:10),
             Name = c('John', 'Smith', 'Peter',
                      'Luke', 'King', rep(NA, 5)))
  
# print df tibble
df                    # Output (i)
  
# use drop_na() to drop columns 
# in df with missing values 
df %>% drop_na(Name)  # Output (ii)

输出(一):

# A tibble: 10 x 2
    S.No Name 
    
 1     1 John 
 2     2 Smith
 3     3 Peter
 4     4 Luke 
 5     5 King 
 6     6  
 7     7  
 8     8  
 9     9  
10    10  

输出(ii):

# A tibble: 5 x 2
   S.No Name 
   
1     1 John 
2     2 Smith
3     3 Peter
4     4 Luke 
5     5 King 
  • replace_na()函数: 替换缺失值。
Parameter

Description

dataA data frame.
replace

If data is a data frame, returns a data frame. If data is a vector, 

returns a vector of class determined by the union of data and replace.

例子:

R

# import tidyr package
library(tidyr)
  
df <- data.frame(S.No = c(1:10),
                 Name = c('John', 'Smith', 
                          'Peter', 'Luke',
                          'King', rep(NA, 5)))
  
df                                      # Output (i)
  
# use replace_na() to replace missing values or na
df %>% replace_na(list(Name = 'Henry')) # Output (ii)

输出(一):

# A tibble: 10 x 2
     S.No Name 
    
 1     1 John 
 2     2 Smith
 3     3 Peter
 4     4 Luke 
 5     5 King 
 6     6  
 7     7  
 8     8  
 9     9  
10    10  

输出(ii):

S.No  Name
1     1  John
2     2 Smith
3     3 Peter
4     4  Luke
5     5  King
6     6 Henry
7     7 Henry
8     8 Henry
9     9 Henry
10   10 Henry