收集、重塑、铸造等之间的区别

Question

gather、reshape、cast 和类似函数之间有什么区别？我知道它们都有助于在长数据和宽数据之间转换，但我在使用它们时遇到了麻烦。文档倾向于使用 "id" 变量和 "time" 变量之类的术语，但我不确定什么是什么。

我有一个这样的数据框：

data <- data.frame(id = c(rep("A", 10), rep("B", 10), rep("C", 10)),
                   val = 1:30)

我正在尝试将其重新格式化为如下所示：

res <- data.frame(A = 1:10,
                  B = 11:20,
                  C = 21:30)

我怎样才能最轻松地做到这一点？有小费吗。我知道这是一个 "easy" 问题，但我很困惑。提前致谢。

Answer 1

tidyr 包是 reshape 和 reshape2 包的 replacement。

因此，tidyr 函数 spread() 和 gather() 分别替代了 reshape2::cast() 和 reshape2::melt()。

要按要求传播数据，您需要添加另一列来指定输出数据框中的行号，如下所示。

data <- data.frame(id = c(rep("A", 10), rep("B", 10), rep("C", 10)),
                   val = 1:30,row=c(1:10,1:10,1:10))

library(tidyr)
data %>% spread(.,id,val)

...输出：

> data %>% spread(.,id,val)
   row  A  B  C
1    1  1 11 21
2    2  2 12 22
3    3  3 13 23
4    4  4 14 24
5    5  5 15 25
6    6  6 16 26
7    7  7 17 27
8    8  8 18 28
9    9  9 19 29
10  10 10 20 30
>

要删除 row 变量，请添加 dplyr 包并 select() 删除不需要的列。

library(tidyr)
library(dplyr)
data %>% spread(.,id,val) %>% select(-row)

...输出：

> data %>% spread(.,id,val) %>% select(-row)
    A  B  C
1   1 11 21
2   2 12 22
3   3 13 23
4   4 14 24
5   5 15 25
6   6 16 26
7   7 17 27
8   8 18 28
9   9 19 29
10 10 20 30
>

Answer 2

发帖前请使用搜索功能。这已被问很多在这里 SO!

在tidyverse中你可以做：

data %>%
    group_by(id) %>%
    mutate(n = 1:n()) %>%
    ungroup() %>%
    spread(id, val) %>%
    select(-n)
## A tibble: 10 x 3
#       A     B     C
#   <int> <int> <int>
# 1     1    11    21
# 2     2    12    22
# 3     3    13    23
# 4     4    14    24
# 5     5    15    25
# 6     6    16    26
# 7     7    17    27
# 8     8    18    28
# 9     9    19    29
#10    10    20    30

评论：我建议逐行执行上面的命令，看看每个命令的作用。另请注意

data %>%
    spread(id, val)

会产生错误（见评论中@neilfws 的解释）。

Answer 3

所有这些函数基本上都做同样的事情 - 它们将数据集从宽格式转换为长格式，反之亦然。不同之处在于他们如何完成任务。

reshape 函数是基本的 R 方法 - 它一直存在。我觉得它很麻烦（我每次都需要检查示例才能使用它），但它的功能非常完美。

如果您从宽格式开始，转换为长格式的简单示例如下所示：

df_long <- reshape(df_wide,
  direction = "wide",
  ids = 1:nrow(df_wide), # required, but not very informative
  times = colnames(df_wide), # required - the factor labels for the variable differentiating a measurement from column 2 versus column 3,
  varying = 1:ncol(df_wide) # required - specify which columns need to be switched to long format.
  v.names = "measurement", # optional - the name for the variable which will contain all the values of the variables being converted to long format
  timevar = "times" # optional - the name for the variable containing the factor (with levels defined in the times argument.)
)

对于长格式（direction = 'long'），您可以类似地执行此操作 - 设置 direction = wide，所需参数变为可选参数，可选参数（timevar， idvar 和 v.names) 成为必需项。（理论上，R有时可以推断出一些变量，但我从来没有运气好。不管是否需要，我都按要求对待。

gather/spread 函数是一种更简单的替代方法。一个很大的区别：它是两个命令而不是一个，因此您不必担心哪些参数与每个命令相关。我看到至少有 2 个答案弹出来描述这些功能是如何工作的，所以我不会重复他们所说的内容。

收集、重塑、铸造等之间的区别

Difference between gather, reshape, cast, etc

r

reshape

data.table

tidyr