在 R 的 spread() 函数中使用两个值列

Question

我最近 post 编辑了一个询问如何将数据从长 table 重塑为宽 table。然后我发现 spread() 是一个非常方便的函数。所以现在我需要对我之前的 post.

做一些进一步的开发

假设我们有这样一个 table：

id1   |  id2 |  info  | action_time | action_comment  |
 1    | a    |  info1 |    time1    |        comment1 |
 1    | a    |  info1 |    time2    |        comment2 |
 1    | a    |  info1 |    time3    |        comment3 |
 2    | b    |  info2 |    time4    |        comment4 |
 2    | b    |  info2 |    time5    |        comment5 |

我想把它改成这样：

id1   |  id2 |  info  |action_time 1|action_comment1 |action_time 2|action_comment2 |action_time 3|action_comment3  |
 1    | a    |  info1 |    time1    |      comment1  |    time2    |      comment2  |    time3    |      comment3   |
 2    | b    |  info2 |    time4    |      comment4  |    time5    |      comment5  |             |                 |

所以这个问题和我之前的问题之间的区别是我添加了另一列，我也需要重新调整它的形状。

我正在考虑使用

library(dplyr)
library(tidyr)

df %>% 
  group_by(id1) %>% 
  mutate(action_no = paste("action_time", row_number())) %>%
  spread(action_no, value = c(action_time, action_comment))

但是当我在 value 参数中放入两个值时它给了我一条错误消息说：列规范无效。

我非常喜欢使用这样的 %>% 运算符来操作数据的想法，所以我很想知道如何更正我的代码来实现这一点。

非常感谢您的帮助

Answer 1

我们可以使用 data.table 的开发版本来做到这一点，它可以包含多个 value.var 列。安装开发版本的说明是here

我们将 'data.frame' 转换为 'data.table' (setDT(df))，使用分组变量 ('id1', 'id2'、'info') 和 dcast 从 'long' 到 'wide' 格式，方法是将 value.var 指定为 'action_time' 和 'action_comment'.

library(data.table)#v1.9.5+
setDT(df)[, ind:= 1:.N, .(id1, id2, info)]
dcast(df, id1 + id2 + info ~ ind,
      value.var=c('action_time', 'action_comment'), fill='')
 #    id1 id2  info 1_action_time 2_action_time 3_action_time 1_action_comment
 #1:   1   a info1         time1         time2         time3         comment1
 #2:   2   b info2         time4         time5                       comment4
 #   2_action_comment 3_action_comment
 #1:         comment2         comment3
 #2:         comment5

或使用 base R 中的 reshape。我们使用 ave 和 reshape 创建序列变量 ('ind') 以将 'long' 格式更改为 'wide' 格式。

df$ind <- with(df, ave(seq_along(id1), id1, id2, info, FUN=seq_along))
reshape(df, idvar=c('id1', 'id2', 'info'),timevar='ind', direction='wide')
#  id1 id2  info action_time.1 action_comment.1 action_time.2 action_comment.2
#1   1   a info1         time1         comment1         time2         comment2
#4   2   b info2         time4         comment4         time5         comment5
#  action_time.3 action_comment.3
#1         time3         comment3
#4          <NA>             <NA>

数据

df <- structure(list(id1 = c(1L, 1L, 1L, 2L, 2L), id2 = c("a", "a", 
"a", "b", "b"), info = c("info1", "info1", "info1", "info2", 
"info2"), action_time = c("time1", "time2", "time3", "time4", 
"time5"), action_comment = c("comment1", "comment2", "comment3", 
"comment4", "comment5")), .Names = c("id1", "id2", "info", "action_time", 
"action_comment"), class = "data.frame", row.names = c(NA, -5L))

Answer 2

尝试：

library(dplyr)
library(tidyr)

df %>%
  group_by(id1) %>%
  mutate(id = row_number()) %>%
  gather(key, value, -(id1:info), -id) %>%
  unite(id_key, id, key) %>%
  spread(id_key, value)

给出：

#Source: local data frame [2 x 9]

#  id1 id2  info 1_action_comment 1_action_time 2_action_comment 2_action_time 3_action_comment 3_action_time
#1   1   a info1         comment1         time1         comment2         time2         comment3         time3
#2   2   b info2         comment4         time4         comment5         time5               NA            NA

Answer 3

不是直接的解决方案，但有效

library(tidyr)
a = spread(df, action_comment, action_time); 
b = spread(df, action_time, action_comment); 

# dropping NAs and shifting the values to left row wise 
a[] = t(apply(a, 1, function(x) `length<-`(na.omit(x), length(x))))
b[] = t(apply(b, 1, function(x) `length<-`(na.omit(x), length(x))))

out = merge(a,b, by = c('id1','id2','info'))
out[, colSums(is.na(out)) != nrow(out)]

#  id1 id2  info comment1 comment2 comment3    time1    time2    time3
#1   1   a info1    time1    time2    time3 comment1 comment2 comment3
#2   2   b info2    time4    time5     <NA> comment4 comment5     <NA>

在 R 的 spread() 函数中使用两个值列

Use put two value columns in spread() function in R

r

reshape2

tidyr

数据