如何将多个变量的重复测量值扩展为宽格式?
How can I spread repeated measures of multiple variables into wide format?
我正在尝试采用长格式的列并将它们展开为宽格式,如下所示。我想使用 tidyr 通过我正在投资的数据操作工具来解决这个问题,但是为了使这个答案更通用,请提供其他解决方案。
这是我拥有的:
library(dplyr); library(tidyr)
set.seed(10)
dat <- data_frame(
Person = rep(c("greg", "sally", "sue"), each=2),
Time = rep(c("Pre", "Post"), 3),
Score1 = round(rnorm(6, mean = 80, sd=4), 0),
Score2 = round(jitter(Score1, 15), 0),
Score3 = 5 + (Score1 + Score2)/2
)
## Person Time Score1 Score2 Score3
## 1 greg Pre 80 78 84.0
## 2 greg Post 79 80 84.5
## 3 sally Pre 75 74 79.5
## 4 sally Post 78 78 83.0
## 5 sue Pre 81 78 84.5
## 6 sue Post 82 81 86.5
所需的宽幅面:
Person Pre.Score1 Pre.Score2 Pre.Score3 Post.Score1 Post.Score2 Post.Score3
1 greg 80 78 84.0 79 80 84.5
2 sally 75 74 79.5 78 78 83.0
3 sue 81 78 84.5 82 81 86.5
我可以通过对每个分数做这样的事情来做到这一点:
spread(dat %>% select(Person, Time, Score1), Time, Score1) %>%
rename(Score1_Pre = Pre, Score1_Post = Post)
然后使用 _join
但这似乎很冗长,而且必须有更好的方法。
相关问题:
Is it possible to use spread on multiple columns in tidyr similar to dcast?
使用reshape2
:
library(reshape2)
dcast(melt(dat), Person ~ Time + variable)
生产:
Using Person, Time as id variables
Person Post_Score1 Post_Score2 Post_Score3 Pre_Score1 Pre_Score2 Pre_Score3
1 greg 79 78 83.5 83 81 87.0
2 sally 82 81 86.5 75 74 79.5
3 sue 78 78 83.0 82 79 85.5
使用 data.table
包中的 dcast
。
library(data.table)#v1.9.5+
dcast(setDT(dat), Person~Time, value.var=paste0("Score", 1:3))
# Person Score1_Post Score1_Pre Score2_Post Score2_Pre Score3_Post Score3_Pre
#1: greg 79 80 80 78 84.5 84.0
#2: sally 78 75 78 74 83.0 79.5
#3: sue 82 81 81 78 86.5 84.5
或 reshape
来自 baseR
reshape(as.data.frame(dat), idvar='Person', timevar='Time',direction='wide')
更新
从开发版本 tidyr_0.8.3.9000
或 CRAN 版本 tidyr_1.0.0
开始,我们可以对多个值列使用 pivot_wider
library(tidyr)
library(stringr)
dat %>%
pivot_wider(names_from = Time, values_from = str_c("Score", 1:3))
# A tibble: 3 x 7
# Person Score1_Pre Score1_Post Score2_Pre Score2_Post Score3_Pre Score3_Post
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 greg 80 79 78 80 84 84.5
#2 sally 75 78 74 78 79.5 83
#3 sue 81 82 78 81 84.5 86.5
编辑:我正在更新这个答案,因为 pivot_wider 已经存在了一段时间并解决了这个问题和评论中的问题。你现在可以做
pivot_wider(
dat,
id_cols = 'Person',
names_from = 'Time',
values_from = c('Score1', 'Score2', 'Score3'),
names_glue = '{Time}.{.value}'
)
得到想要的结果。
原来的答案是
dat %>%
gather(temp, score, starts_with("Score")) %>%
unite(temp1, Time, temp, sep = ".") %>%
spread(temp1, score)
我为自己做了一个基准测试,post如果有人感兴趣的话,我会把它放在这里:
代码
设置从OP中选择,三个变量,两个时间点。但是,数据框的大小从 1,000 行到 100,000 行不等。
library(magrittr)
library(data.table)
library(bench)
f1 <- function(dat) {
tidyr::gather(dat, key = "key", value = "value", -Person, -Time) %>%
tidyr::unite("id", Time, key, sep = ".") %>%
tidyr::spread(id, value)
}
f2 <- function(dat) {
reshape2::dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable)
}
f3 <- function(dat) {
dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable)
}
create_df <- function(rows) {
dat <- expand.grid(Person = factor(1:ceiling(rows/2)),
Time = c("1Pre", "2Post"))
dat$Score1 <- round(rnorm(nrow(dat), mean = 80, sd = 4), 0)
dat$Score2 <- round(jitter(dat$Score1, 15), 0)
dat$Score3 <- 5 + (dat$Score1 + dat$Score2)/2
return(dat)
}
结果
如您所见,reshape2 比 tidyr 快一点,可能是因为 tidyr 的开销更大。重要的是,data.table 超过 10,000 行。
press(
rows = 10^(3:5),
{
dat <- create_df(rows)
dat2 <- copy(dat)
setDT(dat2)
bench::mark(tidyr = f1(dat),
reshape2 = f2(dat),
datatable = f3(dat2),
check = function(x, y) all.equal(x, y, check.attributes = FALSE),
min_iterations = 20
)
}
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 9 x 11
#> expression rows min mean median max `itr/sec` mem_alloc
#> <chr> <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 tidyr 1000 5.7ms 6.13ms 6.02ms 10.06ms 163. 2.78MB
#> 2 reshape2 1000 2.82ms 3.09ms 2.97ms 8.67ms 323. 1.7MB
#> 3 datatable 1000 3.82ms 4ms 3.92ms 8.06ms 250. 2.78MB
#> 4 tidyr 10000 19.31ms 20.34ms 19.95ms 22.98ms 49.2 8.24MB
#> 5 reshape2 10000 13.81ms 14.4ms 14.4ms 15.6ms 69.4 11.34MB
#> 6 datatable 10000 14.56ms 15.16ms 14.91ms 18.93ms 66.0 2.98MB
#> 7 tidyr 100000 197.24ms 219.69ms 205.27ms 268.92ms 4.55 90.55MB
#> 8 reshape2 100000 164.02ms 195.32ms 176.31ms 284.77ms 5.12 121.69MB
#> 9 datatable 100000 51.31ms 60.34ms 58.36ms 113.69ms 16.6 27.36MB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>
由 reprex package (v0.2.1)
于 2019-02-27 创建
我正在尝试采用长格式的列并将它们展开为宽格式,如下所示。我想使用 tidyr 通过我正在投资的数据操作工具来解决这个问题,但是为了使这个答案更通用,请提供其他解决方案。
这是我拥有的:
library(dplyr); library(tidyr)
set.seed(10)
dat <- data_frame(
Person = rep(c("greg", "sally", "sue"), each=2),
Time = rep(c("Pre", "Post"), 3),
Score1 = round(rnorm(6, mean = 80, sd=4), 0),
Score2 = round(jitter(Score1, 15), 0),
Score3 = 5 + (Score1 + Score2)/2
)
## Person Time Score1 Score2 Score3
## 1 greg Pre 80 78 84.0
## 2 greg Post 79 80 84.5
## 3 sally Pre 75 74 79.5
## 4 sally Post 78 78 83.0
## 5 sue Pre 81 78 84.5
## 6 sue Post 82 81 86.5
所需的宽幅面:
Person Pre.Score1 Pre.Score2 Pre.Score3 Post.Score1 Post.Score2 Post.Score3
1 greg 80 78 84.0 79 80 84.5
2 sally 75 74 79.5 78 78 83.0
3 sue 81 78 84.5 82 81 86.5
我可以通过对每个分数做这样的事情来做到这一点:
spread(dat %>% select(Person, Time, Score1), Time, Score1) %>%
rename(Score1_Pre = Pre, Score1_Post = Post)
然后使用 _join
但这似乎很冗长,而且必须有更好的方法。
相关问题:
Is it possible to use spread on multiple columns in tidyr similar to dcast?
使用reshape2
:
library(reshape2)
dcast(melt(dat), Person ~ Time + variable)
生产:
Using Person, Time as id variables
Person Post_Score1 Post_Score2 Post_Score3 Pre_Score1 Pre_Score2 Pre_Score3
1 greg 79 78 83.5 83 81 87.0
2 sally 82 81 86.5 75 74 79.5
3 sue 78 78 83.0 82 79 85.5
使用 data.table
包中的 dcast
。
library(data.table)#v1.9.5+
dcast(setDT(dat), Person~Time, value.var=paste0("Score", 1:3))
# Person Score1_Post Score1_Pre Score2_Post Score2_Pre Score3_Post Score3_Pre
#1: greg 79 80 80 78 84.5 84.0
#2: sally 78 75 78 74 83.0 79.5
#3: sue 82 81 81 78 86.5 84.5
或 reshape
来自 baseR
reshape(as.data.frame(dat), idvar='Person', timevar='Time',direction='wide')
更新
从开发版本 tidyr_0.8.3.9000
或 CRAN 版本 tidyr_1.0.0
开始,我们可以对多个值列使用 pivot_wider
library(tidyr)
library(stringr)
dat %>%
pivot_wider(names_from = Time, values_from = str_c("Score", 1:3))
# A tibble: 3 x 7
# Person Score1_Pre Score1_Post Score2_Pre Score2_Post Score3_Pre Score3_Post
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 greg 80 79 78 80 84 84.5
#2 sally 75 78 74 78 79.5 83
#3 sue 81 82 78 81 84.5 86.5
编辑:我正在更新这个答案,因为 pivot_wider 已经存在了一段时间并解决了这个问题和评论中的问题。你现在可以做
pivot_wider(
dat,
id_cols = 'Person',
names_from = 'Time',
values_from = c('Score1', 'Score2', 'Score3'),
names_glue = '{Time}.{.value}'
)
得到想要的结果。
原来的答案是
dat %>%
gather(temp, score, starts_with("Score")) %>%
unite(temp1, Time, temp, sep = ".") %>%
spread(temp1, score)
我为自己做了一个基准测试,post如果有人感兴趣的话,我会把它放在这里:
代码
设置从OP中选择,三个变量,两个时间点。但是,数据框的大小从 1,000 行到 100,000 行不等。
library(magrittr)
library(data.table)
library(bench)
f1 <- function(dat) {
tidyr::gather(dat, key = "key", value = "value", -Person, -Time) %>%
tidyr::unite("id", Time, key, sep = ".") %>%
tidyr::spread(id, value)
}
f2 <- function(dat) {
reshape2::dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable)
}
f3 <- function(dat) {
dcast(melt(dat, id.vars = c("Person", "Time")), Person ~ Time + variable)
}
create_df <- function(rows) {
dat <- expand.grid(Person = factor(1:ceiling(rows/2)),
Time = c("1Pre", "2Post"))
dat$Score1 <- round(rnorm(nrow(dat), mean = 80, sd = 4), 0)
dat$Score2 <- round(jitter(dat$Score1, 15), 0)
dat$Score3 <- 5 + (dat$Score1 + dat$Score2)/2
return(dat)
}
结果
如您所见,reshape2 比 tidyr 快一点,可能是因为 tidyr 的开销更大。重要的是,data.table 超过 10,000 行。
press(
rows = 10^(3:5),
{
dat <- create_df(rows)
dat2 <- copy(dat)
setDT(dat2)
bench::mark(tidyr = f1(dat),
reshape2 = f2(dat),
datatable = f3(dat2),
check = function(x, y) all.equal(x, y, check.attributes = FALSE),
min_iterations = 20
)
}
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 9 x 11
#> expression rows min mean median max `itr/sec` mem_alloc
#> <chr> <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 tidyr 1000 5.7ms 6.13ms 6.02ms 10.06ms 163. 2.78MB
#> 2 reshape2 1000 2.82ms 3.09ms 2.97ms 8.67ms 323. 1.7MB
#> 3 datatable 1000 3.82ms 4ms 3.92ms 8.06ms 250. 2.78MB
#> 4 tidyr 10000 19.31ms 20.34ms 19.95ms 22.98ms 49.2 8.24MB
#> 5 reshape2 10000 13.81ms 14.4ms 14.4ms 15.6ms 69.4 11.34MB
#> 6 datatable 10000 14.56ms 15.16ms 14.91ms 18.93ms 66.0 2.98MB
#> 7 tidyr 100000 197.24ms 219.69ms 205.27ms 268.92ms 4.55 90.55MB
#> 8 reshape2 100000 164.02ms 195.32ms 176.31ms 284.77ms 5.12 121.69MB
#> 9 datatable 100000 51.31ms 60.34ms 58.36ms 113.69ms 16.6 27.36MB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>
由 reprex package (v0.2.1)
于 2019-02-27 创建