在 Tidy Data 中采用一对多的组间差异
Taking one to many difference Between Groups in Tidy Data
我有一个整洁的数据集,类似于 Introducing tidyr blog post 中的听力示例,但我在药物下还有一个额外的 "placebo" 组,我可以像这样构建
library(dplyr)
library(tidyr)
messy <- data.frame(
name = c("Wilbur", "Petunia", "Gregory"),
a = c(67, 80, 64),
b = c(56, 90, 50),
p = c(60, 70, 60) # this is the new 'placebo' drug
)
tidy <- messy %>%
gather(drug, heartrate, a:p)
假设我从整洁的数据开始,我的目标是创建一个名为 "diff.p" 的新变量,它是每种药物和安慰剂的观察值之间的差异。结果应如下所示
tidy$diff.p <- c(7,10,4,-4,20,-10,0,0,0)
tidy
似乎 ave
and/or mutate
可能是解决问题的好途径(或者构建一个新的数据框?),但我需要一些关于最佳实践的额外指导.
看来你可以很容易地用第二个 tidy
:
tidy2 <- messy %>%
mutate(a = a-p, b = b-p, p = 0) %>%
gather(drug, diff.p, a:p)
left_join(tidy, tidy2, by = c("name", "drug"))
# name drug heartrate diff.p
# 1 Wilbur a 67 7
# 2 Petunia a 80 10
# 3 Gregory a 64 4
# 4 Wilbur b 56 -4
# 5 Petunia b 90 20
# 6 Gregory b 50 -10
# 7 Wilbur p 60 0
# 8 Petunia p 70 0
# 9 Gregory p 60 0
在 dplyr
链中,您可以按 name
分组,然后从 heartrate
中减去 heartrate[drug=="p"]
:
tidy = tidy %>% group_by(name) %>%
mutate(diff.p2 = heartrate - heartrate[drug=="p"])
name drug heartrate diff.p diff.p2
<fctr> <chr> <dbl> <dbl> <dbl>
1 Wilbur a 67 7 7
2 Petunia a 80 10 10
3 Gregory a 64 4 4
4 Wilbur b 56 -4 -4
5 Petunia b 90 20 20
6 Gregory b 50 -10 -10
7 Wilbur p 60 0 0
8 Petunia p 70 0 0
9 Gregory p 60 0 0
另一种选择是data.table
library(data.table)
melt(setDT(messy), id.var = "name", variable.name = "drug",
value.name = "heartrate")[, diff.p2 := heartrate - heartrate[drug=="p"]][]
# name drug heartrate diff.p2
#1: Wilbur a 67 7
#2: Petunia a 80 10
#3: Gregory a 64 4
#4: Wilbur b 56 -4
#5: Petunia b 90 20
#6: Gregory b 50 -10
#7: Wilbur p 60 0
#8: Petunia p 70 0
#9: Gregory p 60 0
我有一个整洁的数据集,类似于 Introducing tidyr blog post 中的听力示例,但我在药物下还有一个额外的 "placebo" 组,我可以像这样构建
library(dplyr)
library(tidyr)
messy <- data.frame(
name = c("Wilbur", "Petunia", "Gregory"),
a = c(67, 80, 64),
b = c(56, 90, 50),
p = c(60, 70, 60) # this is the new 'placebo' drug
)
tidy <- messy %>%
gather(drug, heartrate, a:p)
假设我从整洁的数据开始,我的目标是创建一个名为 "diff.p" 的新变量,它是每种药物和安慰剂的观察值之间的差异。结果应如下所示
tidy$diff.p <- c(7,10,4,-4,20,-10,0,0,0)
tidy
似乎 ave
and/or mutate
可能是解决问题的好途径(或者构建一个新的数据框?),但我需要一些关于最佳实践的额外指导.
看来你可以很容易地用第二个 tidy
:
tidy2 <- messy %>%
mutate(a = a-p, b = b-p, p = 0) %>%
gather(drug, diff.p, a:p)
left_join(tidy, tidy2, by = c("name", "drug"))
# name drug heartrate diff.p
# 1 Wilbur a 67 7
# 2 Petunia a 80 10
# 3 Gregory a 64 4
# 4 Wilbur b 56 -4
# 5 Petunia b 90 20
# 6 Gregory b 50 -10
# 7 Wilbur p 60 0
# 8 Petunia p 70 0
# 9 Gregory p 60 0
在 dplyr
链中,您可以按 name
分组,然后从 heartrate
中减去 heartrate[drug=="p"]
:
tidy = tidy %>% group_by(name) %>%
mutate(diff.p2 = heartrate - heartrate[drug=="p"])
name drug heartrate diff.p diff.p2 <fctr> <chr> <dbl> <dbl> <dbl> 1 Wilbur a 67 7 7 2 Petunia a 80 10 10 3 Gregory a 64 4 4 4 Wilbur b 56 -4 -4 5 Petunia b 90 20 20 6 Gregory b 50 -10 -10 7 Wilbur p 60 0 0 8 Petunia p 70 0 0 9 Gregory p 60 0 0
另一种选择是data.table
library(data.table)
melt(setDT(messy), id.var = "name", variable.name = "drug",
value.name = "heartrate")[, diff.p2 := heartrate - heartrate[drug=="p"]][]
# name drug heartrate diff.p2
#1: Wilbur a 67 7
#2: Petunia a 80 10
#3: Gregory a 64 4
#4: Wilbur b 56 -4
#5: Petunia b 90 20
#6: Gregory b 50 -10
#7: Wilbur p 60 0
#8: Petunia p 70 0
#9: Gregory p 60 0