将行转换为具有每个测量 R 的计数值的列
Pivot rows into columns with values of counts for each measurement R
我有一个正在使用的示例数据框
ID <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
TARG_AVG <- c(2.1,2.1,2.1,2.1,2.1,2.1,2.3,2.3,2.5,2.5,2.5,2.5,3.1,3.1,3.1,3.1,3.3,3.3,3.3,3.3,3.5,3.5)
Measurement <- c("Len","Len","Len","Wid","Ht","Ht","Dep","Brt","Ht","Ht","Dep","Dep"
,"Dep","Dep","Len","Len","Ht","Ht","Brt","Brt","Wid","Wid")
df1 <- data.frame(ID,TARG_AVG,Measurement)
我试图在这里解决 3 个不同的问题
1) 我想获得关于 (ID & TARG_AVG) 分组的唯一测量值的摘要。我目前这样做
unique <- summaryBy(Measurement~ID+TARG_AVG, data=df1, FUN=function(x) { c(Count=length(x)) } )
这给了我总数 (measurement.count),但我也想要每次测量的计数。 我想要的输出是
ID TARG_AVG Len Wid Ht Dep Brt Measurement.Count
1 A 2.1 3 1 2 0 0 6
2 A 2.3 0 0 0 1 1 2
3 A 2.5 0 0 2 2 0 4
4 B 3.1 2 0 0 2 0 4
5 B 3.3 0 0 2 0 2 4
6 B 3.5 0 2 0 0 0 2
2) 获得上述输出后,我想对行进行子集化,以便获得过滤后的输出 returns 行至少有 2 个测量值 > 2。这里我想要的输出将是
ID TARG_AVG Len Wid Ht Dep Brt Measurement.Count
1 A 2.1 3 1 2 0 0 6
3 A 2.5 0 0 2 2 0 4
4 B 3.1 2 0 0 2 0 4
5 B 3.3 0 0 2 0 2 4
3) 最后,我想将列旋转回只有测量 > 2 的行。 我想要的输出 这里是
ID TARG_AVG Measurement
1 A 2.1 Len
2 A 2.1 Len
3 A 2.1 Len
4 A 2.1 Ht
5 A 2.1 Ht
6 A 2.5 Ht
7 A 2.5 Ht
8 A 2.5 Dep
9 A 2.5 Dep
10 B 3.1 Len
11 B 3.1 Len
12 B 3.1 Dep
13 B 3.1 Dep
14 B 3.3 Ht
15 B 3.3 Ht
16 B 3.3 Brt
17 B 3.3 Brt
我目前正在学习 reshape2、dplyr 和 data.table 包,如果有人能给我指出正确的方向来帮助我解决这个问题,那将非常有用。
在这种情况下,您不需要 tidyr
。你只需要 dplyr
:
df2 <- df1 %>%
group_by(ID, TARG_AVG) %>% # Group by ID and TARG_AVG
mutate(count=n()) %>% # Count how many are there for each combination of ID and TARG_AVG
filter(count > 2) %>% # Only keep the ones with more than 2 (I think you meant > 2)
select(-count) # Remove the auxiliary variable count
df2
一个较短的(虽然不太容易理解)版本是:
df2 <- df1 %>%
group_by(ID, TARG_AVG) %>%
filter(n() > 2)
df2
在这种情况下,我直接使用了 n()
函数,而不是生成辅助 count
变量。
编辑: 如果你真的想要 dplyr
和 tidyr
的所有三个步骤,你可以这样做:
ID <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
TARG_AVG <- c(2.1,2.1,2.1,2.1,2.1,2.1,2.3,2.3,2.5,2.5,2.5,2.5,3.1,3.1,3.1,3.1,3.3,3.3,3.3,3.3,3.5,3.5)
Measurement <- c("Len","Len","Len","Wid","Ht","Ht","Dep","Brt","Ht","Ht","Dep","Dep"
,"Dep","Dep","Len","Len","Ht","Ht","Brt","Brt","Wid","Wid")
df0 <- data.frame(ID,TARG_AVG,Measurement)
第 1 步和第 2 步。汇总、计数、按测量次数和分布过滤
df1 <- df0 %>%
group_by(ID, TARG_AVG, Measurement) %>%
summarise(count=n()) %>%
group_by(ID, TARG_AVG) %>% # Step "2"
filter(n() >= 2) %>% # Step "2"
spread(Measurement, count, fill = 0) %>% # Resume step "1"
mutate(Measurement.count = Len + Wid + Ht + Dep + Brt)
df1
步骤 3. 再次整形
df3 <- df2 %>%
select(-Measurement.count) %>%
gather(Measurement, dummy, Brt:Wid) %>%
select(-dummy)
df3
最新解决方案
library(data.table) #v 1.9.6+
setDT(df1)[, indx := .N, by = names(df1)
][indx > 1, if(uniqueN(Measurement) > 1) .SD, by = .(ID, TARG_AVG)]
# ID TARG_AVG Measurement indx
# 1: A 2.1 Len 3
# 2: A 2.1 Len 3
# 3: A 2.1 Len 3
# 4: A 2.1 Ht 2
# 5: A 2.1 Ht 2
# 6: A 2.5 Ht 2
# 7: A 2.5 Ht 2
# 8: A 2.5 Dep 2
# 9: A 2.5 Dep 2
# 10: B 3.1 Dep 2
# 11: B 3.1 Dep 2
# 12: B 3.1 Len 2
# 13: B 3.1 Len 2
# 14: B 3.3 Ht 2
# 15: B 3.3 Ht 2
# 16: B 3.3 Brt 2
# 17: B 3.3 Brt 2
或dplyr
等价物
df1 %>%
group_by(ID, TARG_AVG, Measurement) %>%
filter(n() > 1) %>%
group_by(ID, TARG_AVG) %>%
filter(n_distinct(Measurement) > 1)
旧的解决方案
library(data.table)
## dcast the data (no need in total)
res <- dcast(df1, ID + TARG_AVG ~ Measurement)
## filter by at least 2 incidents of at least length 2
res <- res[rowSums(res[-(1:2)] > 1) > 1,]
## melt the data back and filter again by at least 2 incidents
res <- melt(setDT(res), id = 1:2)[value > 1]
## Expand the data back
res[, .SD[rep(.I, value)]]
原题解法
这是一个可能的解决方案,使用 reshape2
第一步
library(reshape2)
res <- dcast(df1, ID + TARG_AVG ~ Measurement, margins = "Measurement")
第二步
res <- res[res$"(all)" > 2,]
3d 步
library(data.table)
setDT(df1)[, if(.N > 2) .SD, by = .(ID, TARG_AVG)]
这是一个 data.table 解决方案,可能会更快一些。我发现与将任务分为两个步骤相比,使用 by 在 j 中进行子集化可能有点慢:[1] 添加可用于过滤的额外列(在此处执行),[2] 执行一次性过滤(没有 by):
> cTbl[, N := .N, .(ID, TARG_AVG, Measurement)
][N > 1, NMgt1 := uniqueN(Measurement) > 1, .(ID, TARG_AVG)
][N > 1 & NMgt1
][, c('N', 'NMgt1') := NULL
][]
ID TARG_AVG Measurement
1: A 2.1 Len
2: A 2.1 Len
3: A 2.1 Len
4: A 2.1 Ht
5: A 2.1 Ht
6: A 2.5 Ht
7: A 2.5 Ht
8: A 2.5 Dep
9: A 2.5 Dep
10: B 3.1 Dep
11: B 3.1 Dep
12: B 3.1 Len
13: B 3.1 Len
14: B 3.3 Ht
15: B 3.3 Ht
16: B 3.3 Brt
17: B 3.3 Brt
>
我有一个正在使用的示例数据框
ID <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
TARG_AVG <- c(2.1,2.1,2.1,2.1,2.1,2.1,2.3,2.3,2.5,2.5,2.5,2.5,3.1,3.1,3.1,3.1,3.3,3.3,3.3,3.3,3.5,3.5)
Measurement <- c("Len","Len","Len","Wid","Ht","Ht","Dep","Brt","Ht","Ht","Dep","Dep"
,"Dep","Dep","Len","Len","Ht","Ht","Brt","Brt","Wid","Wid")
df1 <- data.frame(ID,TARG_AVG,Measurement)
我试图在这里解决 3 个不同的问题
1) 我想获得关于 (ID & TARG_AVG) 分组的唯一测量值的摘要。我目前这样做
unique <- summaryBy(Measurement~ID+TARG_AVG, data=df1, FUN=function(x) { c(Count=length(x)) } )
这给了我总数 (measurement.count),但我也想要每次测量的计数。 我想要的输出是
ID TARG_AVG Len Wid Ht Dep Brt Measurement.Count
1 A 2.1 3 1 2 0 0 6
2 A 2.3 0 0 0 1 1 2
3 A 2.5 0 0 2 2 0 4
4 B 3.1 2 0 0 2 0 4
5 B 3.3 0 0 2 0 2 4
6 B 3.5 0 2 0 0 0 2
2) 获得上述输出后,我想对行进行子集化,以便获得过滤后的输出 returns 行至少有 2 个测量值 > 2。这里我想要的输出将是
ID TARG_AVG Len Wid Ht Dep Brt Measurement.Count
1 A 2.1 3 1 2 0 0 6
3 A 2.5 0 0 2 2 0 4
4 B 3.1 2 0 0 2 0 4
5 B 3.3 0 0 2 0 2 4
3) 最后,我想将列旋转回只有测量 > 2 的行。 我想要的输出 这里是
ID TARG_AVG Measurement
1 A 2.1 Len
2 A 2.1 Len
3 A 2.1 Len
4 A 2.1 Ht
5 A 2.1 Ht
6 A 2.5 Ht
7 A 2.5 Ht
8 A 2.5 Dep
9 A 2.5 Dep
10 B 3.1 Len
11 B 3.1 Len
12 B 3.1 Dep
13 B 3.1 Dep
14 B 3.3 Ht
15 B 3.3 Ht
16 B 3.3 Brt
17 B 3.3 Brt
我目前正在学习 reshape2、dplyr 和 data.table 包,如果有人能给我指出正确的方向来帮助我解决这个问题,那将非常有用。
在这种情况下,您不需要 tidyr
。你只需要 dplyr
:
df2 <- df1 %>%
group_by(ID, TARG_AVG) %>% # Group by ID and TARG_AVG
mutate(count=n()) %>% # Count how many are there for each combination of ID and TARG_AVG
filter(count > 2) %>% # Only keep the ones with more than 2 (I think you meant > 2)
select(-count) # Remove the auxiliary variable count
df2
一个较短的(虽然不太容易理解)版本是:
df2 <- df1 %>%
group_by(ID, TARG_AVG) %>%
filter(n() > 2)
df2
在这种情况下,我直接使用了 n()
函数,而不是生成辅助 count
变量。
编辑: 如果你真的想要 dplyr
和 tidyr
的所有三个步骤,你可以这样做:
ID <- c("A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
TARG_AVG <- c(2.1,2.1,2.1,2.1,2.1,2.1,2.3,2.3,2.5,2.5,2.5,2.5,3.1,3.1,3.1,3.1,3.3,3.3,3.3,3.3,3.5,3.5)
Measurement <- c("Len","Len","Len","Wid","Ht","Ht","Dep","Brt","Ht","Ht","Dep","Dep"
,"Dep","Dep","Len","Len","Ht","Ht","Brt","Brt","Wid","Wid")
df0 <- data.frame(ID,TARG_AVG,Measurement)
第 1 步和第 2 步。汇总、计数、按测量次数和分布过滤
df1 <- df0 %>%
group_by(ID, TARG_AVG, Measurement) %>%
summarise(count=n()) %>%
group_by(ID, TARG_AVG) %>% # Step "2"
filter(n() >= 2) %>% # Step "2"
spread(Measurement, count, fill = 0) %>% # Resume step "1"
mutate(Measurement.count = Len + Wid + Ht + Dep + Brt)
df1
步骤 3. 再次整形
df3 <- df2 %>%
select(-Measurement.count) %>%
gather(Measurement, dummy, Brt:Wid) %>%
select(-dummy)
df3
最新解决方案
library(data.table) #v 1.9.6+
setDT(df1)[, indx := .N, by = names(df1)
][indx > 1, if(uniqueN(Measurement) > 1) .SD, by = .(ID, TARG_AVG)]
# ID TARG_AVG Measurement indx
# 1: A 2.1 Len 3
# 2: A 2.1 Len 3
# 3: A 2.1 Len 3
# 4: A 2.1 Ht 2
# 5: A 2.1 Ht 2
# 6: A 2.5 Ht 2
# 7: A 2.5 Ht 2
# 8: A 2.5 Dep 2
# 9: A 2.5 Dep 2
# 10: B 3.1 Dep 2
# 11: B 3.1 Dep 2
# 12: B 3.1 Len 2
# 13: B 3.1 Len 2
# 14: B 3.3 Ht 2
# 15: B 3.3 Ht 2
# 16: B 3.3 Brt 2
# 17: B 3.3 Brt 2
或dplyr
等价物
df1 %>%
group_by(ID, TARG_AVG, Measurement) %>%
filter(n() > 1) %>%
group_by(ID, TARG_AVG) %>%
filter(n_distinct(Measurement) > 1)
旧的解决方案
library(data.table)
## dcast the data (no need in total)
res <- dcast(df1, ID + TARG_AVG ~ Measurement)
## filter by at least 2 incidents of at least length 2
res <- res[rowSums(res[-(1:2)] > 1) > 1,]
## melt the data back and filter again by at least 2 incidents
res <- melt(setDT(res), id = 1:2)[value > 1]
## Expand the data back
res[, .SD[rep(.I, value)]]
原题解法
这是一个可能的解决方案,使用 reshape2
第一步
library(reshape2)
res <- dcast(df1, ID + TARG_AVG ~ Measurement, margins = "Measurement")
第二步
res <- res[res$"(all)" > 2,]
3d 步
library(data.table)
setDT(df1)[, if(.N > 2) .SD, by = .(ID, TARG_AVG)]
这是一个 data.table 解决方案,可能会更快一些。我发现与将任务分为两个步骤相比,使用 by 在 j 中进行子集化可能有点慢:[1] 添加可用于过滤的额外列(在此处执行),[2] 执行一次性过滤(没有 by):
> cTbl[, N := .N, .(ID, TARG_AVG, Measurement)
][N > 1, NMgt1 := uniqueN(Measurement) > 1, .(ID, TARG_AVG)
][N > 1 & NMgt1
][, c('N', 'NMgt1') := NULL
][]
ID TARG_AVG Measurement
1: A 2.1 Len
2: A 2.1 Len
3: A 2.1 Len
4: A 2.1 Ht
5: A 2.1 Ht
6: A 2.5 Ht
7: A 2.5 Ht
8: A 2.5 Dep
9: A 2.5 Dep
10: B 3.1 Dep
11: B 3.1 Dep
12: B 3.1 Len
13: B 3.1 Len
14: B 3.3 Ht
15: B 3.3 Ht
16: B 3.3 Brt
17: B 3.3 Brt
>