在不考虑重复项的情况下计算变量的出现次数
Counting occurrence of a variable without taking account duplicates
我有一个大数据框,名为 data with 1 004 490 obs,我想分析一次治疗是否成功。
ID POSITIONS TREATMENT
1 0 A
1 1 A
1 2 B
2 0 C
2 1 D
3 0 B
3 1 B
3 2 C
3 3 A
3 4 A
3 5 B
所以首先,我想计算一次治疗对一个患者(ID)应用的次数,但是一个治疗可以对一个 iD 进行多次。那么,我是否需要先删除所有重复项并在计数之后删除,或者是否有一个函数不考虑所有重复项。
What I want to have :
A : 2
B : 2
C : 2
D : 1
然后,我想知道最后一个位置治疗了多少次,但是最后一个位置总是根据ID不同。
What I want to have :
A : 0
B : 2 (for ID = 1 and 3)
C : 0
D : 1 (for ID = 1)
感谢您的帮助,我是 R 的新用户!
使用 base R,我们可以做到,
merge(aggregate(ID ~ TREATMENT, df, FUN = function(i) length(unique(i))),
aggregate(ID ~ TREATMENT, df[!duplicated(df$ID, fromLast = TRUE),], toString),
by = 'TREATMENT', all = TRUE)
这给出了,
TREATMENT ID.x ID.y
1 A 2 <NA>
2 B 2 1, 3
3 C 2 <NA>
4 D 1 2
这是一个tidyverse
方法,我们根据'ID'、'TREATMENT'得到distinct
行,得到[=28的count
行=]
library(tidyverse)
df1 %>%
distinct(ID, TREATMENT) %>%
count(TREATMENT)
# A tibble: 4 x 2
# TREATMENT n
# <chr> <int>
#1 A 2
#2 B 2
#3 C 2
#4 D 1
对于第二个输出,在按 'ID'、slice
最后一行 (n()
) 分组后,创建一个列 'ind' 和 fill
'TREATMENT' 与 complete
的所有缺失组合都为 0,然后在按 'TREATMENT'
分组后得到 'ind' 的 sum
df1 %>%
group_by(ID) %>%
slice(n()) %>%
mutate(ind = 1) %>%
complete(TREATMENT = unique(df1$TREATMENT), fill = list(ind=0)) %>%
group_by(TREATMENT) %>%
summarise(n = sum(ind))
# A tibble: 4 x 2
# TREATMENT n
# <chr> <dbl>
#1 A 0
#2 B 2
#3 C 0
#4 D 1
数据
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L), POSITIONS = c(0L, 1L, 2L, 0L, 1L, 0L, 1L, 2L, 3L, 4L, 5L
), TREATMENT = c("A", "A", "B", "C", "D", "B", "B", "C", "A",
"A", "B")), .Names = c("ID", "POSITIONS", "TREATMENT"),
class = "data.frame", row.names = c(NA, -11L))
我有一个大数据框,名为 data with 1 004 490 obs,我想分析一次治疗是否成功。
ID POSITIONS TREATMENT
1 0 A
1 1 A
1 2 B
2 0 C
2 1 D
3 0 B
3 1 B
3 2 C
3 3 A
3 4 A
3 5 B
所以首先,我想计算一次治疗对一个患者(ID)应用的次数,但是一个治疗可以对一个 iD 进行多次。那么,我是否需要先删除所有重复项并在计数之后删除,或者是否有一个函数不考虑所有重复项。
What I want to have :
A : 2
B : 2
C : 2
D : 1
然后,我想知道最后一个位置治疗了多少次,但是最后一个位置总是根据ID不同。
What I want to have :
A : 0
B : 2 (for ID = 1 and 3)
C : 0
D : 1 (for ID = 1)
感谢您的帮助,我是 R 的新用户!
使用 base R,我们可以做到,
merge(aggregate(ID ~ TREATMENT, df, FUN = function(i) length(unique(i))),
aggregate(ID ~ TREATMENT, df[!duplicated(df$ID, fromLast = TRUE),], toString),
by = 'TREATMENT', all = TRUE)
这给出了,
TREATMENT ID.x ID.y 1 A 2 <NA> 2 B 2 1, 3 3 C 2 <NA> 4 D 1 2
这是一个tidyverse
方法,我们根据'ID'、'TREATMENT'得到distinct
行,得到[=28的count
行=]
library(tidyverse)
df1 %>%
distinct(ID, TREATMENT) %>%
count(TREATMENT)
# A tibble: 4 x 2
# TREATMENT n
# <chr> <int>
#1 A 2
#2 B 2
#3 C 2
#4 D 1
对于第二个输出,在按 'ID'、slice
最后一行 (n()
) 分组后,创建一个列 'ind' 和 fill
'TREATMENT' 与 complete
的所有缺失组合都为 0,然后在按 'TREATMENT'
sum
df1 %>%
group_by(ID) %>%
slice(n()) %>%
mutate(ind = 1) %>%
complete(TREATMENT = unique(df1$TREATMENT), fill = list(ind=0)) %>%
group_by(TREATMENT) %>%
summarise(n = sum(ind))
# A tibble: 4 x 2
# TREATMENT n
# <chr> <dbl>
#1 A 0
#2 B 2
#3 C 0
#4 D 1
数据
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L), POSITIONS = c(0L, 1L, 2L, 0L, 1L, 0L, 1L, 2L, 3L, 4L, 5L
), TREATMENT = c("A", "A", "B", "C", "D", "B", "B", "C", "A",
"A", "B")), .Names = c("ID", "POSITIONS", "TREATMENT"),
class = "data.frame", row.names = c(NA, -11L))