在不考虑重复项的情况下计算变量的出现次数

Counting occurrence of a variable without taking account duplicates

我有一个大数据框,名为 data with 1 004 490 obs,我想分析一次治疗是否成功。

ID             POSITIONS             TREATMENT
1              0                     A
1              1                     A
1              2                     B
2              0                     C
2              1                     D
3              0                     B
3              1                     B
3              2                     C
3              3                     A
3              4                     A
3              5                     B

所以首先,我想计算一次治疗对一个患者(ID)应用的次数,但是一个治疗可以对一个 iD 进行多次。那么,我是否需要先删除所有重复项并在计数之后删除,或者是否有一个函数不考虑所有重复项。

What I want to have :  
A : 2
B : 2
C : 2
D : 1

然后,我想知道最后一个位置治疗了多少次,但是最后一个位置总是根据ID不同。

What I want to have :  
A : 0
B : 2 (for ID = 1 and 3)
C : 0
D : 1 (for ID = 1)

感谢您的帮助,我是 R 的新用户!

使用 base R,我们可以做到,

merge(aggregate(ID ~ TREATMENT, df, FUN = function(i) length(unique(i))), 
      aggregate(ID ~ TREATMENT, df[!duplicated(df$ID, fromLast = TRUE),], toString), 
      by = 'TREATMENT', all = TRUE)

这给出了,

  TREATMENT ID.x ID.y
1         A    2 <NA>
2         B    2 1, 3
3         C    2 <NA>
4         D    1    2

这是一个tidyverse方法,我们根据'ID'、'TREATMENT'得到distinct行,得到[=28的count行=]

library(tidyverse)
df1 %>%
    distinct(ID, TREATMENT) %>%
    count(TREATMENT)
# A tibble: 4 x 2
# TREATMENT     n
#      <chr> <int>
#1         A     2
#2         B     2
#3         C     2
#4         D     1

对于第二个输出,在按 'ID'、slice 最后一行 (n()) 分组后,创建一个列 'ind' 和 fill 'TREATMENT' 与 complete 的所有缺失组合都为 0,然后在按 'TREATMENT'

分组后得到 'ind' 的 sum
df1 %>% 
   group_by(ID) %>% 
   slice(n()) %>%
   mutate(ind = 1) %>% 
   complete(TREATMENT = unique(df1$TREATMENT), fill = list(ind=0)) %>% 
   group_by(TREATMENT) %>%
   summarise(n = sum(ind))
# A tibble: 4 x 2
#  TREATMENT     n
#      <chr> <dbl>
#1         A     0
#2         B     2
#3         C     0
#4         D     1

数据

df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
3L), POSITIONS = c(0L, 1L, 2L, 0L, 1L, 0L, 1L, 2L, 3L, 4L, 5L
 ), TREATMENT = c("A", "A", "B", "C", "D", "B", "B", "C", "A", 
 "A", "B")), .Names = c("ID", "POSITIONS", "TREATMENT"),
 class = "data.frame", row.names = c(NA, -11L))