R：计算所选列为非空的不同 ID

Question

我有以下数据框：

user_id <- c(97, 97, 97, 97, 96, 95, 95, 94, 94)
event_id <- c(42, 15, 43, 12, 44, 32, 38, 10, 11)
plan_id <- c(NA, 38, NA, NA, 30, NA, NA, 30, 25)
treatment_id <- c(NA, 20, NA, NA, NA, 28, 41, 17, 32)
system <- c(1, 1, 1, 1, NA, 2, 2, NA, NA)

df <- data.frame(user_id, event_id, plan_id, treatment_id system)

我想计算每列 user_id 的不同数量，不包括 NA 值。我希望的输出是：

      user_id   event_id    plan_id   treatment_id  system
  1   4         4           3         4             2

我尝试利用 mutate_all，但没有成功，因为我的数据框太大了。在其他函数中，我使用了以下两行代码来获取每列的非空计数和非重复计数：

colSums(!is.empty(df[,]))
apply(df[,], 2, function(x) length(unique(x)))

最理想的是，我想将两者与 ifelse 结合起来以最大限度地减少突变，因为这最终将被放入一个函数中，与许多其他汇总统计信息一起应用于数据列表帧。

我尝试了一种强力方法，如果不为空则为 1，否则为 0，然后如果为 1 则将 id 复制到该列。然后我可以使用上面的 count distinct 行来获取我的输出。但是，将其复制到其他列时我得到了错误的值，并且调整次数不是最佳的。见代码：

binary <- cbind(df$user_id, !is.empty(df[,2:length(df)]))
copied <- binary %>% replace(. > 0, binary[.,1])

非常感谢你的帮助。

Answer 1

谢谢@dcarlson 我误解了问题：

   apply(df, 2, function(x){length(unique(df[!is.na(x), 1]))})

Answer 2

1: 基础

sapply(df, function(x){
    length(unique(df$user_id[!is.na(x)]))
})
#     user_id     event_id      plan_id treatment_id       system 
#           4            4            3            3            2

2: 基础

aggregate(user_id ~ ind, unique(na.omit(cbind(stack(df), df[1]))[-1]), length)
#           ind user_id
#1      user_id       4
#2     event_id       4
#3      plan_id       3
#4 treatment_id       3
#5       system       2

3: tidyverse

df %>%
    mutate(key = user_id) %>%
    pivot_longer(!key) %>%
    filter(!is.na(value)) %>%
    group_by(name) %>%
    summarise(value = n_distinct(key)) %>%
    pivot_wider()
## A tibble: 1 x 5
#  event_id plan_id system treatment_id user_id
#     <int>   <int>  <int>        <int>   <int>
#1        4       3      2            3       4

Answer 3

data.table 选项 uniqueN

> setDT(df)[, lapply(.SD, function(x) uniqueN(user_id[!is.na(x)]))]
   user_id event_id plan_id treatment_id system
1:       4        4       3            3      2

Answer 4

使用 dplyr 您可以将 summarise 与 across 一起使用：

library(dplyr)
df %>% summarise(across(.fns =  ~n_distinct(user_id[!is.na(.x)])))

#  user_id event_id plan_id treatment_id system
#1       4        4       3            3      2

R：计算所选列为非空的不同 ID

R: count distinct IDs where selected column is non-null

r

distinct

non-nullable