与整个数据集相比，计算每一行中唯一和不明确的元素

Question

我有一个包含数千行和将近一百列的数据集。每行仅包含唯一元素，但是，这些元素也可能在其他行中找到。

基本上，我想在我的数据框中创建两个新列，一个存储 Unique 的数量，另一个存储 Ambiguous[=28] 的数量=] 给定行中有元素，但 与整个数据集相比 。

请注意，数据框中有 NA，在计算唯一和不明确的元素时不应考虑这些元素。

df <- data.frame( col1 = c('Ab', 'Cd', 'Ef', 'Gh', 'Ij'), col2 = c('Ac', 'Ce', 'Eg', 'Gi', 'Ik'), col3 = c('Acc', NA, 'Ab', 'Gef', 'Il'), col4 = c(NA, NA, NA, 'Ce', 'Im') )

在上面创建的数据框中，Ab 不是唯一的，因此与整个数据集相比，第 1 行中有 2 个唯一元素和 1 个不明确元素。

在我预期的输出中，第 1 行中的 Unique 等于 2，而 Ambiguous = 1。在第 5 行中，它分别为 4 和 0。

我已经搜索过可能的解决方案，但大多数解决方案只处理特定行中的唯一或重复元素，或特定列的多行。无论如何，我们将不胜感激。

Answer 1

这样的事情怎么样：

df <- data.frame(
  col1 = c('Ab', 'Cd', 'Ef', 'Gh', 'Ij'),
  col2 = c('Ac', 'Ce', 'Eg', 'Gi', 'Ik'), 
  col3 = c('Acc', NA, 'Ab', 'Gef', 'Il'), 
  col4 = c(NA, NA, NA, 'Ce', 'Im')
)

uvals <- avals <- rep(NA, nrow(df))

for(i in 1:nrow(df)){
  other_vals <- na.omit(c(unique(as.matrix(df[-i,]))))
  tmp <- na.omit(as.matrix(df)[i,]) %in% other_vals
  uvals[i] <- sum(tmp == 0, na.rm=TRUE)
  avals[i] <- sum(tmp == 1, na.rm=TRUE)
}

df <- df %>% 
  mutate(unique = uvals, 
         ambiguous = avals)

df
#   col1 col2 col3 col4 unique ambiguous
# 1   Ab   Ac  Acc <NA>      2         1
# 2   Cd   Ce <NA> <NA>      1         1
# 3   Ef   Eg   Ab <NA>      2         1
# 4   Gh   Gi  Gef   Ce      3         1
# 5   Ij   Ik   Il   Im      4         0

Answer 2

另一种避免重新计算的方法。

# First we get the duplicates to avoid recounting every time.
freqs <- table(as.matrix(df))
dupes <- names(freqs[freqs > 1])

# Check the values for (non-)duplication.
is_dupe <- rowSums(apply(df, 2, "%in%", dupes))
not_dupe <- rowSums(apply(df, 2, function(x) {!(x %in% dupes | is.na(x))}))

# Add the columns after we calculated the counts to avoid including them.
df$ambiguous <- is_dupe
df$unique <- not_dupe
df

#   col1 col2 col3 col4 ambiguous unique
# 1   Ab   Ac  Acc <NA>         1      2
# 2   Cd   Ce <NA> <NA>         1      1
# 3   Ef   Eg   Ab <NA>         1      2
# 4   Gh   Gi  Gef   Ce         1      3
# 5   Ij   Ik   Il   Im         0      4

与整个数据集相比，计算每一行中唯一和不明确的元素

Count unique and ambiguous elements in each row compared to the whole dataset

r

unique

count