基于 R 中的另一列对相似字符串进行聚类

Clustering similar strings based on another column in R

我有一个大型数据框,显示字符串之间的距离及其计数。

例如,在第 1 行,您看到 applepple 之间的距离 还有我数过的次数 apple (counts_col1= 100) and我数过的次数 pple (counts_col2=2).

library(tidyverse)

df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
                 col2 = c("pple","app","app", "bananna", "banan", "banan"), 
             distance = c(1,2,3,1,1,2),
          counts_col1 = c(100,100,2,200,200,2),
          counts_col2 = c(2,50,50,2,20,20))
df    
#> # A tibble: 6 × 5
#>   col1    col2    distance counts_col1 counts_col2
#>   <chr>   <chr>      <dbl>       <dbl>       <dbl>
#> 1 apple   pple           1         100           2
#> 2 apple   app            2         100          50
#> 3 pple    app            3           2          50
#> 4 banana  bananna        1         200           2
#> 5 banana  banan          1         200          20
#> 6 bananna banan          2           2          20

reprex package (v2.0.1)

于 2022-03-15 创建

现在我想根据计数最大的字符串对苹果和香蕉进行聚类,即苹果 (100) 和香蕉 (200)。 我希望我的数据看起来像这样

cluster   elements  sum_counts
 apple      apple    152
  NA        pple      NA
  NA         app      NA
 banana     banana   222
  NA       bananna    NA
  NA         banan    NA

输出的格式不一定要这样。我真的很努力地分解这个问题并将这些群体聚集在一起。 非常感谢任何帮助或评论!

这是一种方法,我首先为集合添加一个组标识符(我假设你的实际集合中有这个),然后在制作更长类型的数据集后,我按此分组 id , 和标识符具有最大值的“单词”。然后,我在初始 df 和这组具有 largest_value 单词、总结和重命名的关键行之间使用内部联接。我将所有变体推入列表列。

df <- df %>% mutate(id=c(1,1,1,2,2,2))

df %>% inner_join(
   rbind(
    df %>% select(id,distance,col=col1, counts=counts_col1),
    df %>% select(id,distance,col=col2, counts=counts_col2)
  ) %>% 
  group_by(id) %>% 
  slice_max(counts) %>% 
  distinct(col), 
  by=c("col1"="col")
) %>% 
  group_by(col1) %>% 
  summarize(variants = list(c(col1, cur_group()$col1)),
            total = min(counts_col1) + sum(counts_col2)) %>% 
  rename_all(~c("cluster", "elements", "sum_counts"))

# A tibble: 2 x 3
  cluster elements  sum_counts
  <chr>   <list>         <dbl>
1 apple   <chr [3]>        152
2 banana  <chr [3]>        222

data.table 中的类似方法(也取决于 id 列)

setDT(df)
df[rbind(
  df[,.(id,col=col1,counts=counts_col1)],
  df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
  ,  .(elements=list(c(col2,.BY$cluster)),
       sum_counts = min(counts_col1) + sum(counts_col2)),
  by=.(cluster=col1)]


   cluster             elements sum_counts
    <char>               <list>      <num>
1:  banana bananna,banan,banana        222
2:   apple       pple,app,apple        152

您可以尝试使用来自 igraph 的随机游走聚类:

count_df <- data.table::melt(
  data.table::as.data.table(df), 
  measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
  value.name = c("col", "counts")
) %>%
  select(col, counts) %>%
  unique()

df %>%
  igraph::graph_from_data_frame(directed = FALSE) %>%
  igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
  # igraph::components() %>%
  igraph::membership() %>%
  split(names(.), .) %>%
  map_dfr(
    ~tibble(col = .x) %>% 
      semi_join(count_df, ., by = "col") %>% 
      arrange(desc(counts)) %>%
      summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
  )

  cluster               elements sum_count
1   apple       apple, app, pple       152
2  banana banana, banan, bananna       222

这适用于这个玩具示例,但我认为您的示例过于简单,可能没有反映您的主要问题。或者,如果您对查找连接的组件感兴趣(如果两个单词连接,它们就在同一个集群中),它可能会更容易。那么您需要将 walktrap.community 替换为 components.