基于 R 中的另一列对相似字符串进行聚类
Clustering similar strings based on another column in R
我有一个大型数据框,显示字符串之间的距离及其计数。
例如,在第 1 行,您看到 apple 和 pple 之间的距离 还有我数过的次数 apple (counts_col1= 100) and我数过的次数 pple (counts_col2=2).
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
distance = c(1,2,3,1,1,2),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20))
df
#> # A tibble: 6 × 5
#> col1 col2 distance counts_col1 counts_col2
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 1 100 2
#> 2 apple app 2 100 50
#> 3 pple app 3 2 50
#> 4 banana bananna 1 200 2
#> 5 banana banan 1 200 20
#> 6 bananna banan 2 2 20
由 reprex package (v2.0.1)
于 2022-03-15 创建
现在我想根据计数最大的字符串对苹果和香蕉进行聚类,即苹果 (100) 和香蕉 (200)。
我希望我的数据看起来像这样
cluster elements sum_counts
apple apple 152
NA pple NA
NA app NA
banana banana 222
NA bananna NA
NA banan NA
输出的格式不一定要这样。我真的很努力地分解这个问题并将这些群体聚集在一起。
非常感谢任何帮助或评论!
这是一种方法,我首先为集合添加一个组标识符(我假设你的实际集合中有这个),然后在制作更长类型的数据集后,我按此分组 id
, 和标识符具有最大值的“单词”。然后,我在初始 df 和这组具有 largest_value 单词、总结和重命名的关键行之间使用内部联接。我将所有变体推入列表列。
df <- df %>% mutate(id=c(1,1,1,2,2,2))
df %>% inner_join(
rbind(
df %>% select(id,distance,col=col1, counts=counts_col1),
df %>% select(id,distance,col=col2, counts=counts_col2)
) %>%
group_by(id) %>%
slice_max(counts) %>%
distinct(col),
by=c("col1"="col")
) %>%
group_by(col1) %>%
summarize(variants = list(c(col1, cur_group()$col1)),
total = min(counts_col1) + sum(counts_col2)) %>%
rename_all(~c("cluster", "elements", "sum_counts"))
# A tibble: 2 x 3
cluster elements sum_counts
<chr> <list> <dbl>
1 apple <chr [3]> 152
2 banana <chr [3]> 222
data.table 中的类似方法(也取决于 id
列)
setDT(df)
df[rbind(
df[,.(id,col=col1,counts=counts_col1)],
df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
, .(elements=list(c(col2,.BY$cluster)),
sum_counts = min(counts_col1) + sum(counts_col2)),
by=.(cluster=col1)]
cluster elements sum_counts
<char> <list> <num>
1: banana bananna,banan,banana 222
2: apple pple,app,apple 152
您可以尝试使用来自 igraph
的随机游走聚类:
count_df <- data.table::melt(
data.table::as.data.table(df),
measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
value.name = c("col", "counts")
) %>%
select(col, counts) %>%
unique()
df %>%
igraph::graph_from_data_frame(directed = FALSE) %>%
igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
# igraph::components() %>%
igraph::membership() %>%
split(names(.), .) %>%
map_dfr(
~tibble(col = .x) %>%
semi_join(count_df, ., by = "col") %>%
arrange(desc(counts)) %>%
summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
)
cluster elements sum_count
1 apple apple, app, pple 152
2 banana banana, banan, bananna 222
这适用于这个玩具示例,但我认为您的示例过于简单,可能没有反映您的主要问题。或者,如果您对查找连接的组件感兴趣(如果两个单词连接,它们就在同一个集群中),它可能会更容易。那么您需要将 walktrap.community
替换为 components
.
我有一个大型数据框,显示字符串之间的距离及其计数。
例如,在第 1 行,您看到 apple 和 pple 之间的距离 还有我数过的次数 apple (counts_col1= 100) and我数过的次数 pple (counts_col2=2).
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
distance = c(1,2,3,1,1,2),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20))
df
#> # A tibble: 6 × 5
#> col1 col2 distance counts_col1 counts_col2
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 1 100 2
#> 2 apple app 2 100 50
#> 3 pple app 3 2 50
#> 4 banana bananna 1 200 2
#> 5 banana banan 1 200 20
#> 6 bananna banan 2 2 20
由 reprex package (v2.0.1)
于 2022-03-15 创建现在我想根据计数最大的字符串对苹果和香蕉进行聚类,即苹果 (100) 和香蕉 (200)。 我希望我的数据看起来像这样
cluster elements sum_counts
apple apple 152
NA pple NA
NA app NA
banana banana 222
NA bananna NA
NA banan NA
输出的格式不一定要这样。我真的很努力地分解这个问题并将这些群体聚集在一起。 非常感谢任何帮助或评论!
这是一种方法,我首先为集合添加一个组标识符(我假设你的实际集合中有这个),然后在制作更长类型的数据集后,我按此分组 id
, 和标识符具有最大值的“单词”。然后,我在初始 df 和这组具有 largest_value 单词、总结和重命名的关键行之间使用内部联接。我将所有变体推入列表列。
df <- df %>% mutate(id=c(1,1,1,2,2,2))
df %>% inner_join(
rbind(
df %>% select(id,distance,col=col1, counts=counts_col1),
df %>% select(id,distance,col=col2, counts=counts_col2)
) %>%
group_by(id) %>%
slice_max(counts) %>%
distinct(col),
by=c("col1"="col")
) %>%
group_by(col1) %>%
summarize(variants = list(c(col1, cur_group()$col1)),
total = min(counts_col1) + sum(counts_col2)) %>%
rename_all(~c("cluster", "elements", "sum_counts"))
# A tibble: 2 x 3
cluster elements sum_counts
<chr> <list> <dbl>
1 apple <chr [3]> 152
2 banana <chr [3]> 222
data.table 中的类似方法(也取决于 id
列)
setDT(df)
df[rbind(
df[,.(id,col=col1,counts=counts_col1)],
df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
, .(elements=list(c(col2,.BY$cluster)),
sum_counts = min(counts_col1) + sum(counts_col2)),
by=.(cluster=col1)]
cluster elements sum_counts
<char> <list> <num>
1: banana bananna,banan,banana 222
2: apple pple,app,apple 152
您可以尝试使用来自 igraph
的随机游走聚类:
count_df <- data.table::melt(
data.table::as.data.table(df),
measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
value.name = c("col", "counts")
) %>%
select(col, counts) %>%
unique()
df %>%
igraph::graph_from_data_frame(directed = FALSE) %>%
igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
# igraph::components() %>%
igraph::membership() %>%
split(names(.), .) %>%
map_dfr(
~tibble(col = .x) %>%
semi_join(count_df, ., by = "col") %>%
arrange(desc(counts)) %>%
summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
)
cluster elements sum_count
1 apple apple, app, pple 152
2 banana banana, banan, bananna 222
这适用于这个玩具示例,但我认为您的示例过于简单,可能没有反映您的主要问题。或者,如果您对查找连接的组件感兴趣(如果两个单词连接,它们就在同一个集群中),它可能会更容易。那么您需要将 walktrap.community
替换为 components
.