在 R 中找到每个组中最常见的组合
Find the most common combinations within each group in R
我有以下数据集,显示了每个产品中包含的成分;
data <- data.frame("PRODUCT" = c("Creme","Creme","Creme","Creme","Medoc","Medoc","Medoc","Medoc","Medoc","Hububu","Hububu","Hububu","Hububu","Troll","Troll","Troll","Troll","Suzuki","Suzuki","Gluglu","Gluglu","Gluglu"),
"INGREDIENT" = c("zeze","zaza","zozo","zuzu","zaza","sasa","haha","zuzu","zemzem","zaza","zuzu","zizi","haha","zozo","zaza","zemzem","zuzu","sasa","zuzu","ozam","zaza","hayda"))
我想知道每种产品中最常见的成分组合;哪种成分与哪种其他成分有关?我应用了在此线程中找到的代码 :
combinaisons_par_PRODUCT = data %>%
full_join(data, by="PRODUCT") %>%
group_by(INGREDIENT.x, INGREDIENT.y) %>%
summarise(n = length(unique(PRODUCT))) %>%
filter(INGREDIENT.x!=INGREDIENT.y) %>%
mutate(item = paste(INGREDIENT.x, INGREDIENT.y, sep=", "))
它有效,但还有最后一个缺陷;我希望忽略该命令。例如,这段代码会给我 1 个 HAHA 和 SASA 关联,以及 1 个 SASA 和 HAHA 关联。但对我来说,这些都是一样的东西。所以我希望代码忽略 INGREDIENTS 的顺序,并给我一个 2 HAHA & SASA 的唯一关联。
我尝试在应用代码之前对成分进行排序,但它也没有用。有人可以帮我吗?我怎样才能让这些组合不考虑顺序?
非常感谢!
这是否符合您的要求?我仅限于组合按字母顺序排列的情况,避免重复计算。
data %>%
full_join(data, by="PRODUCT") %>%
filter(INGREDIENT.x < INGREDIENT.y) %>%
count(combo = paste(INGREDIENT.x, INGREDIENT.y, sep = ", "))
我们可以使用 base R
m1 <- crossprod(table(data))
subset(as.data.frame.table(m1 * lower.tri(m1, diag = TRUE)), Freq != 0)
编辑:@ThomasIsCoding 的评论
igraph
选项使用 graph_from_adjacency_matrix
library(igraph)
get.data.frame(
graph_from_adjacency_matrix(
crossprod(table(data)),
mode = "undirected",
weighted = TRUE
)
)
给予
from to weight
1 haha haha 2
2 haha sasa 1
3 haha zaza 2
4 haha zemzem 1
5 haha zizi 1
6 haha zuzu 2
7 hayda hayda 1
8 hayda ozam 1
9 hayda zaza 1
10 ozam ozam 1
11 ozam zaza 1
12 sasa sasa 2
13 sasa zaza 1
14 sasa zemzem 1
15 sasa zuzu 2
16 zaza zaza 5
17 zaza zemzem 2
18 zaza zeze 1
19 zaza zizi 1
20 zaza zozo 2
21 zaza zuzu 4
22 zemzem zemzem 2
23 zemzem zozo 1
24 zemzem zuzu 2
25 zeze zeze 1
26 zeze zozo 1
27 zeze zuzu 1
28 zizi zizi 1
29 zizi zuzu 1
30 zozo zozo 2
31 zozo zuzu 2
32 zuzu zuzu 5
我有以下数据集,显示了每个产品中包含的成分;
data <- data.frame("PRODUCT" = c("Creme","Creme","Creme","Creme","Medoc","Medoc","Medoc","Medoc","Medoc","Hububu","Hububu","Hububu","Hububu","Troll","Troll","Troll","Troll","Suzuki","Suzuki","Gluglu","Gluglu","Gluglu"),
"INGREDIENT" = c("zeze","zaza","zozo","zuzu","zaza","sasa","haha","zuzu","zemzem","zaza","zuzu","zizi","haha","zozo","zaza","zemzem","zuzu","sasa","zuzu","ozam","zaza","hayda"))
我想知道每种产品中最常见的成分组合;哪种成分与哪种其他成分有关?我应用了在此线程中找到的代码
combinaisons_par_PRODUCT = data %>%
full_join(data, by="PRODUCT") %>%
group_by(INGREDIENT.x, INGREDIENT.y) %>%
summarise(n = length(unique(PRODUCT))) %>%
filter(INGREDIENT.x!=INGREDIENT.y) %>%
mutate(item = paste(INGREDIENT.x, INGREDIENT.y, sep=", "))
它有效,但还有最后一个缺陷;我希望忽略该命令。例如,这段代码会给我 1 个 HAHA 和 SASA 关联,以及 1 个 SASA 和 HAHA 关联。但对我来说,这些都是一样的东西。所以我希望代码忽略 INGREDIENTS 的顺序,并给我一个 2 HAHA & SASA 的唯一关联。
我尝试在应用代码之前对成分进行排序,但它也没有用。有人可以帮我吗?我怎样才能让这些组合不考虑顺序?
非常感谢!
这是否符合您的要求?我仅限于组合按字母顺序排列的情况,避免重复计算。
data %>%
full_join(data, by="PRODUCT") %>%
filter(INGREDIENT.x < INGREDIENT.y) %>%
count(combo = paste(INGREDIENT.x, INGREDIENT.y, sep = ", "))
我们可以使用 base R
m1 <- crossprod(table(data))
subset(as.data.frame.table(m1 * lower.tri(m1, diag = TRUE)), Freq != 0)
编辑:@ThomasIsCoding 的评论
igraph
选项使用 graph_from_adjacency_matrix
library(igraph)
get.data.frame(
graph_from_adjacency_matrix(
crossprod(table(data)),
mode = "undirected",
weighted = TRUE
)
)
给予
from to weight
1 haha haha 2
2 haha sasa 1
3 haha zaza 2
4 haha zemzem 1
5 haha zizi 1
6 haha zuzu 2
7 hayda hayda 1
8 hayda ozam 1
9 hayda zaza 1
10 ozam ozam 1
11 ozam zaza 1
12 sasa sasa 2
13 sasa zaza 1
14 sasa zemzem 1
15 sasa zuzu 2
16 zaza zaza 5
17 zaza zemzem 2
18 zaza zeze 1
19 zaza zizi 1
20 zaza zozo 2
21 zaza zuzu 4
22 zemzem zemzem 2
23 zemzem zozo 1
24 zemzem zuzu 2
25 zeze zeze 1
26 zeze zozo 1
27 zeze zuzu 1
28 zizi zizi 1
29 zizi zuzu 1
30 zozo zozo 2
31 zozo zuzu 2
32 zuzu zuzu 5