从 ID 列值相等的一列单词生成两列单词组合
Produce two columns of word combinations from one column of words where ID column values are equal
我正在尝试准备数据框以输入 networkd3
的 forceNetwork
函数
这是我的数据示例:
structure(list(Case.Number = c("127967", "127967", "127967",
"127967", "141330", "141330", "141330", "141330", "141240", "141240",
"141240"), Word = c("account", "want", "membership", "sort",
"unhappi", "vr", "info", "miss", "csrf", "unhappi", "dissatisfi"
)), .Names = c("Case.Number", "Word"), class = c("data.table",
"data.frame"), row.names = c(NA, -11L))
对于每个案例编号的单词,我想生成一个数据框,其中包含两列所有可能(和唯一)的两个单词组合,如下所示,同一列没有重复组合(包括倒序),也没有组合同一个词
127967 account want
127967 account membership
127967 account sort
127967 want membership
127967 want sort
141330 unhappi vr
141330 unhappi info...
excluding
141330 unhappi unhappi
我尝试了以下方法来获得组合:
source <- c("remove")
target <- c("remove")
ID <- c("remove")
df <- data.frame(ID = c("remove"), source = c("remove"), target = c("remove"))
for(i in unique(tbl$Case.Number)){
for (r in grep(i, tbl$Case.Number)) {
if(r < max(grep(i, tbl$Case.Number))){
ID <- i
source <- tbl$Word[r]
target <- tbl$Word[r+1]
rbind(df, cbind(ID, source,target))
}
}
}
View(df)
但是没用。
有没有更简洁的方法?
自连接再过滤:
setkey(dd, Case.Number)
dd[dd, allow.cartesian = TRUE][Word < i.Word]
# Case.Number Word i.Word
# 1: 127967 account want
# 2: 127967 membership want
# 3: 127967 sort want
# 4: 127967 account membership
# 5: 127967 account sort
# 6: 127967 membership sort
# 7: 141240 csrf unhappi
# 8: 141240 dissatisfi unhappi
# 9: 141240 csrf dissatisfi
# 10: 141330 info unhappi
# 11: 141330 miss unhappi
# 12: 141330 unhappi vr
# 13: 141330 info vr
# 14: 141330 miss vr
# 15: 141330 info miss
已更新
使用tidyr::expand
...
df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967 account
127967 want
127967 membership
127967 sort
141330 unhappi
141330 vr
141330 info
141330 miss
141240 csrf
141240 unhappi
141240 dissatisfi
")
library(dplyr)
library(tidyr)
df %>%
group_by(Case.Number) %>%
expand(Word, i.Word = Word) %>%
filter(Word < i.Word)
这是一个 tidyverse
的方法(比下面的原始方法更简单,利用了@Gregor 非常简单的过滤方法)...
df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967 account
127967 want
127967 membership
127967 sort
141330 unhappi
141330 vr
141330 info
141330 miss
141240 csrf
141240 unhappi
141240 dissatisfi
")
library(dplyr)
library(tidyr)
df %>%
group_by(Case.Number) %>%
mutate(i.Word = Word) %>%
complete(Word, i.Word) %>%
filter(Word < i.Word)
# A tibble: 15 x 3
# Groups: Case.Number [3]
Case.Number Word i.Word
<int> <chr> <chr>
1 127967 account membership
2 127967 account sort
3 127967 account want
4 127967 membership sort
5 127967 membership want
6 127967 sort want
7 141240 csrf dissatisfi
8 141240 csrf unhappi
9 141240 dissatisfi unhappi
10 141330 info miss
11 141330 info unhappi
12 141330 info vr
13 141330 miss unhappi
14 141330 miss vr
15 141330 unhappi vr
这里有一个 tidyverse
的方法(如果有点复杂的话)...
df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967 account
127967 want
127967 membership
127967 sort
141330 unhappi
141330 vr
141330 info
141330 miss
141240 csrf
141240 unhappi
141240 dissatisfi
")
library(dplyr)
library(tidyr)
as_tibble(df) %>%
group_by(Case.Number) %>%
mutate(Word = list(as_data_frame(t(combn(unlist(Word), 2))))) %>%
unique() %>%
unnest(Word)
如果您 运行 按顺序执行以下命令以查看它们的作用,会更容易理解一些。 combn
神奇地将向量扩展为所有可能的组合。
vec <- c("account", "want", "membership", "sort")
combn(vec, 2)
t(combn(vec, 2))
as_data_frame(t(combn(vec, 2)))
我正在尝试准备数据框以输入 networkd3
的forceNetwork
函数
这是我的数据示例:
structure(list(Case.Number = c("127967", "127967", "127967",
"127967", "141330", "141330", "141330", "141330", "141240", "141240",
"141240"), Word = c("account", "want", "membership", "sort",
"unhappi", "vr", "info", "miss", "csrf", "unhappi", "dissatisfi"
)), .Names = c("Case.Number", "Word"), class = c("data.table",
"data.frame"), row.names = c(NA, -11L))
对于每个案例编号的单词,我想生成一个数据框,其中包含两列所有可能(和唯一)的两个单词组合,如下所示,同一列没有重复组合(包括倒序),也没有组合同一个词
127967 account want
127967 account membership
127967 account sort
127967 want membership
127967 want sort
141330 unhappi vr
141330 unhappi info...
excluding
141330 unhappi unhappi
我尝试了以下方法来获得组合:
source <- c("remove")
target <- c("remove")
ID <- c("remove")
df <- data.frame(ID = c("remove"), source = c("remove"), target = c("remove"))
for(i in unique(tbl$Case.Number)){
for (r in grep(i, tbl$Case.Number)) {
if(r < max(grep(i, tbl$Case.Number))){
ID <- i
source <- tbl$Word[r]
target <- tbl$Word[r+1]
rbind(df, cbind(ID, source,target))
}
}
}
View(df)
但是没用。
有没有更简洁的方法?
自连接再过滤:
setkey(dd, Case.Number)
dd[dd, allow.cartesian = TRUE][Word < i.Word]
# Case.Number Word i.Word
# 1: 127967 account want
# 2: 127967 membership want
# 3: 127967 sort want
# 4: 127967 account membership
# 5: 127967 account sort
# 6: 127967 membership sort
# 7: 141240 csrf unhappi
# 8: 141240 dissatisfi unhappi
# 9: 141240 csrf dissatisfi
# 10: 141330 info unhappi
# 11: 141330 miss unhappi
# 12: 141330 unhappi vr
# 13: 141330 info vr
# 14: 141330 miss vr
# 15: 141330 info miss
已更新
使用tidyr::expand
...
df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967 account
127967 want
127967 membership
127967 sort
141330 unhappi
141330 vr
141330 info
141330 miss
141240 csrf
141240 unhappi
141240 dissatisfi
")
library(dplyr)
library(tidyr)
df %>%
group_by(Case.Number) %>%
expand(Word, i.Word = Word) %>%
filter(Word < i.Word)
这是一个 tidyverse
的方法(比下面的原始方法更简单,利用了@Gregor 非常简单的过滤方法)...
df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967 account
127967 want
127967 membership
127967 sort
141330 unhappi
141330 vr
141330 info
141330 miss
141240 csrf
141240 unhappi
141240 dissatisfi
")
library(dplyr)
library(tidyr)
df %>%
group_by(Case.Number) %>%
mutate(i.Word = Word) %>%
complete(Word, i.Word) %>%
filter(Word < i.Word)
# A tibble: 15 x 3
# Groups: Case.Number [3]
Case.Number Word i.Word
<int> <chr> <chr>
1 127967 account membership
2 127967 account sort
3 127967 account want
4 127967 membership sort
5 127967 membership want
6 127967 sort want
7 141240 csrf dissatisfi
8 141240 csrf unhappi
9 141240 dissatisfi unhappi
10 141330 info miss
11 141330 info unhappi
12 141330 info vr
13 141330 miss unhappi
14 141330 miss vr
15 141330 unhappi vr
这里有一个 tidyverse
的方法(如果有点复杂的话)...
df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967 account
127967 want
127967 membership
127967 sort
141330 unhappi
141330 vr
141330 info
141330 miss
141240 csrf
141240 unhappi
141240 dissatisfi
")
library(dplyr)
library(tidyr)
as_tibble(df) %>%
group_by(Case.Number) %>%
mutate(Word = list(as_data_frame(t(combn(unlist(Word), 2))))) %>%
unique() %>%
unnest(Word)
如果您 运行 按顺序执行以下命令以查看它们的作用,会更容易理解一些。 combn
神奇地将向量扩展为所有可能的组合。
vec <- c("account", "want", "membership", "sort")
combn(vec, 2)
t(combn(vec, 2))
as_data_frame(t(combn(vec, 2)))