从 ID 列值相等的一列单词生成两列单词组合

Produce two columns of word combinations from one column of words where ID column values are equal

我正在尝试准备数据框以输入

forceNetwork 函数

这是我的数据示例:

structure(list(Case.Number = c("127967", "127967", "127967", 
"127967", "141330", "141330", "141330", "141330", "141240", "141240", 
"141240"), Word = c("account", "want", "membership", "sort", 
"unhappi", "vr", "info", "miss", "csrf", "unhappi", "dissatisfi"
)), .Names = c("Case.Number", "Word"), class = c("data.table", 
"data.frame"), row.names = c(NA, -11L))

对于每个案例编号的单词,我想生成一个数据框,其中包含两列所有可能(和唯一)的两个单词组合,如下所示,同一列没有重复组合(包括倒序),也没有组合同一个词

127967 account want
127967 account membership
127967 account sort
127967 want    membership
127967 want    sort
141330 unhappi vr
141330 unhappi info...

excluding
141330 unhappi unhappi

我尝试了以下方法来获得组合:

source <- c("remove")
target <- c("remove")
ID <- c("remove")
df <- data.frame(ID = c("remove"), source = c("remove"), target = c("remove"))

for(i in unique(tbl$Case.Number)){
  for (r in grep(i, tbl$Case.Number)) {
    if(r < max(grep(i, tbl$Case.Number))){
      ID <- i
      source <- tbl$Word[r]
      target <- tbl$Word[r+1]
      rbind(df, cbind(ID, source,target))
    }

  }

}

View(df) 

但是没用。

有没有更简洁的方法?

自连接再过滤:

setkey(dd, Case.Number)
dd[dd, allow.cartesian = TRUE][Word < i.Word]
#     Case.Number       Word     i.Word
#  1:      127967    account       want
#  2:      127967 membership       want
#  3:      127967       sort       want
#  4:      127967    account membership
#  5:      127967    account       sort
#  6:      127967 membership       sort
#  7:      141240       csrf    unhappi
#  8:      141240 dissatisfi    unhappi
#  9:      141240       csrf dissatisfi
# 10:      141330       info    unhappi
# 11:      141330       miss    unhappi
# 12:      141330    unhappi         vr
# 13:      141330       info         vr
# 14:      141330       miss         vr
# 15:      141330       info       miss

已更新

使用tidyr::expand...

df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967    account
127967       want
127967 membership
127967       sort
141330    unhappi
141330         vr
141330       info
141330       miss
141240       csrf
141240    unhappi
141240 dissatisfi
")

library(dplyr)
library(tidyr)

df %>% 
  group_by(Case.Number) %>% 
  expand(Word, i.Word = Word) %>% 
  filter(Word < i.Word)

这是一个 tidyverse 的方法(比下面的原始方法更简单,利用了@Gregor 非常简单的过滤方法)...

df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967    account
127967       want
127967 membership
127967       sort
141330    unhappi
141330         vr
141330       info
141330       miss
141240       csrf
141240    unhappi
141240 dissatisfi
")

library(dplyr)
library(tidyr)

df %>% 
  group_by(Case.Number) %>% 
  mutate(i.Word = Word) %>% 
  complete(Word, i.Word) %>% 
  filter(Word < i.Word)

# A tibble: 15 x 3
# Groups: Case.Number [3]
   Case.Number Word       i.Word    
         <int> <chr>      <chr>     
 1      127967 account    membership
 2      127967 account    sort      
 3      127967 account    want      
 4      127967 membership sort      
 5      127967 membership want      
 6      127967 sort       want      
 7      141240 csrf       dissatisfi
 8      141240 csrf       unhappi   
 9      141240 dissatisfi unhappi   
10      141330 info       miss      
11      141330 info       unhappi   
12      141330 info       vr        
13      141330 miss       unhappi   
14      141330 miss       vr        
15      141330 unhappi    vr

这里有一个 tidyverse 的方法(如果有点复杂的话)...

df <- read.table(header = T, stringsAsFactors = F, text = "
Case.Number Word
127967    account
127967       want
127967 membership
127967       sort
141330    unhappi
141330         vr
141330       info
141330       miss
141240       csrf
141240    unhappi
141240 dissatisfi
")

library(dplyr)
library(tidyr)

as_tibble(df) %>% 
  group_by(Case.Number) %>% 
  mutate(Word = list(as_data_frame(t(combn(unlist(Word), 2))))) %>% 
  unique() %>% 
  unnest(Word)

如果您 运行 按顺序执行以下命令以查看它们的作用,会更容易理解一些。 combn 神奇地将向量扩展为所有可能的组合。

vec <- c("account", "want", "membership", "sort")
combn(vec, 2)
t(combn(vec, 2))
as_data_frame(t(combn(vec, 2)))