使用 R 的数据帧中字符串的频率及其 ID

Frequency of strings and their IDs in a dataframe using R

目标是生成文本变量的频率并将相应的 ID 与其相关联。

假设Sample是一个dataframe,如下所示:

Sample <- data.frame(ID = c('1', '2', '3', '4', '5', '6'), 
                        Var = c('How are you', 
                                 'Do not go', 
                                 'How are you', 
                                 'Please go',  
                                 'How are you',
                                 'Do not go'))

以下命令生成 Var 列中字符串的频率,如下所示:

as.data.frame(table(unlist(strsplit(tolower(Sample$Var), ', '))))

有没有办法在 table 中一起生成关联的 ID,比如?:

试试这个:

library(dplyr)
#Code
New <- Sample %>% group_by(Var) %>%
  summarise(Freq=n(),IDS=toString(ID))

输出:

# A tibble: 3 x 3
  Var          Freq IDS    
  <chr>       <int> <chr>  
1 Do not go       2 2, 6   
2 How are you     3 1, 3, 5
3 Please go       1 4      

如果您申请,这里还有一个选项data.table

> setDT(Sample)[, .(Freq = .N, ID.asso = list(ID)), keyby = Var]
           Var Freq ID.asso
1:   Do not go    2     2,6
2: How are you    3   1,3,5
3:   Please go    1       4

我们可以使用 dplyrstringr

library(dplyr)
library(stringr)
Sample %>%
   group_by(Var) %>%
    summarise(Freq = n(), IDS = str_c(ID, collapse=", "))

基础 R 解决方案:

data.frame(do.call(rbind, lapply(with(Sample, split(Sample, Var)), function(x){
      with(x, data.frame(Var = unique(Var), Freq = nrow(x), ID = toString(ID)))
   }
  )
), row.names = NULL, stringsAsFactors = FALSE)