如何将单词列表 (chr) 与数据框中多列中的值进行比较，并在 R 中匹配时输出二进制响应

Question

我想将 words 列中的每个单词与 V1 列中的值进行比较 to V576（每行按行）。如果 words 列中的任何单词与 V 列中的任何单词匹配，请替换相应的单词 V 列按 1，如果不匹配则按 0。知道怎么做吗？ 我不确定如何遍历所有行和列

Dataframe 称为 Data。列 words 是一个列表 ($words :List of 42201)。有42201行大约有 576 列要比较的单词（V1 到 V576）。

这里只是前3行前20列的dput文件

structure(list(id = c("Te-1", "Te-2", "Te-3"), category = c("Fabric Care", 
"Fabric Care", "Home Care"), brand = c("Tide", "Tide", "Cascade"
), sub_category = c("Laundry", "Laundry", "Auto Dishwashing"), 
    market = c("US", "US", "US"), review_title = c("the best in a very crowded market", 
    "first time", "i have been using another well known brand and did not expect    "
    ), review_text = c("the best general wash detergent  convenient container that keeps the product driy ", 
    "this helped to clean our washing machine after getting it from someone else   this review was collected as part of a promotion  ", 
    "i have been using another well known brand and did not expect much difference  wow  was i ever mistaken  i will never go back "
    ), review_rating = c(5L, 5L, 5L), words = list(c("the", "best", 
    "general", "wash", "deterg", "conveni", "contain", "that", 
    "keep", "the", "product", "driy"), c("this", "help", "to", 
    "clean", "our", "wash", "machin", "after", "get", "it", "from", 
    "someon", "els", "this", "review", "was", "collect", "as", 
    "part", "of", "a", "promot"), c("i", "have", "been", "use", 
    "anoth", "well", "known", "brand", "and", "did", "not", "expect", 
    "much", "differ", "wow", "was", "i", "ever", "mistaken", 
    "i", "will", "never", "go", "back")), V1 = c("absolut", "absolut", 
    "absolut"), V2 = c("action", "action", "action"), V3 = c("actionpac", 
    "actionpac", "actionpac"), V4 = c("actual", "actual", "actual"
    ), V5 = c("addit", "addit", "addit"), V6 = c("adverti", "adverti", 
    "adverti"), V7 = c("afford", "afford", "afford"), V8 = c("agent", 
    "agent", "agent"), V9 = c("allerg", "allerg", "allerg"), 
    V10 = c("allergi", "allergi", "allergi"), V11 = c("alon", 
    "alon", "alon")), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x0000023d166a1ef0>)

请查看下面的数据框片段，以便更好地理解我的问题

非常感谢您的帮助！

Answer 1

为了向您展示如何创建您的问题的代表，我创建了一个新的数据示例并提供了一个代码，使用我认为可以回答您问题的 tidyverse。

library(tidyverse)

df <- data.frame(
  words = c("I want to compare each individual word in the words",
            "column to the values in columns V1 to V576",
            ". If any word from the words column matches any",
            "replace the word in the respective V column by 1 or else"),
  v1 = c("want", "want", "want", "want"),
  v2 = c("word", "word", "word", "word"),
  v3 = c("any", "any", "any", "any")
  )





df %>%
  gather(key = key, value = value, -words) %>%
  mutate(appear = as.numeric(str_detect(words, value))) %>%
  select(-value) %>%
  spread(key, appear)

输出

                                                     words v1 v2 v3
1          . If any word from the words column matches any  0  1  1
2               column to the values in columns V1 to V576  0  0  0
3      I want to compare each individual word in the words  1  1  0
4 replace the word in the respective V column by 1 or else  0  1  0

Answer 2

除了@Johan Rosa 的 tidyverse-solution 之外，这里还有一个适用于 base-R 的解决方案：

ls <- lapply(1:nrow(yourFrame), function(row){
  out <- as.numeric(yourFrame[row,] %in% unlist(yourFrame[row,'words']))
  names(out) <- names(yourFrame)
  return(out)
})
df <- data.frame(do.call(rbind, ls))

lapply 调用循环遍历 data.frame 的每一行并为每一行创建一个布尔向量，确定是否可以再次找到相应行的 word-vector保留之前的 column-names。最后一个调用只是将它们粘合在一起。

Answer 3

我创建了一个数据框

数据

data <- data.frame(words = c("the, best, general","i, have, been"), v1 = c("best","no"), v2 = c("have", "nothing"), stringsAsFactors = F)

使用 for 循环条件，我已经传递了函数 grepl，只要它匹配它就出现 1 如果不是 0

for (i in 2: ncol(data)){
  for (j in 1:nrow(data)){
  
  x <- i
  
  y <- data$words[j]
  
  ab <- data [j,x]
  
   abc <- grepl (ab , y)
    
   data[j,i] <- ifelse (abc %in% "TRUE", 1, data[j,i])
    
  }
}

结果

print (data)
        words       v1     v2
the, best, general  1      0
   i, have, been    0      0

如何将单词列表 (chr) 与数据框中多列中的值进行比较，并在 R 中匹配时输出二进制响应

How to compare a List of words (chr) to values in multiple columns in a dataframe and output a binary response if there is a match in R

text-processing

r

machine-learning

dataframe

数据

结果