如何取一个词并根据该词在评论中的出现创建一个指示变量？

Question

我有一个单词向量和一个评论向量：

word.list <- c("very", "experience", "glad")

comments  <- c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad")

我想创建一个看起来像

的数据框

df <- data.frame(comments = c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad"),
               very = c(1,0,0,0,1),
               glad = c(0,1,0,0,1),
               experience = c(1,0,0,1,0))

我有 12,000 多条评论和 20 个词，我想用它来做这件事。我该如何有效地做到这一点？对于循环？还有其他方法吗？

Answer 1

一种方法是 stringi 和 gdapTools 包的组合，即

library(stringi)
library(qdapTools)

mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))
#  experience glad very
#1          1    0    1
#2          0    1    0
#3          0    0    0
#4          1    0    0
#5          0    1    1

然后可以使用cbind或data.frame绑定，

cbind(comments, mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))))

Answer 2

使用base-R，此代码将循环遍历单词列表和每个评论，并检查每个单词是否存在于拆分评论中（以空格和标点符号分隔），然后重新组合为一个数据框。 .

df <- as.data.frame(do.call(cbind,lapply(word.list,function(w) 
          as.numeric(sapply(comments,function(v) w %in% unlist(strsplit(v,"[ \.,]")))))))
names(df) <- word.list
df <- cbind(comments,df)

df
                                                                        comments very experience glad
1 very good experience. first time I have been and I would definitely come back.    1          1    0
2                                               glad I scheduled an appointment.    0          0    1
3                                            the staff have become more cordial.    0          0    0
4                                      the experience i had was not good at all.    0          1    0
5                                                                 i am very glad    1          0    1

Answer 3

遍历 word.list 并使用 grepl:

sapply(word.list, function(i) as.numeric(grepl(i, comments)))

要获得漂亮的输出，请转换为数据帧：

data.frame(comments, sapply(word.list, function(i) as.numeric(grepl(i, comments))))

注意： grepl 将匹配 "very" 和 "veryX"。如果不需要，则需要 complete word matching.

# To avoid matching "very" with "veryX"
sapply(word.list, function(i) as.numeric(grepl(paste0("\b", i, "\b"), comments)))

如何取一个词并根据该词在评论中的出现创建一个指示变量？

How to take a word and create an indicator variable based on the word's presence in comments?

regex

r

grepl