如何取一个词并根据该词在评论中的出现创建一个指示变量?
How to take a word and create an indicator variable based on the word's presence in comments?
我有一个单词向量和一个评论向量:
word.list <- c("very", "experience", "glad")
comments <- c("very good experience. first time I have been and I would definitely come back.",
"glad I scheduled an appointment.",
"the staff have become more cordial.",
"the experience i had was not good at all.",
"i am very glad")
我想创建一个看起来像
的数据框
df <- data.frame(comments = c("very good experience. first time I have been and I would definitely come back.",
"glad I scheduled an appointment.",
"the staff have become more cordial.",
"the experience i had was not good at all.",
"i am very glad"),
very = c(1,0,0,0,1),
glad = c(0,1,0,0,1),
experience = c(1,0,0,1,0))
我有 12,000 多条评论和 20 个词,我想用它来做这件事。我该如何有效地做到这一点?对于循环?还有其他方法吗?
一种方法是 stringi
和 gdapTools
包的组合,即
library(stringi)
library(qdapTools)
mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))
# experience glad very
#1 1 0 1
#2 0 1 0
#3 0 0 0
#4 1 0 0
#5 0 1 1
然后可以使用cbind
或data.frame
绑定,
cbind(comments, mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))))
使用base-R,此代码将循环遍历单词列表和每个评论,并检查每个单词是否存在于拆分评论中(以空格和标点符号分隔),然后重新组合为一个数据框。 .
df <- as.data.frame(do.call(cbind,lapply(word.list,function(w)
as.numeric(sapply(comments,function(v) w %in% unlist(strsplit(v,"[ \.,]")))))))
names(df) <- word.list
df <- cbind(comments,df)
df
comments very experience glad
1 very good experience. first time I have been and I would definitely come back. 1 1 0
2 glad I scheduled an appointment. 0 0 1
3 the staff have become more cordial. 0 0 0
4 the experience i had was not good at all. 0 1 0
5 i am very glad 1 0 1
遍历 word.list 并使用 grepl:
sapply(word.list, function(i) as.numeric(grepl(i, comments)))
要获得漂亮的输出,请转换为数据帧:
data.frame(comments, sapply(word.list, function(i) as.numeric(grepl(i, comments))))
注意: grepl 将匹配 "very" 和 "veryX"。如果不需要,则需要 complete word matching.
# To avoid matching "very" with "veryX"
sapply(word.list, function(i) as.numeric(grepl(paste0("\b", i, "\b"), comments)))
我有一个单词向量和一个评论向量:
word.list <- c("very", "experience", "glad")
comments <- c("very good experience. first time I have been and I would definitely come back.",
"glad I scheduled an appointment.",
"the staff have become more cordial.",
"the experience i had was not good at all.",
"i am very glad")
我想创建一个看起来像
的数据框df <- data.frame(comments = c("very good experience. first time I have been and I would definitely come back.",
"glad I scheduled an appointment.",
"the staff have become more cordial.",
"the experience i had was not good at all.",
"i am very glad"),
very = c(1,0,0,0,1),
glad = c(0,1,0,0,1),
experience = c(1,0,0,1,0))
我有 12,000 多条评论和 20 个词,我想用它来做这件事。我该如何有效地做到这一点?对于循环?还有其他方法吗?
一种方法是 stringi
和 gdapTools
包的组合,即
library(stringi)
library(qdapTools)
mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))
# experience glad very
#1 1 0 1
#2 0 1 0
#3 0 0 0
#4 1 0 0
#5 0 1 1
然后可以使用cbind
或data.frame
绑定,
cbind(comments, mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))))
使用base-R,此代码将循环遍历单词列表和每个评论,并检查每个单词是否存在于拆分评论中(以空格和标点符号分隔),然后重新组合为一个数据框。 .
df <- as.data.frame(do.call(cbind,lapply(word.list,function(w)
as.numeric(sapply(comments,function(v) w %in% unlist(strsplit(v,"[ \.,]")))))))
names(df) <- word.list
df <- cbind(comments,df)
df
comments very experience glad
1 very good experience. first time I have been and I would definitely come back. 1 1 0
2 glad I scheduled an appointment. 0 0 1
3 the staff have become more cordial. 0 0 0
4 the experience i had was not good at all. 0 1 0
5 i am very glad 1 0 1
遍历 word.list 并使用 grepl:
sapply(word.list, function(i) as.numeric(grepl(i, comments)))
要获得漂亮的输出,请转换为数据帧:
data.frame(comments, sapply(word.list, function(i) as.numeric(grepl(i, comments))))
注意: grepl 将匹配 "very" 和 "veryX"。如果不需要,则需要 complete word matching.
# To avoid matching "very" with "veryX"
sapply(word.list, function(i) as.numeric(grepl(paste0("\b", i, "\b"), comments)))