在 R 中映射评论的主题
Mapping the topic of the review in R
我有两个数据集,评论数据 & 主题数据
输入我的代码 查看数据
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
输入我的主题数据的代码
structure(list(word = structure(2:1, .Label = c("canteen food",
"sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen",
"Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Dput of my Desired Output, I want to look up the words which are appearing in Topic Data and map the same to the Review Data
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor"),
Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
您想要的是模糊连接之类的东西。这是寻找严格子字符串(但不区分大小写)的蛮力:
library(dplyr)
review %>%
full_join(topic, by = character()) %>% # full cartesian expansion
group_by(word) %>%
mutate(matched = grepl(word[1], Review, ignore.case = TRUE)) %>%
ungroup() %>%
filter(matched) %>%
select(-word, -matched)
# # A tibble: 2 x 2
# Review Topic
# <fct> <fct>
# 1 Sports and physical exercise need to be given importance "Sports "
# 2 Canteen Food could be improved "Canteen"
它有点蛮力,因为它在使用 grepl
进行测试之前对帧进行了笛卡尔连接,但是......你无法真正避免其中的某些部分。
您还可以使用 fuzzyjoin
包,它用于 joins on fuzzy 事物(适当命名) .
fuzzyjoin::regex_left_join(review, topic, by = c(Review = "word"), ignore_case = TRUE)
# Warning: Coercing `pattern` to a plain character vector.
# Review word Topic
# 1 Sports and physical exercise need to be given importance sports and physical Sports
# 2 Canteen Food could be improved canteen food Canteen
警告是因为你的列是factor
,不是character
,应该是无害的。如果想隐藏警告,可以使用suppressWarnings
(有点强);如果您想阻止警告,请将所有适用的列从 factor
转换为 character
(例如,topic[] <- lapply(topic, as.character)
,与 review$Review
相同,但如果您有数字列,请修改它) .
业余爱好者。我使用 base R 而不是 dplyr 做到了这一点,因为我不是最擅长连接函数的人。
下面,初始化你的dfs。我添加了更多示例以确保一切正常。还选择不使用因数,这会使以后分配字符串变得混乱。
# initialize your dfs
review <- data.frame("Review" = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance",
"canteen food x2",
"this is my sports and physical",
"SPORTS AND PHYSICAL",
"meme",
"canteen and food",
"this is my meme",
"memethis"
),
stringsAsFactors = F)
topic <- data.frame("word" = c("canteen food", "sports and physical", "meme"),
"Topic" = c("Canteen", "Sports", "meme_cat"),
stringsAsFactors = F)
然后只是使用了一些嵌套的 for 循环来迭代你想要的单词,找到匹配的字符串,并分配相关的主题。并在 for 循环之前初始化所有内容。
# initialize new column to write into in loop
review <- cbind(review, "Topic" = rep(NA, nrow(review)))
# initialize before for loop
a <- rep(F, nrow(topic))
# loop over words in topic and find string matches in review. if so, assign review$topic = Topic
for (i in 1:nrow(topic)) {
for(j in 1:nrow(review)) {
a[j] <- grepl(topic$word[i], review$Review[j], ignore.case=T)
}
if (any(a)) {
review$Topic[a] = topic$Topic[i]
}
review
# Review Topic
#1 Canteen Food could be improved Canteen
#2 Sports and physical exercise need to be given importance Sports
#3 canteen food x2 Canteen
#4 this is my sports and physical Sports
#5 SPORTS AND PHYSICAL Sports
#6 meme meme_cat
#7 canteen and food <NA>
#8 this is my meme meme_cat
#9 memethis meme_cat
我有两个数据集,评论数据 & 主题数据
输入我的代码 查看数据
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
输入我的主题数据的代码
structure(list(word = structure(2:1, .Label = c("canteen food",
"sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen",
"Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Dput of my Desired Output, I want to look up the words which are appearing in Topic Data and map the same to the Review Data
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor"),
Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
您想要的是模糊连接之类的东西。这是寻找严格子字符串(但不区分大小写)的蛮力:
library(dplyr)
review %>%
full_join(topic, by = character()) %>% # full cartesian expansion
group_by(word) %>%
mutate(matched = grepl(word[1], Review, ignore.case = TRUE)) %>%
ungroup() %>%
filter(matched) %>%
select(-word, -matched)
# # A tibble: 2 x 2
# Review Topic
# <fct> <fct>
# 1 Sports and physical exercise need to be given importance "Sports "
# 2 Canteen Food could be improved "Canteen"
它有点蛮力,因为它在使用 grepl
进行测试之前对帧进行了笛卡尔连接,但是......你无法真正避免其中的某些部分。
您还可以使用 fuzzyjoin
包,它用于 joins on fuzzy 事物(适当命名) .
fuzzyjoin::regex_left_join(review, topic, by = c(Review = "word"), ignore_case = TRUE)
# Warning: Coercing `pattern` to a plain character vector.
# Review word Topic
# 1 Sports and physical exercise need to be given importance sports and physical Sports
# 2 Canteen Food could be improved canteen food Canteen
警告是因为你的列是factor
,不是character
,应该是无害的。如果想隐藏警告,可以使用suppressWarnings
(有点强);如果您想阻止警告,请将所有适用的列从 factor
转换为 character
(例如,topic[] <- lapply(topic, as.character)
,与 review$Review
相同,但如果您有数字列,请修改它) .
业余爱好者。我使用 base R 而不是 dplyr 做到了这一点,因为我不是最擅长连接函数的人。
下面,初始化你的dfs。我添加了更多示例以确保一切正常。还选择不使用因数,这会使以后分配字符串变得混乱。
# initialize your dfs
review <- data.frame("Review" = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance",
"canteen food x2",
"this is my sports and physical",
"SPORTS AND PHYSICAL",
"meme",
"canteen and food",
"this is my meme",
"memethis"
),
stringsAsFactors = F)
topic <- data.frame("word" = c("canteen food", "sports and physical", "meme"),
"Topic" = c("Canteen", "Sports", "meme_cat"),
stringsAsFactors = F)
然后只是使用了一些嵌套的 for 循环来迭代你想要的单词,找到匹配的字符串,并分配相关的主题。并在 for 循环之前初始化所有内容。
# initialize new column to write into in loop
review <- cbind(review, "Topic" = rep(NA, nrow(review)))
# initialize before for loop
a <- rep(F, nrow(topic))
# loop over words in topic and find string matches in review. if so, assign review$topic = Topic
for (i in 1:nrow(topic)) {
for(j in 1:nrow(review)) {
a[j] <- grepl(topic$word[i], review$Review[j], ignore.case=T)
}
if (any(a)) {
review$Topic[a] = topic$Topic[i]
}
review
# Review Topic
#1 Canteen Food could be improved Canteen
#2 Sports and physical exercise need to be given importance Sports
#3 canteen food x2 Canteen
#4 this is my sports and physical Sports
#5 SPORTS AND PHYSICAL Sports
#6 meme meme_cat
#7 canteen and food <NA>
#8 this is my meme meme_cat
#9 memethis meme_cat