在 R 中映射评论的主题

Question

我有两个数据集，评论数据 & 主题数据

输入我的代码 查看数据

structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", 
"Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

输入我的主题数据的代码

structure(list(word = structure(2:1, .Label = c("canteen food", 
"sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen", 
"Sports "), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

Dput of my Desired Output, I want to look up the words which are appearing in Topic Data and map the same to the Review Data

structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", 
"Sports and physical exercise need to be given importance"), class = "factor"), 
    Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

Answer 1

您想要的是模糊连接之类的东西。这是寻找严格子字符串（但不区分大小写）的蛮力：

library(dplyr)
review %>%
  full_join(topic, by = character()) %>% # full cartesian expansion
  group_by(word) %>%
  mutate(matched = grepl(word[1], Review, ignore.case = TRUE)) %>%
  ungroup() %>%
  filter(matched) %>%
  select(-word, -matched)
# # A tibble: 2 x 2
#   Review                                                   Topic    
#   <fct>                                                    <fct>    
# 1 Sports and physical exercise need to be given importance "Sports "
# 2 Canteen Food could be improved                           "Canteen"

它有点蛮力，因为它在使用 grepl 进行测试之前对帧进行了笛卡尔连接，但是......你无法真正避免其中的某些部分。

您还可以使用 fuzzyjoin 包，它用于 joins on fuzzy 事物（适当命名） .

fuzzyjoin::regex_left_join(review, topic, by = c(Review = "word"), ignore_case = TRUE)
# Warning: Coercing `pattern` to a plain character vector.
#                                                     Review                word   Topic
# 1 Sports and physical exercise need to be given importance sports and physical Sports 
# 2                           Canteen Food could be improved        canteen food Canteen

警告是因为你的列是factor，不是character，应该是无害的。如果想隐藏警告，可以使用suppressWarnings（有点强）；如果您想阻止警告，请将所有适用的列从 factor 转换为 character（例如，topic[] <- lapply(topic, as.character)，与 review$Review 相同，但如果您有数字列，请修改它） .

Answer 2

业余爱好者。我使用 base R 而不是 dplyr 做到了这一点，因为我不是最擅长连接函数的人。

下面，初始化你的dfs。我添加了更多示例以确保一切正常。还选择不使用因数，这会使以后分配字符串变得混乱。

# initialize your dfs
review <- data.frame("Review" = c("Canteen Food could be improved", 
                                  "Sports and physical exercise need to be given importance",
                                  "canteen food x2",
                                  "this is my sports and physical",
                                  "SPORTS AND PHYSICAL",
                                  "meme",
                                  "canteen and food",
                                  "this is my meme",
                                  "memethis"
                                  ),
                     stringsAsFactors = F)

topic <- data.frame("word" = c("canteen food", "sports and physical", "meme"), 
                    "Topic" = c("Canteen", "Sports", "meme_cat"),
                    stringsAsFactors = F)

然后只是使用了一些嵌套的 for 循环来迭代你想要的单词，找到匹配的字符串，并分配相关的主题。并在 for 循环之前初始化所有内容。

# initialize new column to write into in loop
review <- cbind(review, "Topic" = rep(NA, nrow(review)))

# initialize before for loop
a <- rep(F, nrow(topic))

# loop over words in topic and find string matches in review. if so, assign review$topic = Topic
for (i in 1:nrow(topic)) {
  for(j in 1:nrow(review)) {
    a[j] <- grepl(topic$word[i], review$Review[j], ignore.case=T)
  }
  if (any(a)) {
    review$Topic[a] = topic$Topic[i]
  }

review
#                                                    Review    Topic
#1                           Canteen Food could be improved  Canteen
#2 Sports and physical exercise need to be given importance   Sports
#3                                          canteen food x2  Canteen
#4                           this is my sports and physical   Sports
#5                                      SPORTS AND PHYSICAL   Sports
#6                                                     meme meme_cat
#7                                         canteen and food     <NA>
#8                                          this is my meme meme_cat
#9                                                 memethis meme_cat

在 R 中映射评论的主题

Mapping the topic of the review in R

r

text-mining

tm

dplyr

tidytext