在文本段落中搜索单词,然后在 R 中标记它们

searching for words in text paragraph and then flagging them in R

我有一个文本数据集,想在其中搜索各种词,然后在找到它们时标记这些词。这是示例数据:

df <- data.table("id" = c(1:3), "report" = c("Travel opens our eyes to art, history, and culture – but it also introduces us to culinary adventures we may have never imagined otherwise."
                                             , "We quickly observed that no one in Sicily cooks with recipes (just with the heart), so we now do the same."
                                             , "We quickly observed that no one in Sicily cooks with recipes so we now do the same."), "summary" = c("On our first trip to Sicily to discover our family roots,"
                                                                      , "If you’re not a gardener, an Internet search for where to find zucchini flowers results."
                                                                      , "add some fresh cream to make the mixture a bit more liquid,"))

到目前为止,我一直在使用 SQL 来处理这个问题,但是当您有很多单词列表要查找时,它会变得很有挑战性。

dfOne <- sqldf("select id
              , case when lower(report) like '%opens%' then 1 else 0 end as opens
, case when lower(report) like '%cooks%' then 1 else 0 end as cooks
, case when lower(report) like '%internet%' then 1 else 0 end as internet
, case when lower(report) like '%zucchini%' then 1 else 0 end as zucchini
, case when lower(report) like '%fresh%' then 1 else 0 end as fresh
      from df
      ")

我正在寻找以更有效的方式执行此操作的想法。想象一下,如果您有一长串目标词,这段代码可能会变得不必要地太长。

谢谢,

SM.

这是一个整洁的方法。它假定您要搜索两个单独的列。

library(tidyverse)

df <- tibble(id = c(1:3), report = c("Travel opens our eyes to art, history, and culture – but it also introduces us to culinary adventures we may have never imagined otherwise."
                                             , "We quickly observed that no one in Sicily cooks with recipes (just with the heart), so we now do the same."
                                             , "We quickly observed that no one in Sicily cooks with recipes so we now do the same."), 
                 summary = c("On our first trip to Sicily to discover our family roots,"
                                                                                                                                                     , "If you’re not a gardener, an Internet search for where to find zucchini flowers results."
                                                                                                                                                     , "add some fresh cream to make the mixture a bit more liquid,"))


# Vector of words
vec <- c('eyes','art','gardener','mixture','trip')

df %>% 
  mutate(reportFlag = case_when(
    str_detect(report,paste(vec,collapse = '|')) ~ T,
    T ~ F)
) %>% 
  mutate(summaryFlag = case_when(
    str_detect(report,paste(vec,collapse = '|')) ~ T,
    T ~ F))

1) sqldf

定义词向量,然后将其转换为SQL。请注意,不需要 case when,因为 like 已经产生了 0/1 结果。在 sqldf 前面加上 fn$ 使 $like 能够将 R like 字符串替换到 SQL 语句中。使用 verbose=TRUEsqldf 的参数来查看生成的 SQL 语句。 words再长也不过两行代码。

words <- c("opens", "cooks", "internet", "zucchini", "fresh", "test me")

like <- toString(sprintf("\nlower(report) like '%%%s%%' as '%s'", words, words))
fn$sqldf("select id, $like from df", verbose = TRUE)

给予:

  id opens cooks internet zucchini fresh test me
1  1     1     0        0        0     0       0
2  2     0     1        0        0     0       0
3  3     0     1        0        0     0       0

2) 外层

使用上面的 words 我们可以按如下方式使用 outer。请注意,必须对 outer 中的函数(第三个参数)进行矢量化,我们可以如图所示对 grepl 进行矢量化。省略 check.names = FALSE 如果您不介意与包含空格或标点符号的单词相关联的列名称被混入句法 R 变量名称中。这会产生与 (1) 相同的输出。

with(df, data.frame(
    id, 
    +t(outer(setNames(words, words), report, Vectorize(grepl))), 
    check.names = FALSE
))

3) 申请

使用 sapply 我们可以得到一个稍微短一点的解决方案,与 (2) 相同。输出与 (1) 和 (2) 相同。

with(df, data.frame(id, +sapply(words, grepl, report), check.names = FALSE))