文本匹配时分配另一列

Assign another column when text matching

我想在关键字与文本中的单词匹配时分配另一列,分配值1。如果文本中有多个相同的单词,则只分配最大值 1 否则分配 0。

假设我有这个数据集:

df = structure(list(text = c("I hate good cheese", "cheese that smells is the best",
                             "isn't it obvious that green cheese serves you well", 
                             "don't fight it just eat the cheese", "the last good cheese is down"), 
                    stuff = c(3, 2, 40, 4, 5) ), row.names = c(NA, 5L), 
               class = c("tbl_df", "tbl", "data.frame"))

用以下关键词搜索:

keywords = structure(list(keyword_one = c("cheese", "blue", "best"),
                          keyword_two = c("smells", "final", 'south')
                          ),
                     row.names = c(NA, -3L),   
                     class = c("tbl_df", "tbl", "data.frame"))

我可以做到以下几点:

df[str_detect(df$text, keywords$keyword_one),]

到 return 关键字匹配的行但是我如何只获取所有行但在匹配时分配值 1?所以像:

# A tibble: 5 × 2
  text                                               stuff    keyword1    keyword 2
* <chr>                                              <dbl>
1 I hate good cheese                                     3        1         0
2 cheese that smells is the best                         2        0         1
3 isn't it obvious that green cheese serves you well    40        0         0
4 don't fight it just eat the cheese                     4        1         0
5 the last good cheese is down                           5        0         0

或者,我发现我可以这样做:

ifelse(str_detect(df$text, keywords$keyword_one), 1, 0)
ifelse(str_detect(df$text, keywords$keyword_two), 1, 0)

但是,如果我在关键字中有很多列并且想要遍历所有这些列,那么效率很低。

此外,我注意到 str_detect 似乎没有在所有文本中检测到 cheese 这个词`为什么会这样?

为什么你会失去“奶酪”?

如果我对你的问题的解释正确,你想要匹配关键字列中的任何单词。

但是很多 R 函数都是向量化的,所以当您 str_detect(df$text, keywords$keyword_one) 它会逐个元素地比较向量并回收较短的向量。 R 给你警告

Warning message: In stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : longer object length is not a multiple of shorter object length

您想要的是 str_detect(df$text, "cheese|blue|best") 如果您看到字符串“cheese”、“blue”或“best”,它将 return TRUE

解决方案

以下代码应该适用于任意数量的关键字。除了 stringr 之外,它还需要 dplyr。诀窍是使用枢轴,以便每个给定的编码动作的数据“整齐”。

# First reorganize the keywords as suggested above

  keys <- keywords %>%
    summarize_all(function(x) str_c(x, collapse ="|")) %>%
    pivot_longer(everything())

> key 
# A tibble: 2 × 2
  name        value             
  <chr>       <chr>             
1 keyword_one cheese|blue|best  
2 keyword_two smells|final|south
# now for each set of keywords test to see if they're in the `text` column.

  df %>% 
    crossing(keys)  %>%
    mutate(observed = 1*str_detect(text, value)) %>%
    select(-value) %>%
    pivot_wider(values_from = observed)

你得到:

# A tibble: 5 × 4
  text                                               stuff keyword_one keyword_two
  <chr>                                              <dbl>       <dbl>       <dbl>
1 cheese that smells is the best                         2           1           1
2 don't fight it just eat the cheese                     4           1           0
3 I hate good cheese                                     3           1           0
4 isn't it obvious that green cheese serves you well    40           1           0
5 the last good cheese is down                           5           1           0