具有动态搜索模式的 R grepl

R grepl with dynamic search pattern

我有一个数据框,df,有一列不同的名称。我有可变数据框,例如search_dfsearch_df1 包含我想在名称列中通过正则表达式搜索的搜索词。 如果找到该词,请将其写入新列,例如df_final$which_word_search_df。 如果找到了多个单词,我想将结果粘贴在一起。 结果应该类似于 df_final.

# load packages
pacman::p_load(tidyverse)

# words I would like to search for
search_df <- data.frame(search_words = c("apple", "peach"))
search_df1 <- data.frame(search_words = c("strawberry", "peach", "banana"))

# data frame which is the basis for my search
df <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"))

# how I expect the final result to look like
df_final <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"),
                       which_word_search_df = c("apple", "apple; peach", "peach", "peach", NA, NA),
                       which_word_search_df1 = c(NA, NA, "peach", "peach", "banana", "banana"))

这是我目前的解决方案,但如您所见,它不是动态的。我手动输入每个搜索词,而不是自动搜索所有搜索词。

df_trial <- df %>% 
  mutate(which_search_word_trial = ifelse(grepl("apple", name, ignore.case = T), "apple", ""),
         which_search_word_trial = ifelse(grepl("peach", name, ignore.case = T), 
                                          paste(which_search_word_trial, "peach", sep = ";"), which_search_word_trial)
  )

我分享的例子只是一个最小的例子。对于实际用例,df 将有 ~200k 行,而我的 search_df 将有~1k 行。

我们可以做到以下几点。

library(dplyr)
library(stringr)

df %>%
  mutate(which_word_search_df = str_extract_all(name,str_c(search_df$search_words, collapse = '|')),
         which_word_search_df1 = str_extract_all(name, str_c(search_df1$search_words, collapse = '|')))

#         name which_word_search_df which_word_search_df1
# 1   apple123                apple                      
# 2 applepeach         apple, peach                 peach
# 3  peachtime                peach                 peach
# 4    peachab                peach                 peach
# 5  bananarrr                                     banana
# 6   bananaxy                                     banana

使用你的 df 作为输入(而不是 df_final):这是一种通过提供搜索数据帧的名称来实现的“自动”方式:

n = c('search_df','search_df1')

for(i in n){
  a= (lapply(get(i)$search_word, function(j){grep(j, df$name)}))
  a=stack(setNames(a,get(i)$search_word))
  df[,paste0('which_word_',i)]=NA
  df[a$values,paste0('which_word_',i)]=as.character(a$ind)
}

输出直接存储在 df 中,但您可以通过将 df 复制到 final_df 轻松更改它,然后在最后两行中使用它。

输出:

       name which_word_search_df which_word_search_df1
1  apple123                apple                  <NA>
2  applebum                apple                  <NA>
3 peachtime                peach                 peach
4   peachab                peach                 peach
5 bananarrr                 <NA>                banana
6  bananaxy                 <NA>                banana

让我知道它是否适合你