具有动态搜索模式的 R grepl
R grepl with dynamic search pattern
我有一个数据框,df
,有一列不同的名称。我有可变数据框,例如search_df
或 search_df1
包含我想在名称列中通过正则表达式搜索的搜索词。
如果找到该词,请将其写入新列,例如df_final$which_word_search_df
。
如果找到了多个单词,我想将结果粘贴在一起。
结果应该类似于 df_final
.
# load packages
pacman::p_load(tidyverse)
# words I would like to search for
search_df <- data.frame(search_words = c("apple", "peach"))
search_df1 <- data.frame(search_words = c("strawberry", "peach", "banana"))
# data frame which is the basis for my search
df <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"))
# how I expect the final result to look like
df_final <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"),
which_word_search_df = c("apple", "apple; peach", "peach", "peach", NA, NA),
which_word_search_df1 = c(NA, NA, "peach", "peach", "banana", "banana"))
这是我目前的解决方案,但如您所见,它不是动态的。我手动输入每个搜索词,而不是自动搜索所有搜索词。
df_trial <- df %>%
mutate(which_search_word_trial = ifelse(grepl("apple", name, ignore.case = T), "apple", ""),
which_search_word_trial = ifelse(grepl("peach", name, ignore.case = T),
paste(which_search_word_trial, "peach", sep = ";"), which_search_word_trial)
)
我分享的例子只是一个最小的例子。对于实际用例,df
将有 ~200k 行,而我的 search_df
将有~1k 行。
我们可以做到以下几点。
library(dplyr)
library(stringr)
df %>%
mutate(which_word_search_df = str_extract_all(name,str_c(search_df$search_words, collapse = '|')),
which_word_search_df1 = str_extract_all(name, str_c(search_df1$search_words, collapse = '|')))
# name which_word_search_df which_word_search_df1
# 1 apple123 apple
# 2 applepeach apple, peach peach
# 3 peachtime peach peach
# 4 peachab peach peach
# 5 bananarrr banana
# 6 bananaxy banana
使用你的 df 作为输入(而不是 df_final):这是一种通过提供搜索数据帧的名称来实现的“自动”方式:
n = c('search_df','search_df1')
for(i in n){
a= (lapply(get(i)$search_word, function(j){grep(j, df$name)}))
a=stack(setNames(a,get(i)$search_word))
df[,paste0('which_word_',i)]=NA
df[a$values,paste0('which_word_',i)]=as.character(a$ind)
}
输出直接存储在 df
中,但您可以通过将 df
复制到 final_df
轻松更改它,然后在最后两行中使用它。
输出:
name which_word_search_df which_word_search_df1
1 apple123 apple <NA>
2 applebum apple <NA>
3 peachtime peach peach
4 peachab peach peach
5 bananarrr <NA> banana
6 bananaxy <NA> banana
让我知道它是否适合你
我有一个数据框,df
,有一列不同的名称。我有可变数据框,例如search_df
或 search_df1
包含我想在名称列中通过正则表达式搜索的搜索词。
如果找到该词,请将其写入新列,例如df_final$which_word_search_df
。
如果找到了多个单词,我想将结果粘贴在一起。
结果应该类似于 df_final
.
# load packages
pacman::p_load(tidyverse)
# words I would like to search for
search_df <- data.frame(search_words = c("apple", "peach"))
search_df1 <- data.frame(search_words = c("strawberry", "peach", "banana"))
# data frame which is the basis for my search
df <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"))
# how I expect the final result to look like
df_final <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"),
which_word_search_df = c("apple", "apple; peach", "peach", "peach", NA, NA),
which_word_search_df1 = c(NA, NA, "peach", "peach", "banana", "banana"))
这是我目前的解决方案,但如您所见,它不是动态的。我手动输入每个搜索词,而不是自动搜索所有搜索词。
df_trial <- df %>%
mutate(which_search_word_trial = ifelse(grepl("apple", name, ignore.case = T), "apple", ""),
which_search_word_trial = ifelse(grepl("peach", name, ignore.case = T),
paste(which_search_word_trial, "peach", sep = ";"), which_search_word_trial)
)
我分享的例子只是一个最小的例子。对于实际用例,df
将有 ~200k 行,而我的 search_df
将有~1k 行。
我们可以做到以下几点。
library(dplyr)
library(stringr)
df %>%
mutate(which_word_search_df = str_extract_all(name,str_c(search_df$search_words, collapse = '|')),
which_word_search_df1 = str_extract_all(name, str_c(search_df1$search_words, collapse = '|')))
# name which_word_search_df which_word_search_df1
# 1 apple123 apple
# 2 applepeach apple, peach peach
# 3 peachtime peach peach
# 4 peachab peach peach
# 5 bananarrr banana
# 6 bananaxy banana
使用你的 df 作为输入(而不是 df_final):这是一种通过提供搜索数据帧的名称来实现的“自动”方式:
n = c('search_df','search_df1')
for(i in n){
a= (lapply(get(i)$search_word, function(j){grep(j, df$name)}))
a=stack(setNames(a,get(i)$search_word))
df[,paste0('which_word_',i)]=NA
df[a$values,paste0('which_word_',i)]=as.character(a$ind)
}
输出直接存储在 df
中,但您可以通过将 df
复制到 final_df
轻松更改它,然后在最后两行中使用它。
输出:
name which_word_search_df which_word_search_df1
1 apple123 apple <NA>
2 applebum apple <NA>
3 peachtime peach peach
4 peachab peach peach
5 bananarrr <NA> banana
6 bananaxy <NA> banana
让我知道它是否适合你