在多个数据框中搜索包含的特定文本，并 return 在新列中的这些值（多次出现）

Question

在从一个数据框搜索多个特定单词时寻求帮助，在另一个数据框的列（文本正文）中搜索多个特定单词，然后将这些值拉出到一个新列中。

进一步解释：

首先，我有一个数据框，其中包含 14 个国家/地区的大量文本摘要。
其次，我有第二个数据框，其中包含所有行政级别 (lvl_2) 的名称，例如省份、村庄等
我想基本上从大摘要中提取任何提及这些特定 adm2 provinces/village 名称的内容，并用这些词中的每一个创建一个新的列，旋转时间更长。

这里有一些示例数据，您可以使用它们来重现我的问题，其中包含两个数据框：(1) test_admin 用于我要搜索的管理级别列表，以及 (2) test_dataset$Summary 这是我要运行搜索的列。（您可以忽略 Other_Variables 的值，这些值填充了真实数据集中的大量值）

test_admin <- data.frame(adm1_name = c("Sindh"),
                   adm2_name = c("Central Karachi", "Dadu", "East Karachi", "Ghotki", "Sujawal", "Sukkur"))
                   
test_dataset <- data.frame(Summary = c("In Cox's Bazar, this and that happened.",
                                       "In Yangon, something else happened",
                                       "In Central Karachi, this happened",
                                       "In Sindh, this happened",
                                       "In Dadu AND East Karachi, this happened"),
                           Other_Variable_1 = 1:5,
                           Other_Variable_2 = 1:5)

为了使事情更加复杂，我还希望能够从 test_admin 数据框的两个列中搜索值。例如，如果您的值“Sindh”来自 adm1_level 列，那么 return 所有结果在 adm2_level 下也是非常酷的。

但如果你能在更基础的层面上解决（只搜索一栏），我也很满意。

我要寻找的输出类似于下面的数据框，它还会 return编辑多行以显示出现多个值的位置。

                                   Summary Other_Variable_1 Other_Variable_2       Locations
1  In Cox's Bazar, this and that happened.                1                1            <NA>
2       In Yangon, something else happened                2                2            <NA>
3        In Central Karachi, this happened                3                3 Central Karachi
4                  In Sindh, this happened                4                4 Central Karachi
5                  In Sindh, this happened                4                4            Dadu
6                  In Sindh, this happened                4                4    East Karachi
7                  In Sindh, this happened                4                4          Ghotki
8                  In Sindh, this happened                4                4         Sujawal
9                  In Sindh, this happened                4                4          Sukkur
10 In Dadu AND East Karachi, this happened                5                5            Dadu
11 In Dadu AND East Karachi, this happened                5                5    East Karachi

我尝试了一些 mutate 和 grepl 函数，但效果不佳。我发现的其他示例似乎只适用于精确值或单一搜索。感谢您的帮助！

首选#tidyverse 解决方案

Answer 1

这是一种方法：

library(tidyverse)

map_df(seq(nrow(test_dataset)), function(i) {
  inds <- str_detect(test_dataset$Summary[i], test_admin$adm1_name) | 
             str_detect(test_dataset$Summary[i], test_admin$adm2_name)
  if(any(inds)) tibble(test_dataset[i, ], Locations = test_admin$adm2_name[inds])
    else tibble(test_dataset[i, ], Locations = NA)
})

#  Summary                                 Other_Variable_1 Other_Variable_2 Locations      
#   <chr>                                              <int>            <int> <chr>          
# 1 In Cox's Bazar, this and that happened.                1                1 NA             
# 2 In Yangon, something else happened                     2                2 NA             
# 3 In Central Karachi, this happened                      3                3 Central Karachi
# 4 In Sindh, this happened                                4                4 Central Karachi
# 5 In Sindh, this happened                                4                4 Dadu           
# 6 In Sindh, this happened                                4                4 East Karachi   
# 7 In Sindh, this happened                                4                4 Ghotki         
# 8 In Sindh, this happened                                4                4 Sujawal        
# 9 In Sindh, this happened                                4                4 Sukkur         
#10 In Dadu AND East Karachi, this happened                5                5 Dadu           
#11 In Dadu AND East Karachi, this happened                5                5 East Karachi

对于 Summary 中的每个值，我们检查它是否匹配 adm1_name 或 adm2_name。如果任何行匹配，我们在输出中包含相应的 Location 值，否则 return NA.

在多个数据框中搜索包含的特定文本，并 return 在新列中的这些值（多次出现）

Search for the inclusion of specific text across multiple dataframes, and return those values in a new column (with multiple occurrences)

search

r

match

grepl

tidyverse