在多个数据框中搜索包含的特定文本,并 return 在新列中的这些值(多次出现)
Search for the inclusion of specific text across multiple dataframes, and return those values in a new column (with multiple occurrences)
在从一个数据框搜索多个特定单词时寻求帮助,在另一个数据框的列(文本正文)中搜索多个特定单词,然后将这些值拉出到一个新列中。
进一步解释:
- 首先,我有一个数据框,其中包含 14 个国家/地区的大量文本摘要。
- 其次,我有第二个数据框,其中包含所有行政级别 (lvl_2) 的名称,例如省份、村庄等
- 我想基本上从大摘要中提取任何提及这些特定 adm2 provinces/village 名称的内容,并用这些词中的每一个创建一个新的列,旋转时间更长。
这里有一些示例数据,您可以使用它们来重现我的问题,其中包含两个数据框:(1) test_admin
用于我要搜索的管理级别列表,以及 (2) test_dataset$Summary
这是我要 运行 搜索的列。 (您可以忽略 Other_Variables 的值,这些值填充了真实数据集中的大量值)
test_admin <- data.frame(adm1_name = c("Sindh"),
adm2_name = c("Central Karachi", "Dadu", "East Karachi", "Ghotki", "Sujawal", "Sukkur"))
test_dataset <- data.frame(Summary = c("In Cox's Bazar, this and that happened.",
"In Yangon, something else happened",
"In Central Karachi, this happened",
"In Sindh, this happened",
"In Dadu AND East Karachi, this happened"),
Other_Variable_1 = 1:5,
Other_Variable_2 = 1:5)
为了使事情更加复杂,我还希望能够从 test_admin
数据框的 两个 列中搜索值。例如,如果您的值“Sindh”来自 adm1_level 列,那么 return 所有结果在 adm2_level 下也是非常酷的。
但如果你能在更基础的层面上解决(只搜索一栏),我也很满意。
我要寻找的输出类似于下面的数据框,它还会 return编辑多行以显示出现多个值的位置。
Summary Other_Variable_1 Other_Variable_2 Locations
1 In Cox's Bazar, this and that happened. 1 1 <NA>
2 In Yangon, something else happened 2 2 <NA>
3 In Central Karachi, this happened 3 3 Central Karachi
4 In Sindh, this happened 4 4 Central Karachi
5 In Sindh, this happened 4 4 Dadu
6 In Sindh, this happened 4 4 East Karachi
7 In Sindh, this happened 4 4 Ghotki
8 In Sindh, this happened 4 4 Sujawal
9 In Sindh, this happened 4 4 Sukkur
10 In Dadu AND East Karachi, this happened 5 5 Dadu
11 In Dadu AND East Karachi, this happened 5 5 East Karachi
我尝试了一些 mutate 和 grepl 函数,但效果不佳。我发现的其他示例似乎只适用于精确值或单一搜索。感谢您的帮助!
首选#tidyverse 解决方案
这是一种方法:
library(tidyverse)
map_df(seq(nrow(test_dataset)), function(i) {
inds <- str_detect(test_dataset$Summary[i], test_admin$adm1_name) |
str_detect(test_dataset$Summary[i], test_admin$adm2_name)
if(any(inds)) tibble(test_dataset[i, ], Locations = test_admin$adm2_name[inds])
else tibble(test_dataset[i, ], Locations = NA)
})
# Summary Other_Variable_1 Other_Variable_2 Locations
# <chr> <int> <int> <chr>
# 1 In Cox's Bazar, this and that happened. 1 1 NA
# 2 In Yangon, something else happened 2 2 NA
# 3 In Central Karachi, this happened 3 3 Central Karachi
# 4 In Sindh, this happened 4 4 Central Karachi
# 5 In Sindh, this happened 4 4 Dadu
# 6 In Sindh, this happened 4 4 East Karachi
# 7 In Sindh, this happened 4 4 Ghotki
# 8 In Sindh, this happened 4 4 Sujawal
# 9 In Sindh, this happened 4 4 Sukkur
#10 In Dadu AND East Karachi, this happened 5 5 Dadu
#11 In Dadu AND East Karachi, this happened 5 5 East Karachi
对于 Summary
中的每个值,我们检查它是否匹配 adm1_name
或 adm2_name
。如果任何行匹配,我们在输出中包含相应的 Location
值,否则 return NA
.
在从一个数据框搜索多个特定单词时寻求帮助,在另一个数据框的列(文本正文)中搜索多个特定单词,然后将这些值拉出到一个新列中。
进一步解释:
- 首先,我有一个数据框,其中包含 14 个国家/地区的大量文本摘要。
- 其次,我有第二个数据框,其中包含所有行政级别 (lvl_2) 的名称,例如省份、村庄等
- 我想基本上从大摘要中提取任何提及这些特定 adm2 provinces/village 名称的内容,并用这些词中的每一个创建一个新的列,旋转时间更长。
这里有一些示例数据,您可以使用它们来重现我的问题,其中包含两个数据框:(1) test_admin
用于我要搜索的管理级别列表,以及 (2) test_dataset$Summary
这是我要 运行 搜索的列。 (您可以忽略 Other_Variables 的值,这些值填充了真实数据集中的大量值)
test_admin <- data.frame(adm1_name = c("Sindh"),
adm2_name = c("Central Karachi", "Dadu", "East Karachi", "Ghotki", "Sujawal", "Sukkur"))
test_dataset <- data.frame(Summary = c("In Cox's Bazar, this and that happened.",
"In Yangon, something else happened",
"In Central Karachi, this happened",
"In Sindh, this happened",
"In Dadu AND East Karachi, this happened"),
Other_Variable_1 = 1:5,
Other_Variable_2 = 1:5)
为了使事情更加复杂,我还希望能够从 test_admin
数据框的 两个 列中搜索值。例如,如果您的值“Sindh”来自 adm1_level 列,那么 return 所有结果在 adm2_level 下也是非常酷的。
但如果你能在更基础的层面上解决(只搜索一栏),我也很满意。
我要寻找的输出类似于下面的数据框,它还会 return编辑多行以显示出现多个值的位置。
Summary Other_Variable_1 Other_Variable_2 Locations
1 In Cox's Bazar, this and that happened. 1 1 <NA>
2 In Yangon, something else happened 2 2 <NA>
3 In Central Karachi, this happened 3 3 Central Karachi
4 In Sindh, this happened 4 4 Central Karachi
5 In Sindh, this happened 4 4 Dadu
6 In Sindh, this happened 4 4 East Karachi
7 In Sindh, this happened 4 4 Ghotki
8 In Sindh, this happened 4 4 Sujawal
9 In Sindh, this happened 4 4 Sukkur
10 In Dadu AND East Karachi, this happened 5 5 Dadu
11 In Dadu AND East Karachi, this happened 5 5 East Karachi
我尝试了一些 mutate 和 grepl 函数,但效果不佳。我发现的其他示例似乎只适用于精确值或单一搜索。感谢您的帮助!
首选#tidyverse 解决方案
这是一种方法:
library(tidyverse)
map_df(seq(nrow(test_dataset)), function(i) {
inds <- str_detect(test_dataset$Summary[i], test_admin$adm1_name) |
str_detect(test_dataset$Summary[i], test_admin$adm2_name)
if(any(inds)) tibble(test_dataset[i, ], Locations = test_admin$adm2_name[inds])
else tibble(test_dataset[i, ], Locations = NA)
})
# Summary Other_Variable_1 Other_Variable_2 Locations
# <chr> <int> <int> <chr>
# 1 In Cox's Bazar, this and that happened. 1 1 NA
# 2 In Yangon, something else happened 2 2 NA
# 3 In Central Karachi, this happened 3 3 Central Karachi
# 4 In Sindh, this happened 4 4 Central Karachi
# 5 In Sindh, this happened 4 4 Dadu
# 6 In Sindh, this happened 4 4 East Karachi
# 7 In Sindh, this happened 4 4 Ghotki
# 8 In Sindh, this happened 4 4 Sujawal
# 9 In Sindh, this happened 4 4 Sukkur
#10 In Dadu AND East Karachi, this happened 5 5 Dadu
#11 In Dadu AND East Karachi, this happened 5 5 East Karachi
对于 Summary
中的每个值,我们检查它是否匹配 adm1_name
或 adm2_name
。如果任何行匹配,我们在输出中包含相应的 Location
值,否则 return NA
.