从R中的另一个数据框中查找所有字符串匹配
Finding all string matches from another dataframe in R
我是 R 的新手。
我有一个数据框 locs
,它有 1 个变量 V1
,看起来像:
V1
edmonton general hospital
cardiovascular institute, hospital san carlos, madrid spain
hospital of santa maria, lisbon, portugal
和另一个数据框 cities
,它有两个如下所示的变量:
city country
edmonton canada
san carlos spain
los angeles united states
santa maria united states
tokyo japan
madrid spain
santa maria portugal
lisbon portugal
我想在 locs
中创建两个新变量,它们与 city
中 V1
的任何字符串匹配相关,这样 locs
看起来像这样:
V1 city country
edmonton general hospital edmonton canada
hospital san carlos, madrid spain san carlos, madrid spain
hospital of santa maria, lisbon, portugal santa maria, lisbon portugal, united states
需要注意的几点:V1
可能有多个国家名称。另外,如果有一个重复的国家(例如,san carlos 和 madrid 都在西班牙),那么我只想要一个国家的实例。
请指教
谢谢。
使用 tidyverse
和 stringr
的解决方案。 locs2
是最终输出。
library(tidyverse)
library(stringr)
locs2 <- locs %>%
rowwise() %>%
mutate(city = list(str_match(V1, cities$city))) %>%
unnest() %>%
drop_na(city) %>%
left_join(cities, by = "city") %>%
group_by(V1) %>%
summarise_all(funs(toString(sort(unique(.)))))
结果
locs2 %>% as.data.frame()
V1 city country
1 cardiovascular institute, hospital san carlos, madrid spain madrid, san carlos spain
2 edmonton general hospital edmonton canada
3 hospital of santa maria, lisbon, portugal lisbon, santa maria portugal, united states
数据
library(tidyverse)
locs <- data_frame(V1 = c("edmonton general hospital",
"cardiovascular institute, hospital san carlos, madrid spain",
"hospital of santa maria, lisbon, portugal"))
cities <- read.table(text = "city country
edmonton canada
'san carlos' spain
'los angeles' 'united states'
'santa maria' 'united states'
tokyo japan
madrid spain
'santa maria' portugal
lisbon portugal",
header = TRUE, stringsAsFactors = FALSE)
我是 R 的新手。
我有一个数据框 locs
,它有 1 个变量 V1
,看起来像:
V1
edmonton general hospital
cardiovascular institute, hospital san carlos, madrid spain
hospital of santa maria, lisbon, portugal
和另一个数据框 cities
,它有两个如下所示的变量:
city country
edmonton canada
san carlos spain
los angeles united states
santa maria united states
tokyo japan
madrid spain
santa maria portugal
lisbon portugal
我想在 locs
中创建两个新变量,它们与 city
中 V1
的任何字符串匹配相关,这样 locs
看起来像这样:
V1 city country
edmonton general hospital edmonton canada
hospital san carlos, madrid spain san carlos, madrid spain
hospital of santa maria, lisbon, portugal santa maria, lisbon portugal, united states
需要注意的几点:V1
可能有多个国家名称。另外,如果有一个重复的国家(例如,san carlos 和 madrid 都在西班牙),那么我只想要一个国家的实例。
请指教
谢谢。
使用 tidyverse
和 stringr
的解决方案。 locs2
是最终输出。
library(tidyverse)
library(stringr)
locs2 <- locs %>%
rowwise() %>%
mutate(city = list(str_match(V1, cities$city))) %>%
unnest() %>%
drop_na(city) %>%
left_join(cities, by = "city") %>%
group_by(V1) %>%
summarise_all(funs(toString(sort(unique(.)))))
结果
locs2 %>% as.data.frame()
V1 city country
1 cardiovascular institute, hospital san carlos, madrid spain madrid, san carlos spain
2 edmonton general hospital edmonton canada
3 hospital of santa maria, lisbon, portugal lisbon, santa maria portugal, united states
数据
library(tidyverse)
locs <- data_frame(V1 = c("edmonton general hospital",
"cardiovascular institute, hospital san carlos, madrid spain",
"hospital of santa maria, lisbon, portugal"))
cities <- read.table(text = "city country
edmonton canada
'san carlos' spain
'los angeles' 'united states'
'santa maria' 'united states'
tokyo japan
madrid spain
'santa maria' portugal
lisbon portugal",
header = TRUE, stringsAsFactors = FALSE)