如何检测数据框中的名称完全或部分包含在另一个数据框中的名称中？

Question

我想检测 df1 中 cities1 的全名是否全部或部分包含在 df2 中 cities2 的名称中（即将城市视为字符串），这样常见的字符显示在一个新列“匹配”，最好使用 dplyr 和 stringr。

city1 <- c("boston", "cambridge", "houston")
df1 <- as.data.frame(city1)
df1

city2 <- c("atlanta", "denver", "cambridge", "cambridgeuk", "london", "york")
df2 <- as.data.frame(city2)
df2

我想要什么（注意cambridge和cambridgeuk的区别）：

  city1     city2       match    
  <chr>     <chr>       <chr>    
1 boston    atlanta     NA       
2 cambridge denver      NA       
3 houston   cambridge   cambridge
4 NA        cambridgeuk cambridge
5 NA        london      NA       
6 NA        york        NA

感谢帮助

Answer 1

如果匹配项是唯一的，我们可以尝试：

library(tidyverse)

cities1 <- df1$city1 %>%
 str_c(collapse = '|')
df2 %>%
 mutate(match = str_extract(city2, cities1))
#>         city2     match
#> 1     atlanta      <NA>
#> 2      denver      <NA>
#> 3   cambridge cambridge
#> 4 cambridgeuk cambridge
#> 5      london      <NA>
#> 6        york      <NA>

但是假设我们在df2$city2里面有cambridgehouston，那就有点复杂了。我确信有更明确的方法可以做到这一点，但这可行。

library(tidyverse)

city1 <- c("boston", "cambridge", "houston")
df1 <- as.data.frame(city1)
df1
#>       city1
#> 1    boston
#> 2 cambridge
#> 3   houston

city2 <- c("atlanta", "denver", "cambridgehouston", "cambridgeuk", "london", "york")
df2 <- as.data.frame(city2)
df2
#>              city2
#> 1          atlanta
#> 2           denver
#> 3 cambridgehouston
#> 4      cambridgeuk
#> 5           london
#> 6             york


result_match <-
  map_dfc(df1$city1, ~
  tibble(!!.x := str_extract(df2$city2, .) %>% replace_na(""))) %>%
  rowwise() %>%
  transmute(match = c_across(all_of(df1$city1)) %>% str_c(collapse = " ") %>%
    str_trim() %>%
    na_if(""))

bind_cols(df2, result_match)
#>              city2             match
#> 1          atlanta              <NA>
#> 2           denver              <NA>
#> 3 cambridgehouston cambridge houston
#> 4      cambridgeuk         cambridge
#> 5           london              <NA>
#> 6             york              <NA>

^{由 reprex package (v2.0.1)}

于 2021-12-24 创建

如何检测数据框中的名称完全或部分包含在另一个数据框中的名称中？

How to detect names from a dataframe that are included totally or partially in names from another dataframe?

matching

dataframe

stringr

dplyr