Return 不均匀列表中数据框的匹配值

Question

我看过几个 SO 帖子，但我正处于“用头撞树”的阶段。感谢您的宝贵时间。

我有一个带有文本字符串的数据框（大约 300 个案例）；我只是想扫描一个单独的城市列表（其中 7000 个），如果字符串中的城市与列表匹配，我想用匹配的城市名称写入一个新的数据框列。

我的数据：

df<-structure(list(Item = c("1965 Wilkes College, Wilkes-Barre, PA", "1967 Spanish National Tourist Office, New York City", "1968 William Penn Memorial Museum, Harrisburg, PA", 
"2010 Strange Evidence. Philadelphia Museum of Art. Peter Barbarie, Curator.","1973 The Museum of Modern Art, New York City", "1974 International Museum of Photography, George Eastman House, Rochester, NY",  "1974 Light Gallery, New York City", "1975 Art Institute of Chicago"
    )), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
    ))

citylist<-c("Barre","Sacramento","Palmer", "New York City","Chicago","Rochester")

我试过类似的东西：

 df$city<-sapply(df$Item,function(x) df$Item[df$Item %in% citylist])

或

citymatcher<-function(x){match(x,citylist)} 
df$city<-sapply(df$Item,citymatcher)

最终，我想要一个整洁的数据框，其中包含一个新列，指示与给定行匹配的城市。

Answer 1

这是您要找的吗？

library(tidyverse)

# collapse citylist into a regex search string
city_regex <- str_c(citylist, collapse = "|")

# extract matching values using stringr::str_extract
df <- df %>% 
  mutate(city = str_extract(Item, city_regex))

输出：

# A tibble: 7 x 2
  Item                                                                          city         
  <chr>                                                                         <chr>        
1 1965 Wilkes College, Wilkes-Barre, PA                                         Barre        
2 1967 Spanish National Tourist Office, New York City                           New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA                             NA           
4 1973 The Museum of Modern Art, New York City                                  New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester    
6 1974 Light Gallery, New York City                                             New York City
7 1975 Art Institute of Chicago                                                 Chicago

Answer 2

从 'citylist' 向量

创建 data.frame/tibble 后，我们可以使用 regex_left_join（来自 fuzzyjoin）

library(fuzzyjoin)
regex_left_join(df, tibble(city = citylist), by = c("Item" = "city"))
# A tibble: 7 × 2
  Item                                                                          city         
  <chr>                                                                         <chr>        
1 1965 Wilkes College, Wilkes-Barre, PA                                         Barre        
2 1967 Spanish National Tourist Office, New York City                           New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA                             <NA>         
4 1973 The Museum of Modern Art, New York City                                  New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester    
6 1974 Light Gallery, New York City                                             New York City
7 1975 Art Institute of Chicago                                                 Chicago

为了避免任何部分子串匹配，请使用单词边界 (\b)

library(stringr)
library(dplyr) 
regex_left_join(df, tibble(city = sprintf("\b%s\b", 
       citylist)), by = c("Item" = "city")) %>%
    mutate(city = str_remove_all(city, fixed("\b")))

-输出

# A tibble: 8 × 2
  Item                                                                          city         
  <chr>                                                                         <chr>        
1 1965 Wilkes College, Wilkes-Barre, PA                                         Barre        
2 1967 Spanish National Tourist Office, New York City                           New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA                             <NA>         
4 2010 Strange Evidence. Philadelphia Museum of Art. Peter Barbarie, Curator.   <NA>         
5 1973 The Museum of Modern Art, New York City                                  New York City
6 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester    
7 1974 Light Gallery, New York City                                             New York City
8 1975 Art Institute of Chicago                                                 Chicago

Answer 3

可能的基础 R 解决方案：

citylist2 <- paste0(citylist, collapse = "|")

df$city <- ifelse(grepl(citylist2, df$Item), sub(paste0(".*(", citylist2, ").*"), "\1", df$Item), NA)

或者我们可以使用 regexpr 和 regmatches:

citylist2 <- paste0(citylist, collapse = "|")

df$city <- NA
df[grepl(citylist2, df$Item),]$city <- regmatches(df$Item, regexpr(citylist2, df$Item))

输出

  Item                                                                          city         
  <chr>                                                                         <chr>        
1 1965 Wilkes College, Wilkes-Barre, PA                                         Barre        
2 1967 Spanish National Tourist Office, New York City                           New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA                             NA           
4 1973 The Museum of Modern Art, New York City                                  New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester    
6 1974 Light Gallery, New York City                                             New York City
7 1975 Art Institute of Chicago                                                 Chicago

基准

Answer 4

我们可以使用 str_match:

library(stringr)
library(dplyr)

pattern <- paste(citylist, collapse = '|')

mutate(df, city = str_match(Item, pattern))

 Item                                                                          city_1[,1]   
  <chr>                                                                         <chr>        
1 1965 Wilkes College, Wilkes-Barre, PA                                         Barre        
2 1967 Spanish National Tourist Office, New York City                           New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA                             NA           
4 1973 The Museum of Modern Art, New York City                                  New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester    
6 1974 Light Gallery, New York City                                             New York City
7 1975 Art Institute of Chicago                                                 Chicago      
>

Return 不均匀列表中数据框的匹配值

Return Matched Value to Dataframe from Uneven List

r

list

match