Return 不均匀列表中数据框的匹配值
Return Matched Value to Dataframe from Uneven List
我看过几个 SO 帖子,但我正处于“用头撞树”的阶段。感谢您的宝贵时间。
我有一个带有文本字符串的数据框(大约 300 个案例);我只是想扫描一个单独的城市列表(其中 7000 个),如果字符串中的城市与列表匹配,我想用匹配的城市名称写入一个新的数据框列。
我的数据:
df<-structure(list(Item = c("1965 Wilkes College, Wilkes-Barre, PA", "1967 Spanish National Tourist Office, New York City", "1968 William Penn Memorial Museum, Harrisburg, PA",
"2010 Strange Evidence. Philadelphia Museum of Art. Peter Barbarie, Curator.","1973 The Museum of Modern Art, New York City", "1974 International Museum of Photography, George Eastman House, Rochester, NY", "1974 Light Gallery, New York City", "1975 Art Institute of Chicago"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))
citylist<-c("Barre","Sacramento","Palmer", "New York City","Chicago","Rochester")
我试过类似的东西:
df$city<-sapply(df$Item,function(x) df$Item[df$Item %in% citylist])
或
citymatcher<-function(x){match(x,citylist)}
df$city<-sapply(df$Item,citymatcher)
最终,我想要一个整洁的数据框,其中包含一个新列,指示与给定行匹配的城市。
这是您要找的吗?
library(tidyverse)
# collapse citylist into a regex search string
city_regex <- str_c(citylist, collapse = "|")
# extract matching values using stringr::str_extract
df <- df %>%
mutate(city = str_extract(Item, city_regex))
输出:
# A tibble: 7 x 2
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA NA
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
从 'citylist' 向量
创建 data.frame/tibble
后,我们可以使用 regex_left_join
(来自 fuzzyjoin
)
library(fuzzyjoin)
regex_left_join(df, tibble(city = citylist), by = c("Item" = "city"))
# A tibble: 7 × 2
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA <NA>
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
为了避免任何部分子串匹配,请使用单词边界 (\b
)
library(stringr)
library(dplyr)
regex_left_join(df, tibble(city = sprintf("\b%s\b",
citylist)), by = c("Item" = "city")) %>%
mutate(city = str_remove_all(city, fixed("\b")))
-输出
# A tibble: 8 × 2
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA <NA>
4 2010 Strange Evidence. Philadelphia Museum of Art. Peter Barbarie, Curator. <NA>
5 1973 The Museum of Modern Art, New York City New York City
6 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
7 1974 Light Gallery, New York City New York City
8 1975 Art Institute of Chicago Chicago
可能的基础 R 解决方案:
citylist2 <- paste0(citylist, collapse = "|")
df$city <- ifelse(grepl(citylist2, df$Item), sub(paste0(".*(", citylist2, ").*"), "\1", df$Item), NA)
或者我们可以使用 regexpr
和 regmatches
:
citylist2 <- paste0(citylist, collapse = "|")
df$city <- NA
df[grepl(citylist2, df$Item),]$city <- regmatches(df$Item, regexpr(citylist2, df$Item))
输出
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA NA
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
基准
我们可以使用 str_match
:
library(stringr)
library(dplyr)
pattern <- paste(citylist, collapse = '|')
mutate(df, city = str_match(Item, pattern))
Item city_1[,1]
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA NA
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
>
我看过几个 SO 帖子,但我正处于“用头撞树”的阶段。感谢您的宝贵时间。
我有一个带有文本字符串的数据框(大约 300 个案例);我只是想扫描一个单独的城市列表(其中 7000 个),如果字符串中的城市与列表匹配,我想用匹配的城市名称写入一个新的数据框列。
我的数据:
df<-structure(list(Item = c("1965 Wilkes College, Wilkes-Barre, PA", "1967 Spanish National Tourist Office, New York City", "1968 William Penn Memorial Museum, Harrisburg, PA",
"2010 Strange Evidence. Philadelphia Museum of Art. Peter Barbarie, Curator.","1973 The Museum of Modern Art, New York City", "1974 International Museum of Photography, George Eastman House, Rochester, NY", "1974 Light Gallery, New York City", "1975 Art Institute of Chicago"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))
citylist<-c("Barre","Sacramento","Palmer", "New York City","Chicago","Rochester")
我试过类似的东西:
df$city<-sapply(df$Item,function(x) df$Item[df$Item %in% citylist])
或
citymatcher<-function(x){match(x,citylist)}
df$city<-sapply(df$Item,citymatcher)
最终,我想要一个整洁的数据框,其中包含一个新列,指示与给定行匹配的城市。
这是您要找的吗?
library(tidyverse)
# collapse citylist into a regex search string
city_regex <- str_c(citylist, collapse = "|")
# extract matching values using stringr::str_extract
df <- df %>%
mutate(city = str_extract(Item, city_regex))
输出:
# A tibble: 7 x 2
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA NA
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
从 'citylist' 向量
创建data.frame/tibble
后,我们可以使用 regex_left_join
(来自 fuzzyjoin
)
library(fuzzyjoin)
regex_left_join(df, tibble(city = citylist), by = c("Item" = "city"))
# A tibble: 7 × 2
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA <NA>
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
为了避免任何部分子串匹配,请使用单词边界 (\b
)
library(stringr)
library(dplyr)
regex_left_join(df, tibble(city = sprintf("\b%s\b",
citylist)), by = c("Item" = "city")) %>%
mutate(city = str_remove_all(city, fixed("\b")))
-输出
# A tibble: 8 × 2
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA <NA>
4 2010 Strange Evidence. Philadelphia Museum of Art. Peter Barbarie, Curator. <NA>
5 1973 The Museum of Modern Art, New York City New York City
6 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
7 1974 Light Gallery, New York City New York City
8 1975 Art Institute of Chicago Chicago
可能的基础 R 解决方案:
citylist2 <- paste0(citylist, collapse = "|")
df$city <- ifelse(grepl(citylist2, df$Item), sub(paste0(".*(", citylist2, ").*"), "\1", df$Item), NA)
或者我们可以使用 regexpr
和 regmatches
:
citylist2 <- paste0(citylist, collapse = "|")
df$city <- NA
df[grepl(citylist2, df$Item),]$city <- regmatches(df$Item, regexpr(citylist2, df$Item))
输出
Item city
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA NA
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
基准
我们可以使用 str_match
:
library(stringr)
library(dplyr)
pattern <- paste(citylist, collapse = '|')
mutate(df, city = str_match(Item, pattern))
Item city_1[,1]
<chr> <chr>
1 1965 Wilkes College, Wilkes-Barre, PA Barre
2 1967 Spanish National Tourist Office, New York City New York City
3 1968 William Penn Memorial Museum, Harrisburg, PA NA
4 1973 The Museum of Modern Art, New York City New York City
5 1974 International Museum of Photography, George Eastman House, Rochester, NY Rochester
6 1974 Light Gallery, New York City New York City
7 1975 Art Institute of Chicago Chicago
>