在两个不同的向量上匹配相同的字符串
Match same strings over two different vectors
假设我们有两个不同的数据集:
数据集 A:
ids name price
1234 bread 1.5
245r7 butter 1.2
123984 red wine 5
43498 beer 1
235897 cream 1.8
数据集 B:
ids name price
24908 lait 1
1234,089 pain 1.7
77289,43498 bière 1.5
245r7 beurre 1.4
我的目标是匹配共享至少一个 ID 的所有产品,并将它们合并到一个新的数据集中,该数据集应如下所示:
id a_name b_name a_price b_price
1234 bread pain 1.5 1.7
245r7 butter beurre 1.2 1.4
43498 beer bière 1 1.5
使用 stringr
或任何其他 R 包是否可行?
我们可以在这里使用sqldf
包:
library(sqldf)
sql <- "SELECT a.ids AS id, a.name AS a_name, b.name AS b_name, a.price AS a_price,
b.price AS b_price
FROM df_a a
INNER JOIN df_b b
ON ',' || b.ids || ',' LIKE '%,' || a.ids || ',%'"
output <- sqldf(sql)
您可以使用 separate_rows
创建一个长数据集,然后进行连接。
library(dplyr)
library(tidyr)
B %>%
separate_rows(ids, sep = ',') %>%
inner_join(A, by = 'ids')
# ids name.x price.x name.y price.y
# <chr> <chr> <dbl> <chr> <dbl>
#1 1234 pain 1.7 bread 1.5
#2 43498 bière 1.5 beer 1
#3 245r7 beurre 1.4 butter 1.2
由于 separate_rows
(我最喜欢的)已经由 Ronak Shah 提供,
这是使用 strsplit
和 unnest()
的另一种策略:
library(tidyr)
library(dplyr)
df_B %>%
mutate(ids = strsplit(as.character(ids), ",")) %>%
unnest() %>%
inner_join(df_A, by="ids")
ids name.x price.x name.y price.y
<chr> <chr> <dbl> <chr> <chr>
1 1234 pain 1.7 bread 1.5
2 43498 bi??re 1.5 beer 1
3 245r7 beurre 1.4 butter 1.2
数据:
df_A <- structure(list(ids = c("1234", "245r7", "123984", "43498", "235897"
), name = c("bread", "butter", "red", "beer", "cream"), price = c("1.5",
"1.2", "wine", "1", "1.8")), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L), problems = structure(list(
row = 3L, col = NA_character_, expected = "3 columns", actual = "4 columns",
file = "'test'"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")))
df_B <- structure(list(ids = c("24908", "1234,089", "77289,43498", "245r7"
), name = c("lait", "pain", "bi??re", "beurre"), price = c(1,
1.7, 1.5, 1.4)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L))
假设我们有两个不同的数据集:
数据集 A:
ids name price
1234 bread 1.5
245r7 butter 1.2
123984 red wine 5
43498 beer 1
235897 cream 1.8
数据集 B:
ids name price
24908 lait 1
1234,089 pain 1.7
77289,43498 bière 1.5
245r7 beurre 1.4
我的目标是匹配共享至少一个 ID 的所有产品,并将它们合并到一个新的数据集中,该数据集应如下所示:
id a_name b_name a_price b_price
1234 bread pain 1.5 1.7
245r7 butter beurre 1.2 1.4
43498 beer bière 1 1.5
使用 stringr
或任何其他 R 包是否可行?
我们可以在这里使用sqldf
包:
library(sqldf)
sql <- "SELECT a.ids AS id, a.name AS a_name, b.name AS b_name, a.price AS a_price,
b.price AS b_price
FROM df_a a
INNER JOIN df_b b
ON ',' || b.ids || ',' LIKE '%,' || a.ids || ',%'"
output <- sqldf(sql)
您可以使用 separate_rows
创建一个长数据集,然后进行连接。
library(dplyr)
library(tidyr)
B %>%
separate_rows(ids, sep = ',') %>%
inner_join(A, by = 'ids')
# ids name.x price.x name.y price.y
# <chr> <chr> <dbl> <chr> <dbl>
#1 1234 pain 1.7 bread 1.5
#2 43498 bière 1.5 beer 1
#3 245r7 beurre 1.4 butter 1.2
由于 separate_rows
(我最喜欢的)已经由 Ronak Shah 提供,
这是使用 strsplit
和 unnest()
的另一种策略:
library(tidyr)
library(dplyr)
df_B %>%
mutate(ids = strsplit(as.character(ids), ",")) %>%
unnest() %>%
inner_join(df_A, by="ids")
ids name.x price.x name.y price.y
<chr> <chr> <dbl> <chr> <chr>
1 1234 pain 1.7 bread 1.5
2 43498 bi??re 1.5 beer 1
3 245r7 beurre 1.4 butter 1.2
数据:
df_A <- structure(list(ids = c("1234", "245r7", "123984", "43498", "235897"
), name = c("bread", "butter", "red", "beer", "cream"), price = c("1.5",
"1.2", "wine", "1", "1.8")), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L), problems = structure(list(
row = 3L, col = NA_character_, expected = "3 columns", actual = "4 columns",
file = "'test'"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")))
df_B <- structure(list(ids = c("24908", "1234,089", "77289,43498", "245r7"
), name = c("lait", "pain", "bi??re", "beurre"), price = c(1,
1.7, 1.5, 1.4)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L))