在 LIKE 上合并两个表,但对于整个字符串而不是字符串的一部分
Merge two tables on LIKE but for whole string not parts of strings
这是我的第一个 post/question 所以请多关照。
我有一个这样的数据框:
id product
1 00109290 Wax Salt; Pepper
2 23243242 Wood Stuff
3 23242433 Magic Unicorn Powder and My Tears
4 23778899 gelatin
5 25887766 tin;
6 7786655 fart noises, and things
7 3432422 --spearmint bacon& hydrangia leaves
我有这样的查找 table:
ingredients
1 wax
2 salt
3 wood
4 my tears
5 unicorn powder
6 gelatin
7 tin
8 hydrangia leaves
9 spearmint
10 bacon
我想在整个字符串上合并它们,所以我得到了这个:
id product ingredients
1 00109290 Wax Salt; Pepper wax
2 00109290 Wax Salt; Pepper salt
3 23243242 Wood Stuff wood
4 23242433 Magic Unicorn Powder and My Tears my tears
5 23242433 Magic Unicorn Powder and My Tears unicorn powder
6 23778899 gelatin gelatin
7 25887766 tin; tin
8 3432422 --spearmint bacon& hydrangia leaves hydrangia leaves
9 3432422 --spearmint bacon& hydrangia leaves spearmint
10 3432422 --spearmint bacon& hydrangia leaves bacon
相反,我得到了这个(注意第 7 行不需要):
id product ingredients
1 00109290 Wax Salt; Pepper wax
2 00109290 Wax Salt; Pepper salt
3 23243242 Wood Stuff wood
4 23242433 Magic Unicorn Powder and My Tears my tears
5 23242433 Magic Unicorn Powder and My Tears unicorn powder
6 23778899 gelatin gelatin
7 23778899 gelatin tin
8 25887766 tin; tin
9 3432422 --spearmint bacon& hydrangia leaves hydrangia leaves
10 3432422 --spearmint bacon& hydrangia leaves spearmint
11 3432422 --spearmint bacon& hydrangia leaves bacon
我非常接近,但我将 'gelatin' 与 'tin' 错误匹配。我想匹配整个单词,而不是部分单词。我尝试了许多不同的技术,最接近的是:
library(sqldf)
id <- c('00109290', '23243242', '23242433',
'23778899', '25887766', '7786655',
'3432422')
product <- c('Wax Salt; Pepper', 'Wood Stuff',
'Magic Unicorn Powder and My Tears',
'gelatin', 'tin;', 'fart noises, and things',
'--spearmint bacon& hydrangia leaves')
ingredients <- c('wax', 'salt', 'wood', 'my tears',
'unicorn powder', 'gelatin', 'tin',
'hydrangia leaves',
'spearmint', 'bacon')
products <- data.frame(id, product)
ingred <- data.frame(ingredients)
new_df <- sqldf("SELECT * from products
join ingred on product LIKE '%' || ingredients || '%'")
非常感谢任何建议。也许需要一种完全不同的方法?我也欢迎就问题的质量提出建议,这是我的第一次,所以你最好马上让我直接回答。
考虑在关键字之前或之后为一个 space 添加 OR
条件,然后完全匹配并替换任何特殊 characters/punctuation。
new_df <- sqldf("SELECT * from products
join ingred on Replace(product, ';', '') LIKE '% ' || ingredients || '%'
OR Replace(product, ';', '') LIKE '%' || ingredients || ' %'
OR Replace(product, ';', '') = ingredients
")
您甚至可以 UNION
不同的特殊字符。下面的示例替换分号和感叹号:
new_df <- sqldf("SELECT * from products
join ingred on Replace(product, ';', '') LIKE '% ' || ingredients || '%'
OR Replace(product, ';', '') LIKE '%' || ingredients || ' %'
OR Replace(product, ';', '') = ingredients
UNION
SELECT * from products
join ingred on Replace(product, '!', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '!', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '!', '') = ingredients
")
对于许多 UNIONs
,考虑让 R 连接 SQL 语句:
sql <- paste(lapply(c("!", "#", "$", "%", "(", ")", ":", ";", ".", "?", ">", "<", "/", "\\", "|"),
function(i)
paste0("SELECT * from products
join ingred on Replace(product, '", i, "', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '", i, "', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '", i, "', '') = ingredients
")
), collapse = "UNION ")
cat(paste(sql))
SELECT * from products
join ingred on Replace(product, '!', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '!', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '!', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '#', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '#', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '#', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '$', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '$', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '$', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '%', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '%', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '%', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '(', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '(', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '(', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, ')', '') LIKE '% ' || ingredients || '%'
OR Replace(product, ')', '') LIKE '%' || ingredients || ' %'
OR Replace(product, ')', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, ':', '') LIKE '% ' || ingredients || '%'
OR Replace(product, ':', '') LIKE '%' || ingredients || ' %'
OR Replace(product, ':', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, ';', '') LIKE '% ' || ingredients || '%'
OR Replace(product, ';', '') LIKE '%' || ingredients || ' %'
OR Replace(product, ';', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '.', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '.', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '.', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '?', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '?', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '?', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '>', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '>', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '>', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '<', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '<', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '<', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '/', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '/', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '/', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '\', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '\', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '\', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '|', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '|', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '|', '') = ingredients
使用 fuzzyjoin 包的解决方案,以及来自 stringr 的 str_detect
:
library(fuzzyjoin)
library(stringr)
f <- function(x, y) {
# tests whether y is an ingredient of x
str_detect(x, regex(paste0("\b", y, "\b"), ignore_case = TRUE))
}
fuzzy_join(products,
ingred,
by = c("product" = "ingredients"),
match_fun = f)
# id product ingredients
# 1 109290 Wax Salt; Pepper wax
# 2 109290 Wax Salt; Pepper salt
# 3 23243242 Wood Stuff wood
# 4 23242433 Magic Unicorn Powder and My Tears my tears
# 5 23242433 Magic Unicorn Powder and My Tears unicorn powder
# 6 23778899 gelatin gelatin
数据
products <- read.table(text = "
id product
1 00109290 'Wax Salt; Pepper'
2 23243242 'Wood Stuff'
3 23242433 'Magic Unicorn Powder and My Tears'
4 23778899 gelatin
", stringsAsFactors = FALSE)
ingred <- read.table(text = "
ingredients
1 wax
2 salt
3 wood
4 'my tears'
5 'unicorn powder'
6 gelatin
7 tin
", stringsAsFactors = FALSE)
这是我的第一个 post/question 所以请多关照。 我有一个这样的数据框:
id product
1 00109290 Wax Salt; Pepper
2 23243242 Wood Stuff
3 23242433 Magic Unicorn Powder and My Tears
4 23778899 gelatin
5 25887766 tin;
6 7786655 fart noises, and things
7 3432422 --spearmint bacon& hydrangia leaves
我有这样的查找 table:
ingredients
1 wax
2 salt
3 wood
4 my tears
5 unicorn powder
6 gelatin
7 tin
8 hydrangia leaves
9 spearmint
10 bacon
我想在整个字符串上合并它们,所以我得到了这个:
id product ingredients
1 00109290 Wax Salt; Pepper wax
2 00109290 Wax Salt; Pepper salt
3 23243242 Wood Stuff wood
4 23242433 Magic Unicorn Powder and My Tears my tears
5 23242433 Magic Unicorn Powder and My Tears unicorn powder
6 23778899 gelatin gelatin
7 25887766 tin; tin
8 3432422 --spearmint bacon& hydrangia leaves hydrangia leaves
9 3432422 --spearmint bacon& hydrangia leaves spearmint
10 3432422 --spearmint bacon& hydrangia leaves bacon
相反,我得到了这个(注意第 7 行不需要):
id product ingredients
1 00109290 Wax Salt; Pepper wax
2 00109290 Wax Salt; Pepper salt
3 23243242 Wood Stuff wood
4 23242433 Magic Unicorn Powder and My Tears my tears
5 23242433 Magic Unicorn Powder and My Tears unicorn powder
6 23778899 gelatin gelatin
7 23778899 gelatin tin
8 25887766 tin; tin
9 3432422 --spearmint bacon& hydrangia leaves hydrangia leaves
10 3432422 --spearmint bacon& hydrangia leaves spearmint
11 3432422 --spearmint bacon& hydrangia leaves bacon
我非常接近,但我将 'gelatin' 与 'tin' 错误匹配。我想匹配整个单词,而不是部分单词。我尝试了许多不同的技术,最接近的是:
library(sqldf)
id <- c('00109290', '23243242', '23242433',
'23778899', '25887766', '7786655',
'3432422')
product <- c('Wax Salt; Pepper', 'Wood Stuff',
'Magic Unicorn Powder and My Tears',
'gelatin', 'tin;', 'fart noises, and things',
'--spearmint bacon& hydrangia leaves')
ingredients <- c('wax', 'salt', 'wood', 'my tears',
'unicorn powder', 'gelatin', 'tin',
'hydrangia leaves',
'spearmint', 'bacon')
products <- data.frame(id, product)
ingred <- data.frame(ingredients)
new_df <- sqldf("SELECT * from products
join ingred on product LIKE '%' || ingredients || '%'")
非常感谢任何建议。也许需要一种完全不同的方法?我也欢迎就问题的质量提出建议,这是我的第一次,所以你最好马上让我直接回答。
考虑在关键字之前或之后为一个 space 添加 OR
条件,然后完全匹配并替换任何特殊 characters/punctuation。
new_df <- sqldf("SELECT * from products
join ingred on Replace(product, ';', '') LIKE '% ' || ingredients || '%'
OR Replace(product, ';', '') LIKE '%' || ingredients || ' %'
OR Replace(product, ';', '') = ingredients
")
您甚至可以 UNION
不同的特殊字符。下面的示例替换分号和感叹号:
new_df <- sqldf("SELECT * from products
join ingred on Replace(product, ';', '') LIKE '% ' || ingredients || '%'
OR Replace(product, ';', '') LIKE '%' || ingredients || ' %'
OR Replace(product, ';', '') = ingredients
UNION
SELECT * from products
join ingred on Replace(product, '!', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '!', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '!', '') = ingredients
")
对于许多 UNIONs
,考虑让 R 连接 SQL 语句:
sql <- paste(lapply(c("!", "#", "$", "%", "(", ")", ":", ";", ".", "?", ">", "<", "/", "\\", "|"),
function(i)
paste0("SELECT * from products
join ingred on Replace(product, '", i, "', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '", i, "', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '", i, "', '') = ingredients
")
), collapse = "UNION ")
cat(paste(sql))
SELECT * from products
join ingred on Replace(product, '!', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '!', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '!', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '#', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '#', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '#', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '$', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '$', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '$', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '%', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '%', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '%', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '(', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '(', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '(', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, ')', '') LIKE '% ' || ingredients || '%'
OR Replace(product, ')', '') LIKE '%' || ingredients || ' %'
OR Replace(product, ')', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, ':', '') LIKE '% ' || ingredients || '%'
OR Replace(product, ':', '') LIKE '%' || ingredients || ' %'
OR Replace(product, ':', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, ';', '') LIKE '% ' || ingredients || '%'
OR Replace(product, ';', '') LIKE '%' || ingredients || ' %'
OR Replace(product, ';', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '.', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '.', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '.', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '?', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '?', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '?', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '>', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '>', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '>', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '<', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '<', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '<', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '/', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '/', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '/', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '\', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '\', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '\', '') = ingredients
UNION SELECT * from products
join ingred on Replace(product, '|', '') LIKE '% ' || ingredients || '%'
OR Replace(product, '|', '') LIKE '%' || ingredients || ' %'
OR Replace(product, '|', '') = ingredients
使用 fuzzyjoin 包的解决方案,以及来自 stringr 的 str_detect
:
library(fuzzyjoin)
library(stringr)
f <- function(x, y) {
# tests whether y is an ingredient of x
str_detect(x, regex(paste0("\b", y, "\b"), ignore_case = TRUE))
}
fuzzy_join(products,
ingred,
by = c("product" = "ingredients"),
match_fun = f)
# id product ingredients
# 1 109290 Wax Salt; Pepper wax
# 2 109290 Wax Salt; Pepper salt
# 3 23243242 Wood Stuff wood
# 4 23242433 Magic Unicorn Powder and My Tears my tears
# 5 23242433 Magic Unicorn Powder and My Tears unicorn powder
# 6 23778899 gelatin gelatin
数据
products <- read.table(text = "
id product
1 00109290 'Wax Salt; Pepper'
2 23243242 'Wood Stuff'
3 23242433 'Magic Unicorn Powder and My Tears'
4 23778899 gelatin
", stringsAsFactors = FALSE)
ingred <- read.table(text = "
ingredients
1 wax
2 salt
3 wood
4 'my tears'
5 'unicorn powder'
6 gelatin
7 tin
", stringsAsFactors = FALSE)