在 R 中识别 left_join 中的同义词

Question

我有几个包含字符的相当大的数据 table，我想将它们与我的数据库中的条目结合起来。拼写通常不太正确，因此无法加入。我知道没有办法创建一个同义词 table 来替换一些拼写错误的字符。但是有没有办法自动检测某些异常情况（参见下面的示例）？

我的数据 table 看起来与此类似：

data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))

我的数据库中的字符是这样的：

characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))

目前，如果我执行 left_join 只有“Apple”会加入：

data <- data %>%
  left_join(characters.database, by = c('products'))

结果：

products	ID
potatoe Chips	NA
potato Chips	NA
potato chips	NA
Potato-chips	NA
apple	NA
Apple	3
Appl	NA
Apple Gala	NA

是否可以在 left_join 期间自动忽略：“大小写字母”、space“”、“-”和单词末尾的“e”？

这就是我想要的table：

products	ID
potatoe Chips	1
potatoChips	1
potato chips	1
Potato-chips	1
apple	1
Apple	3
Appl	1
Apple Gala	NA

有什么想法吗？

Answer 1

如果我是你，我会做几件事：

我会去除所有特殊字符、小写所有字符、删除空格等。这会帮助很多（即薯片、薯片和薯片都转到“potatochips”，您可以然后加入）。
有一个名为 fuzzyjoin 的包，可以让您通过编辑距离等加入正则表达式。这将有助于解决 Apple 与 Apple Gala 和拼写错误等问题。

您可以去除特殊字符（只保留字母）+ 小写字母，例如：

library(stringr)
library(magrittr)

string %>%
  str_remove_all("[^A-Za-z]+") %>%
  tolower()

Answer 2

感谢 Matt Kaye 的建议，我现在做了类似的事情。由于我需要数据库中的正确拼写，并且我的一些字符包含相关的符号和数字，因此我执行了以下操作：

#data
data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))
characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))

#remove spaces and capital letters in data
data <- data %>%
  mutate(products= tolower(products)) %>%
  mutate(products= gsub(" ", "", products))

#add ID to database
characters.database <- characters.database %>%
  dplyr::mutate(ID = row_number())

#remove spaces and capital letters in databasr product names
characters.database_syn <- characters.database %>%
  mutate(products= tolower(products)) %>%
  mutate(products= gsub(" ", "", products))

#join and add correct spelling from database
data <- data %>%
  left_join(characters.database_syn, by = c('products')) %>%
  select(product_syn=products, 'ID') %>%
  left_join(characters.database, by = c('ID'))

#other synonyms have to manually be corrected or with the help of a synonym table (As in MY data special caracters are relevant!)

在 R 中识别 left_join 中的同义词

Recognizing synonyms in left_join in R

r

left-join

synonym