Return 在 R 中模糊连接两个数据帧或向量时,如果它们共享一个共同的词,则有多个可能的匹配项

Return multiple possible matches when fuzzy joining two dataframes or vectors in R if they share a word in common

如果第一个数据帧中的一行与第二个数据帧中的每一行共享一个词,是否有一种连接两个数据帧的方法?

例如:

companies1 <- data.frame(company_name = c("Walmart", "Amazon", "Apple", "CVS Health", "UnitedHealth Group", "Berkshire Hathaway", "Alphabet"))
companies2 <- data.frame(company_name = "Walmart Stores", "Walmart Inc", "Amazon Web Services", "Amazon Alexa", "Apple", "Apple Products", "CVS Health", "UnitedHealth Group", "Berkshire Hathaway", "Berkshire Hathaway Asset Management", "Meta"))

我想匹配这些,这样左右栏之间的每一个可能的匹配都会被 returned,如下所示:

我已经尝试过像 fuzzymatch 和 stringdist 这样的包,但是对于匹配这些包似乎 return 只有最佳匹配。然而,由于我正在做的匹配不像上面那样整洁而且更大,我的计划是找到可能的匹配,然后给他们一个距离分数(例如使用 Jaro-Winkler 距离),此时我'您必须手动 select 正确匹配(如果有的话)。

fuzzy_join:

library(fuzzyjoin)
fuzzy_join(companies2, companies1, match_fun = stringr::str_detect)

                        company_name.x     company_name.y
1                       Walmart Stores            Walmart
2                          Walmart Inc            Walmart
3                  Amazon Web Services             Amazon
4                         Amazon Alexa             Amazon
5                                Apple              Apple
6                       Apple Products              Apple
7                           CVS Health         CVS Health
8                   UnitedHealth Group UnitedHealth Group
9                   Berkshire Hathaway Berkshire Hathaway
10 Berkshire Hathaway Asset Management Berkshire Hathaway

或者,如果您想遵守列的顺序:

fuzzy_join(companies1, companies2, match_fun = function(x, y) stringr::str_detect(y, x))

       company_name.x                      company_name.y
1             Walmart                      Walmart Stores
2             Walmart                         Walmart Inc
3              Amazon                 Amazon Web Services
4              Amazon                        Amazon Alexa
5               Apple                               Apple
6               Apple                      Apple Products
7          CVS Health                          CVS Health
8  UnitedHealth Group                  UnitedHealth Group
9  Berkshire Hathaway                  Berkshire Hathaway
10 Berkshire Hathaway Berkshire Hathaway Asset Management