Return 在 R 中模糊连接两个数据帧或向量时,如果它们共享一个共同的词,则有多个可能的匹配项
Return multiple possible matches when fuzzy joining two dataframes or vectors in R if they share a word in common
如果第一个数据帧中的一行与第二个数据帧中的每一行共享一个词,是否有一种连接两个数据帧的方法?
例如:
companies1 <- data.frame(company_name = c("Walmart", "Amazon", "Apple", "CVS Health", "UnitedHealth Group", "Berkshire Hathaway", "Alphabet"))
companies2 <- data.frame(company_name = "Walmart Stores", "Walmart Inc", "Amazon Web Services", "Amazon Alexa", "Apple", "Apple Products", "CVS Health", "UnitedHealth Group", "Berkshire Hathaway", "Berkshire Hathaway Asset Management", "Meta"))
我想匹配这些,这样左右栏之间的每一个可能的匹配都会被 returned,如下所示:
我已经尝试过像 fuzzymatch 和 stringdist 这样的包,但是对于匹配这些包似乎 return 只有最佳匹配。然而,由于我正在做的匹配不像上面那样整洁而且更大,我的计划是找到可能的匹配,然后给他们一个距离分数(例如使用 Jaro-Winkler 距离),此时我'您必须手动 select 正确匹配(如果有的话)。
与fuzzy_join
:
library(fuzzyjoin)
fuzzy_join(companies2, companies1, match_fun = stringr::str_detect)
company_name.x company_name.y
1 Walmart Stores Walmart
2 Walmart Inc Walmart
3 Amazon Web Services Amazon
4 Amazon Alexa Amazon
5 Apple Apple
6 Apple Products Apple
7 CVS Health CVS Health
8 UnitedHealth Group UnitedHealth Group
9 Berkshire Hathaway Berkshire Hathaway
10 Berkshire Hathaway Asset Management Berkshire Hathaway
或者,如果您想遵守列的顺序:
fuzzy_join(companies1, companies2, match_fun = function(x, y) stringr::str_detect(y, x))
company_name.x company_name.y
1 Walmart Walmart Stores
2 Walmart Walmart Inc
3 Amazon Amazon Web Services
4 Amazon Amazon Alexa
5 Apple Apple
6 Apple Apple Products
7 CVS Health CVS Health
8 UnitedHealth Group UnitedHealth Group
9 Berkshire Hathaway Berkshire Hathaway
10 Berkshire Hathaway Berkshire Hathaway Asset Management
如果第一个数据帧中的一行与第二个数据帧中的每一行共享一个词,是否有一种连接两个数据帧的方法?
例如:
companies1 <- data.frame(company_name = c("Walmart", "Amazon", "Apple", "CVS Health", "UnitedHealth Group", "Berkshire Hathaway", "Alphabet"))
companies2 <- data.frame(company_name = "Walmart Stores", "Walmart Inc", "Amazon Web Services", "Amazon Alexa", "Apple", "Apple Products", "CVS Health", "UnitedHealth Group", "Berkshire Hathaway", "Berkshire Hathaway Asset Management", "Meta"))
我想匹配这些,这样左右栏之间的每一个可能的匹配都会被 returned,如下所示:
我已经尝试过像 fuzzymatch 和 stringdist 这样的包,但是对于匹配这些包似乎 return 只有最佳匹配。然而,由于我正在做的匹配不像上面那样整洁而且更大,我的计划是找到可能的匹配,然后给他们一个距离分数(例如使用 Jaro-Winkler 距离),此时我'您必须手动 select 正确匹配(如果有的话)。
与fuzzy_join
:
library(fuzzyjoin)
fuzzy_join(companies2, companies1, match_fun = stringr::str_detect)
company_name.x company_name.y
1 Walmart Stores Walmart
2 Walmart Inc Walmart
3 Amazon Web Services Amazon
4 Amazon Alexa Amazon
5 Apple Apple
6 Apple Products Apple
7 CVS Health CVS Health
8 UnitedHealth Group UnitedHealth Group
9 Berkshire Hathaway Berkshire Hathaway
10 Berkshire Hathaway Asset Management Berkshire Hathaway
或者,如果您想遵守列的顺序:
fuzzy_join(companies1, companies2, match_fun = function(x, y) stringr::str_detect(y, x))
company_name.x company_name.y
1 Walmart Walmart Stores
2 Walmart Walmart Inc
3 Amazon Amazon Web Services
4 Amazon Amazon Alexa
5 Apple Apple
6 Apple Apple Products
7 CVS Health CVS Health
8 UnitedHealth Group UnitedHealth Group
9 Berkshire Hathaway Berkshire Hathaway
10 Berkshire Hathaway Berkshire Hathaway Asset Management