比较模糊 R
Comparison Fuzzy R
我有两个数据集,数据集 df1 有一列包含在我们的 CRM 中注册的公司名称,另一列包含销售经理的姓名。数据集 df2 有一列包含参加过 IT 活动的公司名称。
数据集df2,因为是参与者手动输入的,所以写有拼写错误,缩写等,即CRM中注册的公司名称的相似名称。
所以目标是比较数据集 df2 中访问事件的公司名称与数据集 df1 中注册的公司名称,并将这些比较分配给销售经理。当然,没有查到的或者相差很远的名字应该给业务员取NA值。
我是 R 的新手,正在尝试各种方法但收效甚微。
你能帮我构建这个脚本吗?
示例如下:
df1 df2
|----------------|----------------| |----------------|
| Company | Sales Manager | | Company Event |
|----------------|----------------| |----------------|
|Customer 1 SA |Erik | |Customer 1 |
|Customer 2 S\A |Selma | |Customer 1 SA. |
|Customer 3 Ltda.|Juca | |Customer2 |
|Customer 4 |Batista | |cUSTOIMER 3 |
|----------------|----------------| |Customer 10 |
|----------------|
预期的最终结果是有另一个带有交叉数据的 df。
matched df
|----------------|----------------|----------------|
| Company Event | Company | Sales Manager |
|----------------|----------------|----------------|
|Customer 1 |Customer 1 SA |Erik |
|Customer 1 SA. |Customer 1 SA |Erik |
|Customer2 |Customer 2 S\A |Selma |
|cUSTOIMER 3 |Customer 3 Ltda.|Juca |
|Customer 10 |NA |NA |
|----------------|----------------|----------------|
以下应该有效。它涉及清理名称,获取最小距离,然后获取销售经理信息。
library(stringdist)
# declare data ------------------------------------------------------------
Company <- c("Customer 1 SA" ,"Customer 2 S/A", "Customer 3 Ltda.", "Customer 4")
SalesManager <- c("Erik", "Selma", "Juca", "Batista")
CompanyEvent <- c("Customer 1", "Customer 1 SA.", "Customer2" , "cUSTOIMER 3", "Customer 10")
df1 <- data.frame(Company, SalesManager, stringsAsFactors = F)
df2 <- data.frame(CompanyEvent, stringsAsFactors = F)
# clean 'dirty' names -----------------------------------------------------
df1$cleannames <- gsub("S/A", "", df1$Company)
df1$cleannames <- gsub("SA", "", df1$cleannames)
df1$cleannames <- gsub("Ltda.", "", df1$cleannames)
df1$cleannames <- gsub(" ", "", df1$cleannames)
df1$cleannames <-tolower(df1$cleannames)
df2$cleannames <- gsub("S/A", "", df2$CompanyEvent)
df2$cleannames <- gsub("SA", "", df2$cleannames)
df2$cleannames <- gsub("Ltda.", "", df2$cleannames)
df2$cleannames <- gsub(" ", "", df2$cleannames)
df2$cleannames <-tolower(df2$cleannames)
# Get the closest matches and distances -----------------------------------
df2$closestentry <- apply(df2,1, function(x) df1$cleannames[which.min(stringdist(x["cleannames"], df1$cleannames ))] )
df2$levdistance <- apply(df2,1, function(x) min(stringdist(x["cleannames"], df1$cleannames )))
#Get sales mgr data using closest matches
df2$salesmgr <- df1$SalesManager[match(df2$closestentry,df1$cleannames )]
df2
> df2
CompanyEvent cleannames closestentry levdistance salesmgr
1 Customer 1 customer1 customer1 0 Erik
2 Customer 1 SA. customer1. customer1 1 Erik
3 Customer2 customer2 customer2 0 Selma
4 cUSTOIMER 3 custoimer3 customer3 1 Juca
5 Customer 10 customer10 customer1 1 Erik
模糊字符串匹配是..好吧,模糊,所以你可能会遇到一些不是你所期望的情况,但是经过一些调整后你应该没问题(这里将添加 customer10
到 df1
例如)
这里所说的距离是字符串距离,见?stringdist
我有两个数据集,数据集 df1 有一列包含在我们的 CRM 中注册的公司名称,另一列包含销售经理的姓名。数据集 df2 有一列包含参加过 IT 活动的公司名称。
数据集df2,因为是参与者手动输入的,所以写有拼写错误,缩写等,即CRM中注册的公司名称的相似名称。
所以目标是比较数据集 df2 中访问事件的公司名称与数据集 df1 中注册的公司名称,并将这些比较分配给销售经理。当然,没有查到的或者相差很远的名字应该给业务员取NA值。
我是 R 的新手,正在尝试各种方法但收效甚微。
你能帮我构建这个脚本吗?
示例如下:
df1 df2
|----------------|----------------| |----------------|
| Company | Sales Manager | | Company Event |
|----------------|----------------| |----------------|
|Customer 1 SA |Erik | |Customer 1 |
|Customer 2 S\A |Selma | |Customer 1 SA. |
|Customer 3 Ltda.|Juca | |Customer2 |
|Customer 4 |Batista | |cUSTOIMER 3 |
|----------------|----------------| |Customer 10 |
|----------------|
预期的最终结果是有另一个带有交叉数据的 df。
matched df
|----------------|----------------|----------------|
| Company Event | Company | Sales Manager |
|----------------|----------------|----------------|
|Customer 1 |Customer 1 SA |Erik |
|Customer 1 SA. |Customer 1 SA |Erik |
|Customer2 |Customer 2 S\A |Selma |
|cUSTOIMER 3 |Customer 3 Ltda.|Juca |
|Customer 10 |NA |NA |
|----------------|----------------|----------------|
以下应该有效。它涉及清理名称,获取最小距离,然后获取销售经理信息。
library(stringdist)
# declare data ------------------------------------------------------------
Company <- c("Customer 1 SA" ,"Customer 2 S/A", "Customer 3 Ltda.", "Customer 4")
SalesManager <- c("Erik", "Selma", "Juca", "Batista")
CompanyEvent <- c("Customer 1", "Customer 1 SA.", "Customer2" , "cUSTOIMER 3", "Customer 10")
df1 <- data.frame(Company, SalesManager, stringsAsFactors = F)
df2 <- data.frame(CompanyEvent, stringsAsFactors = F)
# clean 'dirty' names -----------------------------------------------------
df1$cleannames <- gsub("S/A", "", df1$Company)
df1$cleannames <- gsub("SA", "", df1$cleannames)
df1$cleannames <- gsub("Ltda.", "", df1$cleannames)
df1$cleannames <- gsub(" ", "", df1$cleannames)
df1$cleannames <-tolower(df1$cleannames)
df2$cleannames <- gsub("S/A", "", df2$CompanyEvent)
df2$cleannames <- gsub("SA", "", df2$cleannames)
df2$cleannames <- gsub("Ltda.", "", df2$cleannames)
df2$cleannames <- gsub(" ", "", df2$cleannames)
df2$cleannames <-tolower(df2$cleannames)
# Get the closest matches and distances -----------------------------------
df2$closestentry <- apply(df2,1, function(x) df1$cleannames[which.min(stringdist(x["cleannames"], df1$cleannames ))] )
df2$levdistance <- apply(df2,1, function(x) min(stringdist(x["cleannames"], df1$cleannames )))
#Get sales mgr data using closest matches
df2$salesmgr <- df1$SalesManager[match(df2$closestentry,df1$cleannames )]
df2
> df2
CompanyEvent cleannames closestentry levdistance salesmgr
1 Customer 1 customer1 customer1 0 Erik
2 Customer 1 SA. customer1. customer1 1 Erik
3 Customer2 customer2 customer2 0 Selma
4 cUSTOIMER 3 custoimer3 customer3 1 Juca
5 Customer 10 customer10 customer1 1 Erik
模糊字符串匹配是..好吧,模糊,所以你可能会遇到一些不是你所期望的情况,但是经过一些调整后你应该没问题(这里将添加 customer10
到 df1
例如)
这里所说的距离是字符串距离,见?stringdist