R函数识别不匹配的行

R Function to identify non-matching rows

我正在尝试比较 2 data.frames,"V1" 代表我的 CRM,"V2" 代表我想发送的潜在客户。

'V1 has roughly 8k elements' 'V2 has roughly 25k elements'

我需要将 V2 中的每一行与 V1 中的每一行进行比较,丢弃 V1 中存在 V2 元素的每个实例。

然后我想 return 只有在 V1 中没有完全或松散出现的元素进入 Leads 列。

目标是发出 CRM(V1) 中不存在的潜在客户 (V2)。

我在 stringdist 包方面取得了一些不错的进展,并将 'soundex' 除以 'osa' 以提高我的机会,尽管此方法在 V1 中仍然是 returns 元素。:(

这是我在“潜在客户”列中寻找的预期结果,基于此示例:

潜在客户: J.Jones 恢复 A.W。建设者 C&C 承包商

我们将不胜感激任何帮助,如果有任何不清楚的地方,我深表歉意。

library(reprex)
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
library(tidystringdist)

df <- tibble::tribble(
  ~V1,  ~V2,
  "5th Generation Builder", "5th Generation Builder, LLC",
  "5th Generation Builders Inc.",   "5th Generation Builders",
  "89 Contractors LLC", "89 Contractors LLC",
  "906 Studio Architects LLC",  "906 Studio Architects",
  "A & A Glass Co.",    "Paragon Const.",
  "A & E Farm", "A & E Farm",
  "A & H GLASS",    "C & C Contractors",
  "A & J Homeworks,Painting, and Restoration",  "A.W. Builders",
  "Paragon Const.", "J. Jones Restoration",
  "A & L Construction", "A & L Const.")

tidy_e <- tidy_stringdist(df) %>% 
  filter(soundex>=1) %>% 
  select(-V1, V2) %>% 
  arrange(V2,osa) %>% 
  mutate(V2, sim = soundex/ osa) %>% 
  distinct(V2, osa, soundex, sim) %>% 
  rename('Leads'= 'V2')

reprex package (v0.3.0)

于 2020-04-13 创建

您可以使用 fuzzyjoin package,专为基于字符串距离等不精确匹配连接表而设计。 (免责声明,我是维护者)。

如果您的数据位于两个单独的表 V1 和 V2 中:

V1 <- tibble(name = c("5th Generation Builder", "5th Generation Builders Inc.", "89 Contractors LLC", 
                      "906 Studio Architects LLC", "A & A Glass Co.", "A & E Farm", 
                      "A & H GLASS", "A & J Homeworks,Painting, and Restoration", "Paragon Const.", 
                      "A & L Construction"))

V2 <- tibble(name = c("5th Generation Builder, LLC", "5th Generation Builders", "89 Contractors LLC", 
                      "906 Studio Architects", "Paragon Const.", "A & E Farm", "C & C Contractors", 
                      "A.W. Builders", "J. Jones Restoration", "A & L Const."))

然后您可以使用 stringdist_anti_join() 在 V2 中找到那些在 V1 中没有 soundex 匹配项:

V2 %>%
  stringdist_anti_join(V1, by = "name", method = "soundex")

结果:

# A tibble: 3 x 1
  name                
  <chr>               
1 C & C Contractors   
2 A.W. Builders       
3 J. Jones Restoration

有关 stringdist_ 联接的更多信息,请参阅 this vignette


请注意,如果您想查看 哪个 匹配,您可以使用 stringdist_left_join():

V2 %>%
  stringdist_left_join(V1, by = "name", method = "soundex")
# A tibble: 12 x 2
   name.x                      name.y                      
   <chr>                       <chr>                       
 1 5th Generation Builder, LLC 5th Generation Builder      
 2 5th Generation Builder, LLC 5th Generation Builders Inc.
 3 5th Generation Builders     5th Generation Builder      
 4 5th Generation Builders     5th Generation Builders Inc.
 5 89 Contractors LLC          89 Contractors LLC          
 6 906 Studio Architects       906 Studio Architects LLC   
 7 Paragon Const.              Paragon Const.              
 8 A & E Farm                  A & E Farm                  
 9 C & C Contractors           NA                          
10 A.W. Builders               NA                          
11 J. Jones Restoration        NA                          
12 A & L Const.                A & L Construction