文本匹配组合
Text Match combinations
我的数据如下
# dummy data
ID = c(1,2,3,4,5,6,7,8,9,10,11,12)
addrs = c("3 xx road sg" , "4 yy road sg" , "5 apt 04-3 sg" , "Bung 2 , kl road sg","4 yy road sg" , "3 xx road sg" ,"Bung 2 , kl road sg" ,"5 apt 04-3 sg","3 xx road sg","Bung 2 , sg kl road","3xx Road sg","4 yy sg")
data.1=data.table(ID,addrs)
数据看起来像
ID addrs
1: 1 3 xx road sg
2: 2 4 yy road sg
3: 3 5 apt 04-3 sg
4: 4 Bung 2 , kl road sg
5: 5 4 yy road sg
6: 6 3 xx road sg
7: 7 Bung 2 , kl road sg
8: 8 5 apt 04-3 sg
9: 9 3 xx road sg
我想获得匹配的组合(基于 addrs )所需的输出是(“3 xx road sg”的唯一示例)- 如果 Addr 与 A 和 B 匹配,table 应该有 A-B - 匹配和B-A-匹配
ID.1 ID.2 Match.1 Match.2 Accuracy
1 6 3 xx road sg 3 xx road sg 100%
1 9 3 xx road sg 3 xx road sg 100%
6 9 3 xx road sg 3 xx road sg 100%
9 6 3 xx road sg 3 xx road sg 100%
9 1 3 xx road sg 3 xx road sg 100%
6 1 3 xx road sg 3 xx road sg 100%
显示输出,其中文本可能因空格、字符顺序或字符而异
ID.1 ID.2 Match.1 Match.2 Accuracy
1 11 3 xx road sg 3xx Road sg 100 %
2 12 4 yy road sg 4 yy sg 70 %
4 10 Bung 2 , kl road sg Bung 2 , sg kl road 100 %
关于当数据可能相似但书写不同时如何处理文本匹配的任何进一步输入?
r <- merge(data.1, data.1, by="addrs", all=T, suffixes = c(".1",".2"))
r[r$ID.1 != r$ID.2,]
addrs ID.1 ID.2
2 3 xx road sg 1 6
3 3 xx road sg 1 9
4 3 xx road sg 6 1
6 3 xx road sg 6 9
7 3 xx road sg 9 1
8 3 xx road sg 9 6
11 4 yy road sg 2 5
12 4 yy road sg 5 2
15 5 apt 04-3 sg 3 8
16 5 apt 04-3 sg 8 3
19 Bung 2 , kl road sg 7 4
20 Bung 2 , kl road sg 4 7
我的数据如下
# dummy data
ID = c(1,2,3,4,5,6,7,8,9,10,11,12)
addrs = c("3 xx road sg" , "4 yy road sg" , "5 apt 04-3 sg" , "Bung 2 , kl road sg","4 yy road sg" , "3 xx road sg" ,"Bung 2 , kl road sg" ,"5 apt 04-3 sg","3 xx road sg","Bung 2 , sg kl road","3xx Road sg","4 yy sg")
data.1=data.table(ID,addrs)
数据看起来像
ID addrs
1: 1 3 xx road sg
2: 2 4 yy road sg
3: 3 5 apt 04-3 sg
4: 4 Bung 2 , kl road sg
5: 5 4 yy road sg
6: 6 3 xx road sg
7: 7 Bung 2 , kl road sg
8: 8 5 apt 04-3 sg
9: 9 3 xx road sg
我想获得匹配的组合(基于 addrs )所需的输出是(“3 xx road sg”的唯一示例)- 如果 Addr 与 A 和 B 匹配,table 应该有 A-B - 匹配和B-A-匹配
ID.1 ID.2 Match.1 Match.2 Accuracy
1 6 3 xx road sg 3 xx road sg 100%
1 9 3 xx road sg 3 xx road sg 100%
6 9 3 xx road sg 3 xx road sg 100%
9 6 3 xx road sg 3 xx road sg 100%
9 1 3 xx road sg 3 xx road sg 100%
6 1 3 xx road sg 3 xx road sg 100%
显示输出,其中文本可能因空格、字符顺序或字符而异
ID.1 ID.2 Match.1 Match.2 Accuracy
1 11 3 xx road sg 3xx Road sg 100 %
2 12 4 yy road sg 4 yy sg 70 %
4 10 Bung 2 , kl road sg Bung 2 , sg kl road 100 %
关于当数据可能相似但书写不同时如何处理文本匹配的任何进一步输入?
r <- merge(data.1, data.1, by="addrs", all=T, suffixes = c(".1",".2"))
r[r$ID.1 != r$ID.2,]
addrs ID.1 ID.2
2 3 xx road sg 1 6
3 3 xx road sg 1 9
4 3 xx road sg 6 1
6 3 xx road sg 6 9
7 3 xx road sg 9 1
8 3 xx road sg 9 6
11 4 yy road sg 2 5
12 4 yy road sg 5 2
15 5 apt 04-3 sg 3 8
16 5 apt 04-3 sg 8 3
19 Bung 2 , kl road sg 7 4
20 Bung 2 , kl road sg 4 7