通过排除 R 中的相同条目来计算文本字符串中的相似度百分比
Computing similarity % in text strings by excluding the identical entries in R
给定的 R 脚本计算两个名称之间的相似度百分比,如视觉所示。这里我们有两列 "names1" 和 "names2",它们在 id1 和 id2 中有各自的 id。我的要求是,当我们执行脚本时,"names1" 中的每个名称都与 "names2" 列中的每个名称进行比较,我不希望将相同的条目即 (id1,names1) 列与其进行比较(id2,names2) 列中的相同条目。对于插图,第一个 (id1,names1) 条目 (1,Prabhudev Ramanujam) 应该与所有 (id2,names2) 进行比较,但不与第一个 (id2,names2) 条目进行比较。对于所有对也是如此。另外,如果公式
percent(sapply(names1, function(i)RecordLinkage::levenshteinSim(i,names2)))
可以进行调整以在此处产生类似且更快的结果,因为它在处理大数据时会变慢,附上快照,请帮忙。
library(stringdist)
library(RecordLinkage)
library(dplyr)
library(scales)
id1 <- 1:8
names1 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer
Mahapatra","SriramKishore Sharma",
"Deepak Subramaniam","SriramKishore Sharma","Deepak
Subramaniam","Sangamer Mahapatra")
id2 <- c(1,2,3,4,11,13,9,10)
names2 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer
Mahapatra","SriramKishore Sharma",
"Deepak Subramaniam","Sangamer Mahapatra","SriramKishore
Sharma","Deepak Subramaniam")
Name_Data <- data.frame(id1,names1,id2,names2)
Percent<- percent(sapply(names1, function(i)
RecordLinkage::levenshteinSim(i,names2)))
Total_Value <- data.frame(id2,names2,Percent)
并没有快多少,但我的建议是:
percent(unlist(lapply(1:length(names1), function(x) {
levenshteinSim(names1[x], names2[!(names2==names1[x] & id2==id1[x])])})))
编辑:
或者,这可能会更快 - 我猜它会有所不同:
as.vector(t(1 - (stringdistmatrix(names1, names2, method = "lv") /
outer(nchar(names1), nchar(names2), pmax))))[unlist(lapply(1:length(names1), function(x) !(names2==names1[x] & id2==id1[x])))]
给定的 R 脚本计算两个名称之间的相似度百分比,如视觉所示。这里我们有两列 "names1" 和 "names2",它们在 id1 和 id2 中有各自的 id。我的要求是,当我们执行脚本时,"names1" 中的每个名称都与 "names2" 列中的每个名称进行比较,我不希望将相同的条目即 (id1,names1) 列与其进行比较(id2,names2) 列中的相同条目。对于插图,第一个 (id1,names1) 条目 (1,Prabhudev Ramanujam) 应该与所有 (id2,names2) 进行比较,但不与第一个 (id2,names2) 条目进行比较。对于所有对也是如此。另外,如果公式
percent(sapply(names1, function(i)RecordLinkage::levenshteinSim(i,names2)))
可以进行调整以在此处产生类似且更快的结果,因为它在处理大数据时会变慢,附上快照,请帮忙。
library(stringdist)
library(RecordLinkage)
library(dplyr)
library(scales)
id1 <- 1:8
names1 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer
Mahapatra","SriramKishore Sharma",
"Deepak Subramaniam","SriramKishore Sharma","Deepak
Subramaniam","Sangamer Mahapatra")
id2 <- c(1,2,3,4,11,13,9,10)
names2 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer
Mahapatra","SriramKishore Sharma",
"Deepak Subramaniam","Sangamer Mahapatra","SriramKishore
Sharma","Deepak Subramaniam")
Name_Data <- data.frame(id1,names1,id2,names2)
Percent<- percent(sapply(names1, function(i)
RecordLinkage::levenshteinSim(i,names2)))
Total_Value <- data.frame(id2,names2,Percent)
并没有快多少,但我的建议是:
percent(unlist(lapply(1:length(names1), function(x) {
levenshteinSim(names1[x], names2[!(names2==names1[x] & id2==id1[x])])})))
编辑:
或者,这可能会更快 - 我猜它会有所不同:
as.vector(t(1 - (stringdistmatrix(names1, names2, method = "lv") /
outer(nchar(names1), nchar(names2), pmax))))[unlist(lapply(1:length(names1), function(x) !(names2==names1[x] & id2==id1[x])))]