多列矢量化文本挖掘

vectorized text mining over multiple columns

我有一些代码想要矢量化,但我不确定如何矢量化。以下代码提供了一些示例数据,包括名称和地址。

name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md", 
         "811 quincy st washington dc", "1911 1st st rockville md")

source1 <- data.frame(name, address)

name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long diner",
      "joes crag shack", "mike lowry place", "holiday inn", "zummer")

name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
         "1100 21st st nw washington dc", "1804 w 5th st wilmington de",
         "1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
         "400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address) 

此块通过 R 的原生 adist 函数计算两列文本之间的编辑差异,然后应用 min 函数。

dist.name<- adist(source1$name,source2$name, partial = TRUE, ignore.case = TRUE)
dist.address <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)

min.name<-apply(dist.name, 2, min)
min.address <- apply(dist.address, 2, min)

我想做以下事情:

  1. 根据最小编辑距离将 source1$namesource2$name 匹配。
  2. 如果 1 的结果产生 NA,则使用编辑距离基于 source1$addresssource2$address 进行匹配。我试过使用 for 循环,它适用于 1 但不适用于 2。这是我用来尝试合并两者的代码:

    match.s1.s2<-NULL  
    for(i in 1:nrow(dist.name)){
      for(j in 1:nrow(dist.address)){
    if(is.na(match(min.name[i], dist.name[i, ]))) {
    s2.i <- match(min.address[j], dist.address[j,])
    s1.i <- i
    match.s1.s2 <- match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, 
                                             s1name=source1[s1.i,]$name, adist=min.name[j], 
                                             s1.i.address = source1[s1.i,]$address,
                                             s2.i.address = source2[s2.i,]$address),match.s1.s2)
    
    } else {
      s2.i<-match(min.name[i],dist.name[i,])
      s1.i<-i
      match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, s1name=source1[s1.i,]$name, 
                                adist=min.name[i], s1.i.address = source1[s1.i,]$address,
                                s2.i.address = source2[s2.i,]$address),match.s1.s2)
        }
    
      }
    
    }
    

我的问题是速度很慢,而且最终生成的数据框太大了。最终结果,数据框 match.s1.s2 应该与 source1 具有相同的行数。任何建议或帮助将不胜感激。谢谢。

使用归一化分数(0 到 1 之间)会更有效。这样您就可以使用向量化的 ifelse 来仅更改对应地址分数的 NA。对于非标准化分数,您必须更改整行。试试这个方法:

dist.mat.nm <- adist(source1$name, source2$name, partial = TRUE, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)

#If you use non-normalized distances
dist.mat <- dist.mat.nm
for(i in 1:nrow(dist.mat)){
  if(is.na(dist.mat[i, ])) dist.mat[i, ] <- dist.mat.ad[i, ]
}

#If you use normalized distances
dist.mat <- ifelse(is.na(dist.mat.nm), dist.mat.ad, dist.mat.nm)

which.match <- function(x, nm) return(nm[which(x == min(x))[1]])

matches <- apply(dist.mat, 1, which.match, nm = source2$name)

这可能会提高性能并解决您的问题。如果您愿意更改为标准化距离(而不是 levenshtein),我会推荐 Jaro-Winkler 的。