R 文本挖掘:对来自数据帧的相似模式进行分组。

R text mining : Grouping of similar patterns from a dataframe.

我应用了 tm 包中的各种清理功能,如删除标点符号、数字、特殊字符、常用英文单词等,并得到如下所示的数据框。请记住,我没有主键,例如 cust_id 或 account_number 可以依赖

sno        names
001        SIRIS BLACK
002        JOHN DOE
003        STEPHEN HRYY
004        SIRIUS BLACK
005        SIRUS BLACK
006        JON DOE
007        STEPHEN HARRY
008        STIPHEN HURRY
009        JHN DOE 

看了上面的数据,真切的感受到了模式的相似性,名字也很相近。如何使用 R 的可用文本挖掘函数计算模式相等的百分比,以便最终获得具有所有唯一名称的数据框?

假设和缺点:

  1. 直截了当地假设唯一名称可能是字符数最多的名称,因为我拥有的原始数据在名称上有很多拼写错误。 (逻辑假设,也许会减少错别字的数量)

  2. agrep()函数在大字符串中搜索模式的近似匹配,问题是我实际上不知道模式是什么。

像这样对相似的字符串进行分组:

sno        names
001        SIRIS BLACK          
002        SIRIUS BLACK
003        SIRUS BLACK
004        JHN DOE
005        JOHN DOE
006        JON DOE
007        STEPHEN HARRY
008        STIPHEN HURRY
009        STEPHEN HRYY

终于明白了:

001     JOHN DOE
002     STEPHEN HARRY
003     STIPHEN HURRY
004     SIRIUS BLACK

对于 agrep 部分,这是一种方法 - 您可以使用参数来调整结果:

sim <- setNames(lapply(1:nrow(df), function(i) agrep(df$names[i], df$names, max.distance = list(all=2, insertions=2, deletions=2, substitutions=0))), df$names)
sim <- lapply(sim, function(x) unique(df$names[x]))
df$names2 <- sapply(sim, "[", 1)
df[!duplicated(df$names2), ]
#   sno         names        names2
# 1   1   SIRIS BLACK   SIRIS BLACK
# 2   2      JOHN DOE      JOHN DOE
# 3   3  STEPHEN HRYY  STEPHEN HRYY
# 8   8 STIPHEN HURRY STIPHEN HURRY

这是另一种方法。它使用 RecordLinkage 包并找到排序向量的最短形式。您可以调整阈值水平。

structure(list(sno = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
7L, 8L), .Label = c("JHN", "JOHN", "JON", "SIRIS", "SIRIUS", 
"SIRUS", "STEPHEN", "STIPHEN"), class = "factor"), names = structure(c(2L, 
2L, 2L, 1L, 1L, 1L, 3L, 4L, 5L), .Label = c("BLACK", "DOE", "HARRY", 
"HRYY", "HURRY"), class = "factor"), both.names = c("JHN DOE", 
"JOHN DOE", "JON DOE", "SIRIS BLACK", "SIRIUS BLACK", "SIRUS BLACK", 
"STEPHEN HARRY", "STEPHEN HRYY", "STIPHEN HURRY")), .Names = c("sno", 
"names", "both.names"), row.names = c("009", "002", "006", "001", 
"004", "005", "007", "003", "008"), class = "data.frame")

library("RecordLinkage")
compareJW <- function(string, vec, cutoff) {
  require(RecordLinkage)
  jarowinkler(string, vec) > cutoff
}

shortenFirms <- function(firms, cutoff) {
  shortnames <- firms[1]
  firms <- firms[-1]

  for (firm in firms) {
    if (is.na(firm)) { # no firm name, so short-circuit and add an NA
      shortnames <- c(shortnames, NA)
      next

    }
    unique.short <- unique(shortnames[!is.na(shortnames)])
    hits <- compareJW(firm, unique.short, cutoff)
    if (sum(hits) > 1) {
      warning(paste("cassifyFirms: more than one match for", firm))
      shortnames <- c(shortnames, NA)
    } else if (sum(hits) == 0) {
      shortnames <- c(shortnames, firm)
    } else {
      shortnames <- c(shortnames, unique.short[hits])
    }
  }
  shortnames
}

shortenFirms(df$both.names, 0.8)

shortenFirms(df$both.names, 0.8)

[1] "JHN DOE"       "JHN DOE"       "JHN DOE"       "SIRIS BLACK"   "SIRIS BLACK"   "SIRIS BLACK"   "STEPHEN HARRY"
[8] "STEPHEN HARRY" "STEPHEN HARRY"