R 文本挖掘:对来自数据帧的相似模式进行分组。
R text mining : Grouping of similar patterns from a dataframe.
我应用了 tm
包中的各种清理功能,如删除标点符号、数字、特殊字符、常用英文单词等,并得到如下所示的数据框。请记住,我没有主键,例如 cust_id 或 account_number 可以依赖
sno names
001 SIRIS BLACK
002 JOHN DOE
003 STEPHEN HRYY
004 SIRIUS BLACK
005 SIRUS BLACK
006 JON DOE
007 STEPHEN HARRY
008 STIPHEN HURRY
009 JHN DOE
看了上面的数据,真切的感受到了模式的相似性,名字也很相近。如何使用 R 的可用文本挖掘函数计算模式相等的百分比,以便最终获得具有所有唯一名称的数据框?
假设和缺点:
直截了当地假设唯一名称可能是字符数最多的名称,因为我拥有的原始数据在名称上有很多拼写错误。 (逻辑假设,也许会减少错别字的数量)
agrep()
函数在大字符串中搜索模式的近似匹配,问题是我实际上不知道模式是什么。
像这样对相似的字符串进行分组:
sno names
001 SIRIS BLACK
002 SIRIUS BLACK
003 SIRUS BLACK
004 JHN DOE
005 JOHN DOE
006 JON DOE
007 STEPHEN HARRY
008 STIPHEN HURRY
009 STEPHEN HRYY
终于明白了:
001 JOHN DOE
002 STEPHEN HARRY
003 STIPHEN HURRY
004 SIRIUS BLACK
对于 agrep
部分,这是一种方法 - 您可以使用参数来调整结果:
sim <- setNames(lapply(1:nrow(df), function(i) agrep(df$names[i], df$names, max.distance = list(all=2, insertions=2, deletions=2, substitutions=0))), df$names)
sim <- lapply(sim, function(x) unique(df$names[x]))
df$names2 <- sapply(sim, "[", 1)
df[!duplicated(df$names2), ]
# sno names names2
# 1 1 SIRIS BLACK SIRIS BLACK
# 2 2 JOHN DOE JOHN DOE
# 3 3 STEPHEN HRYY STEPHEN HRYY
# 8 8 STIPHEN HURRY STIPHEN HURRY
这是另一种方法。它使用 RecordLinkage 包并找到排序向量的最短形式。您可以调整阈值水平。
structure(list(sno = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
7L, 8L), .Label = c("JHN", "JOHN", "JON", "SIRIS", "SIRIUS",
"SIRUS", "STEPHEN", "STIPHEN"), class = "factor"), names = structure(c(2L,
2L, 2L, 1L, 1L, 1L, 3L, 4L, 5L), .Label = c("BLACK", "DOE", "HARRY",
"HRYY", "HURRY"), class = "factor"), both.names = c("JHN DOE",
"JOHN DOE", "JON DOE", "SIRIS BLACK", "SIRIUS BLACK", "SIRUS BLACK",
"STEPHEN HARRY", "STEPHEN HRYY", "STIPHEN HURRY")), .Names = c("sno",
"names", "both.names"), row.names = c("009", "002", "006", "001",
"004", "005", "007", "003", "008"), class = "data.frame")
library("RecordLinkage")
compareJW <- function(string, vec, cutoff) {
require(RecordLinkage)
jarowinkler(string, vec) > cutoff
}
shortenFirms <- function(firms, cutoff) {
shortnames <- firms[1]
firms <- firms[-1]
for (firm in firms) {
if (is.na(firm)) { # no firm name, so short-circuit and add an NA
shortnames <- c(shortnames, NA)
next
}
unique.short <- unique(shortnames[!is.na(shortnames)])
hits <- compareJW(firm, unique.short, cutoff)
if (sum(hits) > 1) {
warning(paste("cassifyFirms: more than one match for", firm))
shortnames <- c(shortnames, NA)
} else if (sum(hits) == 0) {
shortnames <- c(shortnames, firm)
} else {
shortnames <- c(shortnames, unique.short[hits])
}
}
shortnames
}
shortenFirms(df$both.names, 0.8)
shortenFirms(df$both.names, 0.8)
[1] "JHN DOE" "JHN DOE" "JHN DOE" "SIRIS BLACK" "SIRIS BLACK" "SIRIS BLACK" "STEPHEN HARRY"
[8] "STEPHEN HARRY" "STEPHEN HARRY"
我应用了 tm
包中的各种清理功能,如删除标点符号、数字、特殊字符、常用英文单词等,并得到如下所示的数据框。请记住,我没有主键,例如 cust_id 或 account_number 可以依赖
sno names
001 SIRIS BLACK
002 JOHN DOE
003 STEPHEN HRYY
004 SIRIUS BLACK
005 SIRUS BLACK
006 JON DOE
007 STEPHEN HARRY
008 STIPHEN HURRY
009 JHN DOE
看了上面的数据,真切的感受到了模式的相似性,名字也很相近。如何使用 R 的可用文本挖掘函数计算模式相等的百分比,以便最终获得具有所有唯一名称的数据框?
假设和缺点:
直截了当地假设唯一名称可能是字符数最多的名称,因为我拥有的原始数据在名称上有很多拼写错误。 (逻辑假设,也许会减少错别字的数量)
agrep()
函数在大字符串中搜索模式的近似匹配,问题是我实际上不知道模式是什么。
像这样对相似的字符串进行分组:
sno names
001 SIRIS BLACK
002 SIRIUS BLACK
003 SIRUS BLACK
004 JHN DOE
005 JOHN DOE
006 JON DOE
007 STEPHEN HARRY
008 STIPHEN HURRY
009 STEPHEN HRYY
终于明白了:
001 JOHN DOE
002 STEPHEN HARRY
003 STIPHEN HURRY
004 SIRIUS BLACK
对于 agrep
部分,这是一种方法 - 您可以使用参数来调整结果:
sim <- setNames(lapply(1:nrow(df), function(i) agrep(df$names[i], df$names, max.distance = list(all=2, insertions=2, deletions=2, substitutions=0))), df$names)
sim <- lapply(sim, function(x) unique(df$names[x]))
df$names2 <- sapply(sim, "[", 1)
df[!duplicated(df$names2), ]
# sno names names2
# 1 1 SIRIS BLACK SIRIS BLACK
# 2 2 JOHN DOE JOHN DOE
# 3 3 STEPHEN HRYY STEPHEN HRYY
# 8 8 STIPHEN HURRY STIPHEN HURRY
这是另一种方法。它使用 RecordLinkage 包并找到排序向量的最短形式。您可以调整阈值水平。
structure(list(sno = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
7L, 8L), .Label = c("JHN", "JOHN", "JON", "SIRIS", "SIRIUS",
"SIRUS", "STEPHEN", "STIPHEN"), class = "factor"), names = structure(c(2L,
2L, 2L, 1L, 1L, 1L, 3L, 4L, 5L), .Label = c("BLACK", "DOE", "HARRY",
"HRYY", "HURRY"), class = "factor"), both.names = c("JHN DOE",
"JOHN DOE", "JON DOE", "SIRIS BLACK", "SIRIUS BLACK", "SIRUS BLACK",
"STEPHEN HARRY", "STEPHEN HRYY", "STIPHEN HURRY")), .Names = c("sno",
"names", "both.names"), row.names = c("009", "002", "006", "001",
"004", "005", "007", "003", "008"), class = "data.frame")
library("RecordLinkage")
compareJW <- function(string, vec, cutoff) {
require(RecordLinkage)
jarowinkler(string, vec) > cutoff
}
shortenFirms <- function(firms, cutoff) {
shortnames <- firms[1]
firms <- firms[-1]
for (firm in firms) {
if (is.na(firm)) { # no firm name, so short-circuit and add an NA
shortnames <- c(shortnames, NA)
next
}
unique.short <- unique(shortnames[!is.na(shortnames)])
hits <- compareJW(firm, unique.short, cutoff)
if (sum(hits) > 1) {
warning(paste("cassifyFirms: more than one match for", firm))
shortnames <- c(shortnames, NA)
} else if (sum(hits) == 0) {
shortnames <- c(shortnames, firm)
} else {
shortnames <- c(shortnames, unique.short[hits])
}
}
shortnames
}
shortenFirms(df$both.names, 0.8)
shortenFirms(df$both.names, 0.8)
[1] "JHN DOE" "JHN DOE" "JHN DOE" "SIRIS BLACK" "SIRIS BLACK" "SIRIS BLACK" "STEPHEN HARRY"
[8] "STEPHEN HARRY" "STEPHEN HARRY"