R在向量中查找相互匹配的元素
R Finding elements matching with each other within a vector
我有一个地址列表。这些地址是由不同的用户输入的,因此相同地址的书写方式存在很大差异。例如,
"andheri at weh pump house", "andheri pump house","andheri pump house(mt)","weh andheri pump house","weh andheri pump house et","weh, nr. pump house"
上面的vector有6个地址。而且几乎所有的都是一样的。我试图找到这些地址之间的匹配项,以便我可以将它们组合在一起并重新编码。
我试过使用 agrep
和 stringdist 包。使用 agrep 我不确定我是否应该将每个地址作为一个模式并将其与其余地址进行匹配。从 stringdist 包我做了以下事情:
library(stringdist)
nsrpatt <- df$Address
x <- scan(what=character(), text = nsrpatt, sep=",")
x <- x[trimws(x)!= ""]
y <- ave(x, phonetic(x), FUN = function(.x) .x[1])
以上给出了错误:
In phonetic(x) : soundex encountered 111 non-printable ASCII or non-ASCII
characters.
不确定我是应该从字符向量中删除这些元素还是将它们转换为其他格式。
我试过agrep:
for (i in 1:length(nsrpattn)) {
npat <- agrep(nsrpattn[i], df$address, max=1, v=T)
}
字符向量的长度约为 25000,这会保持 运行 并使机器停止运行。
如何有效地为每个地址找到最接近的匹配项。
您可以 运行 对您的数据进行小型聚类分析。
x <- c("wall street", "Wall-street", "Wall ST", "andheri pump house",
"weh, nr. pump house", "Wallstreet", "weh andheri pump house",
"Wall Street", "weh andheri pump house et", "andheri at weh pump house",
"andheri pump house(mt)")
首先,你需要一个距离矩阵。
# Levenstein Distance
e <- adist(na.omit(tolower(x)))
rownames(e) <- na.omit(x)
那么,聚类分析就可以运行.
hc <- hclust(as.dist(e)) # find distance clusters
得出最佳分割点,例如图形化,"cut the tree".
plot(hc)
# cut tree at specific cluster size, i.e. getting codes of similar objects
smly <- cutree(hc, h=16)
然后你可以构建一个关键数据框,你可以用它来检查匹配是否正确。
key <- data.frame(x=na.omit(x),
smly=factor(smly, labels=c("Wall Street", "Andheri Pump House")),
row.names=NULL) # key data frame
key
# x smly
# 1 wall street Wall Street
# 2 Wall-street Wall Street
# 3 Wall ST Wall Street
# 4 andheri pump house Andheri Pump House
# 5 weh, nr. pump house Andheri Pump House
# 6 Wallstreet Wall Street
# 7 weh andheri pump house Andheri Pump House
# 8 Wall Street Wall Street
# 9 weh andheri pump house et Andheri Pump House
# 10 andheri at weh pump house Andheri Pump House
# 11 andheri pump house(mt) Andheri Pump House
最后像这样替换你的矢量:
x <- key$smly
我有一个地址列表。这些地址是由不同的用户输入的,因此相同地址的书写方式存在很大差异。例如,
"andheri at weh pump house", "andheri pump house","andheri pump house(mt)","weh andheri pump house","weh andheri pump house et","weh, nr. pump house"
上面的vector有6个地址。而且几乎所有的都是一样的。我试图找到这些地址之间的匹配项,以便我可以将它们组合在一起并重新编码。
我试过使用 agrep
和 stringdist 包。使用 agrep 我不确定我是否应该将每个地址作为一个模式并将其与其余地址进行匹配。从 stringdist 包我做了以下事情:
library(stringdist)
nsrpatt <- df$Address
x <- scan(what=character(), text = nsrpatt, sep=",")
x <- x[trimws(x)!= ""]
y <- ave(x, phonetic(x), FUN = function(.x) .x[1])
以上给出了错误:
In phonetic(x) : soundex encountered 111 non-printable ASCII or non-ASCII
characters.
不确定我是应该从字符向量中删除这些元素还是将它们转换为其他格式。
我试过agrep:
for (i in 1:length(nsrpattn)) {
npat <- agrep(nsrpattn[i], df$address, max=1, v=T)
}
字符向量的长度约为 25000,这会保持 运行 并使机器停止运行。
如何有效地为每个地址找到最接近的匹配项。
您可以 运行 对您的数据进行小型聚类分析。
x <- c("wall street", "Wall-street", "Wall ST", "andheri pump house",
"weh, nr. pump house", "Wallstreet", "weh andheri pump house",
"Wall Street", "weh andheri pump house et", "andheri at weh pump house",
"andheri pump house(mt)")
首先,你需要一个距离矩阵。
# Levenstein Distance
e <- adist(na.omit(tolower(x)))
rownames(e) <- na.omit(x)
那么,聚类分析就可以运行.
hc <- hclust(as.dist(e)) # find distance clusters
得出最佳分割点,例如图形化,"cut the tree".
plot(hc)
# cut tree at specific cluster size, i.e. getting codes of similar objects
smly <- cutree(hc, h=16)
然后你可以构建一个关键数据框,你可以用它来检查匹配是否正确。
key <- data.frame(x=na.omit(x),
smly=factor(smly, labels=c("Wall Street", "Andheri Pump House")),
row.names=NULL) # key data frame
key
# x smly
# 1 wall street Wall Street
# 2 Wall-street Wall Street
# 3 Wall ST Wall Street
# 4 andheri pump house Andheri Pump House
# 5 weh, nr. pump house Andheri Pump House
# 6 Wallstreet Wall Street
# 7 weh andheri pump house Andheri Pump House
# 8 Wall Street Wall Street
# 9 weh andheri pump house et Andheri Pump House
# 10 andheri at weh pump house Andheri Pump House
# 11 andheri pump house(mt) Andheri Pump House
最后像这样替换你的矢量:
x <- key$smly