MetaPhone 函数(如 SoundEx)在 R 中的功能和用途?
MetaPhone Functions (like SoundEx) functions and use in R?
我想在 'R' 中使用 MetaPhone, Double Metaphone, Caverphone, MetaPhone3, SoundEx, and if anyone has done it yet NameX 函数,这样我就可以对类似的值进行分类和汇总,以最大限度地减少分析前的数据清理操作。
我完全知道每个算法都有自己的长处和短处,非常希望不使用SoundEx但如果我找不到替代品,它仍然可能有效;就像 mentioned in this post Harper 将与 SoundEx 下的任何不相关名称列表匹配,但不应该在 Metaphone 中匹配以获得更好的结果匹配。
虽然我不确定哪个最适合我的目的,同时仍然保留一些灵活性,所以这就是我想尝试其中几个以及在查看值之前生成 table 像下面这样。
姓氏不是我最初分析的主题,但我认为这是一个很好的例子,因为我想有效地考虑所有像 'sounding' 一样被视为相同值的词,这正是我试图用在评估值时简单地调用一些东西。
一些我已经看过的东西:
- 我知道可以用 RCpp, and there are even C solutions for SoundEx on SE 编写和调用 C 包,但我以前没有写过 R 包,如果有更简单的方法,我希望避免 re-inventing 轮子是直接在 R 中还是存在具有可用功能的包?
- 我知道 RecordLinkage and now stringdist 包有一个 SoundEx 函数,但没有任何形式的 MetaPhone 函数。
So I am specifically looking for an answer is to how do a MetaPhone / Caverphone function in R and know the "Value" so I can group data values by them?
额外的警告是我仍然认为自己对 R 很陌生,因为我不是它的日常用户。
该算法非常简单,但我也找不到现有的 R 包。如果你真的需要在 R 中完成这项工作,一个短期选择是安装 python 模块 metaphone
(pip install metaphone
) 然后使用 rPython
桥来使用它在 R:
library(rPython)
python.exec("from metaphone import doublemetaphone")
python.call("doublemetaphone", "architect")
[1] "ARKTKT" ""
这不是最优雅的解决方案,但它能让您在 R 中进行 metaphone 操作。
Apache Commons 有一个 codec library 也实现了 metaphone 算法:
library(rJava)
.jinit() # need to have commons-codec-1.10.jar in your CLASSPATH
mp <- .jnew("org.apache.commons.codec.language.Metaphone")
.jcall(mp,"S","metaphone", "architect")
[1] "ARXT"
您可以将上述 .jcall
设为 R 函数并像使用任何其他 R 函数一样使用它:
metaphone <- function(x) {
.jcall(mp,"S","metaphone", x)
}
sapply(c("abridgement", "stupendous"), metaphone)
## abridgement stupendous
## "ABRJ" "STPN"
java 界面也可能跨平台更兼容。
下面是使用 java 界面的更完整视图:
library(rJava)
.jinit()
mp <- .jnew("org.apache.commons.codec.language.Metaphone")
dmp <- .jnew("org.apache.commons.codec.language.DoubleMetaphone")
metaphone <- function(x) {
.jcall(mp,"S","metaphone", x)
}
double_metaphone <- function(x) {
.jcall(dmp,"S","doubleMetaphone", x)
}
words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan',
'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith',
'Smyth', 'Jessica', 'Joshua')
data.frame(metaphone=sapply(words, metaphone),
double=sapply(words, double_metaphone))
## metaphone double
## Catherine K0RN K0RN
## Katherine K0RN K0RN
## Katarina KTRN KTRN
## Johnathan JN0N JN0N
## Jonathan JN0N JN0N
## John JN JN
## Teresa TRS TRS
## Theresa 0RS 0RS
## Smith SM0 SM0
## Smyth SM0 SM0
## Jessica JSK JSK
## Joshua JX JX
包 PGRdup
中的 R
中现在有 Double Metaphone 的实现。
install.packages(PGRdup)
library(PGRdup)
words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan',
'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith',
'Smyth', 'Jessica', 'Joshua')
DoubleMetaphone(words)
$primary
[1] "K0RN" "K0RN" "KTRN" "JN0N" "JN0N" "JN" "TRS" "0RS" "SM0" "SM0" "JSK" "JX"
$alternate
[1] "KTRN" "KTRN" "KTRN" "ANTN" "ANTN" "AN" "TRS" "TRS" "XMT" "XMT" "ASK" "AX"
这里有一个 caverphone 解释,虽然它捕获了级联规则方法,但请记住 caverphone 始终旨在作为自定义区域重音上下文的示例(尽管人们确实使用它以一种通用的方式,因为它根据自己的区域对大多数其他人进行了一组不同的权衡),所以我建议 a)在数据源中获取唯一字符以确保你正在处理它们,b)考虑更改与您正在使用的名称相关的最终长度限制,并且 c)考虑对区域口音组合进行建模 - 这是为了对 1800 年代末/ 1900 年代初新西兰的各种口音组进行建模以及他们可能错误转录彼此所说内容的方式。
caverphonise <- function(x) {
# Convert to lowercase
x <- tolower(x)
# Remove anything not A-Z
x <- gsub("[^a-z]", "", x)
# If the name starts with
## cough make it cou2f
x <- gsub("^cough", "cou2f", x)
## rough make it rou2f
x <- gsub("^rough", "rou2f", x)
## tough make it tou2f
x <- gsub("^tough", "tou2f", x)
## enough make it enou2f
x <- gsub("^enough", "enou2f", x)
## gn make it 2n
x <- gsub("^gn", "2n", x)
# If the name ends with
## mb make it m2
x <- gsub("mb$", "m2", x)
# Replace
## cq with 2q
x <- gsub("cq", "2q", x)
## ci with si
x <- gsub("ci", "si", x)
## ce with se
x <- gsub("ce", "se", x)
## cy with sy
x <- gsub("cy", "sy", x)
## tch with 2ch
x <- gsub("tch", "2ch", x)
## c with k
x <- gsub("c", "k", x)
## q with k
x <- gsub("q", "k", x)
## x with k
x <- gsub("x", "k", x)
## v with f
x <- gsub("v", "f", x)
## dg with 2g
x <- gsub("dg", "2g", x)
## tio with sio
x <- gsub("tio", "sio", x)
## tia with sia
x <- gsub("tia", "sia", x)
## d with t
x <- gsub("d", "t", x)
## ph with fh
x <- gsub("ph", "fh", x)
## b with p
x <- gsub("b", "p", x)
## sh with s2
x <- gsub("sh", "s2", x)
## z with s
x <- gsub("z", "s", x)
## any initial vowel with an A
x <- gsub("^[aeiou]", "A", x)
## all other vowels with a 3
x <- gsub("[aeiou]", "3", x)
## 3gh3 with 3kh3
x <- gsub("3gh3", "3kh3", x)
## gh with 22
x <- gsub("gh", "22", x)
## g with k
x <- gsub("g", "k", x)
## groups of the letter s with a S
x <- gsub("s+", "S", x)
## groups of the letter t with a T
x <- gsub("t+", "T", x)
## groups of the letter p with a P
x <- gsub("p+", "P", x)
## groups of the letter k with a K
x <- gsub("k+", "K", x)
## groups of the letter f with a F
x <- gsub("f+", "F", x)
## groups of the letter m with a M
x <- gsub("m+", "M", x)
## groups of the letter n with a N
x <- gsub("n+", "N", x)
## w3 with W3
x <- gsub("w3", "W3", x)
## wy with Wy
x <- gsub("wy", "Wy", x)
## wh3 with Wh3
x <- gsub("wh3", "Wh3", x)
## why with Why
x <- gsub("why", "Why", x)
## w with 2
x <- gsub("w", "2", x)
## any initial h with an A
x <- gsub("^h", "A", x)
## all other occurrences of h with a 2
x <- gsub("h", "2", x)
## r3 with R3
x <- gsub("r3", "R3", x)
## ry with Ry
x <- gsub("ry", "Ry", x)
## r with 2
x <- gsub("r", "2", x)
## l3 with L3
x <- gsub("l3", "L3", x)
## ly with Ly
x <- gsub("ly", "Ly", x)
## l with 2
x <- gsub("l", "2", x)
## j with y
x <- gsub("j", "y", x)
## y3 with Y3
x <- gsub("y3", "Y3", x)
## y with 2
x <- gsub("y", "2", x)
# remove all
## 2s
x <- gsub("2", "", x)
## 3s
x <- gsub("3", "", x)
# put six 1s on the end
x <- paste(x,"111111", sep="")
# take the first six characters as the code
unlist(lapply(x, FUN= function(x){paste((strsplit(x, "")[[1]])[1:6], collapse="")}))
}
我一直在为此开发一个包,称为 phonics、for a few months。我已经实现了几个常见的和不太常见的,包括 Caverphone、Caverphone2、Metaphone 和 soundex。其他一些已实施。在调用它 1.0 之前,我还有一些计划要实现,但我刚刚向 CRAN 提交了一个包的发布。
我想在 'R' 中使用 MetaPhone, Double Metaphone, Caverphone, MetaPhone3, SoundEx, and if anyone has done it yet NameX 函数,这样我就可以对类似的值进行分类和汇总,以最大限度地减少分析前的数据清理操作。
我完全知道每个算法都有自己的长处和短处,非常希望不使用SoundEx但如果我找不到替代品,它仍然可能有效;就像 mentioned in this post Harper 将与 SoundEx 下的任何不相关名称列表匹配,但不应该在 Metaphone 中匹配以获得更好的结果匹配。
虽然我不确定哪个最适合我的目的,同时仍然保留一些灵活性,所以这就是我想尝试其中几个以及在查看值之前生成 table 像下面这样。
姓氏不是我最初分析的主题,但我认为这是一个很好的例子,因为我想有效地考虑所有像 'sounding' 一样被视为相同值的词,这正是我试图用在评估值时简单地调用一些东西。
一些我已经看过的东西:
- 我知道可以用 RCpp, and there are even C solutions for SoundEx on SE 编写和调用 C 包,但我以前没有写过 R 包,如果有更简单的方法,我希望避免 re-inventing 轮子是直接在 R 中还是存在具有可用功能的包?
- 我知道 RecordLinkage and now stringdist 包有一个 SoundEx 函数,但没有任何形式的 MetaPhone 函数。
So I am specifically looking for an answer is to how do a MetaPhone / Caverphone function in R and know the "Value" so I can group data values by them?
额外的警告是我仍然认为自己对 R 很陌生,因为我不是它的日常用户。
该算法非常简单,但我也找不到现有的 R 包。如果你真的需要在 R 中完成这项工作,一个短期选择是安装 python 模块 metaphone
(pip install metaphone
) 然后使用 rPython
桥来使用它在 R:
library(rPython)
python.exec("from metaphone import doublemetaphone")
python.call("doublemetaphone", "architect")
[1] "ARKTKT" ""
这不是最优雅的解决方案,但它能让您在 R 中进行 metaphone 操作。
Apache Commons 有一个 codec library 也实现了 metaphone 算法:
library(rJava)
.jinit() # need to have commons-codec-1.10.jar in your CLASSPATH
mp <- .jnew("org.apache.commons.codec.language.Metaphone")
.jcall(mp,"S","metaphone", "architect")
[1] "ARXT"
您可以将上述 .jcall
设为 R 函数并像使用任何其他 R 函数一样使用它:
metaphone <- function(x) {
.jcall(mp,"S","metaphone", x)
}
sapply(c("abridgement", "stupendous"), metaphone)
## abridgement stupendous
## "ABRJ" "STPN"
java 界面也可能跨平台更兼容。
下面是使用 java 界面的更完整视图:
library(rJava)
.jinit()
mp <- .jnew("org.apache.commons.codec.language.Metaphone")
dmp <- .jnew("org.apache.commons.codec.language.DoubleMetaphone")
metaphone <- function(x) {
.jcall(mp,"S","metaphone", x)
}
double_metaphone <- function(x) {
.jcall(dmp,"S","doubleMetaphone", x)
}
words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan',
'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith',
'Smyth', 'Jessica', 'Joshua')
data.frame(metaphone=sapply(words, metaphone),
double=sapply(words, double_metaphone))
## metaphone double
## Catherine K0RN K0RN
## Katherine K0RN K0RN
## Katarina KTRN KTRN
## Johnathan JN0N JN0N
## Jonathan JN0N JN0N
## John JN JN
## Teresa TRS TRS
## Theresa 0RS 0RS
## Smith SM0 SM0
## Smyth SM0 SM0
## Jessica JSK JSK
## Joshua JX JX
包 PGRdup
中的 R
中现在有 Double Metaphone 的实现。
install.packages(PGRdup)
library(PGRdup)
words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan',
'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith',
'Smyth', 'Jessica', 'Joshua')
DoubleMetaphone(words)
$primary
[1] "K0RN" "K0RN" "KTRN" "JN0N" "JN0N" "JN" "TRS" "0RS" "SM0" "SM0" "JSK" "JX"
$alternate
[1] "KTRN" "KTRN" "KTRN" "ANTN" "ANTN" "AN" "TRS" "TRS" "XMT" "XMT" "ASK" "AX"
这里有一个 caverphone 解释,虽然它捕获了级联规则方法,但请记住 caverphone 始终旨在作为自定义区域重音上下文的示例(尽管人们确实使用它以一种通用的方式,因为它根据自己的区域对大多数其他人进行了一组不同的权衡),所以我建议 a)在数据源中获取唯一字符以确保你正在处理它们,b)考虑更改与您正在使用的名称相关的最终长度限制,并且 c)考虑对区域口音组合进行建模 - 这是为了对 1800 年代末/ 1900 年代初新西兰的各种口音组进行建模以及他们可能错误转录彼此所说内容的方式。
caverphonise <- function(x) {
# Convert to lowercase
x <- tolower(x)
# Remove anything not A-Z
x <- gsub("[^a-z]", "", x)
# If the name starts with
## cough make it cou2f
x <- gsub("^cough", "cou2f", x)
## rough make it rou2f
x <- gsub("^rough", "rou2f", x)
## tough make it tou2f
x <- gsub("^tough", "tou2f", x)
## enough make it enou2f
x <- gsub("^enough", "enou2f", x)
## gn make it 2n
x <- gsub("^gn", "2n", x)
# If the name ends with
## mb make it m2
x <- gsub("mb$", "m2", x)
# Replace
## cq with 2q
x <- gsub("cq", "2q", x)
## ci with si
x <- gsub("ci", "si", x)
## ce with se
x <- gsub("ce", "se", x)
## cy with sy
x <- gsub("cy", "sy", x)
## tch with 2ch
x <- gsub("tch", "2ch", x)
## c with k
x <- gsub("c", "k", x)
## q with k
x <- gsub("q", "k", x)
## x with k
x <- gsub("x", "k", x)
## v with f
x <- gsub("v", "f", x)
## dg with 2g
x <- gsub("dg", "2g", x)
## tio with sio
x <- gsub("tio", "sio", x)
## tia with sia
x <- gsub("tia", "sia", x)
## d with t
x <- gsub("d", "t", x)
## ph with fh
x <- gsub("ph", "fh", x)
## b with p
x <- gsub("b", "p", x)
## sh with s2
x <- gsub("sh", "s2", x)
## z with s
x <- gsub("z", "s", x)
## any initial vowel with an A
x <- gsub("^[aeiou]", "A", x)
## all other vowels with a 3
x <- gsub("[aeiou]", "3", x)
## 3gh3 with 3kh3
x <- gsub("3gh3", "3kh3", x)
## gh with 22
x <- gsub("gh", "22", x)
## g with k
x <- gsub("g", "k", x)
## groups of the letter s with a S
x <- gsub("s+", "S", x)
## groups of the letter t with a T
x <- gsub("t+", "T", x)
## groups of the letter p with a P
x <- gsub("p+", "P", x)
## groups of the letter k with a K
x <- gsub("k+", "K", x)
## groups of the letter f with a F
x <- gsub("f+", "F", x)
## groups of the letter m with a M
x <- gsub("m+", "M", x)
## groups of the letter n with a N
x <- gsub("n+", "N", x)
## w3 with W3
x <- gsub("w3", "W3", x)
## wy with Wy
x <- gsub("wy", "Wy", x)
## wh3 with Wh3
x <- gsub("wh3", "Wh3", x)
## why with Why
x <- gsub("why", "Why", x)
## w with 2
x <- gsub("w", "2", x)
## any initial h with an A
x <- gsub("^h", "A", x)
## all other occurrences of h with a 2
x <- gsub("h", "2", x)
## r3 with R3
x <- gsub("r3", "R3", x)
## ry with Ry
x <- gsub("ry", "Ry", x)
## r with 2
x <- gsub("r", "2", x)
## l3 with L3
x <- gsub("l3", "L3", x)
## ly with Ly
x <- gsub("ly", "Ly", x)
## l with 2
x <- gsub("l", "2", x)
## j with y
x <- gsub("j", "y", x)
## y3 with Y3
x <- gsub("y3", "Y3", x)
## y with 2
x <- gsub("y", "2", x)
# remove all
## 2s
x <- gsub("2", "", x)
## 3s
x <- gsub("3", "", x)
# put six 1s on the end
x <- paste(x,"111111", sep="")
# take the first six characters as the code
unlist(lapply(x, FUN= function(x){paste((strsplit(x, "")[[1]])[1:6], collapse="")}))
}
我一直在为此开发一个包,称为 phonics、for a few months。我已经实现了几个常见的和不太常见的,包括 Caverphone、Caverphone2、Metaphone 和 soundex。其他一些已实施。在调用它 1.0 之前,我还有一些计划要实现,但我刚刚向 CRAN 提交了一个包的发布。