r stringdist 或 levenshtein.distance 替换字符串
r stringdist or levenshtein.distance to replace strings
我有一个大型数据集,其中包含大约一百万个观察值,并以定义的观察类型作为键控。在数据集中,有大约 900,000 个观察类型错误的观察,50 种可接受的观察类型有大约 850 个(不正确的)变化。
keys <- c("DAY", "EVENING","SUNSET", "DUSK","NIGHT", "MIDNIGHT", "TWILIGHT", "DAWN","SUNRISE", "MORNING")
entries <- c("Day", "day", "SUNSET/DUSK", "DAYS", "dayy", "EVEN", "Evening", "early dusk", "late day", "nite", "red dawn", "Evening Sunset", "mid-night", "midnight", "midnite","DAY", "EVENING","SUNSET", "DUSK","NIGHT", "MIDNIGHT", "TWILIGHT", "DAWN","SUNRISE", "MORNING")
使用 gsub 类似于用手铲挖地下室,在我自己的例子中,是一把坏掉的铲子,因为我对 r 和复杂的正则表达式还很陌生。简单的回退(对我来说)是为每种接受的观察类型编写一个 gsub 语句,但这似乎不必要地费力,因为它需要 50 个语句。
我想使用 levenshtein.distance
或 stringdist
用最短距离字符串替换有问题的条目。 运行 z <- for (i in length(y)) { z[i] = levenshtein.distance(y[i], x)}
不起作用,因为它试图将 (length(x)) 结果传递给每个 y[i]。
如何return得到最小距离的结果?我已经看到 function(x) x[2]
return 是系列中的第二个结果,但是如何获得最低的结果?
你可以试试:
library(stringdist)
m <- stringdistmatrix(entries, keys, method = "lv")
a <- keys[apply(m, 1, which.min)]
如果您想尝试不同的算法,请查看 ?'stringdist-metrics'
或者按照@RHertel 在评论中提到的:
b <- keys[apply(adist(entries, keys), 1, which.min)]
来自 adist()
文档:
Compute the approximate string distance between character vectors. The
distance is a generalized Levenshtein (edit) distance, giving the
minimal possibly weighted number of insertions, deletions and
substitutions needed to transform one string into another.
两种方法产生相同的结果:
> identical(a, b)
#[1] TRUE
我有一个大型数据集,其中包含大约一百万个观察值,并以定义的观察类型作为键控。在数据集中,有大约 900,000 个观察类型错误的观察,50 种可接受的观察类型有大约 850 个(不正确的)变化。
keys <- c("DAY", "EVENING","SUNSET", "DUSK","NIGHT", "MIDNIGHT", "TWILIGHT", "DAWN","SUNRISE", "MORNING")
entries <- c("Day", "day", "SUNSET/DUSK", "DAYS", "dayy", "EVEN", "Evening", "early dusk", "late day", "nite", "red dawn", "Evening Sunset", "mid-night", "midnight", "midnite","DAY", "EVENING","SUNSET", "DUSK","NIGHT", "MIDNIGHT", "TWILIGHT", "DAWN","SUNRISE", "MORNING")
使用 gsub 类似于用手铲挖地下室,在我自己的例子中,是一把坏掉的铲子,因为我对 r 和复杂的正则表达式还很陌生。简单的回退(对我来说)是为每种接受的观察类型编写一个 gsub 语句,但这似乎不必要地费力,因为它需要 50 个语句。
我想使用 levenshtein.distance
或 stringdist
用最短距离字符串替换有问题的条目。 运行 z <- for (i in length(y)) { z[i] = levenshtein.distance(y[i], x)}
不起作用,因为它试图将 (length(x)) 结果传递给每个 y[i]。
如何return得到最小距离的结果?我已经看到 function(x) x[2]
return 是系列中的第二个结果,但是如何获得最低的结果?
你可以试试:
library(stringdist)
m <- stringdistmatrix(entries, keys, method = "lv")
a <- keys[apply(m, 1, which.min)]
如果您想尝试不同的算法,请查看 ?'stringdist-metrics'
或者按照@RHertel 在评论中提到的:
b <- keys[apply(adist(entries, keys), 1, which.min)]
来自 adist()
文档:
Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.
两种方法产生相同的结果:
> identical(a, b)
#[1] TRUE