R - mgsub 问题:被替换的子字符串不是整个字符串
R - mgsub problem: substrings being replaced not whole strings
我已经从 USPS 下载了街道缩写。这是数据:
dput(usps_streets)
structure(list(common_abbrev = c("allee", "alley", "ally", "aly",
"anex", "annex", "annx", "anx", "arc", "arcade", "av", "ave",
"aven", "avenu", "avenue", "avn", "avnue", "bayoo", "bayou",
"bch", "beach", "bend", "bnd", "blf", "bluf", "bluff", "bluffs",
"bot", "btm", "bottm", "bottom", "blvd", "boul", "boulevard",
"boulv", "br", "brnch", "branch", "brdge", "brg", "bridge", "brk",
"brook", "brooks", "burg", "burgs", "byp", "bypa", "bypas", "bypass",
"byps", "camp", "cp", "cmp", "canyn", "canyon", "cnyn", "cape",
"cpe", "causeway", "causwa", "cswy", "cen", "cent", "center",
"centr", "centre", "cnter", "cntr", "ctr", "centers", "cir",
"circ", "circl", "circle", "crcl", "crcle", "circles", "clf",
"cliff", "clfs", "cliffs", "clb", "club", "common", "commons",
"cor", "corner", "corners", "cors", "course", "crse", "court",
"ct", "courts", "cts", "cove", "cv", "coves", "creek", "crk",
"crescent", "cres", "crsent", "crsnt", "crest", "crossing", "crssng",
"xing", "crossroad", "crossroads", "curve", "dale", "dl", "dam",
"dm", "div", "divide", "dv", "dvd", "dr", "driv", "drive", "drv",
"drives", "est", "estate", "estates", "ests", "exp", "expr",
"express", "expressway", "expw", "expy", "ext", "extension",
"extn", "extnsn", "exts", "fall", "falls", "fls", "ferry", "frry",
"fry", "field", "fld", "fields", "flds", "flat", "flt", "flats",
"flts", "ford", "frd", "fords", "forest", "forests", "frst",
"forg", "forge", "frg", "forges", "fork", "frk", "forks", "frks",
"fort", "frt", "ft", "freeway", "freewy", "frway", "frwy", "fwy",
"garden", "gardn", "grden", "grdn", "gardens", "gdns", "grdns",
"gateway", "gatewy", "gatway", "gtway", "gtwy", "glen", "gln",
"glens", "green", "grn", "greens", "grov", "grove", "grv", "groves",
"harb", "harbor", "harbr", "hbr", "hrbor", "harbors", "haven",
"hvn", "ht", "hts", "highway", "highwy", "hiway", "hiwy", "hway",
"hwy", "hill", "hl", "hills", "hls", "hllw", "hollow", "hollows",
"holw", "holws", "inlt", "is", "island", "islnd", "islands",
"islnds", "iss", "isle", "isles", "jct", "jction", "jctn", "junction",
"junctn", "juncton", "jctns", "jcts", "junctions", "key", "ky",
"keys", "kys", "knl", "knol", "knoll", "knls", "knolls", "lk",
"lake", "lks", "lakes", "land", "landing", "lndg", "lndng", "lane",
"ln", "lgt", "light", "lights", "lf", "loaf", "lck", "lock",
"lcks", "locks", "ldg", "ldge", "lodg", "lodge", "loop", "loops",
"mall", "mnr", "manor", "manors", "mnrs", "meadow", "mdw", "mdws",
"meadows", "medows", "mews", "mill", "mills", "missn", "mssn",
"motorway", "mnt", "mt", "mount", "mntain", "mntn", "mountain",
"mountin", "mtin", "mtn", "mntns", "mountains", "nck", "neck",
"orch", "orchard", "orchrd", "oval", "ovl", "overpass", "park",
"prk", "parks", "parkway", "parkwy", "pkway", "pkwy", "pky",
"parkways", "pkwys", "pass", "passage", "path", "paths", "pike",
"pikes", "pine", "pines", "pnes", "pl", "plain", "pln", "plains",
"plns", "plaza", "plz", "plza", "point", "pt", "points", "pts",
"port", "prt", "ports", "prts", "pr", "prairie", "prr", "rad",
"radial", "radiel", "radl", "ramp", "ranch", "ranches", "rnch",
"rnchs", "rapid", "rpd", "rapids", "rpds", "rest", "rst", "rdg",
"rdge", "ridge", "rdgs", "ridges", "riv", "river", "rvr", "rivr",
"rd", "road", "roads", "rds", "route", "row", "rue", "run", "shl",
"shoal", "shls", "shoals", "shoar", "shore", "shr", "shoars",
"shores", "shrs", "skyway", "spg", "spng", "spring", "sprng",
"spgs", "spngs", "springs", "sprngs", "spur", "spurs", "sq",
"sqr", "sqre", "squ", "square", "sqrs", "squares", "sta", "station",
"statn", "stn", "stra", "strav", "straven", "stravenue", "stravn",
"strvn", "strvnue", "stream", "streme", "strm", "street", "strt",
"st", "str", "streets", "smt", "suite", "sumit", "sumitt", "summit",
"ter", "terr", "terrace", "throughway", "trace", "traces", "trce",
"track", "tracks", "trak", "trk", "trks", "trafficway", "trail",
"trails", "trl", "trls", "trailer", "trlr", "trlrs", "tunel",
"tunl", "tunls", "tunnel", "tunnels", "tunnl", "trnpk", "turnpike",
"turnpk", "underpass", "un", "union", "unions", "valley", "vally",
"vlly", "vly", "valleys", "vlys", "vdct", "via", "viadct", "viaduct",
"view", "vw", "views", "vws", "vill", "villag", "village", "villg",
"villiage", "vlg", "villages", "vlgs", "ville", "vl", "vis",
"vist", "vista", "vst", "vsta", "walk", "walks", "wall", "wy",
"way", "ways", "well", "wells", "wls"), usps_abbrev = c("aly",
"aly", "aly", "aly", "anx", "anx", "anx", "anx", "arc", "arc",
"ave", "ave", "ave", "ave", "ave", "ave", "ave", "byu", "byu",
"bch", "bch", "bnd", "bnd", "blf", "blf", "blf", "blfs", "btm",
"btm", "btm", "btm", "blvd", "blvd", "blvd", "blvd", "br", "br",
"br", "brg", "brg", "brg", "brk", "brk", "brks", "bg", "bgs",
"byp", "byp", "byp", "byp", "byp", "cp", "cp", "cp", "cyn", "cyn",
"cyn", "cpe", "cpe", "cswy", "cswy", "cswy", "ctr", "ctr", "ctr",
"ctr", "ctr", "ctr", "ctr", "ctr", "ctrs", "cir", "cir", "cir",
"cir", "cir", "cir", "cirs", "clf", "clf", "clfs", "clfs", "clb",
"clb", "cmn", "cmns", "cor", "cor", "cors", "cors", "crse", "crse",
"ct", "ct", "cts", "cts", "cv", "cv", "cvs", "crk", "crk", "cres",
"cres", "cres", "cres", "crst", "xing", "xing", "xing", "xrd",
"xrds", "curv", "dl", "dl", "dm", "dm", "dv", "dv", "dv", "dv",
"dr", "dr", "dr", "dr", "drs", "est", "est", "ests", "ests",
"expy", "expy", "expy", "expy", "expy", "expy", "ext", "ext",
"ext", "ext", "exts", "fall", "fls", "fls", "fry", "fry", "fry",
"fld", "fld", "flds", "flds", "flt", "flt", "flts", "flts", "frd",
"frd", "frds", "frst", "frst", "frst", "frg", "frg", "frg", "frgs",
"frk", "frk", "frks", "frks", "ft", "ft", "ft", "fwy", "fwy",
"fwy", "fwy", "fwy", "gdn", "gdn", "gdn", "gdn", "gdns", "gdns",
"gdns", "gtwy", "gtwy", "gtwy", "gtwy", "gtwy", "gln", "gln",
"glns", "grn", "grn", "grns", "grv", "grv", "grv", "grvs", "hbr",
"hbr", "hbr", "hbr", "hbr", "hbrs", "hvn", "hvn", "hts", "hts",
"hwy", "hwy", "hwy", "hwy", "hwy", "hwy", "hl", "hl", "hls",
"hls", "holw", "holw", "holw", "holw", "holw", "inlt", "is",
"is", "is", "iss", "iss", "iss", "isle", "isle", "jct", "jct",
"jct", "jct", "jct", "jct", "jcts", "jcts", "jcts", "ky", "ky",
"kys", "kys", "knl", "knl", "knl", "knls", "knls", "lk", "lk",
"lks", "lks", "land", "lndg", "lndg", "lndg", "ln", "ln", "lgt",
"lgt", "lgts", "lf", "lf", "lck", "lck", "lcks", "lcks", "ldg",
"ldg", "ldg", "ldg", "loop", "loop", "mall", "mnr", "mnr", "mnrs",
"mnrs", "mdw", "mdws", "mdws", "mdws", "mdws", "mews", "ml",
"mls", "msn", "msn", "mtwy", "mt", "mt", "mt", "mtn", "mtn",
"mtn", "mtn", "mtn", "mtn", "mtns", "mtns", "nck", "nck", "orch",
"orch", "orch", "oval", "oval", "opas", "park", "park", "park",
"pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pass",
"psge", "path", "path", "pike", "pike", "pne", "pnes", "pnes",
"pl", "pln", "pln", "plns", "plns", "plz", "plz", "plz", "pt",
"pt", "pts", "pts", "prt", "prt", "prts", "prts", "pr", "pr",
"pr", "radl", "radl", "radl", "radl", "ramp", "rnch", "rnch",
"rnch", "rnch", "rpd", "rpd", "rpds", "rpds", "rst", "rst", "rdg",
"rdg", "rdg", "rdgs", "rdgs", "riv", "riv", "riv", "riv", "rd",
"rd", "rds", "rds", "rte", "row", "rue", "run", "shl", "shl",
"shls", "shls", "shr", "shr", "shr", "shrs", "shrs", "shrs",
"skwy", "spg", "spg", "spg", "spg", "spgs", "spgs", "spgs", "spgs",
"spur", "spur", "sq", "sq", "sq", "sq", "sq", "sqs", "sqs", "sta",
"sta", "sta", "sta", "stra", "stra", "stra", "stra", "stra",
"stra", "stra", "strm", "strm", "strm", "st", "st", "st", "st",
"sts", "smt", "ste", "smt", "smt", "smt", "ter", "ter", "ter",
"trwy", "trce", "trce", "trce", "trak", "trak", "trak", "trak",
"trak", "trfy", "trl", "trl", "trl", "trl", "trlr", "trlr", "trlr",
"tunl", "tunl", "tunl", "tunl", "tunl", "tunl", "tpke", "tpke",
"tpke", "upas", "un", "un", "uns", "vly", "vly", "vly", "vly",
"vlys", "vlys", "via", "via", "via", "via", "vw", "vw", "vws",
"vws", "vlg", "vlg", "vlg", "vlg", "vlg", "vlg", "vlgs", "vlgs",
"vl", "vl", "vis", "vis", "vis", "vis", "vis", "walk", "walk",
"wall", "way", "way", "ways", "wl", "wls", "wls")), class = "data.frame", row.names = c(NA,
-503L))
我想用它们来处理街道地址和州。玩具数据:
a <- c("10900 harper ave", "12235 davis annex", "24 van cortland parkway")
为了将常用缩写转换为 usps 缩写(标准化数据),我构建了一个小函数:
mr_zip <- function(x){
x <-textclean::mgsub(usps_streets$common_abbrev, usps_streets$usps_abbrev, x, fixed = T,
order.pattern = T)
return(x)
}
当我将我的函数应用于我的数据时出现问题:
f <- sapply(a, mr_zip)
我得到了错误的结果:
"10900 harper avee" "1235 davis anx" "24 van cortland pkway"
因为我应该得到的是:
"10900 harper ave" "1235 davis anx" "24 van cortland pkwy"
我的问题:
- 当我在
mgsub
函数中指定 order.pattern = T
和 fixed = T
时,为什么会发生这种情况?
- 我该如何解决?
- 是否有替代方法在文本的多个替换模式中使用向量?
在此先致谢,欢迎提出所有建议。
编辑:感谢@RichieSacramento,我发现使用边界词确实有帮助,但在大型数据帧(> 400,000 行)上使用时该功能仍然非常慢。在 mgsub
中使用 safe = TRUE
会导致函数正常工作,但速度非常慢。需要一些快速的东西——因此赏金。
更新
这是 OP 现有问题的基准测试(从 借用测试数据,但使用 n <- 10000
)
> mb1
Unit: milliseconds
expr min lq mean median
f_MK_conv2(df$addresses) 1409.0643 1470.3992 1612.09037 1631.3014
f_MK_replaceString(df, addresses) 50.1582 54.3035 94.53149 62.5772
f_TIC1(df$addresses) 394.5972 420.3283 461.50675 447.6186
f_TIC2(df$addresses) 1579.1868 1852.6873 2052.28388 1964.8845
f_TIC3(df$addresses) 65.8436 71.5448 93.36210 84.9698
uq max neval
1710.3459 1898.6773 20
116.3108 264.2616 20
499.4052 626.9240 20
2246.5562 2916.2253 20
102.7689 183.5121 20
其中基准代码给出如下
f_MK_conv2 <- function(x) {
USPSv <- array(
data = USPS$usps_abbrev,
dimnames = list(USPS$common_abbrev)
)
USPS_conv2 <- function(x) {
t <- str_split(x, " ")
comm <- t[[1]][length(t[[1]])]
str_replace(x, comm, USPSv[comm])
}
Vectorize(USPS_conv2)(x)
}
f_MK_replaceString <- function(.data, value) {
ht.create <- function() new.env()
ht.insert <- function(ht, key, value) ht[[key]] <- value
ht.insert <- Vectorize(ht.insert, c("key", "value"))
ht.lookup <- function(ht, key) ht[[key]]
ht.lookup <- Vectorize(ht.lookup, "key")
ht.delete <- function(ht, key) rm(list = key, envir = ht, inherits = FALSE)
ht.delete <- Vectorize(ht.delete, "key")
addHashTable2 <- function(.x, .y, key, value) {
key <- enquo(key)
value <- enquo(value)
if (!all(c(as_label(key), as_label(value)) %in% names(.y))) {
stop(paste0(
"`.y` must contain `", as_label(key),
"` and `", as_label(value), "` columns"
))
}
if ((.y %>% distinct(!!key, !!value) %>% nrow()) !=
(.y %>% distinct(!!key) %>% nrow())) {
warning(paste0(
"\nThe number of unique values of the ", as_label(key),
" variable is different\n",
" from the number of unique values of the ",
as_label(key), " and ", as_label(value), " pairs!\n",
"The dictionary will only return the last values for a given key!"
))
}
ht <- ht.create()
ht %>% ht.insert(
.y %>% distinct(!!key, !!value) %>% pull(!!key),
.y %>% distinct(!!key, !!value) %>% pull(!!value)
)
attr(.x, "hashTab") <- ht
.x
}
.data <- .data %>% addHashTable2(USPS, common_abbrev, usps_abbrev)
value <- enquo(value)
# Test whether the value variable is in .data
if (!(as_label(value) %in% names(.data))) {
stop(paste(
"The", as_label(value),
"variable does not exist in the .data table!"
))
}
# Dictionary attribute presence test
if (!("hashTab" %in% names(attributes(.data)))) {
stop(paste0(
"\nThere is no dictionary attribute in the .data table!\n",
"Use addHashTable or addHashTable2 to add a dictionary attribute."
))
}
txt <- .data %>% pull(!!value)
i <- sapply(strsplit(txt, ""), function(x) max(which(x == " ")))
txt <- paste0(
str_sub(txt, end = i),
ht.lookup(
attr(.data, "hashTab"),
str_sub(txt, start = i + 1)
)
)
.data %>% mutate(!!value := txt)
}
f_TIC1 <- function(x) {
sapply(
strsplit(x, " "),
function(x) {
with(USPS, {
idx <- match(x, common_abbrev)
paste0(ifelse(is.na(idx), x, usps_abbrev[idx]),
collapse = " "
)
})
}
)
}
f_TIC2 <- function(x) {
res <- c()
for (s in x) {
v <- unlist(strsplit(s, "\W+"))
for (p in v) {
k <- match(p, USPS$common_abbrev)
if (!is.na(k)) {
s <- with(
USPS,
gsub(
sprintf("\b%s\b", common_abbrev[k]),
usps_abbrev[k],
s
)
)
}
}
res <- append(res, s)
}
res
}
f_TIC3 <- function(x) {
x.split <- strsplit(x, " ")
lut <- with(USPS, setNames(usps_abbrev, common_abbrev))
grp <- rep(seq_along(x.split), lengths(x.split))
xx <- unlist(x.split)
r <- lut[xx]
tapply(
replace(xx, !is.na(r), na.omit(r)),
grp,
function(s) paste0(s, collapse = " ")
)
}
f_TIC4 <- function(x) {
xb <- gsub("^.*\s+", "", x)
rp <- with(USPS, usps_abbrev[match(xb, common_abbrev)])
paste0(gsub("\w+$", "", x), replace(xb, !is.na(rp), na.omit(rp)))
}
f_JM <- function(x) {
x$abbreviation <- gsub("^.* ", "", x$addresses)
setDT(x)
setDT(USPS)
x[USPS, abbreviation := usps_abbrev, on = .(abbreviation = common_abbrev)]
x$usps_abbreviation <- paste(str_extract(x$addresses, "^.* "), x$abbreviation, sep = "")
}
set.seed(1111)
df <- randomAddresses(10000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df$addresses),
f_MK_replaceString(df, addresses),
f_JM(df),
f_TIC1(df$addresses),
f_TIC2(df$addresses),
f_TIC3(df$addresses),
f_TIC4(df$addresses),
times = 20L
)
ggplot2::autoplot(mb1)
可能的解决方案
也许以下基本 R 选项之一可以提供帮助
- 解决方案 1
f_TIC1 <- function(x) {
sapply(
strsplit(x, " "),
function(x) {
with(USPS, {
idx <- match(x, common_abbrev)
paste0(ifelse(is.na(idx), x, usps_abbrev[idx]),
collapse = " "
)
})
}
)
}
- 解决方案 2
f_TIC2 <- function(x) {
res <- c()
for (s in x) {
v <- unlist(strsplit(s, "\W+"))
for (p in v) {
k <- match(p, USPS$common_abbrev)
if (!is.na(k)) {
s <- with(
USPS,
gsub(
sprintf("\b%s\b", common_abbrev[k]),
usps_abbrev[k],
s
)
)
}
}
res <- append(res, s)
}
res
}
- 解决方案 3
f_TIC3 <- function(x) {
x.split <- strsplit(x, " ")
lut <- with(USPS, setNames(usps_abbrev, common_abbrev))
grp <- rep(seq_along(x.split), lengths(x.split))
xx <- unlist(x.split)
r <- lut[xx]
tapply(
replace(xx, !is.na(r), na.omit(r)),
grp,
function(s) paste0(s, collapse = " ")
)
}
- 方案4(这是一个特例,即最后一个词的缩写)
f_TIC4 <- function(x) {
xb <- gsub("^.*\s+", "", x)
rp <- with(USPS, usps_abbrev[match(xb, common_abbrev)])
paste0(gsub("\w+$", "", x), replace(xb, !is.na(rp), na.omit(rp)))
}
产出
[1] "10900 harper ave" "12235 davis anx" "24 van cortland pkwy"
那么让我们开始玩吧。
步骤 1
首先,我们会将您的数据加载到名为 USPS
.
的 tibble
中
library(tidyverse)
USPS = tibble(
common_abbrev = c("allee", "alley", "ally", "aly",
"anex", "annex", "annx", "anx", "arc", "arcade", "av", "ave",
"aven", "avenu", "avenue", "avn", "avnue", "bayoo", "bayou",
"bch", "beach", "bend", "bnd", "blf", "bluf", "bluff", "bluffs",
"bot", "btm", "bottm", "bottom", "blvd", "boul", "boulevard",
"boulv", "br", "brnch", "branch", "brdge", "brg", "bridge", "brk",
"brook", "brooks", "burg", "burgs", "byp", "bypa", "bypas", "bypass",
"byps", "camp", "cp", "cmp", "canyn", "canyon", "cnyn", "cape",
"cpe", "causeway", "causwa", "cswy", "cen", "cent", "center",
"centr", "centre", "cnter", "cntr", "ctr", "centers", "cir",
"circ", "circl", "circle", "crcl", "crcle", "circles", "clf",
"cliff", "clfs", "cliffs", "clb", "club", "common", "commons",
"cor", "corner", "corners", "cors", "course", "crse", "court",
"ct", "courts", "cts", "cove", "cv", "coves", "creek", "crk",
"crescent", "cres", "crsent", "crsnt", "crest", "crossing", "crssng",
"xing", "crossroad", "crossroads", "curve", "dale", "dl", "dam",
"dm", "div", "divide", "dv", "dvd", "dr", "driv", "drive", "drv",
"drives", "est", "estate", "estates", "ests", "exp", "expr",
"express", "expressway", "expw", "expy", "ext", "extension",
"extn", "extnsn", "exts", "fall", "falls", "fls", "ferry", "frry",
"fry", "field", "fld", "fields", "flds", "flat", "flt", "flats",
"flts", "ford", "frd", "fords", "forest", "forests", "frst",
"forg", "forge", "frg", "forges", "fork", "frk", "forks", "frks",
"fort", "frt", "ft", "freeway", "freewy", "frway", "frwy", "fwy",
"garden", "gardn", "grden", "grdn", "gardens", "gdns", "grdns",
"gateway", "gatewy", "gatway", "gtway", "gtwy", "glen", "gln",
"glens", "green", "grn", "greens", "grov", "grove", "grv", "groves",
"harb", "harbor", "harbr", "hbr", "hrbor", "harbors", "haven",
"hvn", "ht", "hts", "highway", "highwy", "hiway", "hiwy", "hway",
"hwy", "hill", "hl", "hills", "hls", "hllw", "hollow", "hollows",
"holw", "holws", "inlt", "is", "island", "islnd", "islands",
"islnds", "iss", "isle", "isles", "jct", "jction", "jctn", "junction",
"junctn", "juncton", "jctns", "jcts", "junctions", "key", "ky",
"keys", "kys", "knl", "knol", "knoll", "knls", "knolls", "lk",
"lake", "lks", "lakes", "land", "landing", "lndg", "lndng", "lane",
"ln", "lgt", "light", "lights", "lf", "loaf", "lck", "lock",
"lcks", "locks", "ldg", "ldge", "lodg", "lodge", "loop", "loops",
"mall", "mnr", "manor", "manors", "mnrs", "meadow", "mdw", "mdws",
"meadows", "medows", "mews", "mill", "mills", "missn", "mssn",
"motorway", "mnt", "mt", "mount", "mntain", "mntn", "mountain",
"mountin", "mtin", "mtn", "mntns", "mountains", "nck", "neck",
"orch", "orchard", "orchrd", "oval", "ovl", "overpass", "park",
"prk", "parks", "parkway", "parkwy", "pkway", "pkwy", "pky",
"parkways", "pkwys", "pass", "passage", "path", "paths", "pike",
"pikes", "pine", "pines", "pnes", "pl", "plain", "pln", "plains",
"plns", "plaza", "plz", "plza", "point", "pt", "points", "pts",
"port", "prt", "ports", "prts", "pr", "prairie", "prr", "rad",
"radial", "radiel", "radl", "ramp", "ranch", "ranches", "rnch",
"rnchs", "rapid", "rpd", "rapids", "rpds", "rest", "rst", "rdg",
"rdge", "ridge", "rdgs", "ridges", "riv", "river", "rvr", "rivr",
"rd", "road", "roads", "rds", "route", "row", "rue", "run", "shl",
"shoal", "shls", "shoals", "shoar", "shore", "shr", "shoars",
"shores", "shrs", "skyway", "spg", "spng", "spring", "sprng",
"spgs", "spngs", "springs", "sprngs", "spur", "spurs", "sq",
"sqr", "sqre", "squ", "square", "sqrs", "squares", "sta", "station",
"statn", "stn", "stra", "strav", "straven", "stravenue", "stravn",
"strvn", "strvnue", "stream", "streme", "strm", "street", "strt",
"st", "str", "streets", "smt", "suite", "sumit", "sumitt", "summit",
"ter", "terr", "terrace", "throughway", "trace", "traces", "trce",
"track", "tracks", "trak", "trk", "trks", "trafficway", "trail",
"trails", "trl", "trls", "trailer", "trlr", "trlrs", "tunel",
"tunl", "tunls", "tunnel", "tunnels", "tunnl", "trnpk", "turnpike",
"turnpk", "underpass", "un", "union", "unions", "valley", "vally",
"vlly", "vly", "valleys", "vlys", "vdct", "via", "viadct", "viaduct",
"view", "vw", "views", "vws", "vill", "villag", "village", "villg",
"villiage", "vlg", "villages", "vlgs", "ville", "vl", "vis",
"vist", "vista", "vst", "vsta", "walk", "walks", "wall", "wy",
"way", "ways", "well", "wells", "wls"),
usps_abbrev = c("aly",
"aly", "aly", "aly", "anx", "anx", "anx", "anx", "arc", "arc",
"ave", "ave", "ave", "ave", "ave", "ave", "ave", "byu", "byu",
"bch", "bch", "bnd", "bnd", "blf", "blf", "blf", "blfs", "btm",
"btm", "btm", "btm", "blvd", "blvd", "blvd", "blvd", "br", "br",
"br", "brg", "brg", "brg", "brk", "brk", "brks", "bg", "bgs",
"byp", "byp", "byp", "byp", "byp", "cp", "cp", "cp", "cyn", "cyn",
"cyn", "cpe", "cpe", "cswy", "cswy", "cswy", "ctr", "ctr", "ctr",
"ctr", "ctr", "ctr", "ctr", "ctr", "ctrs", "cir", "cir", "cir",
"cir", "cir", "cir", "cirs", "clf", "clf", "clfs", "clfs", "clb",
"clb", "cmn", "cmns", "cor", "cor", "cors", "cors", "crse", "crse",
"ct", "ct", "cts", "cts", "cv", "cv", "cvs", "crk", "crk", "cres",
"cres", "cres", "cres", "crst", "xing", "xing", "xing", "xrd",
"xrds", "curv", "dl", "dl", "dm", "dm", "dv", "dv", "dv", "dv",
"dr", "dr", "dr", "dr", "drs", "est", "est", "ests", "ests",
"expy", "expy", "expy", "expy", "expy", "expy", "ext", "ext",
"ext", "ext", "exts", "fall", "fls", "fls", "fry", "fry", "fry",
"fld", "fld", "flds", "flds", "flt", "flt", "flts", "flts", "frd",
"frd", "frds", "frst", "frst", "frst", "frg", "frg", "frg", "frgs",
"frk", "frk", "frks", "frks", "ft", "ft", "ft", "fwy", "fwy",
"fwy", "fwy", "fwy", "gdn", "gdn", "gdn", "gdn", "gdns", "gdns",
"gdns", "gtwy", "gtwy", "gtwy", "gtwy", "gtwy", "gln", "gln",
"glns", "grn", "grn", "grns", "grv", "grv", "grv", "grvs", "hbr",
"hbr", "hbr", "hbr", "hbr", "hbrs", "hvn", "hvn", "hts", "hts",
"hwy", "hwy", "hwy", "hwy", "hwy", "hwy", "hl", "hl", "hls",
"hls", "holw", "holw", "holw", "holw", "holw", "inlt", "is",
"is", "is", "iss", "iss", "iss", "isle", "isle", "jct", "jct",
"jct", "jct", "jct", "jct", "jcts", "jcts", "jcts", "ky", "ky",
"kys", "kys", "knl", "knl", "knl", "knls", "knls", "lk", "lk",
"lks", "lks", "land", "lndg", "lndg", "lndg", "ln", "ln", "lgt",
"lgt", "lgts", "lf", "lf", "lck", "lck", "lcks", "lcks", "ldg",
"ldg", "ldg", "ldg", "loop", "loop", "mall", "mnr", "mnr", "mnrs",
"mnrs", "mdw", "mdws", "mdws", "mdws", "mdws", "mews", "ml",
"mls", "msn", "msn", "mtwy", "mt", "mt", "mt", "mtn", "mtn",
"mtn", "mtn", "mtn", "mtn", "mtns", "mtns", "nck", "nck", "orch",
"orch", "orch", "oval", "oval", "opas", "park", "park", "park",
"pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pass",
"psge", "path", "path", "pike", "pike", "pne", "pnes", "pnes",
"pl", "pln", "pln", "plns", "plns", "plz", "plz", "plz", "pt",
"pt", "pts", "pts", "prt", "prt", "prts", "prts", "pr", "pr",
"pr", "radl", "radl", "radl", "radl", "ramp", "rnch", "rnch",
"rnch", "rnch", "rpd", "rpd", "rpds", "rpds", "rst", "rst", "rdg",
"rdg", "rdg", "rdgs", "rdgs", "riv", "riv", "riv", "riv", "rd",
"rd", "rds", "rds", "rte", "row", "rue", "run", "shl", "shl",
"shls", "shls", "shr", "shr", "shr", "shrs", "shrs", "shrs",
"skwy", "spg", "spg", "spg", "spg", "spgs", "spgs", "spgs", "spgs",
"spur", "spur", "sq", "sq", "sq", "sq", "sq", "sqs", "sqs", "sta",
"sta", "sta", "sta", "stra", "stra", "stra", "stra", "stra",
"stra", "stra", "strm", "strm", "strm", "st", "st", "st", "st",
"sts", "smt", "ste", "smt", "smt", "smt", "ter", "ter", "ter",
"trwy", "trce", "trce", "trce", "trak", "trak", "trak", "trak",
"trak", "trfy", "trl", "trl", "trl", "trl", "trlr", "trlr", "trlr",
"tunl", "tunl", "tunl", "tunl", "tunl", "tunl", "tpke", "tpke",
"tpke", "upas", "un", "un", "uns", "vly", "vly", "vly", "vly",
"vlys", "vlys", "via", "via", "via", "via", "vw", "vw", "vws",
"vws", "vlg", "vlg", "vlg", "vlg", "vlg", "vlg", "vlgs", "vlgs",
"vl", "vl", "vis", "vis", "vis", "vis", "vis", "walk", "walk",
"wall", "way", "way", "ways", "wl", "wls", "wls"))
USPS
输出
# A tibble: 503 x 2
common_abbrev usps_abbrev
<chr> <chr>
1 allee aly
2 alley aly
3 ally aly
4 aly aly
5 anex anx
6 annex anx
7 annx anx
8 anx anx
9 arc arc
10 arcade arc
# ... with 493 more rows
步骤 2
现在我们将把您的 USPS
table 转换为具有命名元素的向量。
USPSv = array(data = USPS$usps_abbrev,
dimnames= list(USPS$common_abbrev))
让我们看看它给我们带来了什么
USPSv['viadct']
# viadct
# "via"
USPSv['coves']
# coves
# "cvs"
看起来很吸引人。
步骤 3
现在让我们创建一个转换(矢量化)函数,它使用我们的 USPSv
向量和命名元素。
USPS_conv = function(x) {
comm = str_split(x, " ") %>% .[[1]] %>% .[length(.)]
str_replace(x, comm, USPSv[comm])
}
USPS_conv = Vectorize(USPS_conv)
让我们看看我们的 USPS_conv
是如何工作的。
USPS_conv("10900 harper coves")
# 10900 harper coves
# "10900 harper cvs"
USPS_conv("10900 harper viadct")
# 10900 harper viadct
# "10900 harper via"
很好,但是它会处理向量吗?
USPS_conv(c("10900 harper coves", "10900 harper viadct", "10900 harper ave"))
# 10900 harper coves 10900 harper viadct 10900 harper ave
# "10900 harper cvs" "10900 harper via" "10900 harper ave"
到目前为止一切都很顺利。
步骤 4
现在是时候在 mutate
函数中使用我们的 USPS_conv
函数了。
但是,我们需要一些输入数据。我们会自己生成它们。
n=10
set.seed(1111)
df = tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
df
输出
# A tibble: 10 x 1
addresses
<chr>
1 8995 davis crk
2 8527 davis tunnl
3 7663 von brown wall
4 3043 harper lake
5 9192 von brown grdn
6 120 marry rvr
7 72 von brown locks
8 8752 marry gardn
9 7754 davis corner
10 3745 davis jcts
让我们进行一次变异
df %>% mutate(addresses = USPS_conv(addresses))
输出
# A tibble: 10 x 1
addresses
<chr>
1 8995 davis crk
2 8527 davis tunl
3 7663 von brown wall
4 3043 harper lk
5 9192 von brown gdn
6 120 marry riv
7 72 von brown lcks
8 8752 marry gdn
9 7754 davis cor
10 3745 davis jcts
看起来还好吗?好像是最多的。
步骤 5
所以是时候对 1,000,000 个地址进行大测试了!
我们将像以前一样生成数据。
n=1000000
set.seed(1111)
df = tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
df
输出
# A tibble: 1,000,000 x 1
addresses
<chr>
1 8995 marry pass
2 8527 davis spng
3 7663 marry loaf
4 3043 davis common
5 9192 marry bnd
6 120 von brown corner
7 72 van cortland plains
8 8752 van cortland crcle
9 7754 von brown sqrs
10 3745 marry key
# ... with 999,990 more rows
那我们走吧。但是让我们立即测量需要多长时间。
start_time =Sys.time()
df %>% mutate(addresses = USPS_conv(addresses))
Sys.time()-start_time
#Time difference of 3.610211 mins
如您所见,我只用了不到 4 分钟。我不知道您是否期待更快的速度以及您是否对这次满意。我会等你的评论。
最后一分钟更新
事实证明,如果我们稍微更改其代码,USPS_conv
可以稍微加快。
USPS_conv2 = function(x) {
t = str_split(x, " ")
comm = t[[1]][length(t[[1]])]
str_replace(x, comm, USPSv[comm])
}
USPS_conv2 = Vectorize(USPS_conv2)
新的 USPS_conv2
函数运行速度稍快。
所有这些转化为将一百万条记录的变异时间减少到 3.3 分钟。
超级速度的大更新!!
我意识到我的第一个版本的答案虽然结构简单,但有点慢:-(。所以我决定想出更快的东西。我将在这里分享我的想法,但请注意,一些解决方案会有点“神奇”。
魔法辞典-环境
为了加快运算速度,我们需要创建一个字典,将键快速转换为值。我们将使用 R 中的环境创建它。
这是我们词典的一个小界面。
#Simple Dictionary (hash Table) Interface for R
ht.create = function() new.env()
ht.insert = function(ht, key, value) ht[[key]] <- value
ht.insert = Vectorize(ht.insert, c("key", "value"))
ht.lookup = function(ht, key) ht[[key]]
ht.lookup = Vectorize(ht.lookup, "key")
ht.delete = function(ht, key) rm(list=key,envir=ht,inherits=FALSE)
ht.delete = Vectorize(ht.delete, "key")
它是怎么发生的。我已经显示了。下面我将创建一个新的字典环境 ht.create()
,我将向其中添加两个元素“a1”和“a2”ht.insert
,其值分别为“va1”和“va2”。最后,我将使用这些 ht.lookup
键的值询问我的环境字典。
ht1 = ht.create()
ht.insert(ht1, "a1", "va1" )
ht1 %>% ht.insert("a2", "va2")
ht.lookup(ht1, "a1")
# a1
# "va1"
ht1 %>% ht.lookup("a2")
# a2
# "va2"
请注意函数 ht.insert
和 ht.lookup
是向量化的,这意味着我可以将整个向量添加到字典中。以同样的方式,我将能够通过给出整个向量来查询我的字典。
ht.insert(ht1, paste0("a", 1:10),paste0("va", 1:10))
ht1 %>% ht.insert( paste0("a", 11:20),paste0("va", 11:20))
ht.lookup(ht1, paste0("a", 10:1))
# a10 a9 a8 a7 a6 a5 a4 a3 a2 a1
# "va10" "va9" "va8" "va7" "va6" "va5" "va4" "va3" "va2" "va1"
ht1 %>% ht.lookup(paste0("a", 20:11))
# a20 a19 a18 a17 a16 a15 a14 a13 a12 a11
# "va20" "va19" "va18" "va17" "va16" "va15" "va14" "va13" "va12" "va11"
魔法属性
现在我们将执行一个函数,向选定的字典环境添加一个附加属性 table。
#Functions that add a dictionary attribute to tibble
addHashTable = function(.data, key, value){
key = enquo(key)
value = enquo(value)
if (!all(c(as_label(key), as_label(value)) %in% names(.data))) {
stop(paste0("`.data` must contain `", as_label(key),
"` and `", as_label(value), "` columns"))
}
if((.data %>% distinct(!!key, !!value) %>% nrow)!=
(.data %>% distinct(!!key) %>% nrow)){
warning(paste0(
"\nThe number of unique values of the ", as_label(key),
" variable is different\n",
" from the number of unique values of the ",
as_label(key), " and ", as_label(value)," pairs!\n",
"The dictionary will only return the last values for a given key!"))
}
ht = ht.create()
ht %>% ht.insert(.data %>% distinct(!!key, !!value) %>% pull(!!key),
.data %>% distinct(!!key, !!value) %>% pull(!!value))
attr(.data, "hashTab") = ht
.data
}
addHashTable2 = function(.x, .y, key, value){
key = enquo(key)
value = enquo(value)
if (!all(c(as_label(key), as_label(value)) %in% names(.y))) {
stop(paste0("`.y` must contain `", as_label(key),
"` and `", as_label(value), "` columns"))
}
if((.y %>% distinct(!!key, !!value) %>% nrow)!=
(.y %>% distinct(!!key) %>% nrow)){
warning(paste0(
"\nThe number of unique values of the ", as_label(key),
" variable is different\n",
" from the number of unique values of the ",
as_label(key), " and ", as_label(value)," pairs!\n",
"The dictionary will only return the last values for a given key!"))
}
ht = ht.create()
ht %>% ht.insert(.y %>% distinct(!!key, !!value) %>% pull(!!key),
.y %>% distinct(!!key, !!value) %>% pull(!!value))
attr(.x, "hashTab") = ht
.x
}
那里实际上有两个功能。 addHashTable
函数将 dictionary-environment 属性添加到从中获取键值对的同一 table。 addHashTable2
函数同样添加到字典环境 table,但从另一个 table.
检索密钥对
让我们看看 addHashTable
是如何工作的。
USPS = USPS %>% addHashTable(common_abbrev, usps_abbrev)
str(USPS)
# tibble [503 x 2] (S3: tbl_df/tbl/data.frame)
# $ common_abbrev: chr [1:503] "allee" "alley" "ally" "aly" ...
# $ usps_abbrev : chr [1:503] "aly" "aly" "aly" "aly" ...
# - attr(*, "hashTab")=<environment: 0x000000001591bbf0>
如您所见,USPS
table 中添加了一个指向 0x000000001591bbf0
环境的属性。
替换函数
我们需要创建一个函数,该函数将使用以这种方式添加的字典环境来替换,在这种情况下,将指定变量中的最后一个单词替换为字典中的相应值。在这里。
replaceString = function(.data, value){
value = enquo(value)
#Test whether the value variable is in .data
if(!(as_label(value) %in% names(.data))){
stop(paste("The", as_label(value),
"variable does not exist in the .data table!"))
}
#Dictionary attribute presence test
if(!("hashTab" %in% names(attributes(.data)))) {
stop(paste0(
"\nThere is no dictionary attribute in the .data table!\n",
"Use addHashTable or addHashTable2 to add a dictionary attribute."))
}
txt = .data %>% pull(!!value)
i = sapply(strsplit(txt, ""), function(x) max(which(x==" ")))
txt = paste0(str_sub(txt, end=i),
ht.lookup(attr(.data, "hashTab"),
str_sub(txt, start=i+1)))
.data %>% mutate(!!value := txt)
}
第一次测试
第一篇文字的时间到了。为了避免复制代码,我添加了一个 returns 一个带有随机选择地址的 table 的小函数。
randomAddresses = function(n){
tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
}
set.seed(1111)
df = randomAddresses(10)
df
# # A tibble: 10 x 1
# addresses
# <chr>
# 1 74 marry forges
# 2 787 von brown knol
# 3 2755 van cortland summit
# 4 9405 harper plaza
# 5 5376 marry pass
# 6 1857 marry trailer
# 7 9810 von brown drv
# 8 7984 davis garden
# 9 9110 marry alley
# 10 6458 von brown row
是时候使用我们神奇的文本替换功能了。但是,请记住先将字典环境添加到 table。
df = df %>% addHashTable2(USPS, common_abbrev, usps_abbrev)
df %>% replaceString(addresses)
# A tibble: 10 x 1
# addresses
# <chr>
# 1 74 marry frgs
# 2 787 von brown knl
# 3 2755 van cortland smt
# 4 9405 harper plz
# 5 5376 marry pass
# 6 1857 marry trlr
# 7 9810 von brown dr
# 8 7984 davis gdn
# 9 9110 marry aly
# 10 6458 von brown row
看起来可行!
大考验
嗯,没什么好等的。现在让我们在具有 百万行 的 table 上尝试一下。
让我们立即测量绘制地址和添加字典环境需要多长时间。
start_time =Sys.time()
df = randomAddresses(1000000)
df = df %>% addHashTable2(USPS, common_abbrev, usps_abbrev)
Sys.time()-start_time
#Time difference of 1.56609 secs
输出
df
# A tibble: 1,000,000 x 1
# addresses
# <chr>
# 1 8995 marry pass
# 2 8527 davis spng
# 3 7663 marry loaf
# 4 3043 davis common
# 5 9192 marry bnd
# 6 120 von brown corner
# 7 72 van cortland plains
# 8 8752 van cortland crcle
# 9 7754 von brown sqrs
# 10 3745 marry key
# # ... with 999,990 more rows
1.6 秒可能不算多。然而,最大的问题是需要多长时间来替换缩写。
start_time =Sys.time()
df = df %>% replaceString(addresses)
Sys.time()-start_time
#Time difference of 8.316476 secs
输出
# A tibble: 1,000,000 x 1
# addresses
# <chr>
# 1 8995 marry pass
# 2 8527 davis spg
# 3 7663 marry lf
# 4 3043 davis cmn
# 5 9192 marry bnd
# 6 120 von brown cor
# 7 72 van cortland plns
# 8 8752 van cortland cir
# 9 7754 von brown sqs
# 10 3745 marry ky
# # ... with 999,990 more rows
砰!!我们还有 8 秒 !!
我确信 R 中无法实现更快的机制。
@ThomasIsCoding 的小更新
下面是一个小的基准测试。请注意,我从@ThomasIsCoding.
那里借用了函数 f_MK_conv2
、 f_TIC1
和 f_TIC2
的代码
set.seed(1111)
df = randomAddresses(10000)
df = df %>% addHashTable2(USPS, common_abbrev, usps_abbrev)
library(microbenchmark)
mb1 = microbenchmark(
f_MK_conv2(df$addresses),
f_TIC1(df$addresses),
f_TIC2(df$addresses),
replaceString(df, addresses),
times = 20L
)
ggplot2::autoplot(mb1)
更新:
我花了一些时间调整我现有的答案(如下),我相信这是最快的方法。此外,值得注意的是,如果您将 perl = TRUE
添加到 f_JM 和 TIC4 中的 gsub,您会在本示例中明显提高速度(可能不适用于 'real world' 数据)。我的回答还有一个重要的警告,因为它基于地址中最后一个术语的缩写词(例如 TIC1、TIC2 和 TIC3 不依赖于该假设)。
非常感谢@Marek 和@TIC 提供的基准测试代码和建设性意见:
## Benchmarking with updated f_JM() and TIC4()
library(data.table)
library(tidyverse)
USPS = tibble(
common_abbrev = c("allee", "alley", "ally", "aly",
"anex", "annex", "annx", "anx", "arc", "arcade", "av", "ave",
"aven", "avenu", "avenue", "avn", "avnue", "bayoo", "bayou",
"bch", "beach", "bend", "bnd", "blf", "bluf", "bluff", "bluffs",
"bot", "btm", "bottm", "bottom", "blvd", "boul", "boulevard",
"boulv", "br", "brnch", "branch", "brdge", "brg", "bridge", "brk",
"brook", "brooks", "burg", "burgs", "byp", "bypa", "bypas", "bypass",
"byps", "camp", "cp", "cmp", "canyn", "canyon", "cnyn", "cape",
"cpe", "causeway", "causwa", "cswy", "cen", "cent", "center",
"centr", "centre", "cnter", "cntr", "ctr", "centers", "cir",
"circ", "circl", "circle", "crcl", "crcle", "circles", "clf",
"cliff", "clfs", "cliffs", "clb", "club", "common", "commons",
"cor", "corner", "corners", "cors", "course", "crse", "court",
"ct", "courts", "cts", "cove", "cv", "coves", "creek", "crk",
"crescent", "cres", "crsent", "crsnt", "crest", "crossing", "crssng",
"xing", "crossroad", "crossroads", "curve", "dale", "dl", "dam",
"dm", "div", "divide", "dv", "dvd", "dr", "driv", "drive", "drv",
"drives", "est", "estate", "estates", "ests", "exp", "expr",
"express", "expressway", "expw", "expy", "ext", "extension",
"extn", "extnsn", "exts", "fall", "falls", "fls", "ferry", "frry",
"fry", "field", "fld", "fields", "flds", "flat", "flt", "flats",
"flts", "ford", "frd", "fords", "forest", "forests", "frst",
"forg", "forge", "frg", "forges", "fork", "frk", "forks", "frks",
"fort", "frt", "ft", "freeway", "freewy", "frway", "frwy", "fwy",
"garden", "gardn", "grden", "grdn", "gardens", "gdns", "grdns",
"gateway", "gatewy", "gatway", "gtway", "gtwy", "glen", "gln",
"glens", "green", "grn", "greens", "grov", "grove", "grv", "groves",
"harb", "harbor", "harbr", "hbr", "hrbor", "harbors", "haven",
"hvn", "ht", "hts", "highway", "highwy", "hiway", "hiwy", "hway",
"hwy", "hill", "hl", "hills", "hls", "hllw", "hollow", "hollows",
"holw", "holws", "inlt", "is", "island", "islnd", "islands",
"islnds", "iss", "isle", "isles", "jct", "jction", "jctn", "junction",
"junctn", "juncton", "jctns", "jcts", "junctions", "key", "ky",
"keys", "kys", "knl", "knol", "knoll", "knls", "knolls", "lk",
"lake", "lks", "lakes", "land", "landing", "lndg", "lndng", "lane",
"ln", "lgt", "light", "lights", "lf", "loaf", "lck", "lock",
"lcks", "locks", "ldg", "ldge", "lodg", "lodge", "loop", "loops",
"mall", "mnr", "manor", "manors", "mnrs", "meadow", "mdw", "mdws",
"meadows", "medows", "mews", "mill", "mills", "missn", "mssn",
"motorway", "mnt", "mt", "mount", "mntain", "mntn", "mountain",
"mountin", "mtin", "mtn", "mntns", "mountains", "nck", "neck",
"orch", "orchard", "orchrd", "oval", "ovl", "overpass", "park",
"prk", "parks", "parkway", "parkwy", "pkway", "pkwy", "pky",
"parkways", "pkwys", "pass", "passage", "path", "paths", "pike",
"pikes", "pine", "pines", "pnes", "pl", "plain", "pln", "plains",
"plns", "plaza", "plz", "plza", "point", "pt", "points", "pts",
"port", "prt", "ports", "prts", "pr", "prairie", "prr", "rad",
"radial", "radiel", "radl", "ramp", "ranch", "ranches", "rnch",
"rnchs", "rapid", "rpd", "rapids", "rpds", "rest", "rst", "rdg",
"rdge", "ridge", "rdgs", "ridges", "riv", "river", "rvr", "rivr",
"rd", "road", "roads", "rds", "route", "row", "rue", "run", "shl",
"shoal", "shls", "shoals", "shoar", "shore", "shr", "shoars",
"shores", "shrs", "skyway", "spg", "spng", "spring", "sprng",
"spgs", "spngs", "springs", "sprngs", "spur", "spurs", "sq",
"sqr", "sqre", "squ", "square", "sqrs", "squares", "sta", "station",
"statn", "stn", "stra", "strav", "straven", "stravenue", "stravn",
"strvn", "strvnue", "stream", "streme", "strm", "street", "strt",
"st", "str", "streets", "smt", "suite", "sumit", "sumitt", "summit",
"ter", "terr", "terrace", "throughway", "trace", "traces", "trce",
"track", "tracks", "trak", "trk", "trks", "trafficway", "trail",
"trails", "trl", "trls", "trailer", "trlr", "trlrs", "tunel",
"tunl", "tunls", "tunnel", "tunnels", "tunnl", "trnpk", "turnpike",
"turnpk", "underpass", "un", "union", "unions", "valley", "vally",
"vlly", "vly", "valleys", "vlys", "vdct", "via", "viadct", "viaduct",
"view", "vw", "views", "vws", "vill", "villag", "village", "villg",
"villiage", "vlg", "villages", "vlgs", "ville", "vl", "vis",
"vist", "vista", "vst", "vsta", "walk", "walks", "wall", "wy",
"way", "ways", "well", "wells", "wls"),
usps_abbrev = c("aly",
"aly", "aly", "aly", "anx", "anx", "anx", "anx", "arc", "arc",
"ave", "ave", "ave", "ave", "ave", "ave", "ave", "byu", "byu",
"bch", "bch", "bnd", "bnd", "blf", "blf", "blf", "blfs", "btm",
"btm", "btm", "btm", "blvd", "blvd", "blvd", "blvd", "br", "br",
"br", "brg", "brg", "brg", "brk", "brk", "brks", "bg", "bgs",
"byp", "byp", "byp", "byp", "byp", "cp", "cp", "cp", "cyn", "cyn",
"cyn", "cpe", "cpe", "cswy", "cswy", "cswy", "ctr", "ctr", "ctr",
"ctr", "ctr", "ctr", "ctr", "ctr", "ctrs", "cir", "cir", "cir",
"cir", "cir", "cir", "cirs", "clf", "clf", "clfs", "clfs", "clb",
"clb", "cmn", "cmns", "cor", "cor", "cors", "cors", "crse", "crse",
"ct", "ct", "cts", "cts", "cv", "cv", "cvs", "crk", "crk", "cres",
"cres", "cres", "cres", "crst", "xing", "xing", "xing", "xrd",
"xrds", "curv", "dl", "dl", "dm", "dm", "dv", "dv", "dv", "dv",
"dr", "dr", "dr", "dr", "drs", "est", "est", "ests", "ests",
"expy", "expy", "expy", "expy", "expy", "expy", "ext", "ext",
"ext", "ext", "exts", "fall", "fls", "fls", "fry", "fry", "fry",
"fld", "fld", "flds", "flds", "flt", "flt", "flts", "flts", "frd",
"frd", "frds", "frst", "frst", "frst", "frg", "frg", "frg", "frgs",
"frk", "frk", "frks", "frks", "ft", "ft", "ft", "fwy", "fwy",
"fwy", "fwy", "fwy", "gdn", "gdn", "gdn", "gdn", "gdns", "gdns",
"gdns", "gtwy", "gtwy", "gtwy", "gtwy", "gtwy", "gln", "gln",
"glns", "grn", "grn", "grns", "grv", "grv", "grv", "grvs", "hbr",
"hbr", "hbr", "hbr", "hbr", "hbrs", "hvn", "hvn", "hts", "hts",
"hwy", "hwy", "hwy", "hwy", "hwy", "hwy", "hl", "hl", "hls",
"hls", "holw", "holw", "holw", "holw", "holw", "inlt", "is",
"is", "is", "iss", "iss", "iss", "isle", "isle", "jct", "jct",
"jct", "jct", "jct", "jct", "jcts", "jcts", "jcts", "ky", "ky",
"kys", "kys", "knl", "knl", "knl", "knls", "knls", "lk", "lk",
"lks", "lks", "land", "lndg", "lndg", "lndg", "ln", "ln", "lgt",
"lgt", "lgts", "lf", "lf", "lck", "lck", "lcks", "lcks", "ldg",
"ldg", "ldg", "ldg", "loop", "loop", "mall", "mnr", "mnr", "mnrs",
"mnrs", "mdw", "mdws", "mdws", "mdws", "mdws", "mews", "ml",
"mls", "msn", "msn", "mtwy", "mt", "mt", "mt", "mtn", "mtn",
"mtn", "mtn", "mtn", "mtn", "mtns", "mtns", "nck", "nck", "orch",
"orch", "orch", "oval", "oval", "opas", "park", "park", "park",
"pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pass",
"psge", "path", "path", "pike", "pike", "pne", "pnes", "pnes",
"pl", "pln", "pln", "plns", "plns", "plz", "plz", "plz", "pt",
"pt", "pts", "pts", "prt", "prt", "prts", "prts", "pr", "pr",
"pr", "radl", "radl", "radl", "radl", "ramp", "rnch", "rnch",
"rnch", "rnch", "rpd", "rpd", "rpds", "rpds", "rst", "rst", "rdg",
"rdg", "rdg", "rdgs", "rdgs", "riv", "riv", "riv", "riv", "rd",
"rd", "rds", "rds", "rte", "row", "rue", "run", "shl", "shl",
"shls", "shls", "shr", "shr", "shr", "shrs", "shrs", "shrs",
"skwy", "spg", "spg", "spg", "spg", "spgs", "spgs", "spgs", "spgs",
"spur", "spur", "sq", "sq", "sq", "sq", "sq", "sqs", "sqs", "sta",
"sta", "sta", "sta", "stra", "stra", "stra", "stra", "stra",
"stra", "stra", "strm", "strm", "strm", "st", "st", "st", "st",
"sts", "smt", "ste", "smt", "smt", "smt", "ter", "ter", "ter",
"trwy", "trce", "trce", "trce", "trak", "trak", "trak", "trak",
"trak", "trfy", "trl", "trl", "trl", "trl", "trlr", "trlr", "trlr",
"tunl", "tunl", "tunl", "tunl", "tunl", "tunl", "tpke", "tpke",
"tpke", "upas", "un", "un", "uns", "vly", "vly", "vly", "vly",
"vlys", "vlys", "via", "via", "via", "via", "vw", "vw", "vws",
"vws", "vlg", "vlg", "vlg", "vlg", "vlg", "vlg", "vlgs", "vlgs",
"vl", "vl", "vis", "vis", "vis", "vis", "vis", "walk", "walk",
"wall", "way", "way", "ways", "wl", "wls", "wls"))
randomAddresses = function(n){
tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
}
set.seed(1111)
df = randomAddresses(10)
USPS_conv2 = function(x, y) {
t = str_split(x, " ")
comm = t[[1]][length(t[[1]])]
str_replace(x, comm, y[comm])
}
USPS_conv2 = Vectorize(USPS_conv2, "x")
f_MK_conv2 <- function(x, y) {
x %>% mutate(
addresses = USPS_conv2(addresses,
array(data = y$usps_abbrev, dimnames = list(y$common_abbrev))))
}
f_MK_conv2(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
ht.create <- function() new.env()
ht.insert <- function(ht, key, value) ht[[key]] <- value
ht.insert <- Vectorize(ht.insert, c("key", "value"))
ht.lookup <- function(ht, key) ht[[key]]
ht.lookup <- Vectorize(ht.lookup, "key")
ht.delete <- function(ht, key) rm(list = key, envir = ht, inherits = FALSE)
ht.delete <- Vectorize(ht.delete, "key")
f_MK_replaceString <- function(x, y) {
ht <- ht.create()
ht.insert(ht, y$common_abbrev, y$usps_abbrev)
txt <- x$addresses
i <- sapply(strsplit(txt, ""), function(x) max(which(x == " ")))
txt <- paste0(
str_sub(txt, end = i),
ht.lookup(ht, str_sub(txt, start = i + 1))
)
x %>% mutate(addresses = txt)
}
f_MK_replaceString(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
f_TIC1 <- function(x, y) {
x %>% mutate(addresses = sapply(
strsplit(x$addresses, " "),
function(x) {
with(y, {
idx <- match(x, common_abbrev)
paste0(ifelse(is.na(idx), x, usps_abbrev[idx]),
collapse = " "
)
})
}
)
)
}
f_TIC1(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
f_TIC2 <- function(x, y) {
res <- c()
for (s in x$addresses) {
v <- unlist(strsplit(s, "\W+"))
for (p in v) {
k <- match(p, y$common_abbrev)
if (!is.na(k)) {
s <- with(
y,
gsub(
sprintf("\b%s\b", common_abbrev[k]),
usps_abbrev[k],
s
)
)
}
}
res <- append(res, s)
}
x %>% mutate(addresses = res)
}
f_TIC2(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
f_TIC3 <- function(x, y) {
x.split <- strsplit(x$addresses, " ")
lut <- with(y, setNames(usps_abbrev, common_abbrev))
grp <- rep(seq_along(x.split), lengths(x.split))
xx <- unlist(x.split)
r <- lut[xx]
x %>% mutate(addresses = tapply(
replace(xx, !is.na(r), na.omit(r)),
grp,
function(s) paste0(s, collapse = " ")
))
}
f_TIC3(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
f_TIC4 <- function(x, y) {
xb <- gsub("^.*\s+", "", x$addresses, perl = TRUE)
rp <- with(USPS, usps_abbrev[match(xb, common_abbrev)])
x %>% mutate(addresses = paste0(gsub("\w+$", "", x$addresses), replace(xb, !is.na(rp), na.omit(rp))))
}
f_TIC4(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
f_JM <- function(x, y) {
x$abbreviation <- gsub("^.* ", "", x$addresses, perl = TRUE)
setDT(x)
setDT(y)
x[y, abbreviation := usps_abbrev, on = .(abbreviation = common_abbrev)]
x$addresses <- paste(str_extract(x$addresses, "^.* "), x$abbreviation, sep = "")
x$abbreviation <- NULL
return(as_tibble(x))
}
f_JM(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
set.seed(1111)
df = randomAddresses(100)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
set.seed(1111)
df = randomAddresses(1000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
set.seed(1111)
df = randomAddresses(10000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
set.seed(1111)
df = randomAddresses(100000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_replaceString(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
set.seed(1111)
df = randomAddresses(1000000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_replaceString(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
由 reprex package (v2.0.1)
于 2021-11-04 创建
原文:
出色的答案 and !经过一些调整和基准测试后,我认为这种 data.table 'split/lookup-replace/paste' 方法可能更快:
library(tidyverse)
library(data.table)
n=1000000
set.seed(1111)
df = tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
df
#> # A tibble: 1,000,000 × 1
#> addresses
#> <chr>
#> 1 8995 marry pass
#> 2 8527 davis spng
#> 3 7663 marry loaf
#> 4 3043 davis common
#> 5 9192 marry bnd
#> 6 120 von brown corner
#> 7 72 van cortland plains
#> 8 8752 van cortland crcle
#> 9 7754 von brown sqrs
#> 10 3745 marry key
#> # … with 999,990 more rows
start_time =Sys.time()
df$abbreviation <- gsub("^.* ", "", df$addresses)
setDT(df)
setDT(USPS)
df[USPS, abbreviation:=usps_abbrev, on=.(abbreviation=common_abbrev)]
df$usps_abbreviation <- paste(str_extract(df$addresses, "^.* "), df$abbreviation, sep = "")
Sys.time()-start_time
#> Time difference of 2.804245 secs
df
#> addresses abbreviation usps_abbreviation
#> 1: 8995 marry pass pass 8995 marry pass
#> 2: 8527 davis spng spg 8527 davis spg
#> 3: 7663 marry loaf lf 7663 marry lf
#> 4: 3043 davis common cmn 3043 davis cmn
#> 5: 9192 marry bnd bnd 9192 marry bnd
#> ---
#> 999996: 1379 marry vdct via 1379 marry via
#> 999997: 237 harper avnue ave 237 harper ave
#> 999998: 7592 davis riv riv 7592 davis riv
#> 999999: 4963 marry junction jct 4963 marry jct
#> 1000000: 813 harper bluf blf 813 harper blf
由 reprex package (v2.0.1)
于 2021-11-03 创建
编辑
我更改了 dt_func()
以产生与 Marek 的函数相同的输出(更公平的比较)并且它仍然非常快:
set.seed(1111)
df <- randomAddresses(10000)
dt_func <- function(x) {
x$abbreviation <- gsub("^.* ", "", x$addresses)
setDT(x)
setDT(USPS)
x[USPS, abbreviation:=usps_abbrev, on=.(abbreviation=common_abbrev)]
x$addresses <- paste(str_extract(x$addresses, "^.* "), x$abbreviation, sep = "")
x$abbreviation <- NULL
return(as_tibble(x))
}
比较输出:
df2 <- f_MK_replaceString(df, addresses)
df3 <- dt_func(df)
dplyr::all_equal(df2, df3)
#> [1] TRUE
所有感兴趣的人的最新更新
我正在写一个额外的答案,因为我原来的答案不能容纳这么长的文本和代码了。
亲爱的同事们,下面我将这里创建的所有函数收集在一个集体代码块中,这样任何人都可以尝试一下,而不必将其与多个答案结合起来。
首先,我统一了所有函数,使每个函数在输入端接受两个参数,在输出端 returns 修改后的 tibble。我还将所有内部函数移到了处理函数之外。
最后,我对包含 100、1,000、10,000、100,000 和 1,000,000 行的表执行了基准测试。
这是全部代码
library(tidyverse)
library(data.table)
library(tidyverse)
USPS = tibble(
common_abbrev = c("allee", "alley", "ally", "aly",
"anex", "annex", "annx", "anx", "arc", "arcade", "av", "ave",
"aven", "avenu", "avenue", "avn", "avnue", "bayoo", "bayou",
"bch", "beach", "bend", "bnd", "blf", "bluf", "bluff", "bluffs",
"bot", "btm", "bottm", "bottom", "blvd", "boul", "boulevard",
"boulv", "br", "brnch", "branch", "brdge", "brg", "bridge", "brk",
"brook", "brooks", "burg", "burgs", "byp", "bypa", "bypas", "bypass",
"byps", "camp", "cp", "cmp", "canyn", "canyon", "cnyn", "cape",
"cpe", "causeway", "causwa", "cswy", "cen", "cent", "center",
"centr", "centre", "cnter", "cntr", "ctr", "centers", "cir",
"circ", "circl", "circle", "crcl", "crcle", "circles", "clf",
"cliff", "clfs", "cliffs", "clb", "club", "common", "commons",
"cor", "corner", "corners", "cors", "course", "crse", "court",
"ct", "courts", "cts", "cove", "cv", "coves", "creek", "crk",
"crescent", "cres", "crsent", "crsnt", "crest", "crossing", "crssng",
"xing", "crossroad", "crossroads", "curve", "dale", "dl", "dam",
"dm", "div", "divide", "dv", "dvd", "dr", "driv", "drive", "drv",
"drives", "est", "estate", "estates", "ests", "exp", "expr",
"express", "expressway", "expw", "expy", "ext", "extension",
"extn", "extnsn", "exts", "fall", "falls", "fls", "ferry", "frry",
"fry", "field", "fld", "fields", "flds", "flat", "flt", "flats",
"flts", "ford", "frd", "fords", "forest", "forests", "frst",
"forg", "forge", "frg", "forges", "fork", "frk", "forks", "frks",
"fort", "frt", "ft", "freeway", "freewy", "frway", "frwy", "fwy",
"garden", "gardn", "grden", "grdn", "gardens", "gdns", "grdns",
"gateway", "gatewy", "gatway", "gtway", "gtwy", "glen", "gln",
"glens", "green", "grn", "greens", "grov", "grove", "grv", "groves",
"harb", "harbor", "harbr", "hbr", "hrbor", "harbors", "haven",
"hvn", "ht", "hts", "highway", "highwy", "hiway", "hiwy", "hway",
"hwy", "hill", "hl", "hills", "hls", "hllw", "hollow", "hollows",
"holw", "holws", "inlt", "is", "island", "islnd", "islands",
"islnds", "iss", "isle", "isles", "jct", "jction", "jctn", "junction",
"junctn", "juncton", "jctns", "jcts", "junctions", "key", "ky",
"keys", "kys", "knl", "knol", "knoll", "knls", "knolls", "lk",
"lake", "lks", "lakes", "land", "landing", "lndg", "lndng", "lane",
"ln", "lgt", "light", "lights", "lf", "loaf", "lck", "lock",
"lcks", "locks", "ldg", "ldge", "lodg", "lodge", "loop", "loops",
"mall", "mnr", "manor", "manors", "mnrs", "meadow", "mdw", "mdws",
"meadows", "medows", "mews", "mill", "mills", "missn", "mssn",
"motorway", "mnt", "mt", "mount", "mntain", "mntn", "mountain",
"mountin", "mtin", "mtn", "mntns", "mountains", "nck", "neck",
"orch", "orchard", "orchrd", "oval", "ovl", "overpass", "park",
"prk", "parks", "parkway", "parkwy", "pkway", "pkwy", "pky",
"parkways", "pkwys", "pass", "passage", "path", "paths", "pike",
"pikes", "pine", "pines", "pnes", "pl", "plain", "pln", "plains",
"plns", "plaza", "plz", "plza", "point", "pt", "points", "pts",
"port", "prt", "ports", "prts", "pr", "prairie", "prr", "rad",
"radial", "radiel", "radl", "ramp", "ranch", "ranches", "rnch",
"rnchs", "rapid", "rpd", "rapids", "rpds", "rest", "rst", "rdg",
"rdge", "ridge", "rdgs", "ridges", "riv", "river", "rvr", "rivr",
"rd", "road", "roads", "rds", "route", "row", "rue", "run", "shl",
"shoal", "shls", "shoals", "shoar", "shore", "shr", "shoars",
"shores", "shrs", "skyway", "spg", "spng", "spring", "sprng",
"spgs", "spngs", "springs", "sprngs", "spur", "spurs", "sq",
"sqr", "sqre", "squ", "square", "sqrs", "squares", "sta", "station",
"statn", "stn", "stra", "strav", "straven", "stravenue", "stravn",
"strvn", "strvnue", "stream", "streme", "strm", "street", "strt",
"st", "str", "streets", "smt", "suite", "sumit", "sumitt", "summit",
"ter", "terr", "terrace", "throughway", "trace", "traces", "trce",
"track", "tracks", "trak", "trk", "trks", "trafficway", "trail",
"trails", "trl", "trls", "trailer", "trlr", "trlrs", "tunel",
"tunl", "tunls", "tunnel", "tunnels", "tunnl", "trnpk", "turnpike",
"turnpk", "underpass", "un", "union", "unions", "valley", "vally",
"vlly", "vly", "valleys", "vlys", "vdct", "via", "viadct", "viaduct",
"view", "vw", "views", "vws", "vill", "villag", "village", "villg",
"villiage", "vlg", "villages", "vlgs", "ville", "vl", "vis",
"vist", "vista", "vst", "vsta", "walk", "walks", "wall", "wy",
"way", "ways", "well", "wells", "wls"),
usps_abbrev = c("aly",
"aly", "aly", "aly", "anx", "anx", "anx", "anx", "arc", "arc",
"ave", "ave", "ave", "ave", "ave", "ave", "ave", "byu", "byu",
"bch", "bch", "bnd", "bnd", "blf", "blf", "blf", "blfs", "btm",
"btm", "btm", "btm", "blvd", "blvd", "blvd", "blvd", "br", "br",
"br", "brg", "brg", "brg", "brk", "brk", "brks", "bg", "bgs",
"byp", "byp", "byp", "byp", "byp", "cp", "cp", "cp", "cyn", "cyn",
"cyn", "cpe", "cpe", "cswy", "cswy", "cswy", "ctr", "ctr", "ctr",
"ctr", "ctr", "ctr", "ctr", "ctr", "ctrs", "cir", "cir", "cir",
"cir", "cir", "cir", "cirs", "clf", "clf", "clfs", "clfs", "clb",
"clb", "cmn", "cmns", "cor", "cor", "cors", "cors", "crse", "crse",
"ct", "ct", "cts", "cts", "cv", "cv", "cvs", "crk", "crk", "cres",
"cres", "cres", "cres", "crst", "xing", "xing", "xing", "xrd",
"xrds", "curv", "dl", "dl", "dm", "dm", "dv", "dv", "dv", "dv",
"dr", "dr", "dr", "dr", "drs", "est", "est", "ests", "ests",
"expy", "expy", "expy", "expy", "expy", "expy", "ext", "ext",
"ext", "ext", "exts", "fall", "fls", "fls", "fry", "fry", "fry",
"fld", "fld", "flds", "flds", "flt", "flt", "flts", "flts", "frd",
"frd", "frds", "frst", "frst", "frst", "frg", "frg", "frg", "frgs",
"frk", "frk", "frks", "frks", "ft", "ft", "ft", "fwy", "fwy",
"fwy", "fwy", "fwy", "gdn", "gdn", "gdn", "gdn", "gdns", "gdns",
"gdns", "gtwy", "gtwy", "gtwy", "gtwy", "gtwy", "gln", "gln",
"glns", "grn", "grn", "grns", "grv", "grv", "grv", "grvs", "hbr",
"hbr", "hbr", "hbr", "hbr", "hbrs", "hvn", "hvn", "hts", "hts",
"hwy", "hwy", "hwy", "hwy", "hwy", "hwy", "hl", "hl", "hls",
"hls", "holw", "holw", "holw", "holw", "holw", "inlt", "is",
"is", "is", "iss", "iss", "iss", "isle", "isle", "jct", "jct",
"jct", "jct", "jct", "jct", "jcts", "jcts", "jcts", "ky", "ky",
"kys", "kys", "knl", "knl", "knl", "knls", "knls", "lk", "lk",
"lks", "lks", "land", "lndg", "lndg", "lndg", "ln", "ln", "lgt",
"lgt", "lgts", "lf", "lf", "lck", "lck", "lcks", "lcks", "ldg",
"ldg", "ldg", "ldg", "loop", "loop", "mall", "mnr", "mnr", "mnrs",
"mnrs", "mdw", "mdws", "mdws", "mdws", "mdws", "mews", "ml",
"mls", "msn", "msn", "mtwy", "mt", "mt", "mt", "mtn", "mtn",
"mtn", "mtn", "mtn", "mtn", "mtns", "mtns", "nck", "nck", "orch",
"orch", "orch", "oval", "oval", "opas", "park", "park", "park",
"pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pass",
"psge", "path", "path", "pike", "pike", "pne", "pnes", "pnes",
"pl", "pln", "pln", "plns", "plns", "plz", "plz", "plz", "pt",
"pt", "pts", "pts", "prt", "prt", "prts", "prts", "pr", "pr",
"pr", "radl", "radl", "radl", "radl", "ramp", "rnch", "rnch",
"rnch", "rnch", "rpd", "rpd", "rpds", "rpds", "rst", "rst", "rdg",
"rdg", "rdg", "rdgs", "rdgs", "riv", "riv", "riv", "riv", "rd",
"rd", "rds", "rds", "rte", "row", "rue", "run", "shl", "shl",
"shls", "shls", "shr", "shr", "shr", "shrs", "shrs", "shrs",
"skwy", "spg", "spg", "spg", "spg", "spgs", "spgs", "spgs", "spgs",
"spur", "spur", "sq", "sq", "sq", "sq", "sq", "sqs", "sqs", "sta",
"sta", "sta", "sta", "stra", "stra", "stra", "stra", "stra",
"stra", "stra", "strm", "strm", "strm", "st", "st", "st", "st",
"sts", "smt", "ste", "smt", "smt", "smt", "ter", "ter", "ter",
"trwy", "trce", "trce", "trce", "trak", "trak", "trak", "trak",
"trak", "trfy", "trl", "trl", "trl", "trl", "trlr", "trlr", "trlr",
"tunl", "tunl", "tunl", "tunl", "tunl", "tunl", "tpke", "tpke",
"tpke", "upas", "un", "un", "uns", "vly", "vly", "vly", "vly",
"vlys", "vlys", "via", "via", "via", "via", "vw", "vw", "vws",
"vws", "vlg", "vlg", "vlg", "vlg", "vlg", "vlg", "vlgs", "vlgs",
"vl", "vl", "vis", "vis", "vis", "vis", "vis", "walk", "walk",
"wall", "way", "way", "ways", "wl", "wls", "wls"))
randomAddresses = function(n){
tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
}
set.seed(1111)
df = randomAddresses(10)
USPS_conv2 = function(x, y) {
t = str_split(x, " ")
comm = t[[1]][length(t[[1]])]
str_replace(x, comm, y[comm])
}
USPS_conv2 = Vectorize(USPS_conv2, "x")
f_MK_conv2 <- function(x, y) {
x %>% mutate(
addresses = USPS_conv2(addresses,
array(data = y$usps_abbrev, dimnames = list(y$common_abbrev))))
}
f_MK_conv2(df, USPS)
ht.create <- function() new.env()
ht.insert <- function(ht, key, value) ht[[key]] <- value
ht.insert <- Vectorize(ht.insert, c("key", "value"))
ht.lookup <- function(ht, key) ht[[key]]
ht.lookup <- Vectorize(ht.lookup, "key")
ht.delete <- function(ht, key) rm(list = key, envir = ht, inherits = FALSE)
ht.delete <- Vectorize(ht.delete, "key")
f_MK_replaceString <- function(x, y) {
ht <- ht.create()
ht.insert(ht, y$common_abbrev, y$usps_abbrev)
txt <- x$addresses
i <- sapply(strsplit(txt, ""), function(x) max(which(x == " ")))
txt <- paste0(
str_sub(txt, end = i),
ht.lookup(ht, str_sub(txt, start = i + 1))
)
x %>% mutate(addresses = txt)
}
f_MK_replaceString(df, USPS)
f_TIC1 <- function(x, y) {
x %>% mutate(addresses = sapply(
strsplit(x$addresses, " "),
function(x) {
with(y, {
idx <- match(x, common_abbrev)
paste0(ifelse(is.na(idx), x, usps_abbrev[idx]),
collapse = " "
)
})
}
)
)
}
f_TIC1(df, USPS)
f_TIC2 <- function(x, y) {
res <- c()
for (s in x$addresses) {
v <- unlist(strsplit(s, "\W+"))
for (p in v) {
k <- match(p, y$common_abbrev)
if (!is.na(k)) {
s <- with(
y,
gsub(
sprintf("\b%s\b", common_abbrev[k]),
usps_abbrev[k],
s
)
)
}
}
res <- append(res, s)
}
x %>% mutate(addresses = res)
}
f_TIC2(df, USPS)
f_TIC3 <- function(x, y) {
x.split <- strsplit(x$addresses, " ")
lut <- with(y, setNames(usps_abbrev, common_abbrev))
grp <- rep(seq_along(x.split), lengths(x.split))
xx <- unlist(x.split)
r <- lut[xx]
x %>% mutate(addresses = tapply(
replace(xx, !is.na(r), na.omit(r)),
grp,
function(s) paste0(s, collapse = " ")
))
}
f_TIC3(df, USPS)
f_TIC4 <- function(x, y) {
xb <- gsub("^.*\s+", "", x$addresses)
rp <- with(USPS, usps_abbrev[match(xb, common_abbrev)])
x %>% mutate(addresses = paste0(gsub("\w+$", "", x$addresses), replace(xb, !is.na(rp), na.omit(rp))))
}
f_TIC4(df, USPS)
f_JM <- function(x, y) {
x$abbreviation <- gsub("^.* ", "", x$addresses)
setDT(x)
setDT(y)
x[y, abbreviation := usps_abbrev, on = .(abbreviation = common_abbrev)]
x$addresses <- paste(str_extract(x$addresses, "^.* "), x$abbreviation, sep = "")
x$abbreviation <- NULL
return(as_tibble(x))
}
f_JM(df, USPS)
set.seed(1111)
df = randomAddresses(100)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
set.seed(1111)
df = randomAddresses(1000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
set.seed(1111)
df = randomAddresses(10000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
set.seed(1111)
df = randomAddresses(100000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_replaceString(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
set.seed(1111)
df = randomAddresses(1000000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_replaceString(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
现在以图表形式显示结果
结论和总结时间
@jared_mamrot - 你完全正确。 data.table
太棒了!!
@ThomasIsCoding - f_TIC4
太棒了。它的简单是美丽的!!
@AnyoneWhoComesBy - 恭喜你读完了这篇文章。相信你也能在这里学到很多东西!!
特别是@jvalenti
这是一个特殊的答案,您可以在其中找到修改后的函数和您的任务所需的所有代码。
library(tidyverse)
USPS = tibble(
common_abbrev = c("allee", "alley", "ally", "aly",
"anex", "annex", "annx", "anx", "arc", "arcade", "av", "ave",
"aven", "avenu", "avenue", "avn", "avnue", "bayoo", "bayou",
"bch", "beach", "bend", "bnd", "blf", "bluf", "bluff", "bluffs",
"bot", "btm", "bottm", "bottom", "blvd", "boul", "boulevard",
"boulv", "br", "brnch", "branch", "brdge", "brg", "bridge", "brk",
"brook", "brooks", "burg", "burgs", "byp", "bypa", "bypas", "bypass",
"byps", "camp", "cp", "cmp", "canyn", "canyon", "cnyn", "cape",
"cpe", "causeway", "causwa", "cswy", "cen", "cent", "center",
"centr", "centre", "cnter", "cntr", "ctr", "centers", "cir",
"circ", "circl", "circle", "crcl", "crcle", "circles", "clf",
"cliff", "clfs", "cliffs", "clb", "club", "common", "commons",
"cor", "corner", "corners", "cors", "course", "crse", "court",
"ct", "courts", "cts", "cove", "cv", "coves", "creek", "crk",
"crescent", "cres", "crsent", "crsnt", "crest", "crossing", "crssng",
"xing", "crossroad", "crossroads", "curve", "dale", "dl", "dam",
"dm", "div", "divide", "dv", "dvd", "dr", "driv", "drive", "drv",
"drives", "est", "estate", "estates", "ests", "exp", "expr",
"express", "expressway", "expw", "expy", "ext", "extension",
"extn", "extnsn", "exts", "fall", "falls", "fls", "ferry", "frry",
"fry", "field", "fld", "fields", "flds", "flat", "flt", "flats",
"flts", "ford", "frd", "fords", "forest", "forests", "frst",
"forg", "forge", "frg", "forges", "fork", "frk", "forks", "frks",
"fort", "frt", "ft", "freeway", "freewy", "frway", "frwy", "fwy",
"garden", "gardn", "grden", "grdn", "gardens", "gdns", "grdns",
"gateway", "gatewy", "gatway", "gtway", "gtwy", "glen", "gln",
"glens", "green", "grn", "greens", "grov", "grove", "grv", "groves",
"harb", "harbor", "harbr", "hbr", "hrbor", "harbors", "haven",
"hvn", "ht", "hts", "highway", "highwy", "hiway", "hiwy", "hway",
"hwy", "hill", "hl", "hills", "hls", "hllw", "hollow", "hollows",
"holw", "holws", "inlt", "is", "island", "islnd", "islands",
"islnds", "iss", "isle", "isles", "jct", "jction", "jctn", "junction",
"junctn", "juncton", "jctns", "jcts", "junctions", "key", "ky",
"keys", "kys", "knl", "knol", "knoll", "knls", "knolls", "lk",
"lake", "lks", "lakes", "land", "landing", "lndg", "lndng", "lane",
"ln", "lgt", "light", "lights", "lf", "loaf", "lck", "lock",
"lcks", "locks", "ldg", "ldge", "lodg", "lodge", "loop", "loops",
"mall", "mnr", "manor", "manors", "mnrs", "meadow", "mdw", "mdws",
"meadows", "medows", "mews", "mill", "mills", "missn", "mssn",
"motorway", "mnt", "mt", "mount", "mntain", "mntn", "mountain",
"mountin", "mtin", "mtn", "mntns", "mountains", "nck", "neck",
"orch", "orchard", "orchrd", "oval", "ovl", "overpass", "park",
"prk", "parks", "parkway", "parkwy", "pkway", "pkwy", "pky",
"parkways", "pkwys", "pass", "passage", "path", "paths", "pike",
"pikes", "pine", "pines", "pnes", "pl", "plain", "pln", "plains",
"plns", "plaza", "plz", "plza", "point", "pt", "points", "pts",
"port", "prt", "ports", "prts", "pr", "prairie", "prr", "rad",
"radial", "radiel", "radl", "ramp", "ranch", "ranches", "rnch",
"rnchs", "rapid", "rpd", "rapids", "rpds", "rest", "rst", "rdg",
"rdge", "ridge", "rdgs", "ridges", "riv", "river", "rvr", "rivr",
"rd", "road", "roads", "rds", "route", "row", "rue", "run", "shl",
"shoal", "shls", "shoals", "shoar", "shore", "shr", "shoars",
"shores", "shrs", "skyway", "spg", "spng", "spring", "sprng",
"spgs", "spngs", "springs", "sprngs", "spur", "spurs", "sq",
"sqr", "sqre", "squ", "square", "sqrs", "squares", "sta", "station",
"statn", "stn", "stra", "strav", "straven", "stravenue", "stravn",
"strvn", "strvnue", "stream", "streme", "strm", "street", "strt",
"st", "str", "streets", "smt", "suite", "sumit", "sumitt", "summit",
"ter", "terr", "terrace", "throughway", "trace", "traces", "trce",
"track", "tracks", "trak", "trk", "trks", "trafficway", "trail",
"trails", "trl", "trls", "trailer", "trlr", "trlrs", "tunel",
"tunl", "tunls", "tunnel", "tunnels", "tunnl", "trnpk", "turnpike",
"turnpk", "underpass", "un", "union", "unions", "valley", "vally",
"vlly", "vly", "valleys", "vlys", "vdct", "via", "viadct", "viaduct",
"view", "vw", "views", "vws", "vill", "villag", "village", "villg",
"villiage", "vlg", "villages", "vlgs", "ville", "vl", "vis",
"vist", "vista", "vst", "vsta", "walk", "walks", "wall", "wy",
"way", "ways", "well", "wells", "wls"),
usps_abbrev = c("aly",
"aly", "aly", "aly", "anx", "anx", "anx", "anx", "arc", "arc",
"ave", "ave", "ave", "ave", "ave", "ave", "ave", "byu", "byu",
"bch", "bch", "bnd", "bnd", "blf", "blf", "blf", "blfs", "btm",
"btm", "btm", "btm", "blvd", "blvd", "blvd", "blvd", "br", "br",
"br", "brg", "brg", "brg", "brk", "brk", "brks", "bg", "bgs",
"byp", "byp", "byp", "byp", "byp", "cp", "cp", "cp", "cyn", "cyn",
"cyn", "cpe", "cpe", "cswy", "cswy", "cswy", "ctr", "ctr", "ctr",
"ctr", "ctr", "ctr", "ctr", "ctr", "ctrs", "cir", "cir", "cir",
"cir", "cir", "cir", "cirs", "clf", "clf", "clfs", "clfs", "clb",
"clb", "cmn", "cmns", "cor", "cor", "cors", "cors", "crse", "crse",
"ct", "ct", "cts", "cts", "cv", "cv", "cvs", "crk", "crk", "cres",
"cres", "cres", "cres", "crst", "xing", "xing", "xing", "xrd",
"xrds", "curv", "dl", "dl", "dm", "dm", "dv", "dv", "dv", "dv",
"dr", "dr", "dr", "dr", "drs", "est", "est", "ests", "ests",
"expy", "expy", "expy", "expy", "expy", "expy", "ext", "ext",
"ext", "ext", "exts", "fall", "fls", "fls", "fry", "fry", "fry",
"fld", "fld", "flds", "flds", "flt", "flt", "flts", "flts", "frd",
"frd", "frds", "frst", "frst", "frst", "frg", "frg", "frg", "frgs",
"frk", "frk", "frks", "frks", "ft", "ft", "ft", "fwy", "fwy",
"fwy", "fwy", "fwy", "gdn", "gdn", "gdn", "gdn", "gdns", "gdns",
"gdns", "gtwy", "gtwy", "gtwy", "gtwy", "gtwy", "gln", "gln",
"glns", "grn", "grn", "grns", "grv", "grv", "grv", "grvs", "hbr",
"hbr", "hbr", "hbr", "hbr", "hbrs", "hvn", "hvn", "hts", "hts",
"hwy", "hwy", "hwy", "hwy", "hwy", "hwy", "hl", "hl", "hls",
"hls", "holw", "holw", "holw", "holw", "holw", "inlt", "is",
"is", "is", "iss", "iss", "iss", "isle", "isle", "jct", "jct",
"jct", "jct", "jct", "jct", "jcts", "jcts", "jcts", "ky", "ky",
"kys", "kys", "knl", "knl", "knl", "knls", "knls", "lk", "lk",
"lks", "lks", "land", "lndg", "lndg", "lndg", "ln", "ln", "lgt",
"lgt", "lgts", "lf", "lf", "lck", "lck", "lcks", "lcks", "ldg",
"ldg", "ldg", "ldg", "loop", "loop", "mall", "mnr", "mnr", "mnrs",
"mnrs", "mdw", "mdws", "mdws", "mdws", "mdws", "mews", "ml",
"mls", "msn", "msn", "mtwy", "mt", "mt", "mt", "mtn", "mtn",
"mtn", "mtn", "mtn", "mtn", "mtns", "mtns", "nck", "nck", "orch",
"orch", "orch", "oval", "oval", "opas", "park", "park", "park",
"pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pass",
"psge", "path", "path", "pike", "pike", "pne", "pnes", "pnes",
"pl", "pln", "pln", "plns", "plns", "plz", "plz", "plz", "pt",
"pt", "pts", "pts", "prt", "prt", "prts", "prts", "pr", "pr",
"pr", "radl", "radl", "radl", "radl", "ramp", "rnch", "rnch",
"rnch", "rnch", "rpd", "rpd", "rpds", "rpds", "rst", "rst", "rdg",
"rdg", "rdg", "rdgs", "rdgs", "riv", "riv", "riv", "riv", "rd",
"rd", "rds", "rds", "rte", "row", "rue", "run", "shl", "shl",
"shls", "shls", "shr", "shr", "shr", "shrs", "shrs", "shrs",
"skwy", "spg", "spg", "spg", "spg", "spgs", "spgs", "spgs", "spgs",
"spur", "spur", "sq", "sq", "sq", "sq", "sq", "sqs", "sqs", "sta",
"sta", "sta", "sta", "stra", "stra", "stra", "stra", "stra",
"stra", "stra", "strm", "strm", "strm", "st", "st", "st", "st",
"sts", "smt", "ste", "smt", "smt", "smt", "ter", "ter", "ter",
"trwy", "trce", "trce", "trce", "trak", "trak", "trak", "trak",
"trak", "trfy", "trl", "trl", "trl", "trl", "trlr", "trlr", "trlr",
"tunl", "tunl", "tunl", "tunl", "tunl", "tunl", "tpke", "tpke",
"tpke", "upas", "un", "un", "uns", "vly", "vly", "vly", "vly",
"vlys", "vlys", "via", "via", "via", "via", "vw", "vw", "vws",
"vws", "vlg", "vlg", "vlg", "vlg", "vlg", "vlg", "vlgs", "vlgs",
"vl", "vl", "vis", "vis", "vis", "vis", "vis", "walk", "walk",
"wall", "way", "way", "ways", "wl", "wls", "wls"))
randomAddresses = function(n){
tibble(
addresses =
replicate(
n,
sample(c(sample(10:10000, 1, replace = TRUE) %>% paste0,
sample(c("harper", "davis", "van cortland", "marry", "von brown"), 1),
sample(USPS$common_abbrev, 1)), 3) %>% paste(collapse = " ")
)
)
}
ht.create <- function() new.env()
ht.insert <- function(ht, key, value) ht[[key]] <- value
ht.insert <- Vectorize(ht.insert, c("key", "value"))
ht.lookup <- function(ht, key) ht[[key]]
ht.lookup <- Vectorize(ht.lookup, "key")
addHashTable2 = function(.x, .y, key, value){
key = enquo(key)
value = enquo(value)
if (!all(c(as_label(key), as_label(value)) %in% names(.y))) {
stop(paste0("`.y` must contain `", as_label(key),
"` and `", as_label(value), "` columns"))
}
if((.y %>% distinct(!!key, !!value) %>% nrow)!=
(.y %>% distinct(!!key) %>% nrow)){
warning(paste0(
"\nThe number of unique values of the ", as_label(key),
" variable is different\n",
" from the number of unique values of the ",
as_label(key), " and ", as_label(value)," pairs!\n",
"The dictionary will only return the last values for a given key!"))
}
ht = ht.create()
ht %>% ht.insert(.y %>% distinct(!!key, !!value) %>% pull(!!key),
.y %>% distinct(!!key, !!value) %>% pull(!!value))
attr(.x, "hashTab") = ht
.x
}
replaceString = function(.data, value){
value = enquo(value)
#Test whether the value variable is in .data
if(!(as_label(value) %in% names(.data))){
stop(paste("The", as_label(value),
"variable does not exist in the .data table!"))
}
#Dictionary attribute presence test
if(!("hashTab" %in% names(attributes(.data)))) {
stop(paste0(
"\nThere is no dictionary attribute in the .data table!\n",
"Use addHashTable or addHashTable2 to add a dictionary attribute."))
}
ht = attr(.data, "hashTab")
txtRep = function(txt){
txt = str_split(txt, " ")[[1]]
httxt = ht.lookup(ht, txt)
txt[httxt!="NULL"] = httxt[httxt!="NULL"]
paste(txt, collapse = " ")
}
.data %>% rowwise(!!value) %>%
mutate(!!value := txtRep(!!value))
}
replaceString
功能已修改为替换缩写,无论它们在句子中的什么位置。
查看使用方法。
set.seed(1111)
df=randomAddresses(10)
df
输出
# A tibble: 10 x 1
addresses
<chr>
1 marry wall 8995
2 cen 9192 marry
3 bayoo 3745 davis
4 marry hollows 4104
5 grdn 7162 marry
6 lck harper 1211
7 9405 van cortland knol
8 7984 von brown viadct
9 4365 von brown rue
10 6399 von brown mssn
现在我们要修改这个tibble
。
df %>% addHashTable2(USPS, common_abbrev, usps_abbrev) %>%
replaceString(addresses)
输出
# A tibble: 10 x 1
# Rowwise: addresses
addresses
<chr>
1 marry wall 8995
2 ctr 9192 marry
3 byu 3745 davis
4 marry holw 4104
5 gdn 7162 marry
6 lck harper 1211
7 9405 van cortland knl
8 7984 von brown via
9 4365 von brown rue
10 6399 von brown msn
祝你好运,大数据快速突变!!
我已经从 USPS 下载了街道缩写。这是数据:
dput(usps_streets)
structure(list(common_abbrev = c("allee", "alley", "ally", "aly",
"anex", "annex", "annx", "anx", "arc", "arcade", "av", "ave",
"aven", "avenu", "avenue", "avn", "avnue", "bayoo", "bayou",
"bch", "beach", "bend", "bnd", "blf", "bluf", "bluff", "bluffs",
"bot", "btm", "bottm", "bottom", "blvd", "boul", "boulevard",
"boulv", "br", "brnch", "branch", "brdge", "brg", "bridge", "brk",
"brook", "brooks", "burg", "burgs", "byp", "bypa", "bypas", "bypass",
"byps", "camp", "cp", "cmp", "canyn", "canyon", "cnyn", "cape",
"cpe", "causeway", "causwa", "cswy", "cen", "cent", "center",
"centr", "centre", "cnter", "cntr", "ctr", "centers", "cir",
"circ", "circl", "circle", "crcl", "crcle", "circles", "clf",
"cliff", "clfs", "cliffs", "clb", "club", "common", "commons",
"cor", "corner", "corners", "cors", "course", "crse", "court",
"ct", "courts", "cts", "cove", "cv", "coves", "creek", "crk",
"crescent", "cres", "crsent", "crsnt", "crest", "crossing", "crssng",
"xing", "crossroad", "crossroads", "curve", "dale", "dl", "dam",
"dm", "div", "divide", "dv", "dvd", "dr", "driv", "drive", "drv",
"drives", "est", "estate", "estates", "ests", "exp", "expr",
"express", "expressway", "expw", "expy", "ext", "extension",
"extn", "extnsn", "exts", "fall", "falls", "fls", "ferry", "frry",
"fry", "field", "fld", "fields", "flds", "flat", "flt", "flats",
"flts", "ford", "frd", "fords", "forest", "forests", "frst",
"forg", "forge", "frg", "forges", "fork", "frk", "forks", "frks",
"fort", "frt", "ft", "freeway", "freewy", "frway", "frwy", "fwy",
"garden", "gardn", "grden", "grdn", "gardens", "gdns", "grdns",
"gateway", "gatewy", "gatway", "gtway", "gtwy", "glen", "gln",
"glens", "green", "grn", "greens", "grov", "grove", "grv", "groves",
"harb", "harbor", "harbr", "hbr", "hrbor", "harbors", "haven",
"hvn", "ht", "hts", "highway", "highwy", "hiway", "hiwy", "hway",
"hwy", "hill", "hl", "hills", "hls", "hllw", "hollow", "hollows",
"holw", "holws", "inlt", "is", "island", "islnd", "islands",
"islnds", "iss", "isle", "isles", "jct", "jction", "jctn", "junction",
"junctn", "juncton", "jctns", "jcts", "junctions", "key", "ky",
"keys", "kys", "knl", "knol", "knoll", "knls", "knolls", "lk",
"lake", "lks", "lakes", "land", "landing", "lndg", "lndng", "lane",
"ln", "lgt", "light", "lights", "lf", "loaf", "lck", "lock",
"lcks", "locks", "ldg", "ldge", "lodg", "lodge", "loop", "loops",
"mall", "mnr", "manor", "manors", "mnrs", "meadow", "mdw", "mdws",
"meadows", "medows", "mews", "mill", "mills", "missn", "mssn",
"motorway", "mnt", "mt", "mount", "mntain", "mntn", "mountain",
"mountin", "mtin", "mtn", "mntns", "mountains", "nck", "neck",
"orch", "orchard", "orchrd", "oval", "ovl", "overpass", "park",
"prk", "parks", "parkway", "parkwy", "pkway", "pkwy", "pky",
"parkways", "pkwys", "pass", "passage", "path", "paths", "pike",
"pikes", "pine", "pines", "pnes", "pl", "plain", "pln", "plains",
"plns", "plaza", "plz", "plza", "point", "pt", "points", "pts",
"port", "prt", "ports", "prts", "pr", "prairie", "prr", "rad",
"radial", "radiel", "radl", "ramp", "ranch", "ranches", "rnch",
"rnchs", "rapid", "rpd", "rapids", "rpds", "rest", "rst", "rdg",
"rdge", "ridge", "rdgs", "ridges", "riv", "river", "rvr", "rivr",
"rd", "road", "roads", "rds", "route", "row", "rue", "run", "shl",
"shoal", "shls", "shoals", "shoar", "shore", "shr", "shoars",
"shores", "shrs", "skyway", "spg", "spng", "spring", "sprng",
"spgs", "spngs", "springs", "sprngs", "spur", "spurs", "sq",
"sqr", "sqre", "squ", "square", "sqrs", "squares", "sta", "station",
"statn", "stn", "stra", "strav", "straven", "stravenue", "stravn",
"strvn", "strvnue", "stream", "streme", "strm", "street", "strt",
"st", "str", "streets", "smt", "suite", "sumit", "sumitt", "summit",
"ter", "terr", "terrace", "throughway", "trace", "traces", "trce",
"track", "tracks", "trak", "trk", "trks", "trafficway", "trail",
"trails", "trl", "trls", "trailer", "trlr", "trlrs", "tunel",
"tunl", "tunls", "tunnel", "tunnels", "tunnl", "trnpk", "turnpike",
"turnpk", "underpass", "un", "union", "unions", "valley", "vally",
"vlly", "vly", "valleys", "vlys", "vdct", "via", "viadct", "viaduct",
"view", "vw", "views", "vws", "vill", "villag", "village", "villg",
"villiage", "vlg", "villages", "vlgs", "ville", "vl", "vis",
"vist", "vista", "vst", "vsta", "walk", "walks", "wall", "wy",
"way", "ways", "well", "wells", "wls"), usps_abbrev = c("aly",
"aly", "aly", "aly", "anx", "anx", "anx", "anx", "arc", "arc",
"ave", "ave", "ave", "ave", "ave", "ave", "ave", "byu", "byu",
"bch", "bch", "bnd", "bnd", "blf", "blf", "blf", "blfs", "btm",
"btm", "btm", "btm", "blvd", "blvd", "blvd", "blvd", "br", "br",
"br", "brg", "brg", "brg", "brk", "brk", "brks", "bg", "bgs",
"byp", "byp", "byp", "byp", "byp", "cp", "cp", "cp", "cyn", "cyn",
"cyn", "cpe", "cpe", "cswy", "cswy", "cswy", "ctr", "ctr", "ctr",
"ctr", "ctr", "ctr", "ctr", "ctr", "ctrs", "cir", "cir", "cir",
"cir", "cir", "cir", "cirs", "clf", "clf", "clfs", "clfs", "clb",
"clb", "cmn", "cmns", "cor", "cor", "cors", "cors", "crse", "crse",
"ct", "ct", "cts", "cts", "cv", "cv", "cvs", "crk", "crk", "cres",
"cres", "cres", "cres", "crst", "xing", "xing", "xing", "xrd",
"xrds", "curv", "dl", "dl", "dm", "dm", "dv", "dv", "dv", "dv",
"dr", "dr", "dr", "dr", "drs", "est", "est", "ests", "ests",
"expy", "expy", "expy", "expy", "expy", "expy", "ext", "ext",
"ext", "ext", "exts", "fall", "fls", "fls", "fry", "fry", "fry",
"fld", "fld", "flds", "flds", "flt", "flt", "flts", "flts", "frd",
"frd", "frds", "frst", "frst", "frst", "frg", "frg", "frg", "frgs",
"frk", "frk", "frks", "frks", "ft", "ft", "ft", "fwy", "fwy",
"fwy", "fwy", "fwy", "gdn", "gdn", "gdn", "gdn", "gdns", "gdns",
"gdns", "gtwy", "gtwy", "gtwy", "gtwy", "gtwy", "gln", "gln",
"glns", "grn", "grn", "grns", "grv", "grv", "grv", "grvs", "hbr",
"hbr", "hbr", "hbr", "hbr", "hbrs", "hvn", "hvn", "hts", "hts",
"hwy", "hwy", "hwy", "hwy", "hwy", "hwy", "hl", "hl", "hls",
"hls", "holw", "holw", "holw", "holw", "holw", "inlt", "is",
"is", "is", "iss", "iss", "iss", "isle", "isle", "jct", "jct",
"jct", "jct", "jct", "jct", "jcts", "jcts", "jcts", "ky", "ky",
"kys", "kys", "knl", "knl", "knl", "knls", "knls", "lk", "lk",
"lks", "lks", "land", "lndg", "lndg", "lndg", "ln", "ln", "lgt",
"lgt", "lgts", "lf", "lf", "lck", "lck", "lcks", "lcks", "ldg",
"ldg", "ldg", "ldg", "loop", "loop", "mall", "mnr", "mnr", "mnrs",
"mnrs", "mdw", "mdws", "mdws", "mdws", "mdws", "mews", "ml",
"mls", "msn", "msn", "mtwy", "mt", "mt", "mt", "mtn", "mtn",
"mtn", "mtn", "mtn", "mtn", "mtns", "mtns", "nck", "nck", "orch",
"orch", "orch", "oval", "oval", "opas", "park", "park", "park",
"pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pass",
"psge", "path", "path", "pike", "pike", "pne", "pnes", "pnes",
"pl", "pln", "pln", "plns", "plns", "plz", "plz", "plz", "pt",
"pt", "pts", "pts", "prt", "prt", "prts", "prts", "pr", "pr",
"pr", "radl", "radl", "radl", "radl", "ramp", "rnch", "rnch",
"rnch", "rnch", "rpd", "rpd", "rpds", "rpds", "rst", "rst", "rdg",
"rdg", "rdg", "rdgs", "rdgs", "riv", "riv", "riv", "riv", "rd",
"rd", "rds", "rds", "rte", "row", "rue", "run", "shl", "shl",
"shls", "shls", "shr", "shr", "shr", "shrs", "shrs", "shrs",
"skwy", "spg", "spg", "spg", "spg", "spgs", "spgs", "spgs", "spgs",
"spur", "spur", "sq", "sq", "sq", "sq", "sq", "sqs", "sqs", "sta",
"sta", "sta", "sta", "stra", "stra", "stra", "stra", "stra",
"stra", "stra", "strm", "strm", "strm", "st", "st", "st", "st",
"sts", "smt", "ste", "smt", "smt", "smt", "ter", "ter", "ter",
"trwy", "trce", "trce", "trce", "trak", "trak", "trak", "trak",
"trak", "trfy", "trl", "trl", "trl", "trl", "trlr", "trlr", "trlr",
"tunl", "tunl", "tunl", "tunl", "tunl", "tunl", "tpke", "tpke",
"tpke", "upas", "un", "un", "uns", "vly", "vly", "vly", "vly",
"vlys", "vlys", "via", "via", "via", "via", "vw", "vw", "vws",
"vws", "vlg", "vlg", "vlg", "vlg", "vlg", "vlg", "vlgs", "vlgs",
"vl", "vl", "vis", "vis", "vis", "vis", "vis", "walk", "walk",
"wall", "way", "way", "ways", "wl", "wls", "wls")), class = "data.frame", row.names = c(NA,
-503L))
我想用它们来处理街道地址和州。玩具数据:
a <- c("10900 harper ave", "12235 davis annex", "24 van cortland parkway")
为了将常用缩写转换为 usps 缩写(标准化数据),我构建了一个小函数:
mr_zip <- function(x){
x <-textclean::mgsub(usps_streets$common_abbrev, usps_streets$usps_abbrev, x, fixed = T,
order.pattern = T)
return(x)
}
当我将我的函数应用于我的数据时出现问题:
f <- sapply(a, mr_zip)
我得到了错误的结果:
"10900 harper avee" "1235 davis anx" "24 van cortland pkway"
因为我应该得到的是:
"10900 harper ave" "1235 davis anx" "24 van cortland pkwy"
我的问题:
- 当我在
mgsub
函数中指定order.pattern = T
和fixed = T
时,为什么会发生这种情况? - 我该如何解决?
- 是否有替代方法在文本的多个替换模式中使用向量?
在此先致谢,欢迎提出所有建议。
编辑:感谢@RichieSacramento,我发现使用边界词确实有帮助,但在大型数据帧(> 400,000 行)上使用时该功能仍然非常慢。在 mgsub
中使用 safe = TRUE
会导致函数正常工作,但速度非常慢。需要一些快速的东西——因此赏金。
更新
这是 OP 现有问题的基准测试(从 n <- 10000
)
> mb1
Unit: milliseconds
expr min lq mean median
f_MK_conv2(df$addresses) 1409.0643 1470.3992 1612.09037 1631.3014
f_MK_replaceString(df, addresses) 50.1582 54.3035 94.53149 62.5772
f_TIC1(df$addresses) 394.5972 420.3283 461.50675 447.6186
f_TIC2(df$addresses) 1579.1868 1852.6873 2052.28388 1964.8845
f_TIC3(df$addresses) 65.8436 71.5448 93.36210 84.9698
uq max neval
1710.3459 1898.6773 20
116.3108 264.2616 20
499.4052 626.9240 20
2246.5562 2916.2253 20
102.7689 183.5121 20
其中基准代码给出如下
f_MK_conv2 <- function(x) {
USPSv <- array(
data = USPS$usps_abbrev,
dimnames = list(USPS$common_abbrev)
)
USPS_conv2 <- function(x) {
t <- str_split(x, " ")
comm <- t[[1]][length(t[[1]])]
str_replace(x, comm, USPSv[comm])
}
Vectorize(USPS_conv2)(x)
}
f_MK_replaceString <- function(.data, value) {
ht.create <- function() new.env()
ht.insert <- function(ht, key, value) ht[[key]] <- value
ht.insert <- Vectorize(ht.insert, c("key", "value"))
ht.lookup <- function(ht, key) ht[[key]]
ht.lookup <- Vectorize(ht.lookup, "key")
ht.delete <- function(ht, key) rm(list = key, envir = ht, inherits = FALSE)
ht.delete <- Vectorize(ht.delete, "key")
addHashTable2 <- function(.x, .y, key, value) {
key <- enquo(key)
value <- enquo(value)
if (!all(c(as_label(key), as_label(value)) %in% names(.y))) {
stop(paste0(
"`.y` must contain `", as_label(key),
"` and `", as_label(value), "` columns"
))
}
if ((.y %>% distinct(!!key, !!value) %>% nrow()) !=
(.y %>% distinct(!!key) %>% nrow())) {
warning(paste0(
"\nThe number of unique values of the ", as_label(key),
" variable is different\n",
" from the number of unique values of the ",
as_label(key), " and ", as_label(value), " pairs!\n",
"The dictionary will only return the last values for a given key!"
))
}
ht <- ht.create()
ht %>% ht.insert(
.y %>% distinct(!!key, !!value) %>% pull(!!key),
.y %>% distinct(!!key, !!value) %>% pull(!!value)
)
attr(.x, "hashTab") <- ht
.x
}
.data <- .data %>% addHashTable2(USPS, common_abbrev, usps_abbrev)
value <- enquo(value)
# Test whether the value variable is in .data
if (!(as_label(value) %in% names(.data))) {
stop(paste(
"The", as_label(value),
"variable does not exist in the .data table!"
))
}
# Dictionary attribute presence test
if (!("hashTab" %in% names(attributes(.data)))) {
stop(paste0(
"\nThere is no dictionary attribute in the .data table!\n",
"Use addHashTable or addHashTable2 to add a dictionary attribute."
))
}
txt <- .data %>% pull(!!value)
i <- sapply(strsplit(txt, ""), function(x) max(which(x == " ")))
txt <- paste0(
str_sub(txt, end = i),
ht.lookup(
attr(.data, "hashTab"),
str_sub(txt, start = i + 1)
)
)
.data %>% mutate(!!value := txt)
}
f_TIC1 <- function(x) {
sapply(
strsplit(x, " "),
function(x) {
with(USPS, {
idx <- match(x, common_abbrev)
paste0(ifelse(is.na(idx), x, usps_abbrev[idx]),
collapse = " "
)
})
}
)
}
f_TIC2 <- function(x) {
res <- c()
for (s in x) {
v <- unlist(strsplit(s, "\W+"))
for (p in v) {
k <- match(p, USPS$common_abbrev)
if (!is.na(k)) {
s <- with(
USPS,
gsub(
sprintf("\b%s\b", common_abbrev[k]),
usps_abbrev[k],
s
)
)
}
}
res <- append(res, s)
}
res
}
f_TIC3 <- function(x) {
x.split <- strsplit(x, " ")
lut <- with(USPS, setNames(usps_abbrev, common_abbrev))
grp <- rep(seq_along(x.split), lengths(x.split))
xx <- unlist(x.split)
r <- lut[xx]
tapply(
replace(xx, !is.na(r), na.omit(r)),
grp,
function(s) paste0(s, collapse = " ")
)
}
f_TIC4 <- function(x) {
xb <- gsub("^.*\s+", "", x)
rp <- with(USPS, usps_abbrev[match(xb, common_abbrev)])
paste0(gsub("\w+$", "", x), replace(xb, !is.na(rp), na.omit(rp)))
}
f_JM <- function(x) {
x$abbreviation <- gsub("^.* ", "", x$addresses)
setDT(x)
setDT(USPS)
x[USPS, abbreviation := usps_abbrev, on = .(abbreviation = common_abbrev)]
x$usps_abbreviation <- paste(str_extract(x$addresses, "^.* "), x$abbreviation, sep = "")
}
set.seed(1111)
df <- randomAddresses(10000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df$addresses),
f_MK_replaceString(df, addresses),
f_JM(df),
f_TIC1(df$addresses),
f_TIC2(df$addresses),
f_TIC3(df$addresses),
f_TIC4(df$addresses),
times = 20L
)
ggplot2::autoplot(mb1)
可能的解决方案
也许以下基本 R 选项之一可以提供帮助
- 解决方案 1
f_TIC1 <- function(x) {
sapply(
strsplit(x, " "),
function(x) {
with(USPS, {
idx <- match(x, common_abbrev)
paste0(ifelse(is.na(idx), x, usps_abbrev[idx]),
collapse = " "
)
})
}
)
}
- 解决方案 2
f_TIC2 <- function(x) {
res <- c()
for (s in x) {
v <- unlist(strsplit(s, "\W+"))
for (p in v) {
k <- match(p, USPS$common_abbrev)
if (!is.na(k)) {
s <- with(
USPS,
gsub(
sprintf("\b%s\b", common_abbrev[k]),
usps_abbrev[k],
s
)
)
}
}
res <- append(res, s)
}
res
}
- 解决方案 3
f_TIC3 <- function(x) {
x.split <- strsplit(x, " ")
lut <- with(USPS, setNames(usps_abbrev, common_abbrev))
grp <- rep(seq_along(x.split), lengths(x.split))
xx <- unlist(x.split)
r <- lut[xx]
tapply(
replace(xx, !is.na(r), na.omit(r)),
grp,
function(s) paste0(s, collapse = " ")
)
}
- 方案4(这是一个特例,即最后一个词的缩写)
f_TIC4 <- function(x) {
xb <- gsub("^.*\s+", "", x)
rp <- with(USPS, usps_abbrev[match(xb, common_abbrev)])
paste0(gsub("\w+$", "", x), replace(xb, !is.na(rp), na.omit(rp)))
}
产出
[1] "10900 harper ave" "12235 davis anx" "24 van cortland pkwy"
那么让我们开始玩吧。
步骤 1
首先,我们会将您的数据加载到名为 USPS
.
tibble
中
library(tidyverse)
USPS = tibble(
common_abbrev = c("allee", "alley", "ally", "aly",
"anex", "annex", "annx", "anx", "arc", "arcade", "av", "ave",
"aven", "avenu", "avenue", "avn", "avnue", "bayoo", "bayou",
"bch", "beach", "bend", "bnd", "blf", "bluf", "bluff", "bluffs",
"bot", "btm", "bottm", "bottom", "blvd", "boul", "boulevard",
"boulv", "br", "brnch", "branch", "brdge", "brg", "bridge", "brk",
"brook", "brooks", "burg", "burgs", "byp", "bypa", "bypas", "bypass",
"byps", "camp", "cp", "cmp", "canyn", "canyon", "cnyn", "cape",
"cpe", "causeway", "causwa", "cswy", "cen", "cent", "center",
"centr", "centre", "cnter", "cntr", "ctr", "centers", "cir",
"circ", "circl", "circle", "crcl", "crcle", "circles", "clf",
"cliff", "clfs", "cliffs", "clb", "club", "common", "commons",
"cor", "corner", "corners", "cors", "course", "crse", "court",
"ct", "courts", "cts", "cove", "cv", "coves", "creek", "crk",
"crescent", "cres", "crsent", "crsnt", "crest", "crossing", "crssng",
"xing", "crossroad", "crossroads", "curve", "dale", "dl", "dam",
"dm", "div", "divide", "dv", "dvd", "dr", "driv", "drive", "drv",
"drives", "est", "estate", "estates", "ests", "exp", "expr",
"express", "expressway", "expw", "expy", "ext", "extension",
"extn", "extnsn", "exts", "fall", "falls", "fls", "ferry", "frry",
"fry", "field", "fld", "fields", "flds", "flat", "flt", "flats",
"flts", "ford", "frd", "fords", "forest", "forests", "frst",
"forg", "forge", "frg", "forges", "fork", "frk", "forks", "frks",
"fort", "frt", "ft", "freeway", "freewy", "frway", "frwy", "fwy",
"garden", "gardn", "grden", "grdn", "gardens", "gdns", "grdns",
"gateway", "gatewy", "gatway", "gtway", "gtwy", "glen", "gln",
"glens", "green", "grn", "greens", "grov", "grove", "grv", "groves",
"harb", "harbor", "harbr", "hbr", "hrbor", "harbors", "haven",
"hvn", "ht", "hts", "highway", "highwy", "hiway", "hiwy", "hway",
"hwy", "hill", "hl", "hills", "hls", "hllw", "hollow", "hollows",
"holw", "holws", "inlt", "is", "island", "islnd", "islands",
"islnds", "iss", "isle", "isles", "jct", "jction", "jctn", "junction",
"junctn", "juncton", "jctns", "jcts", "junctions", "key", "ky",
"keys", "kys", "knl", "knol", "knoll", "knls", "knolls", "lk",
"lake", "lks", "lakes", "land", "landing", "lndg", "lndng", "lane",
"ln", "lgt", "light", "lights", "lf", "loaf", "lck", "lock",
"lcks", "locks", "ldg", "ldge", "lodg", "lodge", "loop", "loops",
"mall", "mnr", "manor", "manors", "mnrs", "meadow", "mdw", "mdws",
"meadows", "medows", "mews", "mill", "mills", "missn", "mssn",
"motorway", "mnt", "mt", "mount", "mntain", "mntn", "mountain",
"mountin", "mtin", "mtn", "mntns", "mountains", "nck", "neck",
"orch", "orchard", "orchrd", "oval", "ovl", "overpass", "park",
"prk", "parks", "parkway", "parkwy", "pkway", "pkwy", "pky",
"parkways", "pkwys", "pass", "passage", "path", "paths", "pike",
"pikes", "pine", "pines", "pnes", "pl", "plain", "pln", "plains",
"plns", "plaza", "plz", "plza", "point", "pt", "points", "pts",
"port", "prt", "ports", "prts", "pr", "prairie", "prr", "rad",
"radial", "radiel", "radl", "ramp", "ranch", "ranches", "rnch",
"rnchs", "rapid", "rpd", "rapids", "rpds", "rest", "rst", "rdg",
"rdge", "ridge", "rdgs", "ridges", "riv", "river", "rvr", "rivr",
"rd", "road", "roads", "rds", "route", "row", "rue", "run", "shl",
"shoal", "shls", "shoals", "shoar", "shore", "shr", "shoars",
"shores", "shrs", "skyway", "spg", "spng", "spring", "sprng",
"spgs", "spngs", "springs", "sprngs", "spur", "spurs", "sq",
"sqr", "sqre", "squ", "square", "sqrs", "squares", "sta", "station",
"statn", "stn", "stra", "strav", "straven", "stravenue", "stravn",
"strvn", "strvnue", "stream", "streme", "strm", "street", "strt",
"st", "str", "streets", "smt", "suite", "sumit", "sumitt", "summit",
"ter", "terr", "terrace", "throughway", "trace", "traces", "trce",
"track", "tracks", "trak", "trk", "trks", "trafficway", "trail",
"trails", "trl", "trls", "trailer", "trlr", "trlrs", "tunel",
"tunl", "tunls", "tunnel", "tunnels", "tunnl", "trnpk", "turnpike",
"turnpk", "underpass", "un", "union", "unions", "valley", "vally",
"vlly", "vly", "valleys", "vlys", "vdct", "via", "viadct", "viaduct",
"view", "vw", "views", "vws", "vill", "villag", "village", "villg",
"villiage", "vlg", "villages", "vlgs", "ville", "vl", "vis",
"vist", "vista", "vst", "vsta", "walk", "walks", "wall", "wy",
"way", "ways", "well", "wells", "wls"),
usps_abbrev = c("aly",
"aly", "aly", "aly", "anx", "anx", "anx", "anx", "arc", "arc",
"ave", "ave", "ave", "ave", "ave", "ave", "ave", "byu", "byu",
"bch", "bch", "bnd", "bnd", "blf", "blf", "blf", "blfs", "btm",
"btm", "btm", "btm", "blvd", "blvd", "blvd", "blvd", "br", "br",
"br", "brg", "brg", "brg", "brk", "brk", "brks", "bg", "bgs",
"byp", "byp", "byp", "byp", "byp", "cp", "cp", "cp", "cyn", "cyn",
"cyn", "cpe", "cpe", "cswy", "cswy", "cswy", "ctr", "ctr", "ctr",
"ctr", "ctr", "ctr", "ctr", "ctr", "ctrs", "cir", "cir", "cir",
"cir", "cir", "cir", "cirs", "clf", "clf", "clfs", "clfs", "clb",
"clb", "cmn", "cmns", "cor", "cor", "cors", "cors", "crse", "crse",
"ct", "ct", "cts", "cts", "cv", "cv", "cvs", "crk", "crk", "cres",
"cres", "cres", "cres", "crst", "xing", "xing", "xing", "xrd",
"xrds", "curv", "dl", "dl", "dm", "dm", "dv", "dv", "dv", "dv",
"dr", "dr", "dr", "dr", "drs", "est", "est", "ests", "ests",
"expy", "expy", "expy", "expy", "expy", "expy", "ext", "ext",
"ext", "ext", "exts", "fall", "fls", "fls", "fry", "fry", "fry",
"fld", "fld", "flds", "flds", "flt", "flt", "flts", "flts", "frd",
"frd", "frds", "frst", "frst", "frst", "frg", "frg", "frg", "frgs",
"frk", "frk", "frks", "frks", "ft", "ft", "ft", "fwy", "fwy",
"fwy", "fwy", "fwy", "gdn", "gdn", "gdn", "gdn", "gdns", "gdns",
"gdns", "gtwy", "gtwy", "gtwy", "gtwy", "gtwy", "gln", "gln",
"glns", "grn", "grn", "grns", "grv", "grv", "grv", "grvs", "hbr",
"hbr", "hbr", "hbr", "hbr", "hbrs", "hvn", "hvn", "hts", "hts",
"hwy", "hwy", "hwy", "hwy", "hwy", "hwy", "hl", "hl", "hls",
"hls", "holw", "holw", "holw", "holw", "holw", "inlt", "is",
"is", "is", "iss", "iss", "iss", "isle", "isle", "jct", "jct",
"jct", "jct", "jct", "jct", "jcts", "jcts", "jcts", "ky", "ky",
"kys", "kys", "knl", "knl", "knl", "knls", "knls", "lk", "lk",
"lks", "lks", "land", "lndg", "lndg", "lndg", "ln", "ln", "lgt",
"lgt", "lgts", "lf", "lf", "lck", "lck", "lcks", "lcks", "ldg",
"ldg", "ldg", "ldg", "loop", "loop", "mall", "mnr", "mnr", "mnrs",
"mnrs", "mdw", "mdws", "mdws", "mdws", "mdws", "mews", "ml",
"mls", "msn", "msn", "mtwy", "mt", "mt", "mt", "mtn", "mtn",
"mtn", "mtn", "mtn", "mtn", "mtns", "mtns", "nck", "nck", "orch",
"orch", "orch", "oval", "oval", "opas", "park", "park", "park",
"pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pass",
"psge", "path", "path", "pike", "pike", "pne", "pnes", "pnes",
"pl", "pln", "pln", "plns", "plns", "plz", "plz", "plz", "pt",
"pt", "pts", "pts", "prt", "prt", "prts", "prts", "pr", "pr",
"pr", "radl", "radl", "radl", "radl", "ramp", "rnch", "rnch",
"rnch", "rnch", "rpd", "rpd", "rpds", "rpds", "rst", "rst", "rdg",
"rdg", "rdg", "rdgs", "rdgs", "riv", "riv", "riv", "riv", "rd",
"rd", "rds", "rds", "rte", "row", "rue", "run", "shl", "shl",
"shls", "shls", "shr", "shr", "shr", "shrs", "shrs", "shrs",
"skwy", "spg", "spg", "spg", "spg", "spgs", "spgs", "spgs", "spgs",
"spur", "spur", "sq", "sq", "sq", "sq", "sq", "sqs", "sqs", "sta",
"sta", "sta", "sta", "stra", "stra", "stra", "stra", "stra",
"stra", "stra", "strm", "strm", "strm", "st", "st", "st", "st",
"sts", "smt", "ste", "smt", "smt", "smt", "ter", "ter", "ter",
"trwy", "trce", "trce", "trce", "trak", "trak", "trak", "trak",
"trak", "trfy", "trl", "trl", "trl", "trl", "trlr", "trlr", "trlr",
"tunl", "tunl", "tunl", "tunl", "tunl", "tunl", "tpke", "tpke",
"tpke", "upas", "un", "un", "uns", "vly", "vly", "vly", "vly",
"vlys", "vlys", "via", "via", "via", "via", "vw", "vw", "vws",
"vws", "vlg", "vlg", "vlg", "vlg", "vlg", "vlg", "vlgs", "vlgs",
"vl", "vl", "vis", "vis", "vis", "vis", "vis", "walk", "walk",
"wall", "way", "way", "ways", "wl", "wls", "wls"))
USPS
输出
# A tibble: 503 x 2
common_abbrev usps_abbrev
<chr> <chr>
1 allee aly
2 alley aly
3 ally aly
4 aly aly
5 anex anx
6 annex anx
7 annx anx
8 anx anx
9 arc arc
10 arcade arc
# ... with 493 more rows
步骤 2
现在我们将把您的 USPS
table 转换为具有命名元素的向量。
USPSv = array(data = USPS$usps_abbrev,
dimnames= list(USPS$common_abbrev))
让我们看看它给我们带来了什么
USPSv['viadct']
# viadct
# "via"
USPSv['coves']
# coves
# "cvs"
看起来很吸引人。
步骤 3
现在让我们创建一个转换(矢量化)函数,它使用我们的 USPSv
向量和命名元素。
USPS_conv = function(x) {
comm = str_split(x, " ") %>% .[[1]] %>% .[length(.)]
str_replace(x, comm, USPSv[comm])
}
USPS_conv = Vectorize(USPS_conv)
让我们看看我们的 USPS_conv
是如何工作的。
USPS_conv("10900 harper coves")
# 10900 harper coves
# "10900 harper cvs"
USPS_conv("10900 harper viadct")
# 10900 harper viadct
# "10900 harper via"
很好,但是它会处理向量吗?
USPS_conv(c("10900 harper coves", "10900 harper viadct", "10900 harper ave"))
# 10900 harper coves 10900 harper viadct 10900 harper ave
# "10900 harper cvs" "10900 harper via" "10900 harper ave"
到目前为止一切都很顺利。
步骤 4
现在是时候在 mutate
函数中使用我们的 USPS_conv
函数了。
但是,我们需要一些输入数据。我们会自己生成它们。
n=10
set.seed(1111)
df = tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
df
输出
# A tibble: 10 x 1
addresses
<chr>
1 8995 davis crk
2 8527 davis tunnl
3 7663 von brown wall
4 3043 harper lake
5 9192 von brown grdn
6 120 marry rvr
7 72 von brown locks
8 8752 marry gardn
9 7754 davis corner
10 3745 davis jcts
让我们进行一次变异
df %>% mutate(addresses = USPS_conv(addresses))
输出
# A tibble: 10 x 1
addresses
<chr>
1 8995 davis crk
2 8527 davis tunl
3 7663 von brown wall
4 3043 harper lk
5 9192 von brown gdn
6 120 marry riv
7 72 von brown lcks
8 8752 marry gdn
9 7754 davis cor
10 3745 davis jcts
看起来还好吗?好像是最多的。
步骤 5 所以是时候对 1,000,000 个地址进行大测试了! 我们将像以前一样生成数据。
n=1000000
set.seed(1111)
df = tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
df
输出
# A tibble: 1,000,000 x 1
addresses
<chr>
1 8995 marry pass
2 8527 davis spng
3 7663 marry loaf
4 3043 davis common
5 9192 marry bnd
6 120 von brown corner
7 72 van cortland plains
8 8752 van cortland crcle
9 7754 von brown sqrs
10 3745 marry key
# ... with 999,990 more rows
那我们走吧。但是让我们立即测量需要多长时间。
start_time =Sys.time()
df %>% mutate(addresses = USPS_conv(addresses))
Sys.time()-start_time
#Time difference of 3.610211 mins
如您所见,我只用了不到 4 分钟。我不知道您是否期待更快的速度以及您是否对这次满意。我会等你的评论。
最后一分钟更新
事实证明,如果我们稍微更改其代码,USPS_conv
可以稍微加快。
USPS_conv2 = function(x) {
t = str_split(x, " ")
comm = t[[1]][length(t[[1]])]
str_replace(x, comm, USPSv[comm])
}
USPS_conv2 = Vectorize(USPS_conv2)
新的 USPS_conv2
函数运行速度稍快。
所有这些转化为将一百万条记录的变异时间减少到 3.3 分钟。
超级速度的大更新!!
我意识到我的第一个版本的答案虽然结构简单,但有点慢:-(。所以我决定想出更快的东西。我将在这里分享我的想法,但请注意,一些解决方案会有点“神奇”。
魔法辞典-环境
为了加快运算速度,我们需要创建一个字典,将键快速转换为值。我们将使用 R 中的环境创建它。
这是我们词典的一个小界面。
#Simple Dictionary (hash Table) Interface for R
ht.create = function() new.env()
ht.insert = function(ht, key, value) ht[[key]] <- value
ht.insert = Vectorize(ht.insert, c("key", "value"))
ht.lookup = function(ht, key) ht[[key]]
ht.lookup = Vectorize(ht.lookup, "key")
ht.delete = function(ht, key) rm(list=key,envir=ht,inherits=FALSE)
ht.delete = Vectorize(ht.delete, "key")
它是怎么发生的。我已经显示了。下面我将创建一个新的字典环境 ht.create()
,我将向其中添加两个元素“a1”和“a2”ht.insert
,其值分别为“va1”和“va2”。最后,我将使用这些 ht.lookup
键的值询问我的环境字典。
ht1 = ht.create()
ht.insert(ht1, "a1", "va1" )
ht1 %>% ht.insert("a2", "va2")
ht.lookup(ht1, "a1")
# a1
# "va1"
ht1 %>% ht.lookup("a2")
# a2
# "va2"
请注意函数 ht.insert
和 ht.lookup
是向量化的,这意味着我可以将整个向量添加到字典中。以同样的方式,我将能够通过给出整个向量来查询我的字典。
ht.insert(ht1, paste0("a", 1:10),paste0("va", 1:10))
ht1 %>% ht.insert( paste0("a", 11:20),paste0("va", 11:20))
ht.lookup(ht1, paste0("a", 10:1))
# a10 a9 a8 a7 a6 a5 a4 a3 a2 a1
# "va10" "va9" "va8" "va7" "va6" "va5" "va4" "va3" "va2" "va1"
ht1 %>% ht.lookup(paste0("a", 20:11))
# a20 a19 a18 a17 a16 a15 a14 a13 a12 a11
# "va20" "va19" "va18" "va17" "va16" "va15" "va14" "va13" "va12" "va11"
魔法属性
现在我们将执行一个函数,向选定的字典环境添加一个附加属性 table。
#Functions that add a dictionary attribute to tibble
addHashTable = function(.data, key, value){
key = enquo(key)
value = enquo(value)
if (!all(c(as_label(key), as_label(value)) %in% names(.data))) {
stop(paste0("`.data` must contain `", as_label(key),
"` and `", as_label(value), "` columns"))
}
if((.data %>% distinct(!!key, !!value) %>% nrow)!=
(.data %>% distinct(!!key) %>% nrow)){
warning(paste0(
"\nThe number of unique values of the ", as_label(key),
" variable is different\n",
" from the number of unique values of the ",
as_label(key), " and ", as_label(value)," pairs!\n",
"The dictionary will only return the last values for a given key!"))
}
ht = ht.create()
ht %>% ht.insert(.data %>% distinct(!!key, !!value) %>% pull(!!key),
.data %>% distinct(!!key, !!value) %>% pull(!!value))
attr(.data, "hashTab") = ht
.data
}
addHashTable2 = function(.x, .y, key, value){
key = enquo(key)
value = enquo(value)
if (!all(c(as_label(key), as_label(value)) %in% names(.y))) {
stop(paste0("`.y` must contain `", as_label(key),
"` and `", as_label(value), "` columns"))
}
if((.y %>% distinct(!!key, !!value) %>% nrow)!=
(.y %>% distinct(!!key) %>% nrow)){
warning(paste0(
"\nThe number of unique values of the ", as_label(key),
" variable is different\n",
" from the number of unique values of the ",
as_label(key), " and ", as_label(value)," pairs!\n",
"The dictionary will only return the last values for a given key!"))
}
ht = ht.create()
ht %>% ht.insert(.y %>% distinct(!!key, !!value) %>% pull(!!key),
.y %>% distinct(!!key, !!value) %>% pull(!!value))
attr(.x, "hashTab") = ht
.x
}
那里实际上有两个功能。 addHashTable
函数将 dictionary-environment 属性添加到从中获取键值对的同一 table。 addHashTable2
函数同样添加到字典环境 table,但从另一个 table.
让我们看看 addHashTable
是如何工作的。
USPS = USPS %>% addHashTable(common_abbrev, usps_abbrev)
str(USPS)
# tibble [503 x 2] (S3: tbl_df/tbl/data.frame)
# $ common_abbrev: chr [1:503] "allee" "alley" "ally" "aly" ...
# $ usps_abbrev : chr [1:503] "aly" "aly" "aly" "aly" ...
# - attr(*, "hashTab")=<environment: 0x000000001591bbf0>
如您所见,USPS
table 中添加了一个指向 0x000000001591bbf0
环境的属性。
替换函数
我们需要创建一个函数,该函数将使用以这种方式添加的字典环境来替换,在这种情况下,将指定变量中的最后一个单词替换为字典中的相应值。在这里。
replaceString = function(.data, value){
value = enquo(value)
#Test whether the value variable is in .data
if(!(as_label(value) %in% names(.data))){
stop(paste("The", as_label(value),
"variable does not exist in the .data table!"))
}
#Dictionary attribute presence test
if(!("hashTab" %in% names(attributes(.data)))) {
stop(paste0(
"\nThere is no dictionary attribute in the .data table!\n",
"Use addHashTable or addHashTable2 to add a dictionary attribute."))
}
txt = .data %>% pull(!!value)
i = sapply(strsplit(txt, ""), function(x) max(which(x==" ")))
txt = paste0(str_sub(txt, end=i),
ht.lookup(attr(.data, "hashTab"),
str_sub(txt, start=i+1)))
.data %>% mutate(!!value := txt)
}
第一次测试
第一篇文字的时间到了。为了避免复制代码,我添加了一个 returns 一个带有随机选择地址的 table 的小函数。
randomAddresses = function(n){
tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
}
set.seed(1111)
df = randomAddresses(10)
df
# # A tibble: 10 x 1
# addresses
# <chr>
# 1 74 marry forges
# 2 787 von brown knol
# 3 2755 van cortland summit
# 4 9405 harper plaza
# 5 5376 marry pass
# 6 1857 marry trailer
# 7 9810 von brown drv
# 8 7984 davis garden
# 9 9110 marry alley
# 10 6458 von brown row
是时候使用我们神奇的文本替换功能了。但是,请记住先将字典环境添加到 table。
df = df %>% addHashTable2(USPS, common_abbrev, usps_abbrev)
df %>% replaceString(addresses)
# A tibble: 10 x 1
# addresses
# <chr>
# 1 74 marry frgs
# 2 787 von brown knl
# 3 2755 van cortland smt
# 4 9405 harper plz
# 5 5376 marry pass
# 6 1857 marry trlr
# 7 9810 von brown dr
# 8 7984 davis gdn
# 9 9110 marry aly
# 10 6458 von brown row
看起来可行!
大考验
嗯,没什么好等的。现在让我们在具有 百万行 的 table 上尝试一下。 让我们立即测量绘制地址和添加字典环境需要多长时间。
start_time =Sys.time()
df = randomAddresses(1000000)
df = df %>% addHashTable2(USPS, common_abbrev, usps_abbrev)
Sys.time()-start_time
#Time difference of 1.56609 secs
输出
df
# A tibble: 1,000,000 x 1
# addresses
# <chr>
# 1 8995 marry pass
# 2 8527 davis spng
# 3 7663 marry loaf
# 4 3043 davis common
# 5 9192 marry bnd
# 6 120 von brown corner
# 7 72 van cortland plains
# 8 8752 van cortland crcle
# 9 7754 von brown sqrs
# 10 3745 marry key
# # ... with 999,990 more rows
1.6 秒可能不算多。然而,最大的问题是需要多长时间来替换缩写。
start_time =Sys.time()
df = df %>% replaceString(addresses)
Sys.time()-start_time
#Time difference of 8.316476 secs
输出
# A tibble: 1,000,000 x 1
# addresses
# <chr>
# 1 8995 marry pass
# 2 8527 davis spg
# 3 7663 marry lf
# 4 3043 davis cmn
# 5 9192 marry bnd
# 6 120 von brown cor
# 7 72 van cortland plns
# 8 8752 van cortland cir
# 9 7754 von brown sqs
# 10 3745 marry ky
# # ... with 999,990 more rows
砰!!我们还有 8 秒 !!
我确信 R 中无法实现更快的机制。
@ThomasIsCoding 的小更新
下面是一个小的基准测试。请注意,我从@ThomasIsCoding.
那里借用了函数f_MK_conv2
、 f_TIC1
和 f_TIC2
的代码
set.seed(1111)
df = randomAddresses(10000)
df = df %>% addHashTable2(USPS, common_abbrev, usps_abbrev)
library(microbenchmark)
mb1 = microbenchmark(
f_MK_conv2(df$addresses),
f_TIC1(df$addresses),
f_TIC2(df$addresses),
replaceString(df, addresses),
times = 20L
)
ggplot2::autoplot(mb1)
更新:
我花了一些时间调整我现有的答案(如下),我相信这是最快的方法。此外,值得注意的是,如果您将 perl = TRUE
添加到 f_JM 和 TIC4 中的 gsub,您会在本示例中明显提高速度(可能不适用于 'real world' 数据)。我的回答还有一个重要的警告,因为它基于地址中最后一个术语的缩写词(例如 TIC1、TIC2 和 TIC3 不依赖于该假设)。
非常感谢@Marek 和@TIC 提供的基准测试代码和建设性意见:
## Benchmarking with updated f_JM() and TIC4()
library(data.table)
library(tidyverse)
USPS = tibble(
common_abbrev = c("allee", "alley", "ally", "aly",
"anex", "annex", "annx", "anx", "arc", "arcade", "av", "ave",
"aven", "avenu", "avenue", "avn", "avnue", "bayoo", "bayou",
"bch", "beach", "bend", "bnd", "blf", "bluf", "bluff", "bluffs",
"bot", "btm", "bottm", "bottom", "blvd", "boul", "boulevard",
"boulv", "br", "brnch", "branch", "brdge", "brg", "bridge", "brk",
"brook", "brooks", "burg", "burgs", "byp", "bypa", "bypas", "bypass",
"byps", "camp", "cp", "cmp", "canyn", "canyon", "cnyn", "cape",
"cpe", "causeway", "causwa", "cswy", "cen", "cent", "center",
"centr", "centre", "cnter", "cntr", "ctr", "centers", "cir",
"circ", "circl", "circle", "crcl", "crcle", "circles", "clf",
"cliff", "clfs", "cliffs", "clb", "club", "common", "commons",
"cor", "corner", "corners", "cors", "course", "crse", "court",
"ct", "courts", "cts", "cove", "cv", "coves", "creek", "crk",
"crescent", "cres", "crsent", "crsnt", "crest", "crossing", "crssng",
"xing", "crossroad", "crossroads", "curve", "dale", "dl", "dam",
"dm", "div", "divide", "dv", "dvd", "dr", "driv", "drive", "drv",
"drives", "est", "estate", "estates", "ests", "exp", "expr",
"express", "expressway", "expw", "expy", "ext", "extension",
"extn", "extnsn", "exts", "fall", "falls", "fls", "ferry", "frry",
"fry", "field", "fld", "fields", "flds", "flat", "flt", "flats",
"flts", "ford", "frd", "fords", "forest", "forests", "frst",
"forg", "forge", "frg", "forges", "fork", "frk", "forks", "frks",
"fort", "frt", "ft", "freeway", "freewy", "frway", "frwy", "fwy",
"garden", "gardn", "grden", "grdn", "gardens", "gdns", "grdns",
"gateway", "gatewy", "gatway", "gtway", "gtwy", "glen", "gln",
"glens", "green", "grn", "greens", "grov", "grove", "grv", "groves",
"harb", "harbor", "harbr", "hbr", "hrbor", "harbors", "haven",
"hvn", "ht", "hts", "highway", "highwy", "hiway", "hiwy", "hway",
"hwy", "hill", "hl", "hills", "hls", "hllw", "hollow", "hollows",
"holw", "holws", "inlt", "is", "island", "islnd", "islands",
"islnds", "iss", "isle", "isles", "jct", "jction", "jctn", "junction",
"junctn", "juncton", "jctns", "jcts", "junctions", "key", "ky",
"keys", "kys", "knl", "knol", "knoll", "knls", "knolls", "lk",
"lake", "lks", "lakes", "land", "landing", "lndg", "lndng", "lane",
"ln", "lgt", "light", "lights", "lf", "loaf", "lck", "lock",
"lcks", "locks", "ldg", "ldge", "lodg", "lodge", "loop", "loops",
"mall", "mnr", "manor", "manors", "mnrs", "meadow", "mdw", "mdws",
"meadows", "medows", "mews", "mill", "mills", "missn", "mssn",
"motorway", "mnt", "mt", "mount", "mntain", "mntn", "mountain",
"mountin", "mtin", "mtn", "mntns", "mountains", "nck", "neck",
"orch", "orchard", "orchrd", "oval", "ovl", "overpass", "park",
"prk", "parks", "parkway", "parkwy", "pkway", "pkwy", "pky",
"parkways", "pkwys", "pass", "passage", "path", "paths", "pike",
"pikes", "pine", "pines", "pnes", "pl", "plain", "pln", "plains",
"plns", "plaza", "plz", "plza", "point", "pt", "points", "pts",
"port", "prt", "ports", "prts", "pr", "prairie", "prr", "rad",
"radial", "radiel", "radl", "ramp", "ranch", "ranches", "rnch",
"rnchs", "rapid", "rpd", "rapids", "rpds", "rest", "rst", "rdg",
"rdge", "ridge", "rdgs", "ridges", "riv", "river", "rvr", "rivr",
"rd", "road", "roads", "rds", "route", "row", "rue", "run", "shl",
"shoal", "shls", "shoals", "shoar", "shore", "shr", "shoars",
"shores", "shrs", "skyway", "spg", "spng", "spring", "sprng",
"spgs", "spngs", "springs", "sprngs", "spur", "spurs", "sq",
"sqr", "sqre", "squ", "square", "sqrs", "squares", "sta", "station",
"statn", "stn", "stra", "strav", "straven", "stravenue", "stravn",
"strvn", "strvnue", "stream", "streme", "strm", "street", "strt",
"st", "str", "streets", "smt", "suite", "sumit", "sumitt", "summit",
"ter", "terr", "terrace", "throughway", "trace", "traces", "trce",
"track", "tracks", "trak", "trk", "trks", "trafficway", "trail",
"trails", "trl", "trls", "trailer", "trlr", "trlrs", "tunel",
"tunl", "tunls", "tunnel", "tunnels", "tunnl", "trnpk", "turnpike",
"turnpk", "underpass", "un", "union", "unions", "valley", "vally",
"vlly", "vly", "valleys", "vlys", "vdct", "via", "viadct", "viaduct",
"view", "vw", "views", "vws", "vill", "villag", "village", "villg",
"villiage", "vlg", "villages", "vlgs", "ville", "vl", "vis",
"vist", "vista", "vst", "vsta", "walk", "walks", "wall", "wy",
"way", "ways", "well", "wells", "wls"),
usps_abbrev = c("aly",
"aly", "aly", "aly", "anx", "anx", "anx", "anx", "arc", "arc",
"ave", "ave", "ave", "ave", "ave", "ave", "ave", "byu", "byu",
"bch", "bch", "bnd", "bnd", "blf", "blf", "blf", "blfs", "btm",
"btm", "btm", "btm", "blvd", "blvd", "blvd", "blvd", "br", "br",
"br", "brg", "brg", "brg", "brk", "brk", "brks", "bg", "bgs",
"byp", "byp", "byp", "byp", "byp", "cp", "cp", "cp", "cyn", "cyn",
"cyn", "cpe", "cpe", "cswy", "cswy", "cswy", "ctr", "ctr", "ctr",
"ctr", "ctr", "ctr", "ctr", "ctr", "ctrs", "cir", "cir", "cir",
"cir", "cir", "cir", "cirs", "clf", "clf", "clfs", "clfs", "clb",
"clb", "cmn", "cmns", "cor", "cor", "cors", "cors", "crse", "crse",
"ct", "ct", "cts", "cts", "cv", "cv", "cvs", "crk", "crk", "cres",
"cres", "cres", "cres", "crst", "xing", "xing", "xing", "xrd",
"xrds", "curv", "dl", "dl", "dm", "dm", "dv", "dv", "dv", "dv",
"dr", "dr", "dr", "dr", "drs", "est", "est", "ests", "ests",
"expy", "expy", "expy", "expy", "expy", "expy", "ext", "ext",
"ext", "ext", "exts", "fall", "fls", "fls", "fry", "fry", "fry",
"fld", "fld", "flds", "flds", "flt", "flt", "flts", "flts", "frd",
"frd", "frds", "frst", "frst", "frst", "frg", "frg", "frg", "frgs",
"frk", "frk", "frks", "frks", "ft", "ft", "ft", "fwy", "fwy",
"fwy", "fwy", "fwy", "gdn", "gdn", "gdn", "gdn", "gdns", "gdns",
"gdns", "gtwy", "gtwy", "gtwy", "gtwy", "gtwy", "gln", "gln",
"glns", "grn", "grn", "grns", "grv", "grv", "grv", "grvs", "hbr",
"hbr", "hbr", "hbr", "hbr", "hbrs", "hvn", "hvn", "hts", "hts",
"hwy", "hwy", "hwy", "hwy", "hwy", "hwy", "hl", "hl", "hls",
"hls", "holw", "holw", "holw", "holw", "holw", "inlt", "is",
"is", "is", "iss", "iss", "iss", "isle", "isle", "jct", "jct",
"jct", "jct", "jct", "jct", "jcts", "jcts", "jcts", "ky", "ky",
"kys", "kys", "knl", "knl", "knl", "knls", "knls", "lk", "lk",
"lks", "lks", "land", "lndg", "lndg", "lndg", "ln", "ln", "lgt",
"lgt", "lgts", "lf", "lf", "lck", "lck", "lcks", "lcks", "ldg",
"ldg", "ldg", "ldg", "loop", "loop", "mall", "mnr", "mnr", "mnrs",
"mnrs", "mdw", "mdws", "mdws", "mdws", "mdws", "mews", "ml",
"mls", "msn", "msn", "mtwy", "mt", "mt", "mt", "mtn", "mtn",
"mtn", "mtn", "mtn", "mtn", "mtns", "mtns", "nck", "nck", "orch",
"orch", "orch", "oval", "oval", "opas", "park", "park", "park",
"pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pass",
"psge", "path", "path", "pike", "pike", "pne", "pnes", "pnes",
"pl", "pln", "pln", "plns", "plns", "plz", "plz", "plz", "pt",
"pt", "pts", "pts", "prt", "prt", "prts", "prts", "pr", "pr",
"pr", "radl", "radl", "radl", "radl", "ramp", "rnch", "rnch",
"rnch", "rnch", "rpd", "rpd", "rpds", "rpds", "rst", "rst", "rdg",
"rdg", "rdg", "rdgs", "rdgs", "riv", "riv", "riv", "riv", "rd",
"rd", "rds", "rds", "rte", "row", "rue", "run", "shl", "shl",
"shls", "shls", "shr", "shr", "shr", "shrs", "shrs", "shrs",
"skwy", "spg", "spg", "spg", "spg", "spgs", "spgs", "spgs", "spgs",
"spur", "spur", "sq", "sq", "sq", "sq", "sq", "sqs", "sqs", "sta",
"sta", "sta", "sta", "stra", "stra", "stra", "stra", "stra",
"stra", "stra", "strm", "strm", "strm", "st", "st", "st", "st",
"sts", "smt", "ste", "smt", "smt", "smt", "ter", "ter", "ter",
"trwy", "trce", "trce", "trce", "trak", "trak", "trak", "trak",
"trak", "trfy", "trl", "trl", "trl", "trl", "trlr", "trlr", "trlr",
"tunl", "tunl", "tunl", "tunl", "tunl", "tunl", "tpke", "tpke",
"tpke", "upas", "un", "un", "uns", "vly", "vly", "vly", "vly",
"vlys", "vlys", "via", "via", "via", "via", "vw", "vw", "vws",
"vws", "vlg", "vlg", "vlg", "vlg", "vlg", "vlg", "vlgs", "vlgs",
"vl", "vl", "vis", "vis", "vis", "vis", "vis", "walk", "walk",
"wall", "way", "way", "ways", "wl", "wls", "wls"))
randomAddresses = function(n){
tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
}
set.seed(1111)
df = randomAddresses(10)
USPS_conv2 = function(x, y) {
t = str_split(x, " ")
comm = t[[1]][length(t[[1]])]
str_replace(x, comm, y[comm])
}
USPS_conv2 = Vectorize(USPS_conv2, "x")
f_MK_conv2 <- function(x, y) {
x %>% mutate(
addresses = USPS_conv2(addresses,
array(data = y$usps_abbrev, dimnames = list(y$common_abbrev))))
}
f_MK_conv2(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
ht.create <- function() new.env()
ht.insert <- function(ht, key, value) ht[[key]] <- value
ht.insert <- Vectorize(ht.insert, c("key", "value"))
ht.lookup <- function(ht, key) ht[[key]]
ht.lookup <- Vectorize(ht.lookup, "key")
ht.delete <- function(ht, key) rm(list = key, envir = ht, inherits = FALSE)
ht.delete <- Vectorize(ht.delete, "key")
f_MK_replaceString <- function(x, y) {
ht <- ht.create()
ht.insert(ht, y$common_abbrev, y$usps_abbrev)
txt <- x$addresses
i <- sapply(strsplit(txt, ""), function(x) max(which(x == " ")))
txt <- paste0(
str_sub(txt, end = i),
ht.lookup(ht, str_sub(txt, start = i + 1))
)
x %>% mutate(addresses = txt)
}
f_MK_replaceString(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
f_TIC1 <- function(x, y) {
x %>% mutate(addresses = sapply(
strsplit(x$addresses, " "),
function(x) {
with(y, {
idx <- match(x, common_abbrev)
paste0(ifelse(is.na(idx), x, usps_abbrev[idx]),
collapse = " "
)
})
}
)
)
}
f_TIC1(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
f_TIC2 <- function(x, y) {
res <- c()
for (s in x$addresses) {
v <- unlist(strsplit(s, "\W+"))
for (p in v) {
k <- match(p, y$common_abbrev)
if (!is.na(k)) {
s <- with(
y,
gsub(
sprintf("\b%s\b", common_abbrev[k]),
usps_abbrev[k],
s
)
)
}
}
res <- append(res, s)
}
x %>% mutate(addresses = res)
}
f_TIC2(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
f_TIC3 <- function(x, y) {
x.split <- strsplit(x$addresses, " ")
lut <- with(y, setNames(usps_abbrev, common_abbrev))
grp <- rep(seq_along(x.split), lengths(x.split))
xx <- unlist(x.split)
r <- lut[xx]
x %>% mutate(addresses = tapply(
replace(xx, !is.na(r), na.omit(r)),
grp,
function(s) paste0(s, collapse = " ")
))
}
f_TIC3(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
f_TIC4 <- function(x, y) {
xb <- gsub("^.*\s+", "", x$addresses, perl = TRUE)
rp <- with(USPS, usps_abbrev[match(xb, common_abbrev)])
x %>% mutate(addresses = paste0(gsub("\w+$", "", x$addresses), replace(xb, !is.na(rp), na.omit(rp))))
}
f_TIC4(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
f_JM <- function(x, y) {
x$abbreviation <- gsub("^.* ", "", x$addresses, perl = TRUE)
setDT(x)
setDT(y)
x[y, abbreviation := usps_abbrev, on = .(abbreviation = common_abbrev)]
x$addresses <- paste(str_extract(x$addresses, "^.* "), x$abbreviation, sep = "")
x$abbreviation <- NULL
return(as_tibble(x))
}
f_JM(df, USPS)
#> # A tibble: 10 × 1
#> addresses
#> <chr>
#> 1 8995 davis crk
#> 2 8527 davis tunl
#> 3 7663 von brown wall
#> 4 3043 harper lk
#> 5 9192 von brown gdn
#> 6 120 marry riv
#> 7 72 von brown lcks
#> 8 8752 marry gdn
#> 9 7754 davis cor
#> 10 3745 davis jcts
set.seed(1111)
df = randomAddresses(100)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
set.seed(1111)
df = randomAddresses(1000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
set.seed(1111)
df = randomAddresses(10000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
set.seed(1111)
df = randomAddresses(100000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_replaceString(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
set.seed(1111)
df = randomAddresses(1000000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_replaceString(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
由 reprex package (v2.0.1)
于 2021-11-04 创建原文:
出色的答案
library(tidyverse)
library(data.table)
n=1000000
set.seed(1111)
df = tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
df
#> # A tibble: 1,000,000 × 1
#> addresses
#> <chr>
#> 1 8995 marry pass
#> 2 8527 davis spng
#> 3 7663 marry loaf
#> 4 3043 davis common
#> 5 9192 marry bnd
#> 6 120 von brown corner
#> 7 72 van cortland plains
#> 8 8752 van cortland crcle
#> 9 7754 von brown sqrs
#> 10 3745 marry key
#> # … with 999,990 more rows
start_time =Sys.time()
df$abbreviation <- gsub("^.* ", "", df$addresses)
setDT(df)
setDT(USPS)
df[USPS, abbreviation:=usps_abbrev, on=.(abbreviation=common_abbrev)]
df$usps_abbreviation <- paste(str_extract(df$addresses, "^.* "), df$abbreviation, sep = "")
Sys.time()-start_time
#> Time difference of 2.804245 secs
df
#> addresses abbreviation usps_abbreviation
#> 1: 8995 marry pass pass 8995 marry pass
#> 2: 8527 davis spng spg 8527 davis spg
#> 3: 7663 marry loaf lf 7663 marry lf
#> 4: 3043 davis common cmn 3043 davis cmn
#> 5: 9192 marry bnd bnd 9192 marry bnd
#> ---
#> 999996: 1379 marry vdct via 1379 marry via
#> 999997: 237 harper avnue ave 237 harper ave
#> 999998: 7592 davis riv riv 7592 davis riv
#> 999999: 4963 marry junction jct 4963 marry jct
#> 1000000: 813 harper bluf blf 813 harper blf
由 reprex package (v2.0.1)
于 2021-11-03 创建编辑
我更改了 dt_func()
以产生与 Marek 的函数相同的输出(更公平的比较)并且它仍然非常快:
set.seed(1111)
df <- randomAddresses(10000)
dt_func <- function(x) {
x$abbreviation <- gsub("^.* ", "", x$addresses)
setDT(x)
setDT(USPS)
x[USPS, abbreviation:=usps_abbrev, on=.(abbreviation=common_abbrev)]
x$addresses <- paste(str_extract(x$addresses, "^.* "), x$abbreviation, sep = "")
x$abbreviation <- NULL
return(as_tibble(x))
}
比较输出:
df2 <- f_MK_replaceString(df, addresses)
df3 <- dt_func(df)
dplyr::all_equal(df2, df3)
#> [1] TRUE
所有感兴趣的人的最新更新
我正在写一个额外的答案,因为我原来的答案不能容纳这么长的文本和代码了。
亲爱的同事们,下面我将这里创建的所有函数收集在一个集体代码块中,这样任何人都可以尝试一下,而不必将其与多个答案结合起来。
首先,我统一了所有函数,使每个函数在输入端接受两个参数,在输出端 returns 修改后的 tibble。我还将所有内部函数移到了处理函数之外。
最后,我对包含 100、1,000、10,000、100,000 和 1,000,000 行的表执行了基准测试。
这是全部代码
library(tidyverse)
library(data.table)
library(tidyverse)
USPS = tibble(
common_abbrev = c("allee", "alley", "ally", "aly",
"anex", "annex", "annx", "anx", "arc", "arcade", "av", "ave",
"aven", "avenu", "avenue", "avn", "avnue", "bayoo", "bayou",
"bch", "beach", "bend", "bnd", "blf", "bluf", "bluff", "bluffs",
"bot", "btm", "bottm", "bottom", "blvd", "boul", "boulevard",
"boulv", "br", "brnch", "branch", "brdge", "brg", "bridge", "brk",
"brook", "brooks", "burg", "burgs", "byp", "bypa", "bypas", "bypass",
"byps", "camp", "cp", "cmp", "canyn", "canyon", "cnyn", "cape",
"cpe", "causeway", "causwa", "cswy", "cen", "cent", "center",
"centr", "centre", "cnter", "cntr", "ctr", "centers", "cir",
"circ", "circl", "circle", "crcl", "crcle", "circles", "clf",
"cliff", "clfs", "cliffs", "clb", "club", "common", "commons",
"cor", "corner", "corners", "cors", "course", "crse", "court",
"ct", "courts", "cts", "cove", "cv", "coves", "creek", "crk",
"crescent", "cres", "crsent", "crsnt", "crest", "crossing", "crssng",
"xing", "crossroad", "crossroads", "curve", "dale", "dl", "dam",
"dm", "div", "divide", "dv", "dvd", "dr", "driv", "drive", "drv",
"drives", "est", "estate", "estates", "ests", "exp", "expr",
"express", "expressway", "expw", "expy", "ext", "extension",
"extn", "extnsn", "exts", "fall", "falls", "fls", "ferry", "frry",
"fry", "field", "fld", "fields", "flds", "flat", "flt", "flats",
"flts", "ford", "frd", "fords", "forest", "forests", "frst",
"forg", "forge", "frg", "forges", "fork", "frk", "forks", "frks",
"fort", "frt", "ft", "freeway", "freewy", "frway", "frwy", "fwy",
"garden", "gardn", "grden", "grdn", "gardens", "gdns", "grdns",
"gateway", "gatewy", "gatway", "gtway", "gtwy", "glen", "gln",
"glens", "green", "grn", "greens", "grov", "grove", "grv", "groves",
"harb", "harbor", "harbr", "hbr", "hrbor", "harbors", "haven",
"hvn", "ht", "hts", "highway", "highwy", "hiway", "hiwy", "hway",
"hwy", "hill", "hl", "hills", "hls", "hllw", "hollow", "hollows",
"holw", "holws", "inlt", "is", "island", "islnd", "islands",
"islnds", "iss", "isle", "isles", "jct", "jction", "jctn", "junction",
"junctn", "juncton", "jctns", "jcts", "junctions", "key", "ky",
"keys", "kys", "knl", "knol", "knoll", "knls", "knolls", "lk",
"lake", "lks", "lakes", "land", "landing", "lndg", "lndng", "lane",
"ln", "lgt", "light", "lights", "lf", "loaf", "lck", "lock",
"lcks", "locks", "ldg", "ldge", "lodg", "lodge", "loop", "loops",
"mall", "mnr", "manor", "manors", "mnrs", "meadow", "mdw", "mdws",
"meadows", "medows", "mews", "mill", "mills", "missn", "mssn",
"motorway", "mnt", "mt", "mount", "mntain", "mntn", "mountain",
"mountin", "mtin", "mtn", "mntns", "mountains", "nck", "neck",
"orch", "orchard", "orchrd", "oval", "ovl", "overpass", "park",
"prk", "parks", "parkway", "parkwy", "pkway", "pkwy", "pky",
"parkways", "pkwys", "pass", "passage", "path", "paths", "pike",
"pikes", "pine", "pines", "pnes", "pl", "plain", "pln", "plains",
"plns", "plaza", "plz", "plza", "point", "pt", "points", "pts",
"port", "prt", "ports", "prts", "pr", "prairie", "prr", "rad",
"radial", "radiel", "radl", "ramp", "ranch", "ranches", "rnch",
"rnchs", "rapid", "rpd", "rapids", "rpds", "rest", "rst", "rdg",
"rdge", "ridge", "rdgs", "ridges", "riv", "river", "rvr", "rivr",
"rd", "road", "roads", "rds", "route", "row", "rue", "run", "shl",
"shoal", "shls", "shoals", "shoar", "shore", "shr", "shoars",
"shores", "shrs", "skyway", "spg", "spng", "spring", "sprng",
"spgs", "spngs", "springs", "sprngs", "spur", "spurs", "sq",
"sqr", "sqre", "squ", "square", "sqrs", "squares", "sta", "station",
"statn", "stn", "stra", "strav", "straven", "stravenue", "stravn",
"strvn", "strvnue", "stream", "streme", "strm", "street", "strt",
"st", "str", "streets", "smt", "suite", "sumit", "sumitt", "summit",
"ter", "terr", "terrace", "throughway", "trace", "traces", "trce",
"track", "tracks", "trak", "trk", "trks", "trafficway", "trail",
"trails", "trl", "trls", "trailer", "trlr", "trlrs", "tunel",
"tunl", "tunls", "tunnel", "tunnels", "tunnl", "trnpk", "turnpike",
"turnpk", "underpass", "un", "union", "unions", "valley", "vally",
"vlly", "vly", "valleys", "vlys", "vdct", "via", "viadct", "viaduct",
"view", "vw", "views", "vws", "vill", "villag", "village", "villg",
"villiage", "vlg", "villages", "vlgs", "ville", "vl", "vis",
"vist", "vista", "vst", "vsta", "walk", "walks", "wall", "wy",
"way", "ways", "well", "wells", "wls"),
usps_abbrev = c("aly",
"aly", "aly", "aly", "anx", "anx", "anx", "anx", "arc", "arc",
"ave", "ave", "ave", "ave", "ave", "ave", "ave", "byu", "byu",
"bch", "bch", "bnd", "bnd", "blf", "blf", "blf", "blfs", "btm",
"btm", "btm", "btm", "blvd", "blvd", "blvd", "blvd", "br", "br",
"br", "brg", "brg", "brg", "brk", "brk", "brks", "bg", "bgs",
"byp", "byp", "byp", "byp", "byp", "cp", "cp", "cp", "cyn", "cyn",
"cyn", "cpe", "cpe", "cswy", "cswy", "cswy", "ctr", "ctr", "ctr",
"ctr", "ctr", "ctr", "ctr", "ctr", "ctrs", "cir", "cir", "cir",
"cir", "cir", "cir", "cirs", "clf", "clf", "clfs", "clfs", "clb",
"clb", "cmn", "cmns", "cor", "cor", "cors", "cors", "crse", "crse",
"ct", "ct", "cts", "cts", "cv", "cv", "cvs", "crk", "crk", "cres",
"cres", "cres", "cres", "crst", "xing", "xing", "xing", "xrd",
"xrds", "curv", "dl", "dl", "dm", "dm", "dv", "dv", "dv", "dv",
"dr", "dr", "dr", "dr", "drs", "est", "est", "ests", "ests",
"expy", "expy", "expy", "expy", "expy", "expy", "ext", "ext",
"ext", "ext", "exts", "fall", "fls", "fls", "fry", "fry", "fry",
"fld", "fld", "flds", "flds", "flt", "flt", "flts", "flts", "frd",
"frd", "frds", "frst", "frst", "frst", "frg", "frg", "frg", "frgs",
"frk", "frk", "frks", "frks", "ft", "ft", "ft", "fwy", "fwy",
"fwy", "fwy", "fwy", "gdn", "gdn", "gdn", "gdn", "gdns", "gdns",
"gdns", "gtwy", "gtwy", "gtwy", "gtwy", "gtwy", "gln", "gln",
"glns", "grn", "grn", "grns", "grv", "grv", "grv", "grvs", "hbr",
"hbr", "hbr", "hbr", "hbr", "hbrs", "hvn", "hvn", "hts", "hts",
"hwy", "hwy", "hwy", "hwy", "hwy", "hwy", "hl", "hl", "hls",
"hls", "holw", "holw", "holw", "holw", "holw", "inlt", "is",
"is", "is", "iss", "iss", "iss", "isle", "isle", "jct", "jct",
"jct", "jct", "jct", "jct", "jcts", "jcts", "jcts", "ky", "ky",
"kys", "kys", "knl", "knl", "knl", "knls", "knls", "lk", "lk",
"lks", "lks", "land", "lndg", "lndg", "lndg", "ln", "ln", "lgt",
"lgt", "lgts", "lf", "lf", "lck", "lck", "lcks", "lcks", "ldg",
"ldg", "ldg", "ldg", "loop", "loop", "mall", "mnr", "mnr", "mnrs",
"mnrs", "mdw", "mdws", "mdws", "mdws", "mdws", "mews", "ml",
"mls", "msn", "msn", "mtwy", "mt", "mt", "mt", "mtn", "mtn",
"mtn", "mtn", "mtn", "mtn", "mtns", "mtns", "nck", "nck", "orch",
"orch", "orch", "oval", "oval", "opas", "park", "park", "park",
"pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pass",
"psge", "path", "path", "pike", "pike", "pne", "pnes", "pnes",
"pl", "pln", "pln", "plns", "plns", "plz", "plz", "plz", "pt",
"pt", "pts", "pts", "prt", "prt", "prts", "prts", "pr", "pr",
"pr", "radl", "radl", "radl", "radl", "ramp", "rnch", "rnch",
"rnch", "rnch", "rpd", "rpd", "rpds", "rpds", "rst", "rst", "rdg",
"rdg", "rdg", "rdgs", "rdgs", "riv", "riv", "riv", "riv", "rd",
"rd", "rds", "rds", "rte", "row", "rue", "run", "shl", "shl",
"shls", "shls", "shr", "shr", "shr", "shrs", "shrs", "shrs",
"skwy", "spg", "spg", "spg", "spg", "spgs", "spgs", "spgs", "spgs",
"spur", "spur", "sq", "sq", "sq", "sq", "sq", "sqs", "sqs", "sta",
"sta", "sta", "sta", "stra", "stra", "stra", "stra", "stra",
"stra", "stra", "strm", "strm", "strm", "st", "st", "st", "st",
"sts", "smt", "ste", "smt", "smt", "smt", "ter", "ter", "ter",
"trwy", "trce", "trce", "trce", "trak", "trak", "trak", "trak",
"trak", "trfy", "trl", "trl", "trl", "trl", "trlr", "trlr", "trlr",
"tunl", "tunl", "tunl", "tunl", "tunl", "tunl", "tpke", "tpke",
"tpke", "upas", "un", "un", "uns", "vly", "vly", "vly", "vly",
"vlys", "vlys", "via", "via", "via", "via", "vw", "vw", "vws",
"vws", "vlg", "vlg", "vlg", "vlg", "vlg", "vlg", "vlgs", "vlgs",
"vl", "vl", "vis", "vis", "vis", "vis", "vis", "walk", "walk",
"wall", "way", "way", "ways", "wl", "wls", "wls"))
randomAddresses = function(n){
tibble(
addresses = paste(
sample(10:10000, n, replace = TRUE),
sample(c("harper", "davis", "van cortland", "marry", "von brown"), n, replace = TRUE),
sample(USPS$common_abbrev, n, replace = TRUE)
)
)
}
set.seed(1111)
df = randomAddresses(10)
USPS_conv2 = function(x, y) {
t = str_split(x, " ")
comm = t[[1]][length(t[[1]])]
str_replace(x, comm, y[comm])
}
USPS_conv2 = Vectorize(USPS_conv2, "x")
f_MK_conv2 <- function(x, y) {
x %>% mutate(
addresses = USPS_conv2(addresses,
array(data = y$usps_abbrev, dimnames = list(y$common_abbrev))))
}
f_MK_conv2(df, USPS)
ht.create <- function() new.env()
ht.insert <- function(ht, key, value) ht[[key]] <- value
ht.insert <- Vectorize(ht.insert, c("key", "value"))
ht.lookup <- function(ht, key) ht[[key]]
ht.lookup <- Vectorize(ht.lookup, "key")
ht.delete <- function(ht, key) rm(list = key, envir = ht, inherits = FALSE)
ht.delete <- Vectorize(ht.delete, "key")
f_MK_replaceString <- function(x, y) {
ht <- ht.create()
ht.insert(ht, y$common_abbrev, y$usps_abbrev)
txt <- x$addresses
i <- sapply(strsplit(txt, ""), function(x) max(which(x == " ")))
txt <- paste0(
str_sub(txt, end = i),
ht.lookup(ht, str_sub(txt, start = i + 1))
)
x %>% mutate(addresses = txt)
}
f_MK_replaceString(df, USPS)
f_TIC1 <- function(x, y) {
x %>% mutate(addresses = sapply(
strsplit(x$addresses, " "),
function(x) {
with(y, {
idx <- match(x, common_abbrev)
paste0(ifelse(is.na(idx), x, usps_abbrev[idx]),
collapse = " "
)
})
}
)
)
}
f_TIC1(df, USPS)
f_TIC2 <- function(x, y) {
res <- c()
for (s in x$addresses) {
v <- unlist(strsplit(s, "\W+"))
for (p in v) {
k <- match(p, y$common_abbrev)
if (!is.na(k)) {
s <- with(
y,
gsub(
sprintf("\b%s\b", common_abbrev[k]),
usps_abbrev[k],
s
)
)
}
}
res <- append(res, s)
}
x %>% mutate(addresses = res)
}
f_TIC2(df, USPS)
f_TIC3 <- function(x, y) {
x.split <- strsplit(x$addresses, " ")
lut <- with(y, setNames(usps_abbrev, common_abbrev))
grp <- rep(seq_along(x.split), lengths(x.split))
xx <- unlist(x.split)
r <- lut[xx]
x %>% mutate(addresses = tapply(
replace(xx, !is.na(r), na.omit(r)),
grp,
function(s) paste0(s, collapse = " ")
))
}
f_TIC3(df, USPS)
f_TIC4 <- function(x, y) {
xb <- gsub("^.*\s+", "", x$addresses)
rp <- with(USPS, usps_abbrev[match(xb, common_abbrev)])
x %>% mutate(addresses = paste0(gsub("\w+$", "", x$addresses), replace(xb, !is.na(rp), na.omit(rp))))
}
f_TIC4(df, USPS)
f_JM <- function(x, y) {
x$abbreviation <- gsub("^.* ", "", x$addresses)
setDT(x)
setDT(y)
x[y, abbreviation := usps_abbrev, on = .(abbreviation = common_abbrev)]
x$addresses <- paste(str_extract(x$addresses, "^.* "), x$abbreviation, sep = "")
x$abbreviation <- NULL
return(as_tibble(x))
}
f_JM(df, USPS)
set.seed(1111)
df = randomAddresses(100)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
set.seed(1111)
df = randomAddresses(1000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
set.seed(1111)
df = randomAddresses(10000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_conv2(df, USPS),
f_MK_replaceString(df, USPS),
f_TIC1(df, USPS),
f_TIC2(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
set.seed(1111)
df = randomAddresses(100000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_replaceString(df, USPS),
f_TIC3(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
set.seed(1111)
df = randomAddresses(1000000)
library(microbenchmark)
mb1 <- microbenchmark(
f_MK_replaceString(df, USPS),
f_TIC4(df, USPS),
f_JM(df, USPS),
times = 20L
)
ggplot2::autoplot(mb1)
现在以图表形式显示结果
结论和总结时间
@jared_mamrot - 你完全正确。 data.table
太棒了!!
@ThomasIsCoding - f_TIC4
太棒了。它的简单是美丽的!!
@AnyoneWhoComesBy - 恭喜你读完了这篇文章。相信你也能在这里学到很多东西!!
特别是@jvalenti
这是一个特殊的答案,您可以在其中找到修改后的函数和您的任务所需的所有代码。
library(tidyverse)
USPS = tibble(
common_abbrev = c("allee", "alley", "ally", "aly",
"anex", "annex", "annx", "anx", "arc", "arcade", "av", "ave",
"aven", "avenu", "avenue", "avn", "avnue", "bayoo", "bayou",
"bch", "beach", "bend", "bnd", "blf", "bluf", "bluff", "bluffs",
"bot", "btm", "bottm", "bottom", "blvd", "boul", "boulevard",
"boulv", "br", "brnch", "branch", "brdge", "brg", "bridge", "brk",
"brook", "brooks", "burg", "burgs", "byp", "bypa", "bypas", "bypass",
"byps", "camp", "cp", "cmp", "canyn", "canyon", "cnyn", "cape",
"cpe", "causeway", "causwa", "cswy", "cen", "cent", "center",
"centr", "centre", "cnter", "cntr", "ctr", "centers", "cir",
"circ", "circl", "circle", "crcl", "crcle", "circles", "clf",
"cliff", "clfs", "cliffs", "clb", "club", "common", "commons",
"cor", "corner", "corners", "cors", "course", "crse", "court",
"ct", "courts", "cts", "cove", "cv", "coves", "creek", "crk",
"crescent", "cres", "crsent", "crsnt", "crest", "crossing", "crssng",
"xing", "crossroad", "crossroads", "curve", "dale", "dl", "dam",
"dm", "div", "divide", "dv", "dvd", "dr", "driv", "drive", "drv",
"drives", "est", "estate", "estates", "ests", "exp", "expr",
"express", "expressway", "expw", "expy", "ext", "extension",
"extn", "extnsn", "exts", "fall", "falls", "fls", "ferry", "frry",
"fry", "field", "fld", "fields", "flds", "flat", "flt", "flats",
"flts", "ford", "frd", "fords", "forest", "forests", "frst",
"forg", "forge", "frg", "forges", "fork", "frk", "forks", "frks",
"fort", "frt", "ft", "freeway", "freewy", "frway", "frwy", "fwy",
"garden", "gardn", "grden", "grdn", "gardens", "gdns", "grdns",
"gateway", "gatewy", "gatway", "gtway", "gtwy", "glen", "gln",
"glens", "green", "grn", "greens", "grov", "grove", "grv", "groves",
"harb", "harbor", "harbr", "hbr", "hrbor", "harbors", "haven",
"hvn", "ht", "hts", "highway", "highwy", "hiway", "hiwy", "hway",
"hwy", "hill", "hl", "hills", "hls", "hllw", "hollow", "hollows",
"holw", "holws", "inlt", "is", "island", "islnd", "islands",
"islnds", "iss", "isle", "isles", "jct", "jction", "jctn", "junction",
"junctn", "juncton", "jctns", "jcts", "junctions", "key", "ky",
"keys", "kys", "knl", "knol", "knoll", "knls", "knolls", "lk",
"lake", "lks", "lakes", "land", "landing", "lndg", "lndng", "lane",
"ln", "lgt", "light", "lights", "lf", "loaf", "lck", "lock",
"lcks", "locks", "ldg", "ldge", "lodg", "lodge", "loop", "loops",
"mall", "mnr", "manor", "manors", "mnrs", "meadow", "mdw", "mdws",
"meadows", "medows", "mews", "mill", "mills", "missn", "mssn",
"motorway", "mnt", "mt", "mount", "mntain", "mntn", "mountain",
"mountin", "mtin", "mtn", "mntns", "mountains", "nck", "neck",
"orch", "orchard", "orchrd", "oval", "ovl", "overpass", "park",
"prk", "parks", "parkway", "parkwy", "pkway", "pkwy", "pky",
"parkways", "pkwys", "pass", "passage", "path", "paths", "pike",
"pikes", "pine", "pines", "pnes", "pl", "plain", "pln", "plains",
"plns", "plaza", "plz", "plza", "point", "pt", "points", "pts",
"port", "prt", "ports", "prts", "pr", "prairie", "prr", "rad",
"radial", "radiel", "radl", "ramp", "ranch", "ranches", "rnch",
"rnchs", "rapid", "rpd", "rapids", "rpds", "rest", "rst", "rdg",
"rdge", "ridge", "rdgs", "ridges", "riv", "river", "rvr", "rivr",
"rd", "road", "roads", "rds", "route", "row", "rue", "run", "shl",
"shoal", "shls", "shoals", "shoar", "shore", "shr", "shoars",
"shores", "shrs", "skyway", "spg", "spng", "spring", "sprng",
"spgs", "spngs", "springs", "sprngs", "spur", "spurs", "sq",
"sqr", "sqre", "squ", "square", "sqrs", "squares", "sta", "station",
"statn", "stn", "stra", "strav", "straven", "stravenue", "stravn",
"strvn", "strvnue", "stream", "streme", "strm", "street", "strt",
"st", "str", "streets", "smt", "suite", "sumit", "sumitt", "summit",
"ter", "terr", "terrace", "throughway", "trace", "traces", "trce",
"track", "tracks", "trak", "trk", "trks", "trafficway", "trail",
"trails", "trl", "trls", "trailer", "trlr", "trlrs", "tunel",
"tunl", "tunls", "tunnel", "tunnels", "tunnl", "trnpk", "turnpike",
"turnpk", "underpass", "un", "union", "unions", "valley", "vally",
"vlly", "vly", "valleys", "vlys", "vdct", "via", "viadct", "viaduct",
"view", "vw", "views", "vws", "vill", "villag", "village", "villg",
"villiage", "vlg", "villages", "vlgs", "ville", "vl", "vis",
"vist", "vista", "vst", "vsta", "walk", "walks", "wall", "wy",
"way", "ways", "well", "wells", "wls"),
usps_abbrev = c("aly",
"aly", "aly", "aly", "anx", "anx", "anx", "anx", "arc", "arc",
"ave", "ave", "ave", "ave", "ave", "ave", "ave", "byu", "byu",
"bch", "bch", "bnd", "bnd", "blf", "blf", "blf", "blfs", "btm",
"btm", "btm", "btm", "blvd", "blvd", "blvd", "blvd", "br", "br",
"br", "brg", "brg", "brg", "brk", "brk", "brks", "bg", "bgs",
"byp", "byp", "byp", "byp", "byp", "cp", "cp", "cp", "cyn", "cyn",
"cyn", "cpe", "cpe", "cswy", "cswy", "cswy", "ctr", "ctr", "ctr",
"ctr", "ctr", "ctr", "ctr", "ctr", "ctrs", "cir", "cir", "cir",
"cir", "cir", "cir", "cirs", "clf", "clf", "clfs", "clfs", "clb",
"clb", "cmn", "cmns", "cor", "cor", "cors", "cors", "crse", "crse",
"ct", "ct", "cts", "cts", "cv", "cv", "cvs", "crk", "crk", "cres",
"cres", "cres", "cres", "crst", "xing", "xing", "xing", "xrd",
"xrds", "curv", "dl", "dl", "dm", "dm", "dv", "dv", "dv", "dv",
"dr", "dr", "dr", "dr", "drs", "est", "est", "ests", "ests",
"expy", "expy", "expy", "expy", "expy", "expy", "ext", "ext",
"ext", "ext", "exts", "fall", "fls", "fls", "fry", "fry", "fry",
"fld", "fld", "flds", "flds", "flt", "flt", "flts", "flts", "frd",
"frd", "frds", "frst", "frst", "frst", "frg", "frg", "frg", "frgs",
"frk", "frk", "frks", "frks", "ft", "ft", "ft", "fwy", "fwy",
"fwy", "fwy", "fwy", "gdn", "gdn", "gdn", "gdn", "gdns", "gdns",
"gdns", "gtwy", "gtwy", "gtwy", "gtwy", "gtwy", "gln", "gln",
"glns", "grn", "grn", "grns", "grv", "grv", "grv", "grvs", "hbr",
"hbr", "hbr", "hbr", "hbr", "hbrs", "hvn", "hvn", "hts", "hts",
"hwy", "hwy", "hwy", "hwy", "hwy", "hwy", "hl", "hl", "hls",
"hls", "holw", "holw", "holw", "holw", "holw", "inlt", "is",
"is", "is", "iss", "iss", "iss", "isle", "isle", "jct", "jct",
"jct", "jct", "jct", "jct", "jcts", "jcts", "jcts", "ky", "ky",
"kys", "kys", "knl", "knl", "knl", "knls", "knls", "lk", "lk",
"lks", "lks", "land", "lndg", "lndg", "lndg", "ln", "ln", "lgt",
"lgt", "lgts", "lf", "lf", "lck", "lck", "lcks", "lcks", "ldg",
"ldg", "ldg", "ldg", "loop", "loop", "mall", "mnr", "mnr", "mnrs",
"mnrs", "mdw", "mdws", "mdws", "mdws", "mdws", "mews", "ml",
"mls", "msn", "msn", "mtwy", "mt", "mt", "mt", "mtn", "mtn",
"mtn", "mtn", "mtn", "mtn", "mtns", "mtns", "nck", "nck", "orch",
"orch", "orch", "oval", "oval", "opas", "park", "park", "park",
"pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pkwy", "pass",
"psge", "path", "path", "pike", "pike", "pne", "pnes", "pnes",
"pl", "pln", "pln", "plns", "plns", "plz", "plz", "plz", "pt",
"pt", "pts", "pts", "prt", "prt", "prts", "prts", "pr", "pr",
"pr", "radl", "radl", "radl", "radl", "ramp", "rnch", "rnch",
"rnch", "rnch", "rpd", "rpd", "rpds", "rpds", "rst", "rst", "rdg",
"rdg", "rdg", "rdgs", "rdgs", "riv", "riv", "riv", "riv", "rd",
"rd", "rds", "rds", "rte", "row", "rue", "run", "shl", "shl",
"shls", "shls", "shr", "shr", "shr", "shrs", "shrs", "shrs",
"skwy", "spg", "spg", "spg", "spg", "spgs", "spgs", "spgs", "spgs",
"spur", "spur", "sq", "sq", "sq", "sq", "sq", "sqs", "sqs", "sta",
"sta", "sta", "sta", "stra", "stra", "stra", "stra", "stra",
"stra", "stra", "strm", "strm", "strm", "st", "st", "st", "st",
"sts", "smt", "ste", "smt", "smt", "smt", "ter", "ter", "ter",
"trwy", "trce", "trce", "trce", "trak", "trak", "trak", "trak",
"trak", "trfy", "trl", "trl", "trl", "trl", "trlr", "trlr", "trlr",
"tunl", "tunl", "tunl", "tunl", "tunl", "tunl", "tpke", "tpke",
"tpke", "upas", "un", "un", "uns", "vly", "vly", "vly", "vly",
"vlys", "vlys", "via", "via", "via", "via", "vw", "vw", "vws",
"vws", "vlg", "vlg", "vlg", "vlg", "vlg", "vlg", "vlgs", "vlgs",
"vl", "vl", "vis", "vis", "vis", "vis", "vis", "walk", "walk",
"wall", "way", "way", "ways", "wl", "wls", "wls"))
randomAddresses = function(n){
tibble(
addresses =
replicate(
n,
sample(c(sample(10:10000, 1, replace = TRUE) %>% paste0,
sample(c("harper", "davis", "van cortland", "marry", "von brown"), 1),
sample(USPS$common_abbrev, 1)), 3) %>% paste(collapse = " ")
)
)
}
ht.create <- function() new.env()
ht.insert <- function(ht, key, value) ht[[key]] <- value
ht.insert <- Vectorize(ht.insert, c("key", "value"))
ht.lookup <- function(ht, key) ht[[key]]
ht.lookup <- Vectorize(ht.lookup, "key")
addHashTable2 = function(.x, .y, key, value){
key = enquo(key)
value = enquo(value)
if (!all(c(as_label(key), as_label(value)) %in% names(.y))) {
stop(paste0("`.y` must contain `", as_label(key),
"` and `", as_label(value), "` columns"))
}
if((.y %>% distinct(!!key, !!value) %>% nrow)!=
(.y %>% distinct(!!key) %>% nrow)){
warning(paste0(
"\nThe number of unique values of the ", as_label(key),
" variable is different\n",
" from the number of unique values of the ",
as_label(key), " and ", as_label(value)," pairs!\n",
"The dictionary will only return the last values for a given key!"))
}
ht = ht.create()
ht %>% ht.insert(.y %>% distinct(!!key, !!value) %>% pull(!!key),
.y %>% distinct(!!key, !!value) %>% pull(!!value))
attr(.x, "hashTab") = ht
.x
}
replaceString = function(.data, value){
value = enquo(value)
#Test whether the value variable is in .data
if(!(as_label(value) %in% names(.data))){
stop(paste("The", as_label(value),
"variable does not exist in the .data table!"))
}
#Dictionary attribute presence test
if(!("hashTab" %in% names(attributes(.data)))) {
stop(paste0(
"\nThere is no dictionary attribute in the .data table!\n",
"Use addHashTable or addHashTable2 to add a dictionary attribute."))
}
ht = attr(.data, "hashTab")
txtRep = function(txt){
txt = str_split(txt, " ")[[1]]
httxt = ht.lookup(ht, txt)
txt[httxt!="NULL"] = httxt[httxt!="NULL"]
paste(txt, collapse = " ")
}
.data %>% rowwise(!!value) %>%
mutate(!!value := txtRep(!!value))
}
replaceString
功能已修改为替换缩写,无论它们在句子中的什么位置。
查看使用方法。
set.seed(1111)
df=randomAddresses(10)
df
输出
# A tibble: 10 x 1
addresses
<chr>
1 marry wall 8995
2 cen 9192 marry
3 bayoo 3745 davis
4 marry hollows 4104
5 grdn 7162 marry
6 lck harper 1211
7 9405 van cortland knol
8 7984 von brown viadct
9 4365 von brown rue
10 6399 von brown mssn
现在我们要修改这个tibble
。
df %>% addHashTable2(USPS, common_abbrev, usps_abbrev) %>%
replaceString(addresses)
输出
# A tibble: 10 x 1
# Rowwise: addresses
addresses
<chr>
1 marry wall 8995
2 ctr 9192 marry
3 byu 3745 davis
4 marry holw 4104
5 gdn 7162 marry
6 lck harper 1211
7 9405 van cortland knl
8 7984 von brown via
9 4365 von brown rue
10 6399 von brown msn
祝你好运,大数据快速突变!!