提高替换多个字符串的性能
Improve performance of replacing multiple strings
我在原始数据上屏蔽了 phone 号码和个人姓名。关于 phone 个数字,我已经提出并得到了答案 。
在屏蔽人名的情况下,我有如下代码:
x = c("010-1234-5678",
"John 010-8888-8888",
"Phone: 010-1111-2222",
"Peter 018.1111.3333",
"Year(2007,2019,2020)",
"Alice 01077776666")
df = data.frame(
phoneNumber = x
)
delName = c("John", "Peter", "Alice")
for (name in delName) {
df$phoneNumber <- gsub(name, "anonymous", df$phoneNumber)
}
那个代码对我来说不是问题,
> df
phoneNumber
1 010-1234-5678
2 anonymous 010-8888-8888
3 Phone: 010-1111-2222
4 anonymous 018.1111.3333
5 Year(2007,2019,2020)
6 anonymous 01077776666
但我有超过 10,000 个个人名字需要掩盖。 R 现在正在处理第 789 个进程。时间可以解决,但我想知道减少处理时间的方法。我搜索了foreach
,但我不知道如何调整我上面的原始代码。
您可以先尝试不使用循环,然后 paste
将字符串与 或 \
.
串在一起
(delNamec <- paste(delName, collapse='|'))
# [1] "John|Peter|Alice"
gsub(delNamec, 'anonymous', df$phoneNumber)
# [1] "010-1234-5678"
# [2] "anonymous 010-8888-8888"
# [3] "Phone: 010-1111-2222"
# [4] "anonymous 018.1111.3333"
# [5] "Year(2007,2019,2020)"
# [6] "anonymous 01077776666"
眨眼间运行,即使有 10 万行。
df2 <- df[sample(nrow(df), 1e5, replace=T),,drop=F]
dim(df2)
# [1] 100000 1
system.time(gsub(delNamec, 'anonymous', df2$phoneNumber))
# user system elapsed
# 0.129 0.000 0.129
这是另一个使用 stringr
的选项,它比 gsub
更快。
library(stringr)
str_replace_all(
string = df$phoneNumber,
pattern = paste(delName, collapse = '|'),
replacement = "anonymous"
)
# [1] "010-1234-5678"
# [2] "anonymous 010-8888-8888"
# [3] "Phone: 010-1111-2222"
# [4] "anonymous 018.1111.3333"
# [5] "Year(2007,2019,2020)"
# [6] "anonymous 01077776666"
Benchmark(感谢@jay.sf 的 df2!)
df2 <- df[sample(nrow(df), 1e5, replace=T),,drop=F]
dim(df2)
# [1] 100000 1
bench::mark(
stringr = str_replace_all(
string = df2$phoneNumber,
pattern = paste(delName, collapse = '|'),
replacement = "anonymous"
),
gsub = gsub(delNamec, 'anonymous', df2$phoneNumber)
)
# A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
# 1 stringr 45.4ms 46.7ms 20.9 781KB 0 11 0 525ms
# 2 gsub 97ms 111.8ms 9.18 781KB 0 5 0 544ms
我在原始数据上屏蔽了 phone 号码和个人姓名。关于 phone 个数字,我已经提出并得到了答案
在屏蔽人名的情况下,我有如下代码:
x = c("010-1234-5678",
"John 010-8888-8888",
"Phone: 010-1111-2222",
"Peter 018.1111.3333",
"Year(2007,2019,2020)",
"Alice 01077776666")
df = data.frame(
phoneNumber = x
)
delName = c("John", "Peter", "Alice")
for (name in delName) {
df$phoneNumber <- gsub(name, "anonymous", df$phoneNumber)
}
那个代码对我来说不是问题,
> df
phoneNumber
1 010-1234-5678
2 anonymous 010-8888-8888
3 Phone: 010-1111-2222
4 anonymous 018.1111.3333
5 Year(2007,2019,2020)
6 anonymous 01077776666
但我有超过 10,000 个个人名字需要掩盖。 R 现在正在处理第 789 个进程。时间可以解决,但我想知道减少处理时间的方法。我搜索了foreach
,但我不知道如何调整我上面的原始代码。
您可以先尝试不使用循环,然后 paste
将字符串与 或 \
.
(delNamec <- paste(delName, collapse='|'))
# [1] "John|Peter|Alice"
gsub(delNamec, 'anonymous', df$phoneNumber)
# [1] "010-1234-5678"
# [2] "anonymous 010-8888-8888"
# [3] "Phone: 010-1111-2222"
# [4] "anonymous 018.1111.3333"
# [5] "Year(2007,2019,2020)"
# [6] "anonymous 01077776666"
眨眼间运行,即使有 10 万行。
df2 <- df[sample(nrow(df), 1e5, replace=T),,drop=F]
dim(df2)
# [1] 100000 1
system.time(gsub(delNamec, 'anonymous', df2$phoneNumber))
# user system elapsed
# 0.129 0.000 0.129
这是另一个使用 stringr
的选项,它比 gsub
更快。
library(stringr)
str_replace_all(
string = df$phoneNumber,
pattern = paste(delName, collapse = '|'),
replacement = "anonymous"
)
# [1] "010-1234-5678"
# [2] "anonymous 010-8888-8888"
# [3] "Phone: 010-1111-2222"
# [4] "anonymous 018.1111.3333"
# [5] "Year(2007,2019,2020)"
# [6] "anonymous 01077776666"
Benchmark(感谢@jay.sf 的 df2!)
df2 <- df[sample(nrow(df), 1e5, replace=T),,drop=F]
dim(df2)
# [1] 100000 1
bench::mark(
stringr = str_replace_all(
string = df2$phoneNumber,
pattern = paste(delName, collapse = '|'),
replacement = "anonymous"
),
gsub = gsub(delNamec, 'anonymous', df2$phoneNumber)
)
# A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
# 1 stringr 45.4ms 46.7ms 20.9 781KB 0 11 0 525ms
# 2 gsub 97ms 111.8ms 9.18 781KB 0 5 0 544ms