R 中数以百万计的微小匹配:需要性能
Millions of tiny matches in R : need performance
我有一个一百万长度的词向量,称为 WORDS。我得到了一个名为 SENTENCES 的 900 万个对象列表。我列表中的每个对象都是一个句子,由 10-50 个长度的单词向量表示。这是一个例子:
head(WORDS)
[1] "aba" "accra" "ada" "afrika" "afrikan" "afula" "aggamemon"
SENTENCES[[1]]
[1] "how" "to" "interpret" "that" "picture"
我想将我的列表中的每个句子转换成一个数值向量,其元素对应于句子单词在 WORDS 大向量中的位置。
实际上,我知道如何使用该命令来完成它:
convert <- function(sentence){
return(which(WORDS %in% sentence))
}
SENTENCES_NUM <- lapply(SENTENCES, convert)
问题是花费的时间太长了。我的意思是我的 RStudio 爆炸了,尽管我有一台 16Go RAM 计算机。所以问题是你有什么加快计算速度的想法吗?
fastmatch,R 核心人员的一个小包,对查找进行哈希处理,因此初始搜索尤其是后续搜索更快。
你真正在做的是为每个句子创建一个具有预定义水平的因素。他的 C 代码中的缓慢步骤是对因子水平进行排序,您可以通过向他的快速版本的因子函数提供(唯一的)因子水平列表来避免这种情况。
如果您只想要整数位置,您可以轻松地将因子转换为整数:很多人无意中这样做了。
您实际上根本不需要任何因素来满足您的需求,只需 match
。您的代码还生成一个逻辑向量,然后从中重新计算位置:match
直接进入位置。
library(fastmatch)
library(microbenchmark)
WORDS <- read.table("https://dotnetperls-controls.googlecode.com/files/enable1.txt", stringsAsFactors = FALSE)[[1]]
words_factor <- as.factor(WORDS)
# generate 100 sentences of between 5 and 15 words:
SENTENCES <- lapply(c(1:100), sample, x = WORDS, size = sample(c(5:15), size = 1))
bench_fun <- function(fun)
lapply(SENTENCES, fun)
# poster's slow solution:
hg_convert <- function(sentence)
return(which(WORDS %in% sentence))
jw_convert_match <- function(sentence)
match(sentence, WORDS)
jw_convert_match_factor <- function(sentence)
match(sentence, words_factor)
jw_convert_fastmatch <- function(sentence)
fmatch(sentence, WORDS)
jw_convert_fastmatch_factor <- function(sentence)
fmatch(sentence, words_factor)
message("starting benchmark one")
print(microbenchmark(bench_fun(hg_convert),
bench_fun(jw_convert_match),
bench_fun(jw_convert_match_factor),
bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
# now again with big samples
# generating the SENTENCES is quite slow...
SENTENCES <- lapply(c(1:1e6), sample, x = WORDS, size = sample(c(5:15), size = 1))
message("starting benchmark two, compare with factor vs vector of words")
print(microbenchmark(bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
我把这个放在 https://gist.github.com/jackwasey/59848d84728c0f55ef11
结果格式不是很好,可以说,有或没有因子输入的快速匹配都快得多。
# starting benchmark one
Unit: microseconds
expr min lq mean median uq max neval
bench_fun(hg_convert) 665167.953 678451.008 704030.2427 691859.576 738071.699 777176.143 10
bench_fun(jw_convert_match) 878269.025 950580.480 962171.6683 956413.486 990592.691 1014922.639 10
bench_fun(jw_convert_match_factor) 1082116.859 1104331.677 1182310.1228 1184336.810 1198233.436 1436600.764 10
bench_fun(jw_convert_fastmatch) 203.031 220.134 462.1246 289.647 305.070 2196.906 10
bench_fun(jw_convert_fastmatch_factor) 251.474 300.729 1351.6974 317.439 362.127 10604.506 10
# starting benchmark two, compare with factor vs vector of words
Unit: seconds
expr min lq mean median uq max neval
bench_fun(jw_convert_fastmatch) 3.066001 3.134702 3.186347 3.177419 3.212144 3.351648 10
bench_fun(jw_convert_fastmatch_factor) 3.012734 3.149879 3.281194 3.250365 3.498593 3.563907 10
因此,我现在还不想麻烦并行实现。
不会更快,但这是处理事情的整洁方式。
library(dplyr)
library(tidyr)
sentence =
data_frame(word.name = SENTENCES,
sentence.ID = 1:length(SENTENCES) %>%
unnest(word.name)
word = data_frame(
word.name = WORDS,
word.ID = 1:length(WORDS)
sentence__word =
sentence %>%
left_join(word)
我有一个一百万长度的词向量,称为 WORDS。我得到了一个名为 SENTENCES 的 900 万个对象列表。我列表中的每个对象都是一个句子,由 10-50 个长度的单词向量表示。这是一个例子:
head(WORDS)
[1] "aba" "accra" "ada" "afrika" "afrikan" "afula" "aggamemon"
SENTENCES[[1]]
[1] "how" "to" "interpret" "that" "picture"
我想将我的列表中的每个句子转换成一个数值向量,其元素对应于句子单词在 WORDS 大向量中的位置。 实际上,我知道如何使用该命令来完成它:
convert <- function(sentence){
return(which(WORDS %in% sentence))
}
SENTENCES_NUM <- lapply(SENTENCES, convert)
问题是花费的时间太长了。我的意思是我的 RStudio 爆炸了,尽管我有一台 16Go RAM 计算机。所以问题是你有什么加快计算速度的想法吗?
fastmatch,R 核心人员的一个小包,对查找进行哈希处理,因此初始搜索尤其是后续搜索更快。
你真正在做的是为每个句子创建一个具有预定义水平的因素。他的 C 代码中的缓慢步骤是对因子水平进行排序,您可以通过向他的快速版本的因子函数提供(唯一的)因子水平列表来避免这种情况。
如果您只想要整数位置,您可以轻松地将因子转换为整数:很多人无意中这样做了。
您实际上根本不需要任何因素来满足您的需求,只需 match
。您的代码还生成一个逻辑向量,然后从中重新计算位置:match
直接进入位置。
library(fastmatch)
library(microbenchmark)
WORDS <- read.table("https://dotnetperls-controls.googlecode.com/files/enable1.txt", stringsAsFactors = FALSE)[[1]]
words_factor <- as.factor(WORDS)
# generate 100 sentences of between 5 and 15 words:
SENTENCES <- lapply(c(1:100), sample, x = WORDS, size = sample(c(5:15), size = 1))
bench_fun <- function(fun)
lapply(SENTENCES, fun)
# poster's slow solution:
hg_convert <- function(sentence)
return(which(WORDS %in% sentence))
jw_convert_match <- function(sentence)
match(sentence, WORDS)
jw_convert_match_factor <- function(sentence)
match(sentence, words_factor)
jw_convert_fastmatch <- function(sentence)
fmatch(sentence, WORDS)
jw_convert_fastmatch_factor <- function(sentence)
fmatch(sentence, words_factor)
message("starting benchmark one")
print(microbenchmark(bench_fun(hg_convert),
bench_fun(jw_convert_match),
bench_fun(jw_convert_match_factor),
bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
# now again with big samples
# generating the SENTENCES is quite slow...
SENTENCES <- lapply(c(1:1e6), sample, x = WORDS, size = sample(c(5:15), size = 1))
message("starting benchmark two, compare with factor vs vector of words")
print(microbenchmark(bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
我把这个放在 https://gist.github.com/jackwasey/59848d84728c0f55ef11
结果格式不是很好,可以说,有或没有因子输入的快速匹配都快得多。
# starting benchmark one
Unit: microseconds
expr min lq mean median uq max neval
bench_fun(hg_convert) 665167.953 678451.008 704030.2427 691859.576 738071.699 777176.143 10
bench_fun(jw_convert_match) 878269.025 950580.480 962171.6683 956413.486 990592.691 1014922.639 10
bench_fun(jw_convert_match_factor) 1082116.859 1104331.677 1182310.1228 1184336.810 1198233.436 1436600.764 10
bench_fun(jw_convert_fastmatch) 203.031 220.134 462.1246 289.647 305.070 2196.906 10
bench_fun(jw_convert_fastmatch_factor) 251.474 300.729 1351.6974 317.439 362.127 10604.506 10
# starting benchmark two, compare with factor vs vector of words
Unit: seconds
expr min lq mean median uq max neval
bench_fun(jw_convert_fastmatch) 3.066001 3.134702 3.186347 3.177419 3.212144 3.351648 10
bench_fun(jw_convert_fastmatch_factor) 3.012734 3.149879 3.281194 3.250365 3.498593 3.563907 10
因此,我现在还不想麻烦并行实现。
不会更快,但这是处理事情的整洁方式。
library(dplyr)
library(tidyr)
sentence =
data_frame(word.name = SENTENCES,
sentence.ID = 1:length(SENTENCES) %>%
unnest(word.name)
word = data_frame(
word.name = WORDS,
word.ID = 1:length(WORDS)
sentence__word =
sentence %>%
left_join(word)