R - 缓慢工作 lapply 对有序因子进行排序
R - slowly working lapply with sort on ordered factor
基于问题More efficient means of creating a corpus and DTM,我准备了自己的方法来从大型语料库构建术语文档矩阵,(我希望)不需要术语 x 文档内存。
sparseTDM <- function(vc){
id = unlist(lapply(vc, function(x){x$meta$id}))
content = unlist(lapply(vc, function(x){x$content}))
out = strsplit(content, "\s", perl = T)
names(out) = id
lev.terms = sort(unique(unlist(out)))
lev.docs = id
v1 = lapply(
out,
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = lev.terms
)
v2 = lapply(
seq_along(v1),
function(i, x, n){
rep(i,length(x[[i]]))
},
x = v1,
n = names(v1)
)
stm = data.frame(i = unlist(v1), j = unlist(v2)) %>%
group_by(i, j) %>%
tally() %>%
ungroup()
tmp = simple_triplet_matrix(
i = stm$i,
j = stm$j,
v = stm$n,
nrow = length(lev.terms),
ncol = length(lev.docs),
dimnames = list(Terms = lev.terms, Docs = lev.docs)
)
as.TermDocumentMatrix(tmp, weighting = weightTf)
}
它在计算 v1
时变慢。 运行 持续了 30 分钟,我停止了它。
我准备了一个小例子:
b = paste0("string", 1:200000)
a = sample(b,80)
microbenchmark(
lapply(
list(a=a),
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = b
)
)
结果是:
Unit: milliseconds
expr min lq mean median uq max neval
... 25.80961 28.79981 31.59974 30.79836 33.02461 98.02512 100
id 和 content 有 126522 个元素,Lev.terms 有 155591 个元素,看来我停止处理的太早了。由于最终我将处理 ~6M 文档,我需要问...有什么方法可以加快这段代码的速度吗?
您是否尝试过 sort method (algorithm) 并指定快速排序或 shell 排序?
类似于:
sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=shell)
或:
sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=quick)
此外,如果排序算法一次又一次地重新执行这些步骤,您可以尝试使用一些中间变量来评估嵌套函数:
foo<-factor(x, levels = lev, ordered = TRUE)
bar<-as.integer(foo)
sort(bar, method=quick)
或
sort(bar)
祝你好运!
现在我已经加快了它的速度来替换
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
与
ind = which(lev %in% x)
cnt = as.integer(factor(x, levels = lev[ind], ordered = TRUE))
sort(ind[cnt])
现在的时间是:
expr min lq mean median uq max neval
... 5.248479 6.202161 6.892609 6.501382 7.313061 10.17205 100
我在创建 quanteda::dfm()
(参见 GitHub repo here)的过程中经历了多次解决问题的迭代,到目前为止,最快的解决方案涉及使用 data.table
和 Matrix
包索引文档和标记化特征,计算文档中的特征,并将结果直接插入稀疏矩阵,如下所示:
require(data.table)
require(Matrix)
dfm_quanteda <- function(x) {
docIndex <- 1:length(x)
if (is.null(names(x)))
names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
names(docIndex) <- names(x)
alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
features = unlist(x, use.names = FALSE))
alltokens <- alltokens[features != ""] # if there are any "blank" features
alltokens[, "n":=1L]
alltokens <- alltokens[, by=list(docIndex,features), sum(n)]
uniqueFeatures <- unique(alltokens$features)
uniqueFeatures <- sort(uniqueFeatures)
featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
features = uniqueFeatures)
setkey(alltokens, features)
setkey(featureTable, features)
alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]
sparseMatrix(i = alltokens$docIndex,
j = alltokens$featureIndex,
x = alltokens$V1,
dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}
require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
## user system elapsed
## 0.060 0.005 0.064
这当然只是一个片段,但完整的源代码很容易在 GitHub 存储库 (dfm-main.R
) 中找到。
我也鼓励您使用包中的完整 dfm()
。您可以使用以下命令从 CRAN 或开发版本安装它:
devtools::install_github("kbenoit/quanteda")
在您的文本上查看它在性能方面的工作原理。
基于问题More efficient means of creating a corpus and DTM,我准备了自己的方法来从大型语料库构建术语文档矩阵,(我希望)不需要术语 x 文档内存。
sparseTDM <- function(vc){
id = unlist(lapply(vc, function(x){x$meta$id}))
content = unlist(lapply(vc, function(x){x$content}))
out = strsplit(content, "\s", perl = T)
names(out) = id
lev.terms = sort(unique(unlist(out)))
lev.docs = id
v1 = lapply(
out,
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = lev.terms
)
v2 = lapply(
seq_along(v1),
function(i, x, n){
rep(i,length(x[[i]]))
},
x = v1,
n = names(v1)
)
stm = data.frame(i = unlist(v1), j = unlist(v2)) %>%
group_by(i, j) %>%
tally() %>%
ungroup()
tmp = simple_triplet_matrix(
i = stm$i,
j = stm$j,
v = stm$n,
nrow = length(lev.terms),
ncol = length(lev.docs),
dimnames = list(Terms = lev.terms, Docs = lev.docs)
)
as.TermDocumentMatrix(tmp, weighting = weightTf)
}
它在计算 v1
时变慢。 运行 持续了 30 分钟,我停止了它。
我准备了一个小例子:
b = paste0("string", 1:200000)
a = sample(b,80)
microbenchmark(
lapply(
list(a=a),
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = b
)
)
结果是:
Unit: milliseconds
expr min lq mean median uq max neval
... 25.80961 28.79981 31.59974 30.79836 33.02461 98.02512 100
id 和 content 有 126522 个元素,Lev.terms 有 155591 个元素,看来我停止处理的太早了。由于最终我将处理 ~6M 文档,我需要问...有什么方法可以加快这段代码的速度吗?
您是否尝试过 sort method (algorithm) 并指定快速排序或 shell 排序?
类似于:
sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=shell)
或:
sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=quick)
此外,如果排序算法一次又一次地重新执行这些步骤,您可以尝试使用一些中间变量来评估嵌套函数:
foo<-factor(x, levels = lev, ordered = TRUE)
bar<-as.integer(foo)
sort(bar, method=quick)
或
sort(bar)
祝你好运!
现在我已经加快了它的速度来替换
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
与
ind = which(lev %in% x)
cnt = as.integer(factor(x, levels = lev[ind], ordered = TRUE))
sort(ind[cnt])
现在的时间是:
expr min lq mean median uq max neval
... 5.248479 6.202161 6.892609 6.501382 7.313061 10.17205 100
我在创建 quanteda::dfm()
(参见 GitHub repo here)的过程中经历了多次解决问题的迭代,到目前为止,最快的解决方案涉及使用 data.table
和 Matrix
包索引文档和标记化特征,计算文档中的特征,并将结果直接插入稀疏矩阵,如下所示:
require(data.table)
require(Matrix)
dfm_quanteda <- function(x) {
docIndex <- 1:length(x)
if (is.null(names(x)))
names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
names(docIndex) <- names(x)
alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
features = unlist(x, use.names = FALSE))
alltokens <- alltokens[features != ""] # if there are any "blank" features
alltokens[, "n":=1L]
alltokens <- alltokens[, by=list(docIndex,features), sum(n)]
uniqueFeatures <- unique(alltokens$features)
uniqueFeatures <- sort(uniqueFeatures)
featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
features = uniqueFeatures)
setkey(alltokens, features)
setkey(featureTable, features)
alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]
sparseMatrix(i = alltokens$docIndex,
j = alltokens$featureIndex,
x = alltokens$V1,
dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}
require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
## user system elapsed
## 0.060 0.005 0.064
这当然只是一个片段,但完整的源代码很容易在 GitHub 存储库 (dfm-main.R
) 中找到。
我也鼓励您使用包中的完整 dfm()
。您可以使用以下命令从 CRAN 或开发版本安装它:
devtools::install_github("kbenoit/quanteda")
在您的文本上查看它在性能方面的工作原理。