有没有辅助大数据处理的R包?
Is there a R package to assist in large data processing?
我正在处理一个大型数据集(清理后)。然后处理数据集以创建邻接矩阵,该矩阵被传递给包含 uniqueID 的 id obs 的 logicEval。
5
当运行使用代码片段来创建邻接矩阵时,该过程需要花费大量时间来处理(有时,它只是冻结)。
显然,这是因为该函数正在检查每个唯一元素 (n=10901) 并标记 TRUE/FALSE(如果它出现在观察中)。一个例子(大大减少):
|Obs_1 |Obs_2 |Obs_3 |Obs_4 |Obs_5 | logEval|
|:-----|:-----|:-----|:-----|:-----|-------:|
|TRUE |FALSE |FALSE |FALSE |FALSE | 1|
|FALSE |TRUE |FALSE |FALSE |FALSE | 1|
|FALSE |FALSE |TRUE |FALSE |FALSE | 1|
|FALSE |FALSE |FALSE |TRUE |FALSE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |TRUE |FALSE |FALSE | 1|
|TRUE |FALSE |FALSE |FALSE |FALSE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
实际Obs=43,比较>100000次。
问题: R 崩溃。有没有更好的方法来 运行 它不会因为大小而崩溃?
代码片段:
r
df1<-data.table(col1=sample(500000:500900,700,replace = T),
col2=sample(500000:500900,700,replace = T),
col3=sample(500000:500900,700,replace = T),
col4=sample(500000:500900,700,replace = T),
col5 = sample(500000:500900,700,replace = T),
col6 = sample(500000:500900,700,replace = T),
col7 = sample(500000:500900,700,replace = T),
col8 = sample(500000:500900,700,replace = T),
col9 = sample(500000:500900,700,replace = T),
col10 = sample(500000:500900,700,replace = T),
col11 = sample(500000:500900,700,replace = T),
col12 = sample(500000:500900,700,replace = T),
col13 = sample(500000:500900,700,replace = T),
col14 = sample(500000:500900,700,replace = T),
col15 = sample(500000:500900,700,replace = T),
col16 = sample(500000:500900,700,replace = T),
col17 = sample(500000:500900,700,replace = T),
col18 = sample(500000:500900,700,replace = T),
col19 = sample(500000:500900,700,replace = T),
col20 = sample(500000:500900,700,replace = T),
col21 = sample(500000:500900,700,replace = T),
col22 = sample(500000:500900,700,replace = T),
col23 = sample(500000:500900,700,replace = T),
col24 = sample(500000:500900,700,replace = T),
col25 = sample(500000:500900,700,replace = T),
col26 = sample(500000:500900,700,replace = T),
col27 = sample(500000:500900,700,replace = T),
col28 = sample(500000:500900,700,replace = T),
col29 = sample(500000:500900,700,replace = T),
col30 = sample(500000:500900,700,replace = T),
col31 = sample(500000:500900,700,replace = T),
col32 = sample(500000:500900,700,replace = T),
col33 = sample(500000:500900,700,replace = T),
col34 = sample(500000:500900,700,replace = T),
col35 = sample(500000:500900,700,replace = T),
col36 = sample(500000:500900,700,replace = T),
col37 = sample(500000:500900,700,replace = T),
col38 = sample(500000:500900,700,replace = T),
col39 = sample(500000:500900,700,replace = T),
col40 = sample(500000:500900,700,replace = T),
col41 = sample(500000:500900,700,replace = T),
col42 = sample(500000:500900,700,replace = T),
col43 = sample(500000:500900,700,replace = T))
#find all ids via table
uniqueIDs<-as.character(unique(unlist(df1)))
df1<-data.table(df1)
#creating adjacency matrix
mat <- sapply(uniqueIDs, function(s) apply(dt1, 1, function(x) s %in% x))
#clean-up
colnames(mat) <- uniqueIDs
rownames(mat) <- paste0("row", seq(nrow(dt1)))
mat<-data.table(mat)
mat<-data.table(t(mat))
#apply logical evaluation to count number of TRUE
mat$logEval<-rowSums(mat==TRUE)
Want to make a small update to ensure I am making my overall goal clear:
-数据集有 x (43) 个 obs,每个 obs 有 y (200) 个 nbrid。
运行上述代码的目标是创建一个邻接矩阵来标识每列出现的 nbrids (y)。 [例如,从 unique nbrids 中,y(1) 是否出现在 x(i) 中;y(2)...是否出现 y(900)]。
我不关心 x 本身。最终目标是:
From the unique ids throughout the matrix, what uniqueids appear together & how often [this is why I create the logic test to count .n(i)==TRUE]…for those >2, i can filter as it is likely that such rows share nbrids.
示例结束矩阵;
r
From To Weight
50012 50056 5
50012 50032 3
…
50063 50090 9
真是个大嘴巴_
如果我正确理解了您的要求,那么以下内容应该有效:
df1 = …
tdf1 = as.data.frame(t(df1))
unique_ids = as.character(unique(unlist(df1)))
# mat = sapply(tdf1, `%in%`, x = unique_ids)
mat = vapply(tdf1, `%in%`, logical(length(unique_ids)), x = unique_ids)
rownames(mat) = unique_ids
colnames(mat) = paste0('row', seq_len(ncol(mat))) # ??? Really?!
log_eval = rowSums(mat)
请特别注意我的代码中的 mat
不需要转置,因为它已经处于正确的方向。 commented-out sapply
行等同于 vapply
行,但后者更明确并且执行更严格的类型检查,因此如果数据意外更改则更少 error-prone。 vapply
可能也更有效,但对于您的示例数据,差异并不明显。
顺便说一句,要生成随机 df1
,您可以将 43 行代码缩短为
df1 = as.data.frame(replicate(43, sample(500000 : 500900, 700, replace = TRUE)))
第二次编辑:
这些选项似乎在您的编辑中达到了您的预期输出。这两个选项都依赖 self-joins 来查看有哪些组合。第一个选项使用 lapply()
一次执行 self-join 一列,而后者 melt()
然后 self-joins 整个数据集。对于较小的数据集,lapply()
速度较慢,但在尝试 7,000 行时,它仍然通过,而 melt 和 self-join 创建了太大的数据框。
另外请注意,此数据集实际上没有很多唯一值。如果我知道它是稀疏的,我可能会添加一行来过滤掉整个数据集中没有重复的值。
library(data.table)
# generate data -----------------------------------------------------------
set.seed(1234)
dt1<- data.table(replicate(43, sample(500000:500900,700, replace = TRUE)))
rbindlist(
lapply(dt1
, function(x) {
nbrid_dt = data.table(nbrid = unique(x))
nbrid_dt[nbrid_dt
, on = .(nbrid < nbrid)
, j = .(From = x.nbrid, To = i.nbrid)
, nomatch = 0L
, allow.cartesian = T]
}
)
)[, .N, keyby = .(From, To)]
From To N
1: 500000 500001 11
2: 500000 500002 11
3: 500000 500003 7
4: 500000 500004 9
5: 500000 500005 13
---
405446: 500897 500899 12
405447: 500897 500900 10
405448: 500898 500899 13
405449: 500898 500900 12
405450: 500899 500900 9
#all at once
molten_dt <- unique(melt(dt1))
setkey(molten_dt, variable)
molten_dt[molten_dt
, on = .(value < value
,variable = variable
)
, .(From = x.value, To = i.value)
, allow.cartesian = TRUE
, nomatch = 0L
][!is.na(From), .N, keyby = .(From, To)]
原文:
我没有完全遵循,但如果你主要是在你的 43 列中计算数量,那么收集/融化数据可能是有益的。
molten_dt <- melt(dt1)
molten_dt[, N := length(unique(variable)), by = value]
variable value N
1: V1 500102 9
2: V1 500560 8
3: V1 500548 9
4: V1 500561 12
5: V1 500775 9
---
8596: V43 500096 7
8597: V43 500320 6
8598: V43 500205 14
8599: V43 500711 7
8600: V43 500413 11
#or you can aggregate instead of mutate-in-place
molten_dt[, .(N = length(unique(variable))), by = value]
value N
1: 500102 9
2: 500560 8
3: 500548 9
4: 500561 12
5: 500775 9
---
897: 500753 4
898: 500759 4
899: 500816 6
900: 500772 4
901: 500446 2
此外,我的回答并非 100% 同意@Konrad。当存在重复值时,@Konrad 的解决方案似乎有一个额外的计数。
数据:
set.seed(1234)
dt1<- as.data.table(replicate(43, sample(500000 : 500900, 200, replace = TRUE)))
#h/t for @Konrad for the quick way to make 43 columns
第一次编辑:
如果您只对每个值的计数感兴趣,可以执行以下操作:
mat_data <- matrix(replicate(43, sample(500000 : 500900, 700, replace = TRUE)), ncol = 43)
table(unlist(apply(mat_data, 2, unique)))
这是最快的方法,但问题是您丢失了有关哪个列提供信息的信息。
Unit: milliseconds
expr min lq mean median uq max neval
melt_and_count 53.3914 53.8926 57.38576 55.95545 58.55605 79.2055 20
table_version 11.0566 11.1814 12.24900 11.56760 12.82110 16.4351 20
vapply_version 63.1623 64.8274 69.86041 67.84505 71.40635 108.2279 20
我正在处理一个大型数据集(清理后)。然后处理数据集以创建邻接矩阵,该矩阵被传递给包含 uniqueID 的 id obs 的 logicEval。 5
当运行使用代码片段来创建邻接矩阵时,该过程需要花费大量时间来处理(有时,它只是冻结)。
显然,这是因为该函数正在检查每个唯一元素 (n=10901) 并标记 TRUE/FALSE(如果它出现在观察中)。一个例子(大大减少):
|Obs_1 |Obs_2 |Obs_3 |Obs_4 |Obs_5 | logEval|
|:-----|:-----|:-----|:-----|:-----|-------:|
|TRUE |FALSE |FALSE |FALSE |FALSE | 1|
|FALSE |TRUE |FALSE |FALSE |FALSE | 1|
|FALSE |FALSE |TRUE |FALSE |FALSE | 1|
|FALSE |FALSE |FALSE |TRUE |FALSE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |TRUE |FALSE |FALSE | 1|
|TRUE |FALSE |FALSE |FALSE |FALSE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
实际Obs=43,比较>100000次。
问题: R 崩溃。有没有更好的方法来 运行 它不会因为大小而崩溃?
代码片段:
r
df1<-data.table(col1=sample(500000:500900,700,replace = T),
col2=sample(500000:500900,700,replace = T),
col3=sample(500000:500900,700,replace = T),
col4=sample(500000:500900,700,replace = T),
col5 = sample(500000:500900,700,replace = T),
col6 = sample(500000:500900,700,replace = T),
col7 = sample(500000:500900,700,replace = T),
col8 = sample(500000:500900,700,replace = T),
col9 = sample(500000:500900,700,replace = T),
col10 = sample(500000:500900,700,replace = T),
col11 = sample(500000:500900,700,replace = T),
col12 = sample(500000:500900,700,replace = T),
col13 = sample(500000:500900,700,replace = T),
col14 = sample(500000:500900,700,replace = T),
col15 = sample(500000:500900,700,replace = T),
col16 = sample(500000:500900,700,replace = T),
col17 = sample(500000:500900,700,replace = T),
col18 = sample(500000:500900,700,replace = T),
col19 = sample(500000:500900,700,replace = T),
col20 = sample(500000:500900,700,replace = T),
col21 = sample(500000:500900,700,replace = T),
col22 = sample(500000:500900,700,replace = T),
col23 = sample(500000:500900,700,replace = T),
col24 = sample(500000:500900,700,replace = T),
col25 = sample(500000:500900,700,replace = T),
col26 = sample(500000:500900,700,replace = T),
col27 = sample(500000:500900,700,replace = T),
col28 = sample(500000:500900,700,replace = T),
col29 = sample(500000:500900,700,replace = T),
col30 = sample(500000:500900,700,replace = T),
col31 = sample(500000:500900,700,replace = T),
col32 = sample(500000:500900,700,replace = T),
col33 = sample(500000:500900,700,replace = T),
col34 = sample(500000:500900,700,replace = T),
col35 = sample(500000:500900,700,replace = T),
col36 = sample(500000:500900,700,replace = T),
col37 = sample(500000:500900,700,replace = T),
col38 = sample(500000:500900,700,replace = T),
col39 = sample(500000:500900,700,replace = T),
col40 = sample(500000:500900,700,replace = T),
col41 = sample(500000:500900,700,replace = T),
col42 = sample(500000:500900,700,replace = T),
col43 = sample(500000:500900,700,replace = T))
#find all ids via table
uniqueIDs<-as.character(unique(unlist(df1)))
df1<-data.table(df1)
#creating adjacency matrix
mat <- sapply(uniqueIDs, function(s) apply(dt1, 1, function(x) s %in% x))
#clean-up
colnames(mat) <- uniqueIDs
rownames(mat) <- paste0("row", seq(nrow(dt1)))
mat<-data.table(mat)
mat<-data.table(t(mat))
#apply logical evaluation to count number of TRUE
mat$logEval<-rowSums(mat==TRUE)
Want to make a small update to ensure I am making my overall goal clear:
-数据集有 x (43) 个 obs,每个 obs 有 y (200) 个 nbrid。
运行上述代码的目标是创建一个邻接矩阵来标识每列出现的 nbrids (y)。 [例如,从 unique nbrids 中,y(1) 是否出现在 x(i) 中;y(2)...是否出现 y(900)]。
我不关心 x 本身。最终目标是:
From the unique ids throughout the matrix, what uniqueids appear together & how often [this is why I create the logic test to count .n(i)==TRUE]…for those >2, i can filter as it is likely that such rows share nbrids.
示例结束矩阵;
r
From To Weight
50012 50056 5
50012 50032 3
…
50063 50090 9
真是个大嘴巴_
如果我正确理解了您的要求,那么以下内容应该有效:
df1 = …
tdf1 = as.data.frame(t(df1))
unique_ids = as.character(unique(unlist(df1)))
# mat = sapply(tdf1, `%in%`, x = unique_ids)
mat = vapply(tdf1, `%in%`, logical(length(unique_ids)), x = unique_ids)
rownames(mat) = unique_ids
colnames(mat) = paste0('row', seq_len(ncol(mat))) # ??? Really?!
log_eval = rowSums(mat)
请特别注意我的代码中的 mat
不需要转置,因为它已经处于正确的方向。 commented-out sapply
行等同于 vapply
行,但后者更明确并且执行更严格的类型检查,因此如果数据意外更改则更少 error-prone。 vapply
可能也更有效,但对于您的示例数据,差异并不明显。
顺便说一句,要生成随机 df1
,您可以将 43 行代码缩短为
df1 = as.data.frame(replicate(43, sample(500000 : 500900, 700, replace = TRUE)))
第二次编辑:
这些选项似乎在您的编辑中达到了您的预期输出。这两个选项都依赖 self-joins 来查看有哪些组合。第一个选项使用 lapply()
一次执行 self-join 一列,而后者 melt()
然后 self-joins 整个数据集。对于较小的数据集,lapply()
速度较慢,但在尝试 7,000 行时,它仍然通过,而 melt 和 self-join 创建了太大的数据框。
另外请注意,此数据集实际上没有很多唯一值。如果我知道它是稀疏的,我可能会添加一行来过滤掉整个数据集中没有重复的值。
library(data.table)
# generate data -----------------------------------------------------------
set.seed(1234)
dt1<- data.table(replicate(43, sample(500000:500900,700, replace = TRUE)))
rbindlist(
lapply(dt1
, function(x) {
nbrid_dt = data.table(nbrid = unique(x))
nbrid_dt[nbrid_dt
, on = .(nbrid < nbrid)
, j = .(From = x.nbrid, To = i.nbrid)
, nomatch = 0L
, allow.cartesian = T]
}
)
)[, .N, keyby = .(From, To)]
From To N
1: 500000 500001 11
2: 500000 500002 11
3: 500000 500003 7
4: 500000 500004 9
5: 500000 500005 13
---
405446: 500897 500899 12
405447: 500897 500900 10
405448: 500898 500899 13
405449: 500898 500900 12
405450: 500899 500900 9
#all at once
molten_dt <- unique(melt(dt1))
setkey(molten_dt, variable)
molten_dt[molten_dt
, on = .(value < value
,variable = variable
)
, .(From = x.value, To = i.value)
, allow.cartesian = TRUE
, nomatch = 0L
][!is.na(From), .N, keyby = .(From, To)]
原文: 我没有完全遵循,但如果你主要是在你的 43 列中计算数量,那么收集/融化数据可能是有益的。
molten_dt <- melt(dt1)
molten_dt[, N := length(unique(variable)), by = value]
variable value N
1: V1 500102 9
2: V1 500560 8
3: V1 500548 9
4: V1 500561 12
5: V1 500775 9
---
8596: V43 500096 7
8597: V43 500320 6
8598: V43 500205 14
8599: V43 500711 7
8600: V43 500413 11
#or you can aggregate instead of mutate-in-place
molten_dt[, .(N = length(unique(variable))), by = value]
value N
1: 500102 9
2: 500560 8
3: 500548 9
4: 500561 12
5: 500775 9
---
897: 500753 4
898: 500759 4
899: 500816 6
900: 500772 4
901: 500446 2
此外,我的回答并非 100% 同意@Konrad。当存在重复值时,@Konrad 的解决方案似乎有一个额外的计数。
数据:
set.seed(1234)
dt1<- as.data.table(replicate(43, sample(500000 : 500900, 200, replace = TRUE)))
#h/t for @Konrad for the quick way to make 43 columns
第一次编辑: 如果您只对每个值的计数感兴趣,可以执行以下操作:
mat_data <- matrix(replicate(43, sample(500000 : 500900, 700, replace = TRUE)), ncol = 43)
table(unlist(apply(mat_data, 2, unique)))
这是最快的方法,但问题是您丢失了有关哪个列提供信息的信息。
Unit: milliseconds
expr min lq mean median uq max neval
melt_and_count 53.3914 53.8926 57.38576 55.95545 58.55605 79.2055 20
table_version 11.0566 11.1814 12.24900 11.56760 12.82110 16.4351 20
vapply_version 63.1623 64.8274 69.86041 67.84505 71.40635 108.2279 20