R - ff包:查找ffdf中出现次数最多的元素,并删除所在的行
R - ff package : find the most frequent element in ffdf and delete the rows where is located
我需要一个建议来找到 ffdf 中出现频率最高的元素,然后删除它所在的行。
我决定尝试使用 ff 包,因为我正在处理非常大的数据并且使用 base R 我 运行 内存不足。
这是一个小例子:
# create a base R Matrix
> z<-matrix(c("a", "b", "a", "c", "b", "b", "c", "c", "b", "a"),nrow=5,ncol=2,byrow = TRUE)
> z
[,1] [,2]
[1,] "a" "b"
[2,] "a" "c"
[3,] "b" "b"
[4,] "c" "c"
[5,] "b" "a"
# convert z to ffdf
> u=as.data.frame(z, stringsAsFactors=TRUE)
> u=as.ffdf(u)
> u
ffdf data
V1 V2
1 a b
2 a c
3 b b
4 c c
5 b a
我正在寻找:
- 导出ffdf中出现次数最多的元素(本例中为"b")
- 从ffdf中删除"b"所在的所有行
因此,新的 ffdf 必须如下所示:
V1 V2
1 a c
2 c c
在 base R 中我找到了 "table" 函数
temp <- table(as.vector(z))
t1<-names(temp)[temp == max(temp)]
z1<- z[rowSums(z== t1[1]) == 0, ]
但是处理大量数据我需要像 ff 包这样的东西。
require(ff)
z <- matrix(c("a","b","f","c","f","b","e","c","b","e"),nrow=5,ncol=2,byrow = TRUE)
u <- as.data.frame(z, stringsAsFactors=TRUE)
u <- as.ffdf(u)
u
以下应该适用于任何大小的数据集。它使用来自 ffbase 的 table.ff 和 ffwhich,来自 ff 的 ffrowapply 和基于 ff 整数向量的索引。
require(ffbase)
require(plyr)
## Detect most frequent item (assuming the levels of all columns can be different)
columnfreqs <- lapply(colnames(u), FUN=function(column) table(u[[column]]))
columnfreqs <- lapply(columnfreqs, FUN=function(x) as.data.frame(t(as.matrix(x))))
itemfreqs <- colSums(do.call(rbind.fill, columnfreqs), na.rm=TRUE)
mostfrequent <- names(sort(itemfreqs, decreasing = TRUE))[1]
## Identify the lines where the most frequent item occurs in each row of the ffdf
idx <- ffrowapply(
EXPR = apply(u[i1:i2,], MARGIN=1, FUN=function(row) any(row %in% mostfrequent)),
X=u,
RETURN = TRUE, FF_RETURN = TRUE, RETCOL = NULL, VMODE = "logical")
idx <- ffwhich(idx, idx != TRUE) # remove it is in there + convert logicals to integers
## Remove them
u[idx, ]
我需要一个建议来找到 ffdf 中出现频率最高的元素,然后删除它所在的行。 我决定尝试使用 ff 包,因为我正在处理非常大的数据并且使用 base R 我 运行 内存不足。
这是一个小例子:
# create a base R Matrix
> z<-matrix(c("a", "b", "a", "c", "b", "b", "c", "c", "b", "a"),nrow=5,ncol=2,byrow = TRUE)
> z
[,1] [,2]
[1,] "a" "b"
[2,] "a" "c"
[3,] "b" "b"
[4,] "c" "c"
[5,] "b" "a"
# convert z to ffdf
> u=as.data.frame(z, stringsAsFactors=TRUE)
> u=as.ffdf(u)
> u
ffdf data
V1 V2
1 a b
2 a c
3 b b
4 c c
5 b a
我正在寻找:
- 导出ffdf中出现次数最多的元素(本例中为"b")
- 从ffdf中删除"b"所在的所有行
因此,新的 ffdf 必须如下所示:
V1 V2
1 a c
2 c c
在 base R 中我找到了 "table" 函数
temp <- table(as.vector(z))
t1<-names(temp)[temp == max(temp)]
z1<- z[rowSums(z== t1[1]) == 0, ]
但是处理大量数据我需要像 ff 包这样的东西。
require(ff)
z <- matrix(c("a","b","f","c","f","b","e","c","b","e"),nrow=5,ncol=2,byrow = TRUE)
u <- as.data.frame(z, stringsAsFactors=TRUE)
u <- as.ffdf(u)
u
以下应该适用于任何大小的数据集。它使用来自 ffbase 的 table.ff 和 ffwhich,来自 ff 的 ffrowapply 和基于 ff 整数向量的索引。
require(ffbase)
require(plyr)
## Detect most frequent item (assuming the levels of all columns can be different)
columnfreqs <- lapply(colnames(u), FUN=function(column) table(u[[column]]))
columnfreqs <- lapply(columnfreqs, FUN=function(x) as.data.frame(t(as.matrix(x))))
itemfreqs <- colSums(do.call(rbind.fill, columnfreqs), na.rm=TRUE)
mostfrequent <- names(sort(itemfreqs, decreasing = TRUE))[1]
## Identify the lines where the most frequent item occurs in each row of the ffdf
idx <- ffrowapply(
EXPR = apply(u[i1:i2,], MARGIN=1, FUN=function(row) any(row %in% mostfrequent)),
X=u,
RETURN = TRUE, FF_RETURN = TRUE, RETCOL = NULL, VMODE = "logical")
idx <- ffwhich(idx, idx != TRUE) # remove it is in there + convert logicals to integers
## Remove them
u[idx, ]