将具有公共值的行获取到列表中

Question

我正在尝试根据 "Type of region" 列中的值将行获取到列表中，并将这些列表放入其他数据结构（向量或列表）中。数据如下所示（~700 000 行）：

chr CS  CE  CloneName   score   strand  # locs per clone    # capReg alignments Type of region  
chr1    10027684    10028042    clone_11546 1   +   1   1   chr1_10027880_10028380_DNaseI
chr1    10027799    10028157    clone_11547 1   +   1   1   chr1_10027880_10028380_DNaseI
chr1    10027823    10028181    clone_11548 1   -   1   1   chr1_10027880_10028380_DNaseI
chr1    10027841    10028199    clone_11549 1   +   1   1   chr1_10027880_10028380_DNaseI

这是我尝试做的事情：

typeReg=dat[!duplicated(dat$`Type of region`),]

for(i in 1:nrow(typeReg)){
    res[[i]]=dat[dat$`Type of region`==typeReg[i,]$`Type of region`,]
}

for 循环花费了太多时间，所以我尝试使用应用：

res=apply(typeReg, 1, function(x){
    tmp=dat[dat$`Type of region`==x[9],]
})

但它也很长（区域类型列中有 300 000 个唯一值）。你有解决我的问题的方法吗？或者需要这么长时间是正常的吗？

Answer 1

您可以使用 split():

type <- as.factor(dat$Type_of_Region)
split(dat, type)

但是，如评论中所述，使用 dplyr::group_by() 可能是更好的选择，具体取决于您以后要做什么。

Answer 2

好的，所以拆分有效，但子集化不会降低我在 df 中的因子水平。所以基本上对于拆分函数创建的每个列表，它在原始 df 中带来了 300 000 个级别，因此列表的大小很大。可能的解决方案是在每个创建的列表上使用 droplevels() 函数（如果一个列表太大而无法存储在 RAM 中则不是最优的），使用 for 循环（这个解决方案真的很慢）或删除导致问题是我做的。

res=split(dat[,c(-4,-9)], dat$`Type of region`, drop=TRUE)

将具有公共值的行获取到列表中

Get rows with common value into lists

r

bioinformatics