rbindlist data.frames 和 select 唯一值的列表列

Question

我有一个 data.table 'DT'，其中有一列 ('col2') 是数据框列表：

require(data.table)
DT <- data.table(col1 = c('A','A','B'),
                 col2 = list(data.frame(colA = c(1,3,54, 23), 
                                        colB = c("aa", "bb", "cc", "hh")),
                             data.frame(colA =c(23, 1),
                                       colB = c("hh", "aa")), 
                             data.frame(colA = 1,
                                       colB = "aa")))

> DT
   col1         col2
1:    A <data.frame>
2:    A <data.frame>
3:    B <data.frame>

>> DT$col2
[[1]]
  colA colB
1    1   aa
2    3   bb
3   54   cc
4   23   hh

[[2]]
  colA colB
1   23   hh
2    1   aa

[[3]]
  colA colB
1    1   aa

col2 中的每个 data.frame 都有两列 colA 和 colB。我想要一个 data.table 输出，它基于 DT 的 col1 绑定那些 data.frame 的每个唯一行。我想这就像在 data.table.

的聚合函数中使用 rbindlist

这是期望的输出：

> #desired output
> output
   colA colB col1
1:    1   aa    A
2:    3   bb    A
3:   54   cc    A
4:   23   hh    A
5:    1   aa    B

DT（DT[2, col2]）第二行的数据框有重复的条目，每个唯一的 col1 只需要唯一的条目。

我尝试了以下操作，但出现错误。

desired_output <- DT[, lapply(col2, function(x) unique(rbindlist(x))), by = col1]
# Error in rbindlist(x) : 
#   Item 1 of list input is not a data.frame, data.table or list

这个'works'，虽然不是想要的输出：

unique(rbindlist(DT$col2))
   colA colB
1:    1   aa
2:    3   bb
3:   54   cc
4:   23   hh

是否可以在 data.table 的聚合函数中使用 rbindlist？

Answer 1

你可以像这样做一些骇人听闻的事情：

nDT <- cbind(rbindlist(DT[[2]]), col1 = rep(DT[[1]], sapply(DT[[2]], nrow)))
nDT[!duplicated(nDT)]
   colA colB col1
1:    1   aa    A
2:    3   bb    A
3:   54   cc    A
4:   23   hh    A
5:    1   aa    B

或使用 tidyr（灵感来自 PKumar 的评论）：

unique(tidyr::unnest(DT))

或更通用的基础 R：

names(DT[[2]]) <- DT[[1]]
ndf <- do.call(rbind, DT[[2]])
ndf$col1 <- substr(row.names(ndf), 1, 1)
unique(ndf)

Answer 2

这个有效：

DT1<-apply(DT, 1, function(x){cbind(col1=x$col1,x$col2)})
unique(rbindlist(DT1))
#   col1 colA colB
#1:    A    1   aa
#2:    A    3   bb
#3:    A   54   cc
#4:    A   23   hh
#5:    B    1   aa

Answer 3

组 by 'col1'，运行 rbindlist 'col2'：

unique(DT[ , rbindlist(col2), by = col1]) # trimmed thanks to @snoram
#    col1 colA colB
# 1:    A    1   aa
# 2:    A    3   bb
# 3:    A   54   cc
# 4:    A   23   hh
# 5:    B    1   aa

Answer 4

only unique entries are desired for each unique col1

如果您为 col1 添加一列，上面的表达式表示 "unique entries"（对列无条件）。

Henrik 的回答是保留 col1 的一种方法。另一个是：

unique(DT[, rbindlist(setNames(col2, col1), id="col1")])

我想这应该比

更有效率

bycols = "col1"
unique(DT[, rbindlist(col2), by=bycols])   # Henrik's

尽管对 (1) col1 不是字符列（因此适合 setNames）或 (2) 具有多个 by= 列的扩展并不那么明显。对于这两种情况中的任何一种，我都会使 .id 列等于 DT 的行号，然后将它们复制过来：

bycols = "col1"
res = unique(DT[, rbindlist(col2, id="DT_row")])
res[, (bycols) := DT[DT_row, ..bycols]]

要放置这些列 first/leftmost，我认为 setcolorder(res, bycols) 应该可以，但我的 data.table 版本太旧，无法看到它这样做。

还有 an open issue 类似 tidyr::unnest 的函数。

rbindlist data.frames 和 select 唯一值的列表列

rbindlist a list column of data.frames and select unique values

aggregate

r

list

data.table

rbindlist