在 data.table 中找到与 header 相同的行

Question

我正在使用 fread 读取一个巨大的 csv 文件。数据不知何故 mis-formatted 并且 header 不时重复。我现在想删除文件中的 headers，因此，我必须搜索内容等于 header.

的行

我可以想到 2 个解决方案，但都不是最佳解决方案：

选项 1 假定所有 non-header 重复行至少在一个位置上相互不同
选项 2 非常冗长，需要大量编写

基本上我需要一种循环遍历所有列并将它们与 header.

进行比较

因此，整个事情归结为一个问题：

如何在数据中查找特定行 table 而不对过滤器进行硬编码？

代码

library(data.table)
foo <- data.frame(a = c(1:2, "a", 1:2, "a"), b = c(letters[1:2], "b", letters[2:1], "b"),
                  stringsAsFactors = FALSE)
setDT(foo)

## option 1: use duplicates, assuming that each row is otherwise unique
foo[-(which(duplicated(rbind(as.list(names(foo)), foo))) - 1)]

## option 2: compare directly, but becomes very cumbersome with growing number of columns
foo[!(a == names(foo)[1] & b == names(foo)[2])]

Answer 1

反加入：

setkeyv(foo, names(foo)) # Reordes data though
foo[!list(names(foo))]

   a b
1: 1 a
2: 1 b
3: 2 a
4: 2 b

没有设置键：

nfoo <- names(foo)
foo[!setNames(as.list(nfoo), nfoo), on = nfoo]

Answer 2

由于错位的 headers 与实际的 headers 重复相同，那么我们只需要比较第一列，即您的选项 2，但只检查第一（或任何）列:

foo[ !(a == names(foo)[1]), ]

或者使用 grep 删除 R 之外的 headers，例如：

fread("grep -v myCol1 myfile.txt")

或粘贴每一行，与header比较：

foo[ do.call(paste, c(foo, list(sep = "_"))) != paste(colnames(foo), collapse = "_"), ]

我更愿意选择第二个选项，这样我们就不会遇到使用其他 "after-fread" 解决方案时出现的列类问题。

在 data.table 中找到与 header 相同的行

Find a row in a data.table that is same as the header

r

fread

data.table