防止 fread() 中的 column-class 推理

Question

有没有办法让 fread 模仿 read.table 的行为，即变量的 class 由读入的数据设置。

我有数字数据，在主要数据下方有一些注释。当我使用 fread 读入数据时，列被转换为字符。但是，通过在 read.table` 中设置 nrow 我可以阻止这种行为。这可能吗？（我不想更改原始数据或制作修改后的副本）。谢谢

一个例子

d <- data.frame(x=c(1:100, NA, NA, "fff"), y=c(1:100, NA,NA,NA)) 
write.csv(d, "test.csv",  row.names=F)

in_d <- read.csv("test.csv", nrow=100, header=T)
in_dt <- data.table::fread("test.csv", nrow=100)

产生

> str(in_d)
'data.frame':   100 obs. of  2 variables:
 $ x: int  1 2 3 4 5 6 7 8 9 10 ...
 $ y: int  1 2 3 4 5 6 7 8 9 10 ...
> str(in_dt)
Classes ‘data.table’ and 'data.frame':  100 obs. of  2 variables:
 $ x: chr  "1" "2" "3" "4" ...
 $ y: int  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>

作为一种变通方法，我认为我可以使用 read.table 读取一行，获取 class 并设置 colClasses，但我误解了。

cl <- read.csv("test.csv", nrow=1,  header=T)
cols <- unname(sapply(cl, class))
in_dt <- data.table::fread("test.csv", nrow=100, colClasses=cols)
str(in_dt)

使用 Windows8.1 R 版本 3.1.2 (2014-10-31) 平台：x86_64-w64-mingw32/x64（64 位）

Answer 1

选项 1：使用系统命令

fread() 允许在其第一个参数中使用系统命令。我们可以用它来删除文件第一列中的引号。

indt <- data.table::fread("cat test.csv | tr -d '\"'", nrows = 100)
str(indt)
# Classes ‘data.table’ and 'data.frame':    100 obs. of  2 variables:
#  $ x: int  1 2 3 4 5 6 7 8 9 10 ...
#  $ y: int  1 2 3 4 5 6 7 8 9 10 ...
#  - attr(*, ".internal.selfref")=<externalptr>

系统命令 cat test.csv | tr -d '\"' 解释：

cat test.csv 读取文件到标准输出
|是一个管道，使用上一个命令的输出作为下一个命令的输入
tr -d '\"' 从当前输入中删除（-d）所有出现的双引号（'\"'）

方案二：读后强制

由于选项 1 似乎不适用于您的系统，另一种可能性是像您一样读取文件，但将 x 列转换为 type.convert().

library(data.table)
indt2 <- fread("test.csv", nrows = 100)[, x := type.convert(x)]
str(indt2)
# Classes ‘data.table’ and 'data.frame':    100 obs. of  2 variables:
#  $ x: int  1 2 3 4 5 6 7 8 9 10 ...
#  $ y: int  1 2 3 4 5 6 7 8 9 10 ...
#  - attr(*, ".internal.selfref")=<externalptr>

旁注： 我通常更喜欢使用 type.convert() 而不是 as.numeric() 以避免 "NAs introduced by coercion" 在某些情况下会触发警告。例如，

x <- c("1", "4", "NA", "6")
as.numeric(x)
# [1]  1  4 NA  6
# Warning message:
# NAs introduced by coercion 
type.convert(x)
# [1]  1  4 NA  6

当然你也可以使用as.numeric()。

注意：这个答案假设data.table dev v1.9.5

Answer 2

好的，客户 abusing CSV format 故意将尾随字符串行写到整数列，但没有以 comment.char (#) 开头的行。

然后您以某种方式期望您可以覆盖 fread() 的类型推断以将其读取为整数，方法是使用 nrow 尝试将其限制为仅查看整数行。 read.csv(..., nrow) 会接受这一点，但是 fread() 总是使用所有行进行类型推断（不仅仅是 nrow, skip, header 指定的行），即使它们以 comment.char 开头（那是一个错误）。

听起来像是在滥用 CSV。您的评论行应以 #
是的，fread() 需要一个 fix/enhance 来忽略类型推断的注释行。
目前，您可以通过 post-处理数据-table 读入来解决 fread()。
是否应该更改 fread() 以支持您想要的行为是有争议的：使用 nrows 来限制暴露给类型推断的内容。它可能会解决您的（非常独特的）案例并破坏其他案例。

我不明白为什么您（编辑：客户）不能将您的评论写入单独的 .txt/README/data-dictionary 文件以随 .csv 一起提交。使用单独的数据字典文件的做法已经非常成熟。我从未见过有人对 CSV 文件执行此操作。至少把评论移到页眉，而不是页脚。

防止 fread() 中的 column-class 推理

Preventing column-class inference in fread()

r

read.table

data.table