fill=TRUE 的简单 fread 操作失败

Simple fread operation with fill=TRUE fails

以下代码生成数据文件,其中每一行都有不同的列数。选项 fill=TRUE 似乎仅在达到特定字符限制时才起作用。例如,将第 1-3 行与第 9-11 行进行比较,注意这两个示例都按预期工作。如何在启用 fill=TRUE 的情况下读取整个 notworking1.dat 而不仅仅是前 100 行?

for (i in seq(1000,1099,by=1)) 
    cat(file="working1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working1.dat", fill=TRUE)

for (i in seq(1000,1101,by=1)) 
    cat(file="notworking1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "notworking1.dat", fill=TRUE)

for (i in seq(1,101,by=1)) 
    cat(file="working2.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working2.dat", fill=TRUE)

下面的solution也会失败

df <- fread(input = "notworking1.dat", fill=TRUE, col.names=paste0("V", seq_len(1101)))

收到警告消息:

Warning message: In data.table::fread(input = "notworking1.dat", fill = TRUE) : Stopped early on line 101. Expected 1099 fields but found 1100. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1 2 3 4 ...

我们可以找出最大列数并添加那么多列,然后fread:

x <- readLines("notworking1.dat")
myHeader <- paste(paste0("V", seq(max(lengths(strsplit(x, " ", fixed = TRUE))))), collapse = " ")

# write with headers
write(myHeader, "tmp_file.txt")
write(x, "tmp_file.txt", append = TRUE)
# read as usual with fill
d1 <- fread("tmp_file.txt", fill = TRUE)

# check output
dim(d1)
# [1]  102 1101
d1[100:102, 1101]
#    V1101
# 1:    NA
# 2:    NA
# 3:  1101

但是由于我们已经使用 readLines 导入了数据,我们可以直接解析它:

x <- readLines("notworking1.dat")
xSplit <- strsplit(x, " ", fixed = TRUE)

# rowbind unequal length list, and convert to data.table
d2 <- data.table(t(sapply(xSplit, '[', seq(max(lengths(xSplit))))))

# check output
dim(d2)
# [1]  102 1101
d2[100:102, 1101]
#    V1101
# 1:  <NA>
# 2:  <NA>
# 3:  1101

这是一个已知问题 GitHub issue 5119,尚未实现,但建议 fill 也将整数作为输入。所以解决方案是这样的:

d <- fread(input = "notworking1.dat", fill = 1101)