fill=TRUE 的简单 fread 操作失败
Simple fread operation with fill=TRUE fails
以下代码生成数据文件,其中每一行都有不同的列数。选项 fill=TRUE
似乎仅在达到特定字符限制时才起作用。例如,将第 1-3 行与第 9-11 行进行比较,注意这两个示例都按预期工作。如何在启用 fill=TRUE
的情况下读取整个 notworking1.dat
而不仅仅是前 100 行?
for (i in seq(1000,1099,by=1))
cat(file="working1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working1.dat", fill=TRUE)
for (i in seq(1000,1101,by=1))
cat(file="notworking1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "notworking1.dat", fill=TRUE)
for (i in seq(1,101,by=1))
cat(file="working2.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working2.dat", fill=TRUE)
下面的solution也会失败
df <- fread(input = "notworking1.dat", fill=TRUE, col.names=paste0("V", seq_len(1101)))
收到警告消息:
Warning message: In data.table::fread(input = "notworking1.dat", fill = TRUE) : Stopped early on line 101. Expected 1099 fields but found 1100. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1 2 3 4 ...
我们可以找出最大列数并添加那么多列,然后fread:
x <- readLines("notworking1.dat")
myHeader <- paste(paste0("V", seq(max(lengths(strsplit(x, " ", fixed = TRUE))))), collapse = " ")
# write with headers
write(myHeader, "tmp_file.txt")
write(x, "tmp_file.txt", append = TRUE)
# read as usual with fill
d1 <- fread("tmp_file.txt", fill = TRUE)
# check output
dim(d1)
# [1] 102 1101
d1[100:102, 1101]
# V1101
# 1: NA
# 2: NA
# 3: 1101
但是由于我们已经使用 readLines 导入了数据,我们可以直接解析它:
x <- readLines("notworking1.dat")
xSplit <- strsplit(x, " ", fixed = TRUE)
# rowbind unequal length list, and convert to data.table
d2 <- data.table(t(sapply(xSplit, '[', seq(max(lengths(xSplit))))))
# check output
dim(d2)
# [1] 102 1101
d2[100:102, 1101]
# V1101
# 1: <NA>
# 2: <NA>
# 3: 1101
这是一个已知问题 GitHub issue 5119,尚未实现,但建议 fill 也将整数作为输入。所以解决方案是这样的:
d <- fread(input = "notworking1.dat", fill = 1101)
以下代码生成数据文件,其中每一行都有不同的列数。选项 fill=TRUE
似乎仅在达到特定字符限制时才起作用。例如,将第 1-3 行与第 9-11 行进行比较,注意这两个示例都按预期工作。如何在启用 fill=TRUE
的情况下读取整个 notworking1.dat
而不仅仅是前 100 行?
for (i in seq(1000,1099,by=1))
cat(file="working1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working1.dat", fill=TRUE)
for (i in seq(1000,1101,by=1))
cat(file="notworking1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "notworking1.dat", fill=TRUE)
for (i in seq(1,101,by=1))
cat(file="working2.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working2.dat", fill=TRUE)
下面的solution也会失败
df <- fread(input = "notworking1.dat", fill=TRUE, col.names=paste0("V", seq_len(1101)))
收到警告消息:
Warning message: In data.table::fread(input = "notworking1.dat", fill = TRUE) : Stopped early on line 101. Expected 1099 fields but found 1100. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1 2 3 4 ...
我们可以找出最大列数并添加那么多列,然后fread:
x <- readLines("notworking1.dat")
myHeader <- paste(paste0("V", seq(max(lengths(strsplit(x, " ", fixed = TRUE))))), collapse = " ")
# write with headers
write(myHeader, "tmp_file.txt")
write(x, "tmp_file.txt", append = TRUE)
# read as usual with fill
d1 <- fread("tmp_file.txt", fill = TRUE)
# check output
dim(d1)
# [1] 102 1101
d1[100:102, 1101]
# V1101
# 1: NA
# 2: NA
# 3: 1101
但是由于我们已经使用 readLines 导入了数据,我们可以直接解析它:
x <- readLines("notworking1.dat")
xSplit <- strsplit(x, " ", fixed = TRUE)
# rowbind unequal length list, and convert to data.table
d2 <- data.table(t(sapply(xSplit, '[', seq(max(lengths(xSplit))))))
# check output
dim(d2)
# [1] 102 1101
d2[100:102, 1101]
# V1101
# 1: <NA>
# 2: <NA>
# 3: 1101
这是一个已知问题 GitHub issue 5119,尚未实现,但建议 fill 也将整数作为输入。所以解决方案是这样的:
d <- fread(input = "notworking1.dat", fill = 1101)