Fread 异常行结束导致错误

Question

我正在尝试下载纽约市出租车数据的大型数据库，可在 NYC TLC website.

上公开获取

library(data.table)
feb14 <- fread('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv', header = T)

执行上述代码成功下载数据（需要几分钟），但随后由于内部错误无法解析。我也尝试删除 header = T。

是否有解决方法来处理 fread 中的 "unusual line endings"？

Error in fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv",  : 
  Internal error. No eol2 immediately before line 3 after sep detection.
In addition: Warning message:
In fread("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv",  :
  Detected eol as \n\r, a highly unusual line ending. According to Wikipedia the Acorn BBC used this. If it is intended that the first column on the next row is a character column where the first character of the field value is \r (why?) then the first column should start with a quote (i.e. 'protected'). Proceeding with attempt to read the file.

Answer 1

有时其他选项如 read.csv/read.table can 表现不同...因此您可以随时尝试。（也许源代码说明了原因，还没有研究）。

另一种选择是使用 readLines() 读入这样的文件。据我所知，这里没有parsing/formatting。据我所知，这是读取文件的最基本方法

最后，快速修复：在 fread 中使用选项 'skip = ...'，或者通过说 'nrows = ...' 来控制结束。

Answer 2

fread 有点可疑。 data.table 是读取大文件的更快、更注重性能的方法，但是在这种情况下，行为并不是最佳的。您可能想在 github

上提出这个问题

即使使用 nrows = 5 或 nrows = 1，我也能够在下载的文件上重现该问题，但前提是坚持使用原始文件。如果我复制粘贴前几行然后尝试，问题就消失了。如果我使用小 nrows 直接从网上阅读，这个问题也会消失。这甚至不是 encoding 问题，因此我建议提出问题。

我尝试使用 read.csv 和 100,000 行读取文件，没有出现问题，并且不到 6 秒。

feb14_2 <- read.csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2014-02.csv", header = T, nrows = 100000)

header = T 是多余的参数，因此对 fread 没有影响，但对 read.csv.

是必需的

Answer 3

问题似乎是由于 header 和原始 .csv 文件中的数据之间存在空行引起的。使用 notepad++ 从 .csv 中删除该行似乎已为我修复。

Fread 异常行结束导致错误

Fread unusual line ending causing error

r

fread

data.table