读取巨大的 .csv 文件,其中一些列用单引号引起来,但并非所有列都来自 data.table 包

Read huge .csv file with some columns in single quotes but not all with fread from the data.table package

很抱歉,我无法真正创建一个可重现的示例(或者我猜至少不符合规则),但仍然希望得到帮助。 我正在使用这里的数据: American Housing Survey 2013 data

由于数据文件很大我想使用"fread" 命令而不是"read.csv" 命令。使用 read.csv 我可以执行以下操作:

homimp <- read.csv("homimp.csv", quotes = "'")
head(homimp)
       CONTROL RAS RAH  RAD JRAS JRAD
1 100003130103  74   2   96   -9    9
2 100006110249  35   2 8358   -9    9
3 100006110249  36   2 5970   -9    9
4 100006110249  37   2 6567   -9    9
5 100006110249  40   2  716   -9    9
6 100006110249  45   2 1910   -9    9

并且它会删除引号(请注意,一列 (RAD) 首先不在引号中) 但是,如果我恐惧地阅读,我似乎无法删除引号 引用参数 returns 错误:

homimpdt <- fread("homimp.csv", quote = "'")
Error in fread("homimp.csv", quote = "'") : unused argument (quote = "'")

没有参数引号不会被删除:

homimpdt <- fread("homimp.csv")
head(homimpdt)
          CONTROL  RAS RAH  RAD JRAS JRAD
1: '100003130103' '74' '2'   96 '-9'  '9'
2: '100006110249' '35' '2' 8358 '-9'  '9'
3: '100006110249' '36' '2' 5970 '-9'  '9'
4: '100006110249' '37' '2' 6567 '-9'  '9'
5: '100006110249' '40' '2'  716 '-9'  '9'
6: '100006110249' '45' '2' 1910 '-9'  '9'

我为什么要这样做:

> system.time(newhouse <- read.csv('newhouse.csv', quote = "'"))
   user  system elapsed 
  24.86    0.68   25.77 
> system.time(newhousedt <- fread('newhouse.csv'))
Read 84355 rows and 760 (of 760) columns from 0.273 GB file in 00:00:04
   user  system elapsed 
   3.33    0.07    3.41 

非常感谢您的帮助!

Ad Psidom 的评论:

homimpdt <- fread("homimp.csv", quote = "\'")
Error in fread("homimp.csv", quote = "'") : unused argument (quote = "'")

评论中给出的答案摘要:

解决方案 #1: 感谢@Psidom 和@jangorecki

安装 data.table v. 1.9.7:

install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")

然后运行:

homimpdt <- fread("homimp.csv", quote = "\'")

编辑:CRAN 上 data.table 的当前版本是 1.9.6

解决方案 #2(仅限 linux): 感谢@RichScriven

可以在这里找到:

并在type.convert()函数中设置as.is = TRUE