读取巨大的 .csv 文件,其中一些列用单引号引起来,但并非所有列都来自 data.table 包
Read huge .csv file with some columns in single quotes but not all with fread from the data.table package
很抱歉,我无法真正创建一个可重现的示例(或者我猜至少不符合规则),但仍然希望得到帮助。
我正在使用这里的数据:
American Housing Survey 2013 data
由于数据文件很大我想使用"fread" 命令而不是"read.csv" 命令。使用 read.csv 我可以执行以下操作:
homimp <- read.csv("homimp.csv", quotes = "'")
head(homimp)
CONTROL RAS RAH RAD JRAS JRAD
1 100003130103 74 2 96 -9 9
2 100006110249 35 2 8358 -9 9
3 100006110249 36 2 5970 -9 9
4 100006110249 37 2 6567 -9 9
5 100006110249 40 2 716 -9 9
6 100006110249 45 2 1910 -9 9
并且它会删除引号(请注意,一列 (RAD) 首先不在引号中)
但是,如果我恐惧地阅读,我似乎无法删除引号
引用参数 returns 错误:
homimpdt <- fread("homimp.csv", quote = "'")
Error in fread("homimp.csv", quote = "'") : unused argument (quote = "'")
没有参数引号不会被删除:
homimpdt <- fread("homimp.csv")
head(homimpdt)
CONTROL RAS RAH RAD JRAS JRAD
1: '100003130103' '74' '2' 96 '-9' '9'
2: '100006110249' '35' '2' 8358 '-9' '9'
3: '100006110249' '36' '2' 5970 '-9' '9'
4: '100006110249' '37' '2' 6567 '-9' '9'
5: '100006110249' '40' '2' 716 '-9' '9'
6: '100006110249' '45' '2' 1910 '-9' '9'
我为什么要这样做:
> system.time(newhouse <- read.csv('newhouse.csv', quote = "'"))
user system elapsed
24.86 0.68 25.77
> system.time(newhousedt <- fread('newhouse.csv'))
Read 84355 rows and 760 (of 760) columns from 0.273 GB file in 00:00:04
user system elapsed
3.33 0.07 3.41
非常感谢您的帮助!
Ad Psidom 的评论:
homimpdt <- fread("homimp.csv", quote = "\'")
Error in fread("homimp.csv", quote = "'") : unused argument (quote = "'")
评论中给出的答案摘要:
解决方案 #1:
感谢@Psidom 和@jangorecki
安装 data.table v. 1.9.7:
install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")
然后运行:
homimpdt <- fread("homimp.csv", quote = "\'")
编辑:CRAN 上 data.table 的当前版本是 1.9.6
解决方案 #2(仅限 linux):
感谢@RichScriven
可以在这里找到:
并在type.convert()
函数中设置as.is = TRUE
很抱歉,我无法真正创建一个可重现的示例(或者我猜至少不符合规则),但仍然希望得到帮助。 我正在使用这里的数据: American Housing Survey 2013 data
由于数据文件很大我想使用"fread" 命令而不是"read.csv" 命令。使用 read.csv 我可以执行以下操作:
homimp <- read.csv("homimp.csv", quotes = "'")
head(homimp)
CONTROL RAS RAH RAD JRAS JRAD
1 100003130103 74 2 96 -9 9
2 100006110249 35 2 8358 -9 9
3 100006110249 36 2 5970 -9 9
4 100006110249 37 2 6567 -9 9
5 100006110249 40 2 716 -9 9
6 100006110249 45 2 1910 -9 9
并且它会删除引号(请注意,一列 (RAD) 首先不在引号中) 但是,如果我恐惧地阅读,我似乎无法删除引号 引用参数 returns 错误:
homimpdt <- fread("homimp.csv", quote = "'")
Error in fread("homimp.csv", quote = "'") : unused argument (quote = "'")
没有参数引号不会被删除:
homimpdt <- fread("homimp.csv")
head(homimpdt)
CONTROL RAS RAH RAD JRAS JRAD
1: '100003130103' '74' '2' 96 '-9' '9'
2: '100006110249' '35' '2' 8358 '-9' '9'
3: '100006110249' '36' '2' 5970 '-9' '9'
4: '100006110249' '37' '2' 6567 '-9' '9'
5: '100006110249' '40' '2' 716 '-9' '9'
6: '100006110249' '45' '2' 1910 '-9' '9'
我为什么要这样做:
> system.time(newhouse <- read.csv('newhouse.csv', quote = "'"))
user system elapsed
24.86 0.68 25.77
> system.time(newhousedt <- fread('newhouse.csv'))
Read 84355 rows and 760 (of 760) columns from 0.273 GB file in 00:00:04
user system elapsed
3.33 0.07 3.41
非常感谢您的帮助!
Ad Psidom 的评论:
homimpdt <- fread("homimp.csv", quote = "\'")
Error in fread("homimp.csv", quote = "'") : unused argument (quote = "'")
评论中给出的答案摘要:
解决方案 #1: 感谢@Psidom 和@jangorecki
安装 data.table v. 1.9.7:
install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")
然后运行:
homimpdt <- fread("homimp.csv", quote = "\'")
编辑:CRAN 上 data.table 的当前版本是 1.9.6
解决方案 #2(仅限 linux): 感谢@RichScriven
可以在这里找到:
并在type.convert()
函数中设置as.is = TRUE