由于存在 \ 符号,导入 .csv 文件时出现问题

Problems importing .csv file due to presence of \ symbol

如果下面描述的问题在没有可重现示例的意义上含糊不清,我深表歉意,但问题与 .csv 文件的格式有关。

我正在尝试使用 read.csv 函数从我的笔记本电脑将几个 .csv 文件打开到 R 中。函数中使用的当前参数是:

read.csv(filename, row.names = NULL, stringsAsFactors = F) 

row.names 被设置为 NULL,因为我之前遇到了另一个错误(重复的 row.names)。

导入其中一些时,我收到以下警告:

Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string

我检查了带有此警告的输入产生的数据框,我注意到许多行(在我描述的情况下,大约占道路总数的 10%)已被 NA 替换。在以下使用 dput() 生成的 ascii 文本中,我显示了具有 NA 值的行,以及它上面和下面的行(link 是数据框图像的外观)。

data <- structure(list(borough_17 = c("MN", "0", "MN"), block_17 = c(1602L,  NA, 1602L), lot_17 = c(57L, NA, 1L), cd_17 = c(111L, NA, 111L ), ct2010_17 = c(160.02, NA, 160.02), cb2010_17 = c(1001L, NA,  1001L), schooldist_17 = c(2L, NA, 2L), council_17 = c(4L, NA,  4L), zipcode_17
= c(10029L, NA, 10128L), firecomp_17 = c("E053",  "", "E053"), policeprct_17 = c(23, NA, 23), healthcent_17 = c(12L,  NA, 12L), healtharea_17 = c(2820L, NA, 2820L), sanitboro_17 = c(1L,  NA, 1L), sanitdistr_17 = c(11L, NA, 11L), sanitsub_17 = c("1A",  "", "1A"), address_17 = c("1392 MADISON AVENUE", "", "1150 5 AVENUE" ), zonedist1_17 = c("R7-2", "", "R10"), zonedist2_17 = c(NA,  "", NA), zonedist3_17 = c(NA_integer_, NA_integer_, NA_integer_ ), zonedist4_17
= c(NA, "", NA), overlay1_17 = c("C1-5", "",  NA), overlay2_17 = c(NA, "", NA), spdist1_17 = c(NA, "", "PI" ), spdist2_17 = c(NA, "", NA), spdist3_17 = c(NA_integer_, NA_integer_,  NA_integer_), ltdheight_17 = c(NA, NA, NA), splitzone_17 = c("N",  "", "N"), bldgclass_17 = c("C7", "", "D4"), landuse_17 = c(4L,  NA, 3L), easements_17 = c(0, NA, 0), ownertype_17 = c("P", "",  NA), ownername_17 = c("\1392 ASSOCIATES", "", "1150 FIFTH AVE OWNERS" ), lotarea_17 = c(" LLC,4892,20488,1300,19188,0,1300,0,0,0,0,2,1,6,27,30,50,103.67,50,86,NA,3,Y,5,2,234000,1781100,0,0,1910,1988,0,NA,NA,4.19,3.44,0,6.5,1,1016020057,0,16002,996952,226284,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,303.48084987,5045.80705517,0,0\nMN,1602,40,111,160.02,1000,2,4,10029,E053,23,12,2820,1,11,1Q,68 EAST 97 STREET,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,C1,02,0,NA,MSMC RESIDENTIAL REAL,5046,21288,2767,18521,0,0,0,0,0,2767,2,1,6,30,30,50,100.92,50,88,NA,3,N,5,2,98100,722250,19620,144450,1920,1988,0,NA,NA,4.22,3.44,0,6.5,1,1016020040,0,16002,997361,226090,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,311.056597351,5263.55635194,0,0\nMN,1602,12,111,160.02,1001,2,4,10128,E053,23,12,2820,1,11,1A,15 EAST 96 STREET,R10,NA,NA,NA,NA,NA,PI,NA,NA,NA,N,A7,01,0,P,ESI PROPERTIES L.L.C.,3784,10899,0,10899,0,0,0,0,0,0,2,1,6,1,1,37.5,100.92,37,100,NA,3,N,5,2,99694,312592,0,0,1915,0,0,Expanded Carnegie Hill Historic District,LUCY D. DAHLGREN HOUSE,2.88,10,0,10,1,1016020012,0,16002,996835,226261,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,284.454033558,3945.34293573,0,0\nMN,1602,50,111,160.02,1000,2,4,10029,E053,23,12,2820,1,11,1A,1391 MADISON AVENUE,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,D7,04,0,P,MSMC RESIDENTIAL REAL,10000,44184,8673,35511,0,6000,0,0,0,2673,2,1,6,43,47,100,100,100,91,NA,2,N,3,2,293400,3357000,50792,695697,1920,0,0,NA,NA,4.42,3.44,0,6.5,1,1016020050,0,16002,997119,226224,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,420.450320951,11048.1944701,0,0\nMN,1602,58,111,160.02,1001,2,4,10029,E053,23,12,2820,1,11,1A,1396 MADISON AVENUE,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,C7,04,0,NA,\1392 ASSOCIATES, LLC",  "", "15138"), bldgarea_17 = c("4415", "", "163969"), comarea_17 = c(21072,  NA, 1000), resarea_17 = c(1500, NA, 162969), officearea_17 = c(19572,  NA, 1000), retailarea_17 = c(0L, NA, 0L), garagearea_17 = c(1500L,  NA, 0L), strgearea_17 = c(0L, NA, 0L), factryarea_17 = c(0L,  NA, 0L), otherarea_17 = c(0L, NA, 0L), areasource_17 = c(0L,  NA, 2L), numbldgs_17 = c("2", "", "1"), numfloors_17 = c(1, NA,  15), unitsres_17 = c("6", "", "74"), unitstotal_17 = c(28L, NA,  77L), lotfront_17 = c(32, NA, 100.92), lotdepth_17 = c(50.92,  NA, 150), bldgfront_17 = c(81.42, NA, 100), bldgdepth_17 = c(50,  NA, 142), ext_17 = c("73", "", NA), proxcode_17
= c(NA, "", "1" ), irrlotcode_17 = c("2", "", "N"), lottype_17 = c("Y", "", "3" ), bsmtcode_17 = c("3", "", "2"), assessland_17 = c(2, NA, 769500 ), assesstot_17 = c(278100L, NA, 18590400L), exemptland_17
= c(2135700,  NA, 9440), exempttot_17 = c(0L, NA, 9440L), yearbuilt_17 = c(0L,  NA, 1924L), yearalter1_17 = c(1910L, NA, 1988L), yearalter2_17 = c(0L,  NA, 0L), histdist_17 = c("0", "", "Expanded Carnegie Hill Historic District" ), landmark_17 = c(NA, "", NA), builtfar_17 = c(NA, "", "10.83" ), residfar_17 = c(4.77, NA, 10), commfar_17 = c(3.44, NA, 0), 
    facilfar_17 = c(0, NA, 10), borocode_17 = c(6.5, NA, 1), 
    bbl_17 = c(1, NA, 1016020001), condono_17 = c(1016020058L, 
    NA, 0L), tract2010_17 = c(0L, NA, 16002L), xcoord_17 = c(16002L, 
    NA, 996653L), ycoord_17 = c(996984L, NA, 226361L), zonemap_17 = c("226326", 
    "", "6b"), zmcode_17 = c("6b", "", NA), sanborn_17 = c(NA, 
    "", "108S043"), taxmap_17 = c("108S043", "", "10602"), edesignum_17 = c("10602", 
    "", NA), appbbl_17 = c(NA, "", "0"), appdate_17 = c("0", 
    "", NA), plutomapid_17 = c(NA, "", "1"), version_17 = c("1", 
    "", "17V1"), mappluto_f_17 = c("17V1", "", "0"), shape_leng_17 = c("0", 
    "", "514.691112825"), shape_area_17 = c(285.694168916, NA, 
    15970.2343838), sfha_07_17 = c("4794.40305456", "", "0"), 
    sfha_15_17 = c(0, NA, 0)), .Names = c("borough_17", "block_17",  "lot_17", "cd_17", "ct2010_17", "cb2010_17", "schooldist_17",  "council_17", "zipcode_17", "firecomp_17", "policeprct_17", "healthcent_17",  "healtharea_17", "sanitboro_17", "sanitdistr_17", "sanitsub_17",  "address_17", "zonedist1_17", "zonedist2_17", "zonedist3_17",  "zonedist4_17", "overlay1_17", "overlay2_17", "spdist1_17", "spdist2_17",  "spdist3_17", "ltdheight_17", "splitzone_17", "bldgclass_17",  "landuse_17", "easements_17", "ownertype_17", "ownername_17",  "lotarea_17", "bldgarea_17", "comarea_17", "resarea_17", "officearea_17",  "retailarea_17", "garagearea_17", "strgearea_17", "factryarea_17",  "otherarea_17", "areasource_17", "numbldgs_17", "numfloors_17",  "unitsres_17", "unitstotal_17", "lotfront_17", "lotdepth_17",  "bldgfront_17", "bldgdepth_17", "ext_17", "proxcode_17", "irrlotcode_17",  "lottype_17", "bsmtcode_17", "assessland_17", "assesstot_17",  "exemptland_17", "exempttot_17", "yearbuilt_17", "yearalter1_17",  "yearalter2_17", "histdist_17", "landmark_17", "builtfar_17",  "residfar_17", "commfar_17", "facilfar_17", "borocode_17", "bbl_17",  "condono_17", "tract2010_17", "xcoord_17", "ycoord_17", "zonemap_17",  "zmcode_17", "sanborn_17", "taxmap_17", "edesignum_17", "appbbl_17",  "appdate_17", "plutomapid_17", "version_17", "mappluto_f_17",  "shape_leng_17", "shape_area_17", "sfha_07_17", "sfha_15_17"), row.names = 13:15, class = "data.frame")

Row with NA values

我研究了我的数据框,发现该字段似乎有问题 "lotarea_17"。在此字段中,第一条记录是一个包含许多值的字符串。逗号没有被分隔开,许多列都保留在一个单独的列中(下图显示了数据框的外观):

A look at the row on top of the NA-filled row

我检查了 .csv 文件,我意识到当前一行的特定列中有一个 \ 符号时,就会出现问题。可能这个符号取消了一个 " 符号,导致整行被视为一个字符串。下面,我展示了一个 .csv 文件的两个图像,一个是具有所描述问题的数据集,另一个是运行良好的有相同的记录,但写法不同。似乎出现问题的行的逗号分隔值表示:

...,"\"1392 ASSOCIATES, LLC",4892,20488,1300,19188,...

虽然有效的显示:

"...,1392 ASSOCIATES, LLC",4892,20488,1300,19188,...

csv with the problem A csv file with the same row, that worked when importing it

因此,我的问题是关于是否可以在 R 或 Excel 中快速修复(最好在 R 中的 read.csv 函数中),以自动解决这个问题哪个片段“\”弄乱了我的数据结构。手动修复不是一种选择,因为我们只在一个 df 中讨论了 200 多个案例(我还有更多关于这个问题的案例)。

Addition - .csv 文件是通过写入几个 sf 对象的数据帧创建的。命令行是 `write.table(as.data.frame(sf_object), filename, sep = ",", row.names = F. sf 对象是通过空间连接 shapefile 创建的MapPLUTO(纽约市的土地使用地图数据库)和另一个 shapefile(但我知道该连接没有问题,因为从那里继承的列是完美的)。

您尝试读取的 csv 文件似乎不符合官方 csv 标准 (RFC-4180),所以我认为通过暴力破解是最简单的方法。

我们可以使用readLines读取原始数据,gsub编辑不合规的部分,然后使用read.csv(text=...)转化为data.frame。我不是 100% 确定它会在第一次尝试时工作,但这样更容易调试。什么适用于一个小例子:

raw <- readLines(filename)
raw <- gsub('\\"', '""', raw)
result <- read.csv(text=raw)