readr::read_csv() - 嵌套引号解析失败

readr::read_csv() - parsing failure with nested quotations

我有一个 csv,其中一些列有一个带引号的列,里面有另一个引号:

"blah blah "nested quote"" 并生成解析失败。我不确定这是一个错误还是有处理这个问题的论据?

Reprex(文件是here或下面粘贴的内容):

readr::read_csv("~/temp/shittyquotes.csv")
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   INSTNM = col_character(),
#>   ADDR = col_character(),
#>   CITY = col_character(),
#>   STABBR = col_character(),
#>   ZIP = col_character(),
#>   CHFNM = col_character(),
#>   CHFTITLE = col_character(),
#>   EIN = col_character(),
#>   OPEID = col_character(),
#>   WEBADDR = col_character(),
#>   ADMINURL = col_character(),
#>   FAIDURL = col_character(),
#>   APPLURL = col_character(),
#>   ACT = col_character(),
#>   IALIAS = col_character(),
#>   INSTCAT = col_character(),
#>   CCBASIC = col_character(),
#>   CCIPUG = col_character(),
#>   CCSIZSET = col_character(),
#>   CARNEGIE = col_character()
#>   # ... with 2 more columns
#> )
#> See spec(...) for full column specifications.
#> Warning: 3 parsing failures.
#> row    col           expected      actual                      file
#>   2 IALIAS delimiter or quote C           '~/temp/shittyquotes.csv'
#>   2 IALIAS delimiter or quote D           '~/temp/shittyquotes.csv'
#>   2 NA     59 columns         100 columns '~/temp/shittyquotes.csv'
#> # A tibble: 2 x 59
#>   UNITID INSTNM ADDR  CITY  STABBR ZIP    FIPS OBEREG CHFNM CHFTITLE
#>    <dbl> <chr>  <chr> <chr> <chr>  <chr> <dbl>  <dbl> <chr> <chr>   
#> 1 441238 City … 1500… Duar… CA     9101…     6      8 Dr. … Director
#> 2 441247 Commu… 3800… Mode… CA     9535…     6      8 Vict… Preside…
#> # ... with 49 more variables: GENTELE <dbl>, EIN <chr>, OPEID <chr>,
#> #   OPEFLAG <dbl>, WEBADDR <chr>, ADMINURL <chr>, FAIDURL <chr>,
#> #   APPLURL <chr>, SECTOR <dbl>, ICLEVEL <dbl>, CONTROL <dbl>,
#> #   HLOFFER <dbl>, UGOFFER <dbl>, GROFFER <dbl>, FPOFFER <dbl>,
#> #   HDEGOFFR <dbl>, DEGGRANT <dbl>, HBCU <dbl>, HOSPITAL <dbl>,
#> #   MEDICAL <dbl>, TRIBAL <dbl>, LOCALE <dbl>, OPENPUBL <dbl>, ACT <chr>,
#> #   NEWID <dbl>, DEATHYR <dbl>, CLOSEDAT <dbl>, CYACTIVE <dbl>,
#> #   POSTSEC <dbl>, PSEFLAG <dbl>, PSET4FLG <dbl>, RPTMTH <dbl>,
#> #   IALIAS <chr>, INSTCAT <chr>, CCBASIC <chr>, CCIPUG <chr>,
#> #   CCIPGRAD <dbl>, CCUGPROF <dbl>, CCENRPRF <dbl>, CCSIZSET <chr>,
#> #   CARNEGIE <chr>, TENURSYS <dbl>, LANDGRNT <dbl>, INSTSIZE <chr>,
#> #   CBSA <dbl>, CBSATYPE <chr>, CSA <dbl>, NECTA <dbl>, DFRCGID <dbl>

reprex package (v0.2.1)

创建于 2018-12-04

这里还有 csv 内容:

UNITID,INSTNM,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,CHFNM,CHFTITLE,GENTELE,EIN,OPEID,OPEFLAG,WEBADDR,ADMINURL,FAIDURL,APPLURL,SECTOR,ICLEVEL,CONTROL,HLOFFER,UGOFFER,GROFFER,FPOFFER,HDEGOFFR,DEGGRANT,HBCU,HOSPITAL,MEDICAL,TRIBAL,LOCALE,OPENPUBL,ACT,NEWID,DEATHYR,CLOSEDAT,CYACTIVE,POSTSEC,PSEFLAG,PSET4FLG,RPTMTH,IALIAS,INSTCAT,CCBASIC,CCIPUG,CCIPGRAD,CCUGPROF,CCENRPRF,CCSIZSET,CARNEGIE,TENURSYS,LANDGRNT,INSTSIZE,CBSA,CBSATYPE,CSA,NECTA,DFRCGID 
441238,"City of Hope Graduate School of Biological Science","1500 E Duarte Rd","Duarte","CA","91010-3000", 6, 8,"Dr. Arthur Riggs","Director","6263018293","953432210","03592400",1,"gradschool.coh.org"," "," "," ",2,1,2,9,2,1,2,10,1,2,-2,2,2,21,1,"A ",-2,-2,"-2",1,1,1,1,1," ",1,25,-2,-2,-2,7,-2,-3,1,2,1,31100,1,348,-2,198
441247,"Community Business College","3800 McHenry Ave Suite M","Modesto","CA","95356-1569", 6, 8,"Victor L. Vandenberghe","President","2095293648","484-8230","03615300",7,"www.communitybusinesscollege.edu","www.communitybusinesscollege.edu","www.cbc123.com","www.123.com",9,3,3,1,1,2,2,0,2,2,-2,2,2,12,1,"A ",-2,-2,"-2",1,1,1,1,2,"formerly "Community Business School"",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,33700,1,-2,-2,71
441256,"Design's School of Cosmetology","715 24th St Ste E","Paso Robles","CA","93446", 6, 8,"Sharon Skinner","Administrator","8052378575","80002030","03646300",1,"designsschool.com"," "," "," ",9,3,3,2,1,2,2,0,2,2,-2,2,2,13,1,"A ",-2,-2,"-2",1,1,1,1,2," ",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,42020,1,-2,-2,46

Jim Hester 提供了这个答案:

您需要对 read_delim() 使用 escape_double = FALSE 参数。这不是 read_csv() 的一部分,因为 excel 样式的 csvs 通过加倍来转义内部引号。

data.tablefread() 可以很好地解析文件...它会发出有关引号的警告,但您可以忽略它..

library( data.table )
data.table::fread("./temp.csv" )

Warning message: In data.table::fread("./temp.csv") : Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.