使用`read.csv`时如何防止定界符成为输入的一部分?
How to prevent delimeter being the part of input when use `read.csv`?
我有如下的 csv 文件:
id,age,note
241,18,I am handsome,I am 18.
242,19, <ul>
<li>
<strong>I like music</strong><br />
Talor swift:I kike Talor,I like her concert{smile}.<br />
Beyonce:I have 3 albums.</li>
<ul>
243,17,<I write something sn2370292kl@$^&,hahhaha
head是id,age,note
,note
是学生输入的字符串,可以是任意字符。
在 read.csv("qlist.csv",header=TRUE, sep=",",quote ="\"",na.strings = c(""," "),check.names=TRUE,fill=FALSE,strip.white = FALSE,comment.char = "",allowEscapes = FALSE,stringsAsFactors =FALSE,skipNul = FALSE)
中,我认为我不能使用 ,
作为分隔符,因为 I am handsome,I am 18.
那么,可以使用哪种分隔符来防止分隔符成为输入的一部分呢?
另外,我尝试了 G. Grothendieck 的回答:
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 5.3
year 2019
month 03
day 11
svn rev 76217
language R
version.string R version 3.5.3 (2019-03-11)
nickname Great Truth
> Lines <- 'id,age,note
+ 241,18,I am handsome,I am "18".
+ 242,19, <ul>
+ <li>
+ <strong>I like music</strong><br />
+ Talor swift:I kike Talor,I like her concert{smile}.<br />
+ Beyonce:I have 3 albums.</li>
+ <ul>
+ 243,17,<I write something sn2370292kl@$^&,hahhaha'
> L <- readLines(textConnection(Lines))
> L2 <- lapply(split(L, cumsum(grepl("^\S", L))), function(x) {
+ x <- gsub('"', '""', x)
+ x[1] <- sub('^(.*?,.*?,)', '\1"', x[1])
+ x[length(x)] <- paste0(x[length(x)], '"')
+ x
+ })
> DF <- read.csv(text = unname(unlist(L2)))
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
> dim(DF)
[1] 5 3
我们需要在多行字段周围加上引号。首先使用 readLines
读入文件,将其拆分为逻辑记录并在第三个字段周围加上双引号。然后用 read.csv
阅读。
我们假设与变量 pat
中定义的 grep 模式匹配的行是不从前面的行继续的行。特别是,我们假设以数字、逗号、数字和逗号开头的行不会从前一行继续。如果该假设不成立,请适当修改 pat
。
# L <- readLines("myfile.csv")
L <- readLines(textConnection(Lines))
# pat <- "^\S" # this pattern worked for original input shown in question
pat <- "^\d+,\d+,"
L2 <- lapply(split(L, cumsum(grepl(pat, L))), function(x) {
x <- gsub('"', '""', x)
x[1] <- sub('^(.*?,.*?,)', '\1"', x[1])
x[length(x)] <- paste0(x[length(x)], '"')
x
})
DF <- read.csv(text = unname(unlist(L2)))
dim(DF)
## [1] 3 3
备注
Lines <- 'id,age,note
241,18,I am handsome,I am "18".
242,19, <ul>
<li>
<strong>I like music</strong><br />
Talor swift:I kike Talor,I like her concert{smile}.<br />
Beyonce:I have 3 albums.</li>
<ul>
243,17,<I write something sn2370292kl@$^&,hahhaha'
我有如下的 csv 文件:
id,age,note
241,18,I am handsome,I am 18.
242,19, <ul>
<li>
<strong>I like music</strong><br />
Talor swift:I kike Talor,I like her concert{smile}.<br />
Beyonce:I have 3 albums.</li>
<ul>
243,17,<I write something sn2370292kl@$^&,hahhaha
head是id,age,note
,note
是学生输入的字符串,可以是任意字符。
在 read.csv("qlist.csv",header=TRUE, sep=",",quote ="\"",na.strings = c(""," "),check.names=TRUE,fill=FALSE,strip.white = FALSE,comment.char = "",allowEscapes = FALSE,stringsAsFactors =FALSE,skipNul = FALSE)
中,我认为我不能使用 ,
作为分隔符,因为 I am handsome,I am 18.
那么,可以使用哪种分隔符来防止分隔符成为输入的一部分呢?
另外,我尝试了 G. Grothendieck 的回答:
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 5.3
year 2019
month 03
day 11
svn rev 76217
language R
version.string R version 3.5.3 (2019-03-11)
nickname Great Truth
> Lines <- 'id,age,note
+ 241,18,I am handsome,I am "18".
+ 242,19, <ul>
+ <li>
+ <strong>I like music</strong><br />
+ Talor swift:I kike Talor,I like her concert{smile}.<br />
+ Beyonce:I have 3 albums.</li>
+ <ul>
+ 243,17,<I write something sn2370292kl@$^&,hahhaha'
> L <- readLines(textConnection(Lines))
> L2 <- lapply(split(L, cumsum(grepl("^\S", L))), function(x) {
+ x <- gsub('"', '""', x)
+ x[1] <- sub('^(.*?,.*?,)', '\1"', x[1])
+ x[length(x)] <- paste0(x[length(x)], '"')
+ x
+ })
> DF <- read.csv(text = unname(unlist(L2)))
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
> dim(DF)
[1] 5 3
我们需要在多行字段周围加上引号。首先使用 readLines
读入文件,将其拆分为逻辑记录并在第三个字段周围加上双引号。然后用 read.csv
阅读。
我们假设与变量 pat
中定义的 grep 模式匹配的行是不从前面的行继续的行。特别是,我们假设以数字、逗号、数字和逗号开头的行不会从前一行继续。如果该假设不成立,请适当修改 pat
。
# L <- readLines("myfile.csv")
L <- readLines(textConnection(Lines))
# pat <- "^\S" # this pattern worked for original input shown in question
pat <- "^\d+,\d+,"
L2 <- lapply(split(L, cumsum(grepl(pat, L))), function(x) {
x <- gsub('"', '""', x)
x[1] <- sub('^(.*?,.*?,)', '\1"', x[1])
x[length(x)] <- paste0(x[length(x)], '"')
x
})
DF <- read.csv(text = unname(unlist(L2)))
dim(DF)
## [1] 3 3
备注
Lines <- 'id,age,note
241,18,I am handsome,I am "18".
242,19, <ul>
<li>
<strong>I like music</strong><br />
Talor swift:I kike Talor,I like her concert{smile}.<br />
Beyonce:I have 3 albums.</li>
<ul>
243,17,<I write something sn2370292kl@$^&,hahhaha'