R中有没有办法加入csv文件的断线?
Is there a way in R to join broken lines of csv file?
我有一个导出 csv 文件但不引用新行或使用 /n
而不是 /n/r
的程序。它在记录中间使用与末尾相同的行尾。但是,该程序确实在变量之间使用逗号分隔符。我如何告诉 R
删除所有 eol 标记,直到达到数据中的变量数?
我的数据如下所示:
name, rank, serial number, age, height, weight
mike, noob, 123456, 22, 6, 34.4
bob, officer, 345
323, 24, 6, 2
3.5
ted, officer, 34234, 2
5, 6, 35.2
我如何基本上删除第 2 行 5 之后、第 3 行 2 之后和第 6 行 2 之后的 CR?每行应该有 5 个逗号和 6 个变量。我的数据每行之间没有额外的行。如果不这样做,我就是无法让它停止将所有内容放在一条线上。我的数据是 43 个变量,并且不断生成新行。大多数时候它被读入有几千行。其中大约 20% 的人有 CR 问题。
还想补充一点,新行总是从新行开始,如果有意义的话,它不会跟在前一行的同一行上。
数据框应如下所示:
name, rank, serial number, age, height, weight
mike, noob, 123456, 22, 6, 34.4
bob, officer, 345323, 24, 6, 23.5
ted, officer, 34234, 25, 6, 35.2
如果有帮助的话,这就是我的数据的样子。第一行是 header,后面应该是 6 条记录,但 read.csv
和 fread
以及我尝试的所有其他内容都给了我 10 条记录。第 6 条记录有额外的 CR,但仍有 42 个变量。刚分成 5 行。
EFPCName,EFUseAPPE,log pdl,pdl error,device pretty name,num pages,num sheets,copies printed,total pages printed,total sheets printed,total color pages printed,total bw pages printed,total tab pages printed,total sample pages printed,num copies,print status,instructions,notes1,notes2,username,noneutf8lastuser,non utf8 submitted by,title,size,logical printer,fiery,time,date,total rip duration,timestamp spooling,timestamp done spooling,timestamp waiting to rip,timestamp ripping,timestamp done ripping,timestamp waiting to print,timestamp printing,timestamp done printing,media weight,input slot,media size,media type,interpreter,
LZX Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004491,Canon hold,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,3,2013 06 07 19 37 23,2013 06 07 19 37 24,2013 06 07 19 38 02 118342,2013 06 07 19 38 02 118342,2013 06 07 19 38 09,2013 06 07 19 38 09,2013 06 07 19 38 38,2013 06 07 19 39 19 124419,,Tray5,Tabloid,Plain,PS,
LZX Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926744,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926744,2013 06 07 19 44 07,2013 06 07 19 44 11,2013 06 07 19 44 53 141084,,Tray5,Tabloid,Plain,PS,
LZX Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 551451,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 551451,2013 06 07 19 46 01,2013 06 07 19 46 05,2013 06 07 19 46 46 911557,,Tray5,Tabloid,Plain,PS,
LZX80 Color Copy Cover - 11 x 17 Tabloid,,postscript,,Canon,1,2,2,2,2,2,0,0,0,2,OK,,,,TeamMember,,TeamMember,78011.01.pdf,874486,Canon hold,SERVER-Shredder,2013 06 07 19 47 07,2013 06 07 19 47 00,3,2013 06 07 19 47 17,2013 06 07 19 47 17 507576,2013 06 07 19 47 47 960542,2013 06 07 19 47 47 960542,2013 06 07 19 47 51,2013 06 07 19 47 51,2013 06 07 19 47 54,2013 06 07 19 48 25 77595,,Tray3,Tabloid,Heavy5,PS,
LZX Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 502522,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 502522,2013 06 07 19 48 04,2013 06 07 19 48 07,2013 06 07 19 48 48 188474,,Tray5,Tabloid,Plain,PS,
EX32 Laser 32 - 11 x 17 Tabloid,,pdf,,Canon,63,64,1,63,64,4,59,0,0,1,OK,Size: 11 x 17
Finishing: Coil Binding Cutting Punching
Pages:
1-63 4/0 EX32 Laser 32 - 11 x 17 11 x 17
,Color 77992:01Employee Handbook REVISED_2up(NFC).pdf, McAllen TX,EFI Pace,,,Color 77992:01Employee Handbook REVISED_2up(NFC).pdf,518880,none,SERVER-Shredder,2013 06 07 20 01 52,2013 06 07 20 01 00,3,2013 06 07 20 02 41 495216,2013 06 07 20 02 44 780196,2013 06 07 20 02 41 871208,2013 06 07 20 02 41 871208,2013 06 07 20 02 45,2013 06 07 20 02 45,2013 06 07 20 03 25,2013 06 07 20 05 45 741386,,Tray4,Tabloid,Heavy1,PS,
如果您想在行的长度不等时隐式添加空白字段,请在 read.table 调用中设置 fill = TRUE。
如果这不是您要问的问题,您能否更清楚地提供一个可重现的示例?
这就是我目前所拥有的。看看这对您的数据有何影响。
dat <- readLines("temp.txt") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], ",")) # extract variable names
nvar <- length(varnames)
k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))
while(k <= length(dat)){
k <- k + 1
if(dat[k] == "") {k <- k + 1
print(paste("data line", k, "is an empty string"))
if(k > length(dat)) {break}
}
temp <- dat[k]
# checks if there are enough commas or if the line was broken
while(length(gregexpr(",", temp)[[1]]) < nvar-1){
k <- k + 1
temp <- paste0(temp, dat[k])
}
temp <- unlist(strsplit(temp, ","))
message(k)
dat1 <- rbind(dat1, temp)
}
dat1 = dat1[-1,] # delete the empty initial row
一般的想法是不断折叠文本,直到字符串中有足够多的逗号。一旦实现,数据将以逗号分隔并作为单行添加到矩阵中。该代码非常笨拙,对于大型数据文件来说会很慢。这是我能做的最好的了。
对于原始数据示例,代码有效并创建了一个具有 42 列和 6 行的字符矩阵。对于较小的示例,代码无法处理最后一列中的中断。
我有一个导出 csv 文件但不引用新行或使用 /n
而不是 /n/r
的程序。它在记录中间使用与末尾相同的行尾。但是,该程序确实在变量之间使用逗号分隔符。我如何告诉 R
删除所有 eol 标记,直到达到数据中的变量数?
我的数据如下所示:
name, rank, serial number, age, height, weight
mike, noob, 123456, 22, 6, 34.4
bob, officer, 345
323, 24, 6, 2
3.5
ted, officer, 34234, 2
5, 6, 35.2
我如何基本上删除第 2 行 5 之后、第 3 行 2 之后和第 6 行 2 之后的 CR?每行应该有 5 个逗号和 6 个变量。我的数据每行之间没有额外的行。如果不这样做,我就是无法让它停止将所有内容放在一条线上。我的数据是 43 个变量,并且不断生成新行。大多数时候它被读入有几千行。其中大约 20% 的人有 CR 问题。
还想补充一点,新行总是从新行开始,如果有意义的话,它不会跟在前一行的同一行上。
数据框应如下所示:
name, rank, serial number, age, height, weight
mike, noob, 123456, 22, 6, 34.4
bob, officer, 345323, 24, 6, 23.5
ted, officer, 34234, 25, 6, 35.2
如果有帮助的话,这就是我的数据的样子。第一行是 header,后面应该是 6 条记录,但 read.csv
和 fread
以及我尝试的所有其他内容都给了我 10 条记录。第 6 条记录有额外的 CR,但仍有 42 个变量。刚分成 5 行。
EFPCName,EFUseAPPE,log pdl,pdl error,device pretty name,num pages,num sheets,copies printed,total pages printed,total sheets printed,total color pages printed,total bw pages printed,total tab pages printed,total sample pages printed,num copies,print status,instructions,notes1,notes2,username,noneutf8lastuser,non utf8 submitted by,title,size,logical printer,fiery,time,date,total rip duration,timestamp spooling,timestamp done spooling,timestamp waiting to rip,timestamp ripping,timestamp done ripping,timestamp waiting to print,timestamp printing,timestamp done printing,media weight,input slot,media size,media type,interpreter,
LZX Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004491,Canon hold,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,3,2013 06 07 19 37 23,2013 06 07 19 37 24,2013 06 07 19 38 02 118342,2013 06 07 19 38 02 118342,2013 06 07 19 38 09,2013 06 07 19 38 09,2013 06 07 19 38 38,2013 06 07 19 39 19 124419,,Tray5,Tabloid,Plain,PS,
LZX Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926744,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926090,2013 06 07 19 44 07 926744,2013 06 07 19 44 07,2013 06 07 19 44 11,2013 06 07 19 44 53 141084,,Tray5,Tabloid,Plain,PS,
LZX Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 551451,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 550964,2013 06 07 19 46 01 551451,2013 06 07 19 46 01,2013 06 07 19 46 05,2013 06 07 19 46 46 911557,,Tray5,Tabloid,Plain,PS,
LZX80 Color Copy Cover - 11 x 17 Tabloid,,postscript,,Canon,1,2,2,2,2,2,0,0,0,2,OK,,,,TeamMember,,TeamMember,78011.01.pdf,874486,Canon hold,SERVER-Shredder,2013 06 07 19 47 07,2013 06 07 19 47 00,3,2013 06 07 19 47 17,2013 06 07 19 47 17 507576,2013 06 07 19 47 47 960542,2013 06 07 19 47 47 960542,2013 06 07 19 47 51,2013 06 07 19 47 51,2013 06 07 19 47 54,2013 06 07 19 48 25 77595,,Tray3,Tabloid,Heavy5,PS,
LZX Laser 24 - 11 x 17 Tabloid,,postscript,,Canon,2,1,1,2,1,1,1,0,0,1,OK,,,,TeamMember,,TeamMember,78053.01.pdf,4004520,none,SERVER-Shredder,2013 06 07 19 37 13,2013 06 07 19 37 00,,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 502522,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 501212,2013 06 07 19 48 04 502522,2013 06 07 19 48 04,2013 06 07 19 48 07,2013 06 07 19 48 48 188474,,Tray5,Tabloid,Plain,PS,
EX32 Laser 32 - 11 x 17 Tabloid,,pdf,,Canon,63,64,1,63,64,4,59,0,0,1,OK,Size: 11 x 17
Finishing: Coil Binding Cutting Punching
Pages:
1-63 4/0 EX32 Laser 32 - 11 x 17 11 x 17
,Color 77992:01Employee Handbook REVISED_2up(NFC).pdf, McAllen TX,EFI Pace,,,Color 77992:01Employee Handbook REVISED_2up(NFC).pdf,518880,none,SERVER-Shredder,2013 06 07 20 01 52,2013 06 07 20 01 00,3,2013 06 07 20 02 41 495216,2013 06 07 20 02 44 780196,2013 06 07 20 02 41 871208,2013 06 07 20 02 41 871208,2013 06 07 20 02 45,2013 06 07 20 02 45,2013 06 07 20 03 25,2013 06 07 20 05 45 741386,,Tray4,Tabloid,Heavy1,PS,
如果您想在行的长度不等时隐式添加空白字段,请在 read.table 调用中设置 fill = TRUE。
如果这不是您要问的问题,您能否更清楚地提供一个可重现的示例?
这就是我目前所拥有的。看看这对您的数据有何影响。
dat <- readLines("temp.txt") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], ",")) # extract variable names
nvar <- length(varnames)
k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))
while(k <= length(dat)){
k <- k + 1
if(dat[k] == "") {k <- k + 1
print(paste("data line", k, "is an empty string"))
if(k > length(dat)) {break}
}
temp <- dat[k]
# checks if there are enough commas or if the line was broken
while(length(gregexpr(",", temp)[[1]]) < nvar-1){
k <- k + 1
temp <- paste0(temp, dat[k])
}
temp <- unlist(strsplit(temp, ","))
message(k)
dat1 <- rbind(dat1, temp)
}
dat1 = dat1[-1,] # delete the empty initial row
一般的想法是不断折叠文本,直到字符串中有足够多的逗号。一旦实现,数据将以逗号分隔并作为单行添加到矩阵中。该代码非常笨拙,对于大型数据文件来说会很慢。这是我能做的最好的了。
对于原始数据示例,代码有效并创建了一个具有 42 列和 6 行的字符矩阵。对于较小的示例,代码无法处理最后一列中的中断。