我如何读取包含多个 headers 的文件?
How can i read a file with multi headers?
我有一个包含多个 headers 的文件,我还需要 headers .
我的文件头:
>\>1 Len = 254
>13 112 1 18
>15 112 1 30
>22 11 3 25
>\>1 Reverse Len = 254
>14 11 1 15
>\>2 Len = 186
>19 15 2 34
>25 11 3 25
>....
如何读取此文件,并将值导入 R 变量(如数据框)?
或者,如果有人可以帮助我们删除 headers 并添加另一列代表 table 的数量(或显示此行是另一个 table)
我不想将它作为字符串读取并解析它
如果有帮助,数据是来自 MUMMER 包的报告
而且我在这里上传了一个例子:
http://m.uploadedit.com/ba3c/1429271308686.txt
如果不以字符串形式读取整个内容并对其进行解析,确实不容易做到这一点,但是您可以轻松地将此类操作转换为函数,就像我在 read.mtable
函数中所做的那样my "SOfun" package.
此处应用于您的示例数据:
## library(devtools)
## install_github("mrdwab/SOfun")
library(SOfun)
X <- read.mtable("http://m.uploadedit.com/ba3c/1429271308686.txt", ">")
X <- X[!grepl("Reverse", names(X))]
names(X)
# [1] "> 1 Len = 354" "> 2 Len = 127" "> 3 Len = 109" "> 4 Len = 52"
# [5] "> 5 Len = 1189" "> 6 Len = 1007" "> 7 Len = 918" "> 10 Len = 192"
# [9] "> 11 Len = 169" "> 13 Len = 248" "> 14 Len = 2500"
X[1]
# $`> 1 Len = 354`
# V1 V2 V3 V4
# 1 203757 1 1 35
# 2 122132 1 1 87
# 3 203756 1 1 354
# 4 1 1 1 354
# 5 42364 12 1 89
# 6 203757 37 37 91
# 7 122132 90 90 38
# 8 42364 102 91 37
# 9 203757 129 129 168
# 10 42364 140 129 212
# 11 122132 129 129 212
# 12 203757 298 298 43
如您所见,它创建了 11 个 data.frame
的 list
,每个都以 "Len =" 值命名。
这里用到的两个参数是文件位置(这里是URL)和chunkID
,可以设置为正则表达式或者你想匹配的固定模式。在这里,我们要匹配任何以“>”开头的行,以指示新数据集的开始位置。
或者如果你想要一个冗长繁琐的方法...
# if you just want the data and not the header information
x<-read.table("1429271308686.txt",comment.char=">")
# in case all else fails, my somewhat cumbersome solution...
x<-scan("1429271308686.txt",what="raw")
# extract the lengths, ind1 has all the lengths
ind1<-x=="="
ind1<-c(ind1[length(ind1)],ind1[-length(ind1)]) # take the value that comes after "="
cumsum(ind1)
lengths<-as.numeric(x[ind1])[c(TRUE,FALSE)] # only want one of the lengths
# remove the unwanted characters
ind2<-x==">"
ind2<-c(ind2[length(ind2)],ind2[-length(ind2)]) # take the value that comes after ">"
ind3<-x==">"|x=="Len"|x=="="|x=="Reverse"
dat<-as.numeric(x[!(ind1|ind2|ind3)]) # remove the unwanted
# arrange as matrix
mat<-matrix(dat,length(dat)/4,4,byrow=T)
# the number of rows for each block
block<-(c(1:length(x))[duplicated(cumsum(!ind2))][c(FALSE,TRUE)]-c(1:length(x))[duplicated(cumsum(!ind2))][c(TRUE,FALSE)]-5)/4
# the number for each block
id<-as.numeric(x[ind2])[c(TRUE,FALSE)]
# new vector
mat<-cbind(rep(id,block),mat) # note, this assumes that the last line is again "> Reverse"
最后我用几行代码解析数据并将数据导入R
我将所有 table 合并为一个 table 并添加一个新列来表示名称
table 秒...
就是这样:
lns = readLines("filename.txt") ; # read the data as character
idx = grepl(">", lns) ; # location of all ">"s
df = read.table(text=lns[!idx]) ; # read all lines as table unless those who starts with ">"
wd = diff(c(which(idx), length(idx) + 1)) - 1 ; # finding the index of each table to add in new column
df$label = rep(lns[idx], wd) ; # add table indices in a new column
另一种处理这种特殊情况的方法是使用其他论坛中有人向我建议的 perl onliner,我不知道它是什么但它有效:
https://support.bioconductor.org/p/66724/#66767
感谢其他人提供的有用答案和评论,帮助我得出答案:)
我有一个包含多个 headers 的文件,我还需要 headers .
我的文件头:
>\>1 Len = 254
>13 112 1 18
>15 112 1 30
>22 11 3 25
>\>1 Reverse Len = 254
>14 11 1 15
>\>2 Len = 186
>19 15 2 34
>25 11 3 25
>....
如何读取此文件,并将值导入 R 变量(如数据框)?
或者,如果有人可以帮助我们删除 headers 并添加另一列代表 table 的数量(或显示此行是另一个 table)
我不想将它作为字符串读取并解析它
如果有帮助,数据是来自 MUMMER 包的报告
而且我在这里上传了一个例子: http://m.uploadedit.com/ba3c/1429271308686.txt
如果不以字符串形式读取整个内容并对其进行解析,确实不容易做到这一点,但是您可以轻松地将此类操作转换为函数,就像我在 read.mtable
函数中所做的那样my "SOfun" package.
此处应用于您的示例数据:
## library(devtools)
## install_github("mrdwab/SOfun")
library(SOfun)
X <- read.mtable("http://m.uploadedit.com/ba3c/1429271308686.txt", ">")
X <- X[!grepl("Reverse", names(X))]
names(X)
# [1] "> 1 Len = 354" "> 2 Len = 127" "> 3 Len = 109" "> 4 Len = 52"
# [5] "> 5 Len = 1189" "> 6 Len = 1007" "> 7 Len = 918" "> 10 Len = 192"
# [9] "> 11 Len = 169" "> 13 Len = 248" "> 14 Len = 2500"
X[1]
# $`> 1 Len = 354`
# V1 V2 V3 V4
# 1 203757 1 1 35
# 2 122132 1 1 87
# 3 203756 1 1 354
# 4 1 1 1 354
# 5 42364 12 1 89
# 6 203757 37 37 91
# 7 122132 90 90 38
# 8 42364 102 91 37
# 9 203757 129 129 168
# 10 42364 140 129 212
# 11 122132 129 129 212
# 12 203757 298 298 43
如您所见,它创建了 11 个 data.frame
的 list
,每个都以 "Len =" 值命名。
这里用到的两个参数是文件位置(这里是URL)和chunkID
,可以设置为正则表达式或者你想匹配的固定模式。在这里,我们要匹配任何以“>”开头的行,以指示新数据集的开始位置。
或者如果你想要一个冗长繁琐的方法...
# if you just want the data and not the header information
x<-read.table("1429271308686.txt",comment.char=">")
# in case all else fails, my somewhat cumbersome solution...
x<-scan("1429271308686.txt",what="raw")
# extract the lengths, ind1 has all the lengths
ind1<-x=="="
ind1<-c(ind1[length(ind1)],ind1[-length(ind1)]) # take the value that comes after "="
cumsum(ind1)
lengths<-as.numeric(x[ind1])[c(TRUE,FALSE)] # only want one of the lengths
# remove the unwanted characters
ind2<-x==">"
ind2<-c(ind2[length(ind2)],ind2[-length(ind2)]) # take the value that comes after ">"
ind3<-x==">"|x=="Len"|x=="="|x=="Reverse"
dat<-as.numeric(x[!(ind1|ind2|ind3)]) # remove the unwanted
# arrange as matrix
mat<-matrix(dat,length(dat)/4,4,byrow=T)
# the number of rows for each block
block<-(c(1:length(x))[duplicated(cumsum(!ind2))][c(FALSE,TRUE)]-c(1:length(x))[duplicated(cumsum(!ind2))][c(TRUE,FALSE)]-5)/4
# the number for each block
id<-as.numeric(x[ind2])[c(TRUE,FALSE)]
# new vector
mat<-cbind(rep(id,block),mat) # note, this assumes that the last line is again "> Reverse"
最后我用几行代码解析数据并将数据导入R
我将所有 table 合并为一个 table 并添加一个新列来表示名称 table 秒...
就是这样:
lns = readLines("filename.txt") ; # read the data as character
idx = grepl(">", lns) ; # location of all ">"s
df = read.table(text=lns[!idx]) ; # read all lines as table unless those who starts with ">"
wd = diff(c(which(idx), length(idx) + 1)) - 1 ; # finding the index of each table to add in new column
df$label = rep(lns[idx], wd) ; # add table indices in a new column
另一种处理这种特殊情况的方法是使用其他论坛中有人向我建议的 perl onliner,我不知道它是什么但它有效:
https://support.bioconductor.org/p/66724/#66767
感谢其他人提供的有用答案和评论,帮助我得出答案:)