如何通过“>”拆分导入为 data.frame 的 FASTA 文件
How to split a FASTA file imported as data.frame through ">"
我最喜欢将 R 中的 FASTA 文件导入单列数据框,如下所示:
dna.sequences <- data.frame(c(">ID1", "sequence1", ">ID2" , "sequence2", ...))
我想将这个数据框分成两列,并删除每个 ID 前面的“>”,这样我最终得到这样的结果
new_dna <- data.frame(
ID = c("ID1", "ID2" ... ),
sequence = c("sequence1", "sequence2" ... )
)
提前致谢,何塞
如果您总是有交替的 ID
和 sequence
值,您可以使用矢量回收技术。
transform(data.frame(ID = dna.sequences$col[c(TRUE, FALSE)],
sequence = dna.sequences$col[c(FALSE, TRUE)]),
ID = sub('^>', '', ID))
# ID sequence
#1 ID1 sequence1
#2 ID2 sequence2
数据
dna.sequences <- data.frame(col = c(">ID1", "sequence1", ">ID2" , "sequence2"))
假设您的文件是这样的:
writeLines(">ID1\nGAGA\n>ID2\nTATA","test.fa")
dna.sequences = read.table("test.fa")
dna.sequences
V1
1 >ID1
2 GAGA
3 >ID2
4 TATA
假设阅读正确:
rows = 1:nrow(dna.sequences)
data.frame(ID = gsub(">","",as.character(dna.sequences[rows %% 2==1,1])),
sequences = dna.sequences[rows %% 2==0,1])
或者更好的是,直接使用专用于此目的的包阅读它:
library(Biostrings)
data = readDNAStringSet("test.fa")
data
A DNAStringSet instance of length 2
width seq names
[1] 4 GAGA ID1
[2] 4 TATA ID2
dna.sequences = data.frame(ID=names(data),sequences=as.character(data))
dna.sequences
ID sequences
ID1 ID1 GAGA
ID2 ID2 TATA
使用“seqinr”:
seqs = read.fasta('filename', as.string = TRUE)
dna_sequences = data.frame(ID = names(seqs), sequence = seqs)
使用‘magrittr’管道我们可以去掉临时变量:
dna_sequences = read.fasta('filename', as.string = TRUE) %>%
{data.frame(ID = names(.), sequence = .)}
我最喜欢将 R 中的 FASTA 文件导入单列数据框,如下所示:
dna.sequences <- data.frame(c(">ID1", "sequence1", ">ID2" , "sequence2", ...))
我想将这个数据框分成两列,并删除每个 ID 前面的“>”,这样我最终得到这样的结果
new_dna <- data.frame(
ID = c("ID1", "ID2" ... ),
sequence = c("sequence1", "sequence2" ... )
)
提前致谢,何塞
如果您总是有交替的 ID
和 sequence
值,您可以使用矢量回收技术。
transform(data.frame(ID = dna.sequences$col[c(TRUE, FALSE)],
sequence = dna.sequences$col[c(FALSE, TRUE)]),
ID = sub('^>', '', ID))
# ID sequence
#1 ID1 sequence1
#2 ID2 sequence2
数据
dna.sequences <- data.frame(col = c(">ID1", "sequence1", ">ID2" , "sequence2"))
假设您的文件是这样的:
writeLines(">ID1\nGAGA\n>ID2\nTATA","test.fa")
dna.sequences = read.table("test.fa")
dna.sequences
V1
1 >ID1
2 GAGA
3 >ID2
4 TATA
假设阅读正确:
rows = 1:nrow(dna.sequences)
data.frame(ID = gsub(">","",as.character(dna.sequences[rows %% 2==1,1])),
sequences = dna.sequences[rows %% 2==0,1])
或者更好的是,直接使用专用于此目的的包阅读它:
library(Biostrings)
data = readDNAStringSet("test.fa")
data
A DNAStringSet instance of length 2
width seq names
[1] 4 GAGA ID1
[2] 4 TATA ID2
dna.sequences = data.frame(ID=names(data),sequences=as.character(data))
dna.sequences
ID sequences
ID1 ID1 GAGA
ID2 ID2 TATA
使用“seqinr”:
seqs = read.fasta('filename', as.string = TRUE)
dna_sequences = data.frame(ID = names(seqs), sequence = seqs)
使用‘magrittr’管道我们可以去掉临时变量:
dna_sequences = read.fasta('filename', as.string = TRUE) %>%
{data.frame(ID = names(.), sequence = .)}