如何通过“>”拆分导入为 data.frame 的 FASTA 文件

How to split a FASTA file imported as data.frame through ">"

我最喜欢将 R 中的 FASTA 文件导入单列数据框,如下所示:

dna.sequences <- data.frame(c(">ID1", "sequence1", ">ID2" , "sequence2", ...))

我想将这个数据框分成两列,并删除每个 ID 前面的“>”,这样我最终得到这样的结果

    new_dna <- data.frame(
          ID = c("ID1", "ID2" ... ),
            sequence = c("sequence1", "sequence2" ... )              
            )

提前致谢,何塞

如果您总是有交替的 IDsequence 值,您可以使用矢量回收技术。

transform(data.frame(ID = dna.sequences$col[c(TRUE, FALSE)], 
                     sequence = dna.sequences$col[c(FALSE, TRUE)]), 
          ID = sub('^>', '', ID))

#   ID  sequence
#1 ID1 sequence1
#2 ID2 sequence2

数据

dna.sequences <- data.frame(col = c(">ID1", "sequence1", ">ID2" , "sequence2"))

假设您的文件是这样的:

writeLines(">ID1\nGAGA\n>ID2\nTATA","test.fa")
dna.sequences = read.table("test.fa")

dna.sequences
    V1
1 >ID1
2 GAGA
3 >ID2
4 TATA

假设阅读正确:

rows = 1:nrow(dna.sequences)
data.frame(ID = gsub(">","",as.character(dna.sequences[rows %% 2==1,1])),
sequences = dna.sequences[rows %% 2==0,1])

或者更好的是,直接使用专用于此目的的包阅读它:

library(Biostrings)
data = readDNAStringSet("test.fa")

data
  A DNAStringSet instance of length 2
    width seq                                               names               
[1]     4 GAGA                                              ID1
[2]     4 TATA                                              ID2

dna.sequences = data.frame(ID=names(data),sequences=as.character(data))

dna.sequences
     ID sequences
ID1 ID1      GAGA
ID2 ID2      TATA

使用“seqinr”:

seqs = read.fasta('filename', as.string = TRUE)
dna_sequences = data.frame(ID = names(seqs), sequence = seqs)

使用‘magrittr’管道我们可以去掉临时变量:

dna_sequences = read.fasta('filename', as.string = TRUE) %>%
    {data.frame(ID = names(.), sequence = .)}