如何在 R 中将 fasta 文件拆分为所需的核苷酸长度?

How to split a fasta file into desired nucelotide length in R?

作为编程语言的完全初学者,特别是 R,我几天以来一直被这个问题困扰

我有一个随机的 fasta 文件:

>header AAAATGGGGCTTTTACCCCGATA

My desired output is:

segment1 AAAAT
segment2 GGGGC
segment3 TTTTA
segment4 CCCCG
segment5 ATA

我需要一点帮助来解决这个问题。任何帮助将不胜感激!

这是我的做法:

# Use the Biostrings package to parse fasta file and store long strings
# https://bioconductor.org/packages/release/bioc/html/Biostrings.html 
require(Biostrings)

# Create test data
fileConn<-file("test.fa")
writeLines(c(
    ">header1",
    "AAAATGGGGC",
    "TTTTACCCCG",
    "ATA",
    ">header2",
    "ACGTACG"
), fileConn)
close(fileConn)

# Read fasta file
string_set <- readBStringSet("test.fa", format="fasta")
string_set

# Segment the first sequence
sequence = string_set[[1]]
sequence_length = length(sequence)
segment_length = 5

segment_count = ceiling(sequence_length/segment_length) # number of segments
segments <- lapply(1:segment_count, function(segment_id) {
    start_position <- (segment_id-1)*segment_length + 1
    end_position <- min(start_position + segment_length - 1, sequence_length)
    segment <- as.character(subseq(sequence,start_position,end_position))
    segment_name <- sprintf("Segment%s", segment_id)
    c(`Segment Name`=segment_name, `Segment`=segment)
})
segments <- as.data.frame(do.call(rbind, segments))
segments