如何在 R 中将 fasta 文件拆分为所需的核苷酸长度?
How to split a fasta file into desired nucelotide length in R?
作为编程语言的完全初学者,特别是 R
,我几天以来一直被这个问题困扰
我有一个随机的 fasta 文件:
>header AAAATGGGGCTTTTACCCCGATA
My desired output is:
segment1 AAAAT
segment2 GGGGC
segment3 TTTTA
segment4 CCCCG
segment5 ATA
我需要一点帮助来解决这个问题。任何帮助将不胜感激!
这是我的做法:
# Use the Biostrings package to parse fasta file and store long strings
# https://bioconductor.org/packages/release/bioc/html/Biostrings.html
require(Biostrings)
# Create test data
fileConn<-file("test.fa")
writeLines(c(
">header1",
"AAAATGGGGC",
"TTTTACCCCG",
"ATA",
">header2",
"ACGTACG"
), fileConn)
close(fileConn)
# Read fasta file
string_set <- readBStringSet("test.fa", format="fasta")
string_set
# Segment the first sequence
sequence = string_set[[1]]
sequence_length = length(sequence)
segment_length = 5
segment_count = ceiling(sequence_length/segment_length) # number of segments
segments <- lapply(1:segment_count, function(segment_id) {
start_position <- (segment_id-1)*segment_length + 1
end_position <- min(start_position + segment_length - 1, sequence_length)
segment <- as.character(subseq(sequence,start_position,end_position))
segment_name <- sprintf("Segment%s", segment_id)
c(`Segment Name`=segment_name, `Segment`=segment)
})
segments <- as.data.frame(do.call(rbind, segments))
segments
作为编程语言的完全初学者,特别是 R
,我几天以来一直被这个问题困扰
我有一个随机的 fasta 文件:
>header AAAATGGGGCTTTTACCCCGATA
My desired output is:
segment1 AAAAT
segment2 GGGGC
segment3 TTTTA
segment4 CCCCG
segment5 ATA
我需要一点帮助来解决这个问题。任何帮助将不胜感激!
这是我的做法:
# Use the Biostrings package to parse fasta file and store long strings
# https://bioconductor.org/packages/release/bioc/html/Biostrings.html
require(Biostrings)
# Create test data
fileConn<-file("test.fa")
writeLines(c(
">header1",
"AAAATGGGGC",
"TTTTACCCCG",
"ATA",
">header2",
"ACGTACG"
), fileConn)
close(fileConn)
# Read fasta file
string_set <- readBStringSet("test.fa", format="fasta")
string_set
# Segment the first sequence
sequence = string_set[[1]]
sequence_length = length(sequence)
segment_length = 5
segment_count = ceiling(sequence_length/segment_length) # number of segments
segments <- lapply(1:segment_count, function(segment_id) {
start_position <- (segment_id-1)*segment_length + 1
end_position <- min(start_position + segment_length - 1, sequence_length)
segment <- as.character(subseq(sequence,start_position,end_position))
segment_name <- sprintf("Segment%s", segment_id)
c(`Segment Name`=segment_name, `Segment`=segment)
})
segments <- as.data.frame(do.call(rbind, segments))
segments