如果可以用“-”分隔，如何在R中的文件（.fasta）中找到特定序列

Question

我想找到序列的位置，例如 "atgcgctcgactcca" 在 fasta 文件中。我已经找到了一种使用从这个问题中获得的以下功能来执行此操作的方法： how use matchpattern() to find certain aminoacid in a file with many sequence(.fasta) in R

asdf<-read.table(file = "TSS_00001_ONACali.fa")

SequenzPosition <- lapply(asdf, function(x) {
  string <- BString(paste(x, collapse = ""))
  matchPattern("atgcgctcgactcca", string)
 })

但我的问题是序列也可能在文件中被“-”分割；例如："atgc---gctcgact--cca".

有没有办法让函数忽略“-”？

提前谢谢你！

Answer 1

您可以编写自己的简单代码来忽略“-”

这是核心代码：

> temp = s2c(sequence)

> newsequence = c2s( temp[temp != "-"] )

c2s() 和 s2c() 是 "seqinr" 包中的函数

你还可以使用 R 之外的其他包，如 mummer 或 Blast+，从而产生可读的输出

如果位置很重要，您可以使用下面的代码来检索正确的索引：

> which(temp != "-")[i]    #put the the temp index instead of i

如果可以用“-”分隔，如何在R中的文件（.fasta）中找到特定序列

How to find certain sequence in a file (.fasta) in R if it can be separated by "-"

r

sequence