(v)matchPattern DNAStringSet 密码子列表以参考 DNAString

(v)matchPattern DNAStringSetList of Codons to Reference DNAString

我正在评估下一代测序 (NGS) 实验中热点单核苷酸多态性 (SNP) 对病毒蛋白质序列的影响。我有参考 DNA 序列和热点列表。我需要先弄清楚这些热点所在位置的阅读框架。为此,我生成了一个包含所有人类密码子的 DNAStringSetList,并希望使用 Biostrings 包中的 vmatchpatternmatchpattern 来找出热点落在密码子中的位置阅读框架。

我经常纠结于 lapply 和其他 apply 函数,所以我倾向于使用 for 循环。我正在努力改进这方面的内容,欢迎提供 apply 解决方案(如果有的话)。

这是密码子列表的代码:

alanine <- DNAStringSet("GCN")
arginine <- DNAStringSet(c("CGN", "AGR", "CGY", "MGR"))  
asparginine <- DNAStringSet("AAY") 
aspartic_acid <- DNAStringSet("GAY")
asparagine_or_aspartic_acid <- DNAStringSet("RAY")
cysteine <- DNAStringSet("TGY")
glutamine <- DNAStringSet("CAR")
glutamic_acid <- DNAStringSet("GAR")
glutamine_or_glutamic_acid <- DNAStringSet("SAR")
glycine <- DNAStringSet("GGN")
histidine <- DNAStringSet("CAY")
start <- DNAStringSet("ATG")
isoleucine <- DNAStringSet("ATH")
leucine <- DNAStringSet(c("CTN", "TTR", "CTY", "YTR"))
lysine <- DNAStringSet("AAR") 
methionine <- DNAStringSet("ATG") 
phenylalanine <- DNAStringSet("TTY") 
proline <- DNAStringSet("CCN")
serine <- DNAStringSet(c("TCN", "AGY"))
threonine <- DNAStringSet("ACN")
tyrosine <- DNAStringSet("TGG")
tryptophan <- DNAStringSet("TAY")
valine <- DNAStringSet("GTN")
stop <- DNAStringSet(c("TRA", "TAR"))

codons <- DNAStringSetList(list(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
                           cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
                           histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
                           proline, serine, threonine, tyrosine, tryptophan, valine, stop))

当前for循环代码:

reference_stringset <-  DNAStringSet(covid)

codon_locations <- list()

for (i in 1:length(codons)) {
  pattern <- codons[[i]]
  codon_locations[i] <- vmatchPattern(pattern, reference_stringset)
}

当前错误代码。我正在过滤密码子 DNAStringSetList 使其成为 DNAStringSet.

Error in normargPattern(pattern, subject) : 'pattern' must be a single string or an XString object

我无法给出确切的核苷酸序列,但这里是 COVID 基因组(link:https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta)用作代表:

#for those not used to using .fasta files, first copy and past genome into notepad and save as a .fasta file
#use readDNAStringSet from Biostrings package to read in the .fasta file
filepath = #insert file path
covid <- readDNAStringSet(filepath)

对于当前代码,更改 codons 的构成方式。当前 codons 的输出如下所示:

DNAStringSetList of length 24
[[1]] GCN
[[2]] CGN AGR CGY MGR
[[3]] AAY
[[4]] GAY
[[5]] RAY
[[6]] TGY
[[7]] CAR
[[8]] GAR
[[9]] SAR
[[10]] GGN
...
<14 more elements> 

将其从 DNAStringSetList 更改为 DNAStringSet 氨基酸团聚体。

codons <- DNAStringSet(c(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
                                cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
                                histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
                                proline, serine, threonine, tyrosine, tryptophan, valine, stop))

codons
DNAStringSet object of length 32:
     width seq
 [1]     3 GCN
 [2]     3 CGN
 [3]     3 AGR
 [4]     3 CGY
 [5]     3 MGR
 ...   ... ...
[28]     3 TGG
[29]     3 TAY
[30]     3 GTN
[31]     3 TRA
[32]     3 TAR

当我 运行 脚本时,我得到以下输出,其中列出了示例的 SARS-CoV-2 分离株(我显示的是一小部分)

codon_locations[27:28]
[[1]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 0 ranges and 0 metadata columns:
       start       end     width
   <integer> <integer> <integer>


[[2]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 554 ranges and 0 metadata columns:
            start       end     width
        <integer> <integer> <integer>
    [1]        89        91         3
    [2]       267       269         3
    [3]       283       285         3
    [4]       352       354         3
    [5]       358       360         3
    ...       ...       ...       ...
  [550]     29261     29263         3
  [551]     29289     29291         3
  [552]     29472     29474         3
  [553]     29559     29561         3
  [554]     29793     29795         3

查看那些有输出的,只有那些具有标准核苷酸(“ATCG”,无摆动)的匹配。这些也需要更改才能搜索。

如果你在推特上,我建议使用#rstats、#bioconductor 和#bioinformatics 主题标签链接问题以产生更多的吸引力,我注意到关于 SO 的生物信息学特定问题不会生成为很多嗡嗡声。