(v)matchPattern DNAStringSet 密码子列表以参考 DNAString

Question

我正在评估下一代测序 (NGS) 实验中热点单核苷酸多态性 (SNP) 对病毒蛋白质序列的影响。我有参考 DNA 序列和热点列表。我需要先弄清楚这些热点所在位置的阅读框架。为此，我生成了一个包含所有人类密码子的 DNAStringSetList，并希望使用 Biostrings 包中的 vmatchpattern 或 matchpattern 来找出热点落在密码子中的位置阅读框架。

我经常纠结于 lapply 和其他 apply 函数，所以我倾向于使用 for 循环。我正在努力改进这方面的内容，欢迎提供 apply 解决方案（如果有的话）。

这是密码子列表的代码：

alanine <- DNAStringSet("GCN")
arginine <- DNAStringSet(c("CGN", "AGR", "CGY", "MGR"))  
asparginine <- DNAStringSet("AAY") 
aspartic_acid <- DNAStringSet("GAY")
asparagine_or_aspartic_acid <- DNAStringSet("RAY")
cysteine <- DNAStringSet("TGY")
glutamine <- DNAStringSet("CAR")
glutamic_acid <- DNAStringSet("GAR")
glutamine_or_glutamic_acid <- DNAStringSet("SAR")
glycine <- DNAStringSet("GGN")
histidine <- DNAStringSet("CAY")
start <- DNAStringSet("ATG")
isoleucine <- DNAStringSet("ATH")
leucine <- DNAStringSet(c("CTN", "TTR", "CTY", "YTR"))
lysine <- DNAStringSet("AAR") 
methionine <- DNAStringSet("ATG") 
phenylalanine <- DNAStringSet("TTY") 
proline <- DNAStringSet("CCN")
serine <- DNAStringSet(c("TCN", "AGY"))
threonine <- DNAStringSet("ACN")
tyrosine <- DNAStringSet("TGG")
tryptophan <- DNAStringSet("TAY")
valine <- DNAStringSet("GTN")
stop <- DNAStringSet(c("TRA", "TAR"))

codons <- DNAStringSetList(list(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
                           cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
                           histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
                           proline, serine, threonine, tyrosine, tryptophan, valine, stop))

当前for循环代码：

reference_stringset <-  DNAStringSet(covid)

codon_locations <- list()

for (i in 1:length(codons)) {
  pattern <- codons[[i]]
  codon_locations[i] <- vmatchPattern(pattern, reference_stringset)
}

当前错误代码。我正在过滤密码子 DNAStringSetList 使其成为 DNAStringSet.

Error in normargPattern(pattern, subject) : 'pattern' must be a single string or an XString object

我无法给出确切的核苷酸序列，但这里是 COVID 基因组（link：https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta）用作代表：

#for those not used to using .fasta files, first copy and past genome into notepad and save as a .fasta file
#use readDNAStringSet from Biostrings package to read in the .fasta file
filepath = #insert file path
covid <- readDNAStringSet(filepath)

Answer 1

对于当前代码，更改 codons 的构成方式。当前 codons 的输出如下所示：

DNAStringSetList of length 24
[[1]] GCN
[[2]] CGN AGR CGY MGR
[[3]] AAY
[[4]] GAY
[[5]] RAY
[[6]] TGY
[[7]] CAR
[[8]] GAR
[[9]] SAR
[[10]] GGN
...
<14 more elements>

将其从 DNAStringSetList 更改为 DNAStringSet 氨基酸团聚体。

codons <- DNAStringSet(c(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
                                cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
                                histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
                                proline, serine, threonine, tyrosine, tryptophan, valine, stop))

codons
DNAStringSet object of length 32:
     width seq
 [1]     3 GCN
 [2]     3 CGN
 [3]     3 AGR
 [4]     3 CGY
 [5]     3 MGR
 ...   ... ...
[28]     3 TGG
[29]     3 TAY
[30]     3 GTN
[31]     3 TRA
[32]     3 TAR

当我运行脚本时，我得到以下输出，其中列出了示例的 SARS-CoV-2 分离株（我显示的是一小部分）

codon_locations[27:28]
[[1]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 0 ranges and 0 metadata columns:
       start       end     width
   <integer> <integer> <integer>


[[2]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 554 ranges and 0 metadata columns:
            start       end     width
        <integer> <integer> <integer>
    [1]        89        91         3
    [2]       267       269         3
    [3]       283       285         3
    [4]       352       354         3
    [5]       358       360         3
    ...       ...       ...       ...
  [550]     29261     29263         3
  [551]     29289     29291         3
  [552]     29472     29474         3
  [553]     29559     29561         3
  [554]     29793     29795         3

查看那些有输出的，只有那些具有标准核苷酸（“ATCG”，无摆动）的匹配。这些也需要更改才能搜索。

如果你在推特上，我建议使用#rstats、#bioconductor 和#bioinformatics 主题标签链接问题以产生更多的吸引力，我注意到关于 SO 的生物信息学特定问题不会生成为很多嗡嗡声。

(v)matchPattern DNAStringSet 密码子列表以参考 DNAString

(v)matchPattern DNAStringSetList of Codons to Reference DNAString

r

bioconductor