(v)matchPattern DNAStringSet 密码子列表以参考 DNAString
(v)matchPattern DNAStringSetList of Codons to Reference DNAString
我正在评估下一代测序 (NGS) 实验中热点单核苷酸多态性 (SNP) 对病毒蛋白质序列的影响。我有参考 DNA 序列和热点列表。我需要先弄清楚这些热点所在位置的阅读框架。为此,我生成了一个包含所有人类密码子的 DNAStringSetList
,并希望使用 Biostrings
包中的 vmatchpattern
或 matchpattern
来找出热点落在密码子中的位置阅读框架。
我经常纠结于 lapply
和其他 apply
函数,所以我倾向于使用 for
循环。我正在努力改进这方面的内容,欢迎提供 apply
解决方案(如果有的话)。
这是密码子列表的代码:
alanine <- DNAStringSet("GCN")
arginine <- DNAStringSet(c("CGN", "AGR", "CGY", "MGR"))
asparginine <- DNAStringSet("AAY")
aspartic_acid <- DNAStringSet("GAY")
asparagine_or_aspartic_acid <- DNAStringSet("RAY")
cysteine <- DNAStringSet("TGY")
glutamine <- DNAStringSet("CAR")
glutamic_acid <- DNAStringSet("GAR")
glutamine_or_glutamic_acid <- DNAStringSet("SAR")
glycine <- DNAStringSet("GGN")
histidine <- DNAStringSet("CAY")
start <- DNAStringSet("ATG")
isoleucine <- DNAStringSet("ATH")
leucine <- DNAStringSet(c("CTN", "TTR", "CTY", "YTR"))
lysine <- DNAStringSet("AAR")
methionine <- DNAStringSet("ATG")
phenylalanine <- DNAStringSet("TTY")
proline <- DNAStringSet("CCN")
serine <- DNAStringSet(c("TCN", "AGY"))
threonine <- DNAStringSet("ACN")
tyrosine <- DNAStringSet("TGG")
tryptophan <- DNAStringSet("TAY")
valine <- DNAStringSet("GTN")
stop <- DNAStringSet(c("TRA", "TAR"))
codons <- DNAStringSetList(list(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
当前for循环代码:
reference_stringset <- DNAStringSet(covid)
codon_locations <- list()
for (i in 1:length(codons)) {
pattern <- codons[[i]]
codon_locations[i] <- vmatchPattern(pattern, reference_stringset)
}
当前错误代码。我正在过滤密码子 DNAStringSetList
使其成为 DNAStringSet
.
Error in normargPattern(pattern, subject) : 'pattern' must be a single string or an XString object
我无法给出确切的核苷酸序列,但这里是 COVID 基因组(link:https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta)用作代表:
#for those not used to using .fasta files, first copy and past genome into notepad and save as a .fasta file
#use readDNAStringSet from Biostrings package to read in the .fasta file
filepath = #insert file path
covid <- readDNAStringSet(filepath)
对于当前代码,更改 codons
的构成方式。当前 codons
的输出如下所示:
DNAStringSetList of length 24
[[1]] GCN
[[2]] CGN AGR CGY MGR
[[3]] AAY
[[4]] GAY
[[5]] RAY
[[6]] TGY
[[7]] CAR
[[8]] GAR
[[9]] SAR
[[10]] GGN
...
<14 more elements>
将其从 DNAStringSetList
更改为 DNAStringSet
氨基酸团聚体。
codons <- DNAStringSet(c(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
codons
DNAStringSet object of length 32:
width seq
[1] 3 GCN
[2] 3 CGN
[3] 3 AGR
[4] 3 CGY
[5] 3 MGR
... ... ...
[28] 3 TGG
[29] 3 TAY
[30] 3 GTN
[31] 3 TRA
[32] 3 TAR
当我 运行 脚本时,我得到以下输出,其中列出了示例的 SARS-CoV-2 分离株(我显示的是一小部分)
codon_locations[27:28]
[[1]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 0 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[[2]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 554 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 89 91 3
[2] 267 269 3
[3] 283 285 3
[4] 352 354 3
[5] 358 360 3
... ... ... ...
[550] 29261 29263 3
[551] 29289 29291 3
[552] 29472 29474 3
[553] 29559 29561 3
[554] 29793 29795 3
查看那些有输出的,只有那些具有标准核苷酸(“ATCG”,无摆动)的匹配。这些也需要更改才能搜索。
如果你在推特上,我建议使用#rstats、#bioconductor 和#bioinformatics 主题标签链接问题以产生更多的吸引力,我注意到关于 SO 的生物信息学特定问题不会生成为很多嗡嗡声。
我正在评估下一代测序 (NGS) 实验中热点单核苷酸多态性 (SNP) 对病毒蛋白质序列的影响。我有参考 DNA 序列和热点列表。我需要先弄清楚这些热点所在位置的阅读框架。为此,我生成了一个包含所有人类密码子的 DNAStringSetList
,并希望使用 Biostrings
包中的 vmatchpattern
或 matchpattern
来找出热点落在密码子中的位置阅读框架。
我经常纠结于 lapply
和其他 apply
函数,所以我倾向于使用 for
循环。我正在努力改进这方面的内容,欢迎提供 apply
解决方案(如果有的话)。
这是密码子列表的代码:
alanine <- DNAStringSet("GCN")
arginine <- DNAStringSet(c("CGN", "AGR", "CGY", "MGR"))
asparginine <- DNAStringSet("AAY")
aspartic_acid <- DNAStringSet("GAY")
asparagine_or_aspartic_acid <- DNAStringSet("RAY")
cysteine <- DNAStringSet("TGY")
glutamine <- DNAStringSet("CAR")
glutamic_acid <- DNAStringSet("GAR")
glutamine_or_glutamic_acid <- DNAStringSet("SAR")
glycine <- DNAStringSet("GGN")
histidine <- DNAStringSet("CAY")
start <- DNAStringSet("ATG")
isoleucine <- DNAStringSet("ATH")
leucine <- DNAStringSet(c("CTN", "TTR", "CTY", "YTR"))
lysine <- DNAStringSet("AAR")
methionine <- DNAStringSet("ATG")
phenylalanine <- DNAStringSet("TTY")
proline <- DNAStringSet("CCN")
serine <- DNAStringSet(c("TCN", "AGY"))
threonine <- DNAStringSet("ACN")
tyrosine <- DNAStringSet("TGG")
tryptophan <- DNAStringSet("TAY")
valine <- DNAStringSet("GTN")
stop <- DNAStringSet(c("TRA", "TAR"))
codons <- DNAStringSetList(list(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
当前for循环代码:
reference_stringset <- DNAStringSet(covid)
codon_locations <- list()
for (i in 1:length(codons)) {
pattern <- codons[[i]]
codon_locations[i] <- vmatchPattern(pattern, reference_stringset)
}
当前错误代码。我正在过滤密码子 DNAStringSetList
使其成为 DNAStringSet
.
Error in normargPattern(pattern, subject) : 'pattern' must be a single string or an XString object
我无法给出确切的核苷酸序列,但这里是 COVID 基因组(link:https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta)用作代表:
#for those not used to using .fasta files, first copy and past genome into notepad and save as a .fasta file
#use readDNAStringSet from Biostrings package to read in the .fasta file
filepath = #insert file path
covid <- readDNAStringSet(filepath)
对于当前代码,更改 codons
的构成方式。当前 codons
的输出如下所示:
DNAStringSetList of length 24
[[1]] GCN
[[2]] CGN AGR CGY MGR
[[3]] AAY
[[4]] GAY
[[5]] RAY
[[6]] TGY
[[7]] CAR
[[8]] GAR
[[9]] SAR
[[10]] GGN
...
<14 more elements>
将其从 DNAStringSetList
更改为 DNAStringSet
氨基酸团聚体。
codons <- DNAStringSet(c(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
codons
DNAStringSet object of length 32:
width seq
[1] 3 GCN
[2] 3 CGN
[3] 3 AGR
[4] 3 CGY
[5] 3 MGR
... ... ...
[28] 3 TGG
[29] 3 TAY
[30] 3 GTN
[31] 3 TRA
[32] 3 TAR
当我 运行 脚本时,我得到以下输出,其中列出了示例的 SARS-CoV-2 分离株(我显示的是一小部分)
codon_locations[27:28]
[[1]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 0 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[[2]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 554 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 89 91 3
[2] 267 269 3
[3] 283 285 3
[4] 352 354 3
[5] 358 360 3
... ... ... ...
[550] 29261 29263 3
[551] 29289 29291 3
[552] 29472 29474 3
[553] 29559 29561 3
[554] 29793 29795 3
查看那些有输出的,只有那些具有标准核苷酸(“ATCG”,无摆动)的匹配。这些也需要更改才能搜索。
如果你在推特上,我建议使用#rstats、#bioconductor 和#bioinformatics 主题标签链接问题以产生更多的吸引力,我注意到关于 SO 的生物信息学特定问题不会生成为很多嗡嗡声。