如何找到密码子的特定频率?
How to find specific frequency of a codon?
我想在 R 中创建一个函数来计算每个密码子的频率。
我们知道甲硫氨酸是一种氨基酸,只能由一组密码子 ATG 形成,因此它在每组序列中的百分比为 1。而甘氨酸可以由 GGT、GGC、GGA、GGG 形成,因此出现的百分比每个密码子将是 0.25。
输入将在 DNA 序列中,如 ATGGGTGGCGGAGGG,并且在密码子的帮助下 table 它可以计算输入中每次出现的百分比。
请帮助我提出实现此功能的方法。
例如,
如果我的论点是 ATGTGTTGCTGG
那么,我的结果就是
ATG=1
TGT=0.5
TGC=0.5
TGG=1
R 的数据:
codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T",
ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K",
AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L",
CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P",
CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R",
CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V",
GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D",
GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G",
GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F",
TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop",
TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")
首先,我得到我的查找列表和序列。
codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T",
ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K",
AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L",
CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P",
CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R",
CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V",
GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D",
GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G",
GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F",
TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop",
TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")
MySeq <- "ATGTGTTGCTGG"
接下来,我加载 stringi
库并将序列分成三个字符的块。
# Load library
library(stringi)
# Break into 3 bases
seq_split <- stri_sub(MySeq, seq(1, stri_length(MySeq), by=3), length=3)
然后,我用table
.
统计这三个base chunk对应的字母
# Get associated letters
letter_count <- table(unlist(codon[seq_split]))
最后,我将序列与计数的倒数绑定在一起,并重命名我的数据框列。
# Bind into a data frame
res <- data.frame(seq_split,
1/letter_count[match(unlist(codon[seq_split]), names(letter_count))])
# Rename columns
colnames(res) <- c("Sequence", "Letter", "Percentage")
# Sequence Letter Percentage
#1 ATG M 1.0
#2 TGT C 0.5
#3 TGC C 0.5
#4 TGG W 1.0
这里有两件事要解决:
将codon
转换为每个字母的分数
( fracs <- 1/table(unlist(codon)) )
# A C D E F G H I
# 0.2500000 0.5000000 0.5000000 0.5000000 0.5000000 0.2500000 0.5000000 0.3333333
# K L M N P Q R S
# 0.5000000 0.1666667 1.0000000 0.5000000 0.2500000 0.5000000 0.1666667 0.1666667
# stop T V W Y
# 0.3333333 0.2500000 0.2500000 1.0000000 0.5000000
codonfracs <- setNames(lapply(codon, function(x) unname(fracs[x])), names(codon))
str(head(codonfracs))
# List of 6
# $ ATA: num 0.333
# $ ATC: num 0.333
# $ ATT: num 0.333
# $ ATG: num 1
# $ ACA: num 0.25
# $ ACC: num 0.25
将序列字符串转换为长度为 3 个子字符串的向量
s <- 'ATGTGTTGCTGG'
strsplit3 <- function(s, k=3) {
starts <- seq.int(1, nchar(s), by=k)
stops <- c(starts[-1] - 1, nchar(s))
mapply(substr, s, starts, stops, USE.NAMES=FALSE)
}
strsplit3(s)
# [1] "ATG" "TGT" "TGC" "TGG"
从这里开始,这只是一个查找:
codonfracs[ strsplit3(s) ]
# $ATG
# [1] 1
# $TGT
# [1] 0.5
# $TGC
# [1] 0.5
# $TGG
# [1] 1
编辑
既然你想要其他密码子的状态,试试这个:
x <- codonfracs
x[ ! names(x) %in% strsplit3(s) ] <- 0
str(x)
# List of 64
# $ ATA: num 0
# $ ATC: num 0
# $ ATT: num 0
# $ ATG: num 1
# $ ACA: num 0
# $ ACC: num 0
# $ ACG: num 0
# ...snip...
# $ TAT: num 0
# $ TAA: num 0
# $ TAG: num 0
# $ TGC: num 0.5
# $ TGT: num 0.5
# $ TGA: num 0
# $ TGG: num 1
通往这个解决方案的路径略有不同:
f0 <- function(dna, weight) {
codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))
tibble(id = seq_along(codons), codons = codons) %>%
unnest() %>%
mutate(weight = as.vector(wt[codons]))
}
首先,codon
只是一个命名向量,不是列表;这是权重
codon <- unlist(codon)
weight <- setNames(1 / table(codon)[codon], names(codon))
其次,可能存在一个 DNA 序列载体,而不是一个。
dna <- c("ATGTGTTGCTGG", "GGTCGTTGTGTA")
要开发解决方案,可以通过搜索任何核苷酸 [ACGT]
重复 {3}
次
来找到密码子
codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))
在 tidyverse 中进行操作似乎很方便,创建一个 tibble (data.frame),其中 id
表示密码子来自哪个序列
library(tidyverse)
tbl <- tibble(id = seq_along(codons), codon = codons) %>% unnest()
然后添加权重
tbl <- mutate(tbl, weight = as.vector(weight[codon]))
所以我们有
> tbl
# A tibble: 8 x 3
id codon weight
<int> <chr> <dbl>
1 1 ATG 1
2 1 TGT 0.5
3 1 TGC 0.5
4 1 TGG 1
5 2 GGT 0.25
6 2 CGT 0.167
7 2 TGT 0.5
8 2 GTA 0.25
标准的 tidyverse 操作可用于进一步总结,特别是当同一密码子出现多次时
tbl %>% group_by(id, codon) %>%
summarize(wt = sum(weight))
我想在 R 中创建一个函数来计算每个密码子的频率。 我们知道甲硫氨酸是一种氨基酸,只能由一组密码子 ATG 形成,因此它在每组序列中的百分比为 1。而甘氨酸可以由 GGT、GGC、GGA、GGG 形成,因此出现的百分比每个密码子将是 0.25。 输入将在 DNA 序列中,如 ATGGGTGGCGGAGGG,并且在密码子的帮助下 table 它可以计算输入中每次出现的百分比。
请帮助我提出实现此功能的方法。
例如, 如果我的论点是 ATGTGTTGCTGG 那么,我的结果就是
ATG=1
TGT=0.5
TGC=0.5
TGG=1
R 的数据:
codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T",
ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K",
AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L",
CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P",
CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R",
CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V",
GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D",
GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G",
GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F",
TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop",
TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")
首先,我得到我的查找列表和序列。
codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T",
ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K",
AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L",
CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P",
CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R",
CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V",
GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D",
GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G",
GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F",
TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop",
TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")
MySeq <- "ATGTGTTGCTGG"
接下来,我加载 stringi
库并将序列分成三个字符的块。
# Load library
library(stringi)
# Break into 3 bases
seq_split <- stri_sub(MySeq, seq(1, stri_length(MySeq), by=3), length=3)
然后,我用table
.
# Get associated letters
letter_count <- table(unlist(codon[seq_split]))
最后,我将序列与计数的倒数绑定在一起,并重命名我的数据框列。
# Bind into a data frame
res <- data.frame(seq_split,
1/letter_count[match(unlist(codon[seq_split]), names(letter_count))])
# Rename columns
colnames(res) <- c("Sequence", "Letter", "Percentage")
# Sequence Letter Percentage
#1 ATG M 1.0
#2 TGT C 0.5
#3 TGC C 0.5
#4 TGG W 1.0
这里有两件事要解决:
将
codon
转换为每个字母的分数( fracs <- 1/table(unlist(codon)) ) # A C D E F G H I # 0.2500000 0.5000000 0.5000000 0.5000000 0.5000000 0.2500000 0.5000000 0.3333333 # K L M N P Q R S # 0.5000000 0.1666667 1.0000000 0.5000000 0.2500000 0.5000000 0.1666667 0.1666667 # stop T V W Y # 0.3333333 0.2500000 0.2500000 1.0000000 0.5000000 codonfracs <- setNames(lapply(codon, function(x) unname(fracs[x])), names(codon)) str(head(codonfracs)) # List of 6 # $ ATA: num 0.333 # $ ATC: num 0.333 # $ ATT: num 0.333 # $ ATG: num 1 # $ ACA: num 0.25 # $ ACC: num 0.25
将序列字符串转换为长度为 3 个子字符串的向量
s <- 'ATGTGTTGCTGG' strsplit3 <- function(s, k=3) { starts <- seq.int(1, nchar(s), by=k) stops <- c(starts[-1] - 1, nchar(s)) mapply(substr, s, starts, stops, USE.NAMES=FALSE) } strsplit3(s) # [1] "ATG" "TGT" "TGC" "TGG"
从这里开始,这只是一个查找:
codonfracs[ strsplit3(s) ]
# $ATG
# [1] 1
# $TGT
# [1] 0.5
# $TGC
# [1] 0.5
# $TGG
# [1] 1
编辑
既然你想要其他密码子的状态,试试这个:
x <- codonfracs
x[ ! names(x) %in% strsplit3(s) ] <- 0
str(x)
# List of 64
# $ ATA: num 0
# $ ATC: num 0
# $ ATT: num 0
# $ ATG: num 1
# $ ACA: num 0
# $ ACC: num 0
# $ ACG: num 0
# ...snip...
# $ TAT: num 0
# $ TAA: num 0
# $ TAG: num 0
# $ TGC: num 0.5
# $ TGT: num 0.5
# $ TGA: num 0
# $ TGG: num 1
通往这个解决方案的路径略有不同:
f0 <- function(dna, weight) {
codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))
tibble(id = seq_along(codons), codons = codons) %>%
unnest() %>%
mutate(weight = as.vector(wt[codons]))
}
首先,codon
只是一个命名向量,不是列表;这是权重
codon <- unlist(codon)
weight <- setNames(1 / table(codon)[codon], names(codon))
其次,可能存在一个 DNA 序列载体,而不是一个。
dna <- c("ATGTGTTGCTGG", "GGTCGTTGTGTA")
要开发解决方案,可以通过搜索任何核苷酸 [ACGT]
重复 {3}
次
codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))
在 tidyverse 中进行操作似乎很方便,创建一个 tibble (data.frame),其中 id
表示密码子来自哪个序列
library(tidyverse)
tbl <- tibble(id = seq_along(codons), codon = codons) %>% unnest()
然后添加权重
tbl <- mutate(tbl, weight = as.vector(weight[codon]))
所以我们有
> tbl
# A tibble: 8 x 3
id codon weight
<int> <chr> <dbl>
1 1 ATG 1
2 1 TGT 0.5
3 1 TGC 0.5
4 1 TGG 1
5 2 GGT 0.25
6 2 CGT 0.167
7 2 TGT 0.5
8 2 GTA 0.25
标准的 tidyverse 操作可用于进一步总结,特别是当同一密码子出现多次时
tbl %>% group_by(id, codon) %>%
summarize(wt = sum(weight))