如何操作一列的两块?
How to manipulate two pieces of a column?
我正在处理一些遗传数据,但我的其中一个专栏不是我想要的格式。我不知道这里讨论了多少生物学,但我正在尝试修复我的氨基酸在我的数据中的显示方式。
氨基酸显然有一个名字,但它们也有一个 3 个字母的缩写和一个 1 个字母的缩写。我的数据包含 3 个字母形式的氨基酸,但我想将它们更改为 1 个字母的缩写。这是我的数据示例。
chr location effect impact AA_change
1 12543 missense_variant MODERATE p.Ala12Val
1 52367 missense_variant MODERATE p.Leu54Pro
1 752347 missense_variant MODERATE p.Met99Ser
1 984645 missense_variant MODERATE p.Lys34Ile
1 989845 missense_variant MODERATE p.Arg4Cys
1 999854 missense_variant MODERATE p.His43Gly
1 999855 missense_variant MODERATE p.Glu14Phe
dat <- structure(list(chr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), location = c(12543L,
52367L, 752347L, 984645L, 989845L, 999854L, 999855L), effect = c("missense_variant",
"missense_variant", "missense_variant", "missense_variant", "missense_variant",
"missense_variant", "missense_variant"), impact = c("MODERATE",
"MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE"
), AA_change = c("Ala12Val", "Leu54Pro", "Met99Ser", "Lys34Ile",
"Arg4Cys", "His43Gly", "Glu14Phe")), .Names = c("chr", "location",
"effect", "impact", "AA_change"), row.names = c(NA, -7L), class = "data.frame")
这是一个 3 字母氨基酸的列表,以及它们更好的缩写是什么。
Ala == A
Arg == R
Asn == N
Asp == D
Cys == C
Glu == E
Gln == Q
Gly == G
His == H
Ile == I
Leu == L
Lys == K
Met == M
Phe == F
Pro == P
Ser == S
Thr == T
Trp == W
Tyr == Y
Val == V
我觉得有一个简单的功能可以做到这一点,但我正在努力解决如何做到这一点。我习惯于只更改专栏的一部分,而不是一次更改两件事。所以我想问的是如何改变这个
Ala12Val
Leu54Pro
Met99Ser
Lys34Ile
Arg4Cys
His43Gly
Glu14Phe
对此
A12V
L54P
M99S
K32I
R4C
E14F
这是可以做到的吗?
查找氨基酸,然后获取前 3 个字母的子串和映射,提取数字,后 3 个字母的子串和映射。然后全部粘贴在一起。
# lookup map
AAmap <- setNames(c("A","R","N","D","C","E","Q","G","H","I","L","K","M","F","P","S","T","W","Y","V"),
c("Ala","Arg","Asn","Asp","Cys","Glu","Gln","Gly","His","Ile","Leu","Lys","Met","Phe","Pro","Ser","Thr","Trp","Tyr","Val"))
# get first 3 map to AA, get digits, get last 3 map to AA
dat$AA_change_short <-
paste0(AAmap[ substr(dat$AA_change, 1, 3) ],
gsub("[^\d]+", "", dat$AA_change, perl = TRUE),
AAmap[ substr(dat$AA_change, nchar(dat$AA_change) - 2, nchar(dat$AA_change)) ])
dat
# chr location effect impact AA_change AA_change_short
# 1 1 12543 missense_variant MODERATE Ala12Val A12V
# 2 1 52367 missense_variant MODERATE Leu54Pro L54P
# 3 1 752347 missense_variant MODERATE Met99Ser M99S
# 4 1 984645 missense_variant MODERATE Lys34Ile K34I
# 5 1 989845 missense_variant MODERATE Arg4Cys R4C
# 6 1 999854 missense_variant MODERATE His43Gly H43G
# 7 1 999855 missense_variant MODERATE Glu14Phe E14F
b=which(adist(dat2$V1,dat$AA_change,partial = T)==0,T)
dat$AA_change1=`regmatches<-`(dat$AA_change,gregexpr("\D+",dat$AA_change),
value=split(dat2$V3[b[,1]],b[,2]))
dat
chr location effect impact AA_change AA_change1
1 1 12543 missense_variant MODERATE Ala12Val A12V
2 1 52367 missense_variant MODERATE Leu54Pro L54P
3 1 752347 missense_variant MODERATE Met99Ser M99S
4 1 984645 missense_variant MODERATE Lys34Ile I34K
5 1 989845 missense_variant MODERATE Arg4Cys R4C
6 1 999854 missense_variant MODERATE His43Gly G43H
7 1 999855 missense_variant MODERATE Glu14Phe E14F
dat2 = read.table(text="Ala == A
Arg == R
Asn == N
Asp == D
Cys == C
Glu == E
Gln == Q
Gly == G
His == H
Ile == I
Leu == L
Lys == K
Met == M
Phe == F
Pro == P
Ser == S
Thr == T
Trp == W
Tyr == Y
Val == V")[-2]
如果它始终采用 {acid, numbers, acid} 形式,您可以将其拆分为三列并使用 match
或连接进行替换。对于 data.table,这看起来像...
library(data.table)
setDT(dat)
# put your mapping into a nicer format
abbrDT = fread(header = FALSE,"
Ala == A
Arg == R
Asn == N
Asp == D
Cys == C
Glu == E
Gln == Q
Gly == G
His == H
Ile == I
Leu == L
Lys == K
Met == M
Phe == F
Pro == P
Ser == S
Thr == T
Trp == W
Tyr == Y
Val == V")[, .(abbr3 = V1, abbr1 = V3)]
# split the column
patt = "(?<=\d)(?=\D)|(?<=\D)(?=\d)"
dat[, c("AA1", "num", "AA2") := tstrsplit(AA_change, patt, perl=TRUE)]
# substitute for each part
dat[abbrDT, on=.(AA1 = abbr3), AA1 := abbr1]
dat[abbrDT, on=.(AA2 = abbr3), AA2 := abbr1]
这给出了
chr location effect impact AA_change AA1 num AA2
1: 1 12543 missense_variant MODERATE Ala12Val A 12 V
2: 1 52367 missense_variant MODERATE Leu54Pro L 54 P
3: 1 752347 missense_variant MODERATE Met99Ser M 99 S
4: 1 984645 missense_variant MODERATE Lys34Ile K 34 I
5: 1 989845 missense_variant MODERATE Arg4Cys R 4 C
6: 1 999854 missense_variant MODERATE His43Gly H 43 G
7: 1 999855 missense_variant MODERATE Glu14Phe E 14 F
可选,再次合并列并删除不需要的列:
dat[, AA_change := paste0(AA1, num, AA2)]
dat[, c("AA1", "num", "AA2") := NULL]
我正在处理一些遗传数据,但我的其中一个专栏不是我想要的格式。我不知道这里讨论了多少生物学,但我正在尝试修复我的氨基酸在我的数据中的显示方式。
氨基酸显然有一个名字,但它们也有一个 3 个字母的缩写和一个 1 个字母的缩写。我的数据包含 3 个字母形式的氨基酸,但我想将它们更改为 1 个字母的缩写。这是我的数据示例。
chr location effect impact AA_change
1 12543 missense_variant MODERATE p.Ala12Val
1 52367 missense_variant MODERATE p.Leu54Pro
1 752347 missense_variant MODERATE p.Met99Ser
1 984645 missense_variant MODERATE p.Lys34Ile
1 989845 missense_variant MODERATE p.Arg4Cys
1 999854 missense_variant MODERATE p.His43Gly
1 999855 missense_variant MODERATE p.Glu14Phe
dat <- structure(list(chr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), location = c(12543L,
52367L, 752347L, 984645L, 989845L, 999854L, 999855L), effect = c("missense_variant",
"missense_variant", "missense_variant", "missense_variant", "missense_variant",
"missense_variant", "missense_variant"), impact = c("MODERATE",
"MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE"
), AA_change = c("Ala12Val", "Leu54Pro", "Met99Ser", "Lys34Ile",
"Arg4Cys", "His43Gly", "Glu14Phe")), .Names = c("chr", "location",
"effect", "impact", "AA_change"), row.names = c(NA, -7L), class = "data.frame")
这是一个 3 字母氨基酸的列表,以及它们更好的缩写是什么。
Ala == A
Arg == R
Asn == N
Asp == D
Cys == C
Glu == E
Gln == Q
Gly == G
His == H
Ile == I
Leu == L
Lys == K
Met == M
Phe == F
Pro == P
Ser == S
Thr == T
Trp == W
Tyr == Y
Val == V
我觉得有一个简单的功能可以做到这一点,但我正在努力解决如何做到这一点。我习惯于只更改专栏的一部分,而不是一次更改两件事。所以我想问的是如何改变这个
Ala12Val
Leu54Pro
Met99Ser
Lys34Ile
Arg4Cys
His43Gly
Glu14Phe
对此
A12V
L54P
M99S
K32I
R4C
E14F
这是可以做到的吗?
查找氨基酸,然后获取前 3 个字母的子串和映射,提取数字,后 3 个字母的子串和映射。然后全部粘贴在一起。
# lookup map
AAmap <- setNames(c("A","R","N","D","C","E","Q","G","H","I","L","K","M","F","P","S","T","W","Y","V"),
c("Ala","Arg","Asn","Asp","Cys","Glu","Gln","Gly","His","Ile","Leu","Lys","Met","Phe","Pro","Ser","Thr","Trp","Tyr","Val"))
# get first 3 map to AA, get digits, get last 3 map to AA
dat$AA_change_short <-
paste0(AAmap[ substr(dat$AA_change, 1, 3) ],
gsub("[^\d]+", "", dat$AA_change, perl = TRUE),
AAmap[ substr(dat$AA_change, nchar(dat$AA_change) - 2, nchar(dat$AA_change)) ])
dat
# chr location effect impact AA_change AA_change_short
# 1 1 12543 missense_variant MODERATE Ala12Val A12V
# 2 1 52367 missense_variant MODERATE Leu54Pro L54P
# 3 1 752347 missense_variant MODERATE Met99Ser M99S
# 4 1 984645 missense_variant MODERATE Lys34Ile K34I
# 5 1 989845 missense_variant MODERATE Arg4Cys R4C
# 6 1 999854 missense_variant MODERATE His43Gly H43G
# 7 1 999855 missense_variant MODERATE Glu14Phe E14F
b=which(adist(dat2$V1,dat$AA_change,partial = T)==0,T)
dat$AA_change1=`regmatches<-`(dat$AA_change,gregexpr("\D+",dat$AA_change),
value=split(dat2$V3[b[,1]],b[,2]))
dat
chr location effect impact AA_change AA_change1
1 1 12543 missense_variant MODERATE Ala12Val A12V
2 1 52367 missense_variant MODERATE Leu54Pro L54P
3 1 752347 missense_variant MODERATE Met99Ser M99S
4 1 984645 missense_variant MODERATE Lys34Ile I34K
5 1 989845 missense_variant MODERATE Arg4Cys R4C
6 1 999854 missense_variant MODERATE His43Gly G43H
7 1 999855 missense_variant MODERATE Glu14Phe E14F
dat2 = read.table(text="Ala == A
Arg == R
Asn == N
Asp == D
Cys == C
Glu == E
Gln == Q
Gly == G
His == H
Ile == I
Leu == L
Lys == K
Met == M
Phe == F
Pro == P
Ser == S
Thr == T
Trp == W
Tyr == Y
Val == V")[-2]
如果它始终采用 {acid, numbers, acid} 形式,您可以将其拆分为三列并使用 match
或连接进行替换。对于 data.table,这看起来像...
library(data.table)
setDT(dat)
# put your mapping into a nicer format
abbrDT = fread(header = FALSE,"
Ala == A
Arg == R
Asn == N
Asp == D
Cys == C
Glu == E
Gln == Q
Gly == G
His == H
Ile == I
Leu == L
Lys == K
Met == M
Phe == F
Pro == P
Ser == S
Thr == T
Trp == W
Tyr == Y
Val == V")[, .(abbr3 = V1, abbr1 = V3)]
# split the column
patt = "(?<=\d)(?=\D)|(?<=\D)(?=\d)"
dat[, c("AA1", "num", "AA2") := tstrsplit(AA_change, patt, perl=TRUE)]
# substitute for each part
dat[abbrDT, on=.(AA1 = abbr3), AA1 := abbr1]
dat[abbrDT, on=.(AA2 = abbr3), AA2 := abbr1]
这给出了
chr location effect impact AA_change AA1 num AA2
1: 1 12543 missense_variant MODERATE Ala12Val A 12 V
2: 1 52367 missense_variant MODERATE Leu54Pro L 54 P
3: 1 752347 missense_variant MODERATE Met99Ser M 99 S
4: 1 984645 missense_variant MODERATE Lys34Ile K 34 I
5: 1 989845 missense_variant MODERATE Arg4Cys R 4 C
6: 1 999854 missense_variant MODERATE His43Gly H 43 G
7: 1 999855 missense_variant MODERATE Glu14Phe E 14 F
可选,再次合并列并删除不需要的列:
dat[, AA_change := paste0(AA1, num, AA2)]
dat[, c("AA1", "num", "AA2") := NULL]