在 R 中拆分不均匀的字符串 - 可变子字符串和分隔符
Split uneven string in R - variable substring and delimiters
在发布我的问题之前,我想强调一下,我确实在这里找到了类似的东西,但没有找到我需要的东西。
我正在处理 FASTA 文件,更准确地说是 FASTA headers,它看起来像这样:
sp|Q2UVX4|CO3_BOVIN 补C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV= 2
我需要提取粗体文本。第一个粗体文本是蛋白质名称。第二个粗体是基因名称。请注意,它们各不相同,我从同一字符串中的多个 fasta header 开始分析。 只有第一个header重要,其余的都是废话。这是一个例子:
> proteinGroups$Fasta.headers
[1] "sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV=2;tr|A0A0F6QNP7|A0A0F6QNP7_BOVIN C3-beta-c OS=Bos taurus OX=9913 GN=C3 PE=2 SV=1;tr|A0A3Q1MHV6|A0A3Q1MHV6_BOVIN C3-beta-c OS=Bos taurus OX=9913 GN=C3 PE=1 SV=1;tr|A0A3Q1M2B2|A0A3Q1M2B2_B"
[2] "tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=3 SV=1;sp|Q2TBU0|HPT_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=2 SV=1;tr|A0A0M4MD57|A0A0M4MD57_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=2 SV=1;tr|G3X6K8|G3X6K8_BOVIN H"
[3] "tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange protein OS=Bos taurus OX=9913 GN=SLC4A7 PE=3 SV=1"
[4] "sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican core protein OS=Bos taurus OX=9913 GN=VCAN;sp|P81282-3|CSPG2_BOVIN Isoform V2 of Versican core protein OS=Bos taurus OX=9913 GN=VCAN;tr|F1MZ83|F1MZ83_BOVIN Versican core protein OS=Bos taurus OX=9913 GN=VCAN P"
[5] "tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic hyperkeratosis; keratosis palmaris et plantaris) OS=Bos taurus OX=9913 GN=KRT10 PE=2 SV=1;sp|P06394|K1C10_BOVIN Keratin, type I cytoskeletal 10 OS=Bos taurus OX=9913 GN=KRT10 PE=3 SV=1"
您可能已经注意到,一些蛋白质名称几乎是一个完整的短语,而另一些则只是一个单词。基因也是如此,它并不总是 2 个字符,在这个例子中达到了 6 个字符。
使用我在这里找到的信息,我能够构建一个代码的弗兰肯斯坦,但可能远非理想:
library(stringr)
library(reshape2)
#split the protein name from the other delimiters
fasta.header <- str_split(proteinGroups$Fasta.headers, "(?=OS=)")
#discard the additional fasta headers
protGene <- sapply(fasta.header, "[", c(1,2))
#invert the orientation and change to DF
protGene <- as.data.frame(t(protGene))
#rename columns
colnames(protGene) <- c("protein.name", "gene")
#discard the extra info and keep protein name only
protGene$protein.name <- colsplit(protGene$protein.name, " ", c("X1","X2"))[2]
#split the crap that came along with the additional headers in the first step
temp1 <- strsplit(protGene$gene, ";")
#assign cleaner values to the table
protGene$gene <- sapply(temp1, "[", 1)
#split the rest of the annotation
temp2 <- strsplit(protGene$gene, "OS=| OX=| GN=| PE=| SV=")
#assign gene name to the table
protGene$gene <- sapply(temp2, "[", 4)
我能够获取数据,但我觉得这远非稳健或优化。关于要更改的内容有什么想法吗?
提前致谢!
我不确定这是否是您要查找的内容。假设,您的数据存储在名为 proteinGroups
的 data.frame 中,并且 header 位于 Fasta.headers
.
列中
library(stringr)
library(dplyr)
proteinGroups %>%
tibble() %>%
mutate(string = str_split(Fasta.headers, ";[a-z]{2}\|[A-Z0-9\-]*\|"),
rn = row_number()) %>%
unnest_longer(string) %>%
mutate(
protein_name = ifelse(str_detect(string, ".*_BOVIN\s(.*?)\sOS=.*"),
str_replace(string, ".*_BOVIN\s(.*?)\sOS=.*", "\1"),
NA_character_),
gene = ifelse(str_detect(string, ".*GN=([A-Z0-9]*).*"),
str_replace(string, ".*GN=([A-Z0-9]*).*", "\1"),
NA_character_),
.keep = "unused"
)
我们按照 ;tr|A0A0F6QNP7|
或 ;sp|P81282-3|
的模式将字符串拆分成更小的块。
- 我们提取
_BOVIN
和 OS=
之间的所有内容。那是蛋白质的名字。
- 我们提取
GN=
之后匹配大写字母和数字的所有内容。这就是基因。
所以这个returns
# A tibble: 14 x 4
Fasta.headers rn protein_name gene
<chr> <int> <chr> <chr>
1 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 Complement C3 C3
2 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 C3-beta-c C3
3 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 C3-beta-c C3
4 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 NA NA
5 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 Haptoglobin HP
6 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 Haptoglobin HP
7 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 Haptoglobin HP
8 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 NA NA
9 tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange pr~ 3 Anion exchange protein SLC4~
10 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~ 4 Isoform V3 of Versican core protein VCAN
11 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~ 4 Isoform V2 of Versican core protein VCAN
12 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~ 4 Versican core protein VCAN
13 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic~ 5 Keratin 10 (Epidermolytic hyperkerat~ KRT10
14 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic~ 5 Keratin, type I cytoskeletal 10 KRT10
由于 只有第一个 header 重要,其余的都是废话,我们只取每个字符串的第一行
proteinGroups %>%
tibble() %>%
mutate(string = str_split(Fasta.headers, ";[a-z]{2}\|[A-Z0-9\-]*\|"),
rn = row_number()) %>%
unnest_longer(string) %>%
mutate(
protein_name = ifelse(str_detect(string, ".*_BOVIN\s(.*?)\sOS=.*"),
str_replace(string, ".*_BOVIN\s(.*?)\sOS=.*", "\1"),
NA_character_),
gene = ifelse(str_detect(string, ".*GN=([A-Z0-9]*).*"),
str_replace(string, ".*GN=([A-Z0-9]*).*", "\1"),
NA_character_),
.keep = "unused"
) %>%
group_by(rn) %>%
slice(1) %>%
ungroup() %>%
select(-rn)
得到
# A tibble: 5 x 3
Fasta.headers protein_name gene
<chr> <chr> <chr>
1 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9~ Complement C3 C3
2 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bos ta~ Haptoglobin HP
3 tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange protei~ Anion exchange protein SLC4~
4 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican core ~ Isoform V3 of Versican core protein VCAN
5 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic hyp~ Keratin 10 (Epidermolytic hyperkeratosi~ KRT10
在发布我的问题之前,我想强调一下,我确实在这里找到了类似的东西,但没有找到我需要的东西。
我正在处理 FASTA 文件,更准确地说是 FASTA headers,它看起来像这样: sp|Q2UVX4|CO3_BOVIN 补C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV= 2
我需要提取粗体文本。第一个粗体文本是蛋白质名称。第二个粗体是基因名称。请注意,它们各不相同,我从同一字符串中的多个 fasta header 开始分析。 只有第一个header重要,其余的都是废话。这是一个例子:
> proteinGroups$Fasta.headers
[1] "sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV=2;tr|A0A0F6QNP7|A0A0F6QNP7_BOVIN C3-beta-c OS=Bos taurus OX=9913 GN=C3 PE=2 SV=1;tr|A0A3Q1MHV6|A0A3Q1MHV6_BOVIN C3-beta-c OS=Bos taurus OX=9913 GN=C3 PE=1 SV=1;tr|A0A3Q1M2B2|A0A3Q1M2B2_B"
[2] "tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=3 SV=1;sp|Q2TBU0|HPT_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=2 SV=1;tr|A0A0M4MD57|A0A0M4MD57_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=2 SV=1;tr|G3X6K8|G3X6K8_BOVIN H"
[3] "tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange protein OS=Bos taurus OX=9913 GN=SLC4A7 PE=3 SV=1"
[4] "sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican core protein OS=Bos taurus OX=9913 GN=VCAN;sp|P81282-3|CSPG2_BOVIN Isoform V2 of Versican core protein OS=Bos taurus OX=9913 GN=VCAN;tr|F1MZ83|F1MZ83_BOVIN Versican core protein OS=Bos taurus OX=9913 GN=VCAN P"
[5] "tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic hyperkeratosis; keratosis palmaris et plantaris) OS=Bos taurus OX=9913 GN=KRT10 PE=2 SV=1;sp|P06394|K1C10_BOVIN Keratin, type I cytoskeletal 10 OS=Bos taurus OX=9913 GN=KRT10 PE=3 SV=1"
您可能已经注意到,一些蛋白质名称几乎是一个完整的短语,而另一些则只是一个单词。基因也是如此,它并不总是 2 个字符,在这个例子中达到了 6 个字符。
使用我在这里找到的信息,我能够构建一个代码的弗兰肯斯坦,但可能远非理想:
library(stringr)
library(reshape2)
#split the protein name from the other delimiters
fasta.header <- str_split(proteinGroups$Fasta.headers, "(?=OS=)")
#discard the additional fasta headers
protGene <- sapply(fasta.header, "[", c(1,2))
#invert the orientation and change to DF
protGene <- as.data.frame(t(protGene))
#rename columns
colnames(protGene) <- c("protein.name", "gene")
#discard the extra info and keep protein name only
protGene$protein.name <- colsplit(protGene$protein.name, " ", c("X1","X2"))[2]
#split the crap that came along with the additional headers in the first step
temp1 <- strsplit(protGene$gene, ";")
#assign cleaner values to the table
protGene$gene <- sapply(temp1, "[", 1)
#split the rest of the annotation
temp2 <- strsplit(protGene$gene, "OS=| OX=| GN=| PE=| SV=")
#assign gene name to the table
protGene$gene <- sapply(temp2, "[", 4)
我能够获取数据,但我觉得这远非稳健或优化。关于要更改的内容有什么想法吗?
提前致谢!
我不确定这是否是您要查找的内容。假设,您的数据存储在名为 proteinGroups
的 data.frame 中,并且 header 位于 Fasta.headers
.
library(stringr)
library(dplyr)
proteinGroups %>%
tibble() %>%
mutate(string = str_split(Fasta.headers, ";[a-z]{2}\|[A-Z0-9\-]*\|"),
rn = row_number()) %>%
unnest_longer(string) %>%
mutate(
protein_name = ifelse(str_detect(string, ".*_BOVIN\s(.*?)\sOS=.*"),
str_replace(string, ".*_BOVIN\s(.*?)\sOS=.*", "\1"),
NA_character_),
gene = ifelse(str_detect(string, ".*GN=([A-Z0-9]*).*"),
str_replace(string, ".*GN=([A-Z0-9]*).*", "\1"),
NA_character_),
.keep = "unused"
)
我们按照 ;tr|A0A0F6QNP7|
或 ;sp|P81282-3|
的模式将字符串拆分成更小的块。
- 我们提取
_BOVIN
和OS=
之间的所有内容。那是蛋白质的名字。 - 我们提取
GN=
之后匹配大写字母和数字的所有内容。这就是基因。
所以这个returns
# A tibble: 14 x 4
Fasta.headers rn protein_name gene
<chr> <int> <chr> <chr>
1 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 Complement C3 C3
2 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 C3-beta-c C3
3 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 C3-beta-c C3
4 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 NA NA
5 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 Haptoglobin HP
6 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 Haptoglobin HP
7 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 Haptoglobin HP
8 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 NA NA
9 tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange pr~ 3 Anion exchange protein SLC4~
10 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~ 4 Isoform V3 of Versican core protein VCAN
11 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~ 4 Isoform V2 of Versican core protein VCAN
12 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~ 4 Versican core protein VCAN
13 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic~ 5 Keratin 10 (Epidermolytic hyperkerat~ KRT10
14 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic~ 5 Keratin, type I cytoskeletal 10 KRT10
由于 只有第一个 header 重要,其余的都是废话,我们只取每个字符串的第一行
proteinGroups %>%
tibble() %>%
mutate(string = str_split(Fasta.headers, ";[a-z]{2}\|[A-Z0-9\-]*\|"),
rn = row_number()) %>%
unnest_longer(string) %>%
mutate(
protein_name = ifelse(str_detect(string, ".*_BOVIN\s(.*?)\sOS=.*"),
str_replace(string, ".*_BOVIN\s(.*?)\sOS=.*", "\1"),
NA_character_),
gene = ifelse(str_detect(string, ".*GN=([A-Z0-9]*).*"),
str_replace(string, ".*GN=([A-Z0-9]*).*", "\1"),
NA_character_),
.keep = "unused"
) %>%
group_by(rn) %>%
slice(1) %>%
ungroup() %>%
select(-rn)
得到
# A tibble: 5 x 3
Fasta.headers protein_name gene
<chr> <chr> <chr>
1 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9~ Complement C3 C3
2 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bos ta~ Haptoglobin HP
3 tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange protei~ Anion exchange protein SLC4~
4 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican core ~ Isoform V3 of Versican core protein VCAN
5 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic hyp~ Keratin 10 (Epidermolytic hyperkeratosi~ KRT10