从具有不同分隔符的文件创建数据框
Create a dataframe from a file with different delimiters
我需要根据不同的分隔符和关键字创建一个 table。我有以下文件:
>>>ENST00000370225_4
>7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
7E7O_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
Length=2317
Score = 4711 bits (12220), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 2273/2273 (100%), Positives = 2273/2273 (100%), Gaps = 0/2273 (0%)
>NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo
sapiens]
P78363.3 RecName: Full=Retinal-specific phospholipid-transporting ATPase
ABCA4; AltName: Full=ATP-binding cassette sub-family A member
4; AltName: Full=RIM ABC transporter; Short=RIM proteinv;
Short=RmP; AltName: Full=Retinal-specific ATP-binding cassette
transporter; AltName: Full=Stargardt disease protein [Homo
sapiens]
7LKP_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
7M1P_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
7M1Q_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
EAW73056.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform
CRA_a [Homo sapiens]
Length=2273
Score = 4711 bits (12219), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 2273/2273 (100%), Positives = 2273/2273 (100%), Gaps = 0/2273 (0%)
>>>ENST00000460514_1
>CAH10486.1 hypothetical protein [Homo sapiens]
Length=1065
Score = 301 bits (772), Expect = 2e-96, Method: Compositional matrix adjust.
Identities = 146/146 (100%), Positives = 146/146 (100%), Gaps = 0/146 (0%)
>CAA75729.1 ABCR [Homo sapiens]
Length=2273
Score = 300 bits (769), Expect = 2e-94, Method: Compositional matrix adjust.
Identities = 146/146 (100%), Positives = 146/146 (100%), Gaps = 0/146 (0%)
所需的输出是:
Transcript Protein Length Score Identity Percent
1 ENST00000370225_4 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2317 4711 bits (12220) 2273/2273 100%
2 ENST00000370225_4 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 2273 4711 bits (12219) 2273/2273 100%
3 ENST00000460514_1 CAH10486.1 hypothetical protein 1065 301 bits (772) 146/146 100%
4 ENST00000460514_1 CAA75729.1 ABCR 2273 300 bits (769) 146/146 100%
每个所需的列在原始文件中由长度、身份、分数、“>”和“>>>”等关键字分隔
我尝试了以下脚本,但我缺少添加成绩单(第一列),这是由“>>>”分隔的第一列。
my_txt <- readLines(con = "gene_filt_perc.txt")
transcript <-my_txt[grepl("^\s*>>>", my_txt)]
lengths <-my_txt[grepl("^\s*Length", my_txt)]
lengths <- gsub("Length=", "", lengths)
scores <-my_txt[grepl("^\s*Score", my_txt)]
scores <- gsub(" Score = ", "", scores)
scores <- gsub("\, Expect = ..*", "", scores)
identities <-my_txt[grepl("^\s*Identities", my_txt)]
identities <- gsub(" Identities = ", "", identities)
identities <- gsub("\, Positives = ..*", "", identities)
protein <-my_txt[grepl("^\s*>[[:alnum:]]", my_txt)]
result <-data.frame("protein"=protein, "identities"=identities, "scores"=scores, "lengths"=lengths)
result
protein identities scores lengths
1 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2273/2273 (100%) 4711 bits (12220) 2317
2 7LKZ_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2271/2273 (99%) 4711 bits (12220) 2273
3 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo 2273/2273 (100%) 4711 bits (12219) 2273
4 BAE06122.2 ABCA4 variant protein [Homo sapiens] 2272/2273 (99%) 4710 bits (12218) 2273
5 7E7Q_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2271/2273 (99%) 4709 bits (12214) 2317
有没有更简单的方法来构造 data.frame?
对所有数据做gsub
会破坏分层嵌套。将您的数据放在一起。从存储在 data.txt
中的一个大字符串开始,将它们拆分为每个转录本一个元素。可以使用相对表达式提取字段。使用 unnest
可以将列表转换为 table 中的多行。这会自动重复每个相应蛋白质的转录本 ID。
library(tidyverse)
read_file("data.txt") %>%
str_split(">>>") %>%
simplify() %>%
discard(~ .x == "") %>%
# parse all transcripts
map(function(Transcript) {
list(
Transcript = Transcript %>% str_extract("ENST.*"),
Protein = {
Transcript %>%
str_split(">") %>%
simplify() %>%
discard(~ .x %>% str_detect("^ENST"))
}
)
}) %>%
enframe() %>%
select(value) %>%
unnest_wider(value) %>%
unnest(Protein) %>%
mutate(
# parse all proteins
Protein = Protein %>% map(function(Protein) {
list(
# first line
Protein = Protein %>% str_split("\n") %>% simplify() %>% first() %>% str_trim(),
# numbers after pattern 'Length='
Length = Protein %>% str_extract("(?<=Length=)[0-9 ]+") %>% as.numeric(),
# numbers after pattern 'Score = '
Score = Protein %>% str_extract("(?<=Score = )[0-9 ]+") %>% as.numeric()
)
})
) %>%
unnest_wider(Protein)
#> Transcript Protein Length Score
#> <chr> <chr> <dbl> <dbl>
#>1 ENST00000370225_4 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2317 4711
#>2 ENST00000370225_4 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo 2273 4711
#>3 ENST00000460514_1 CAH10486.1 hypothetical protein [Homo sapiens] 1065 301
#>4 ENST00000460514_1 CAA75729.1 ABCR [Homo sapiens] 2273 300
我需要根据不同的分隔符和关键字创建一个 table。我有以下文件:
>>>ENST00000370225_4
>7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
7E7O_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
Length=2317
Score = 4711 bits (12220), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 2273/2273 (100%), Positives = 2273/2273 (100%), Gaps = 0/2273 (0%)
>NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo
sapiens]
P78363.3 RecName: Full=Retinal-specific phospholipid-transporting ATPase
ABCA4; AltName: Full=ATP-binding cassette sub-family A member
4; AltName: Full=RIM ABC transporter; Short=RIM proteinv;
Short=RmP; AltName: Full=Retinal-specific ATP-binding cassette
transporter; AltName: Full=Stargardt disease protein [Homo
sapiens]
7LKP_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
7M1P_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
7M1Q_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
EAW73056.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform
CRA_a [Homo sapiens]
Length=2273
Score = 4711 bits (12219), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 2273/2273 (100%), Positives = 2273/2273 (100%), Gaps = 0/2273 (0%)
>>>ENST00000460514_1
>CAH10486.1 hypothetical protein [Homo sapiens]
Length=1065
Score = 301 bits (772), Expect = 2e-96, Method: Compositional matrix adjust.
Identities = 146/146 (100%), Positives = 146/146 (100%), Gaps = 0/146 (0%)
>CAA75729.1 ABCR [Homo sapiens]
Length=2273
Score = 300 bits (769), Expect = 2e-94, Method: Compositional matrix adjust.
Identities = 146/146 (100%), Positives = 146/146 (100%), Gaps = 0/146 (0%)
所需的输出是:
Transcript Protein Length Score Identity Percent
1 ENST00000370225_4 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2317 4711 bits (12220) 2273/2273 100%
2 ENST00000370225_4 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 2273 4711 bits (12219) 2273/2273 100%
3 ENST00000460514_1 CAH10486.1 hypothetical protein 1065 301 bits (772) 146/146 100%
4 ENST00000460514_1 CAA75729.1 ABCR 2273 300 bits (769) 146/146 100%
每个所需的列在原始文件中由长度、身份、分数、“>”和“>>>”等关键字分隔 我尝试了以下脚本,但我缺少添加成绩单(第一列),这是由“>>>”分隔的第一列。
my_txt <- readLines(con = "gene_filt_perc.txt")
transcript <-my_txt[grepl("^\s*>>>", my_txt)]
lengths <-my_txt[grepl("^\s*Length", my_txt)]
lengths <- gsub("Length=", "", lengths)
scores <-my_txt[grepl("^\s*Score", my_txt)]
scores <- gsub(" Score = ", "", scores)
scores <- gsub("\, Expect = ..*", "", scores)
identities <-my_txt[grepl("^\s*Identities", my_txt)]
identities <- gsub(" Identities = ", "", identities)
identities <- gsub("\, Positives = ..*", "", identities)
protein <-my_txt[grepl("^\s*>[[:alnum:]]", my_txt)]
result <-data.frame("protein"=protein, "identities"=identities, "scores"=scores, "lengths"=lengths)
result
protein identities scores lengths
1 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2273/2273 (100%) 4711 bits (12220) 2317
2 7LKZ_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2271/2273 (99%) 4711 bits (12220) 2273
3 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo 2273/2273 (100%) 4711 bits (12219) 2273
4 BAE06122.2 ABCA4 variant protein [Homo sapiens] 2272/2273 (99%) 4710 bits (12218) 2273
5 7E7Q_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2271/2273 (99%) 4709 bits (12214) 2317
有没有更简单的方法来构造 data.frame?
对所有数据做gsub
会破坏分层嵌套。将您的数据放在一起。从存储在 data.txt
中的一个大字符串开始,将它们拆分为每个转录本一个元素。可以使用相对表达式提取字段。使用 unnest
可以将列表转换为 table 中的多行。这会自动重复每个相应蛋白质的转录本 ID。
library(tidyverse)
read_file("data.txt") %>%
str_split(">>>") %>%
simplify() %>%
discard(~ .x == "") %>%
# parse all transcripts
map(function(Transcript) {
list(
Transcript = Transcript %>% str_extract("ENST.*"),
Protein = {
Transcript %>%
str_split(">") %>%
simplify() %>%
discard(~ .x %>% str_detect("^ENST"))
}
)
}) %>%
enframe() %>%
select(value) %>%
unnest_wider(value) %>%
unnest(Protein) %>%
mutate(
# parse all proteins
Protein = Protein %>% map(function(Protein) {
list(
# first line
Protein = Protein %>% str_split("\n") %>% simplify() %>% first() %>% str_trim(),
# numbers after pattern 'Length='
Length = Protein %>% str_extract("(?<=Length=)[0-9 ]+") %>% as.numeric(),
# numbers after pattern 'Score = '
Score = Protein %>% str_extract("(?<=Score = )[0-9 ]+") %>% as.numeric()
)
})
) %>%
unnest_wider(Protein)
#> Transcript Protein Length Score
#> <chr> <chr> <dbl> <dbl>
#>1 ENST00000370225_4 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2317 4711
#>2 ENST00000370225_4 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo 2273 4711
#>3 ENST00000460514_1 CAH10486.1 hypothetical protein [Homo sapiens] 1065 301
#>4 ENST00000460514_1 CAA75729.1 ABCR [Homo sapiens] 2273 300