从 prokka gff table 中拆分额外分隔的列,将不同数量的条目分成带有 NA 的新列(splitstackshape / R)
split extra-delimited column from prokka gff table with varying number of entries into new columns with NAs (splitstackshape / R)
我有一个包含制表符分隔和分号分隔数据的文件(.gff 格式的 prokka 注释文件)。不幸的是,分号分隔的部分在条目数上不一致。
幸运的是,分号后的前导部分(例如 ID=
或 gene=
)是一致的。我想准备此数据作为 R(或 R 内)的输入,而没有不同的列号或空字段。这些是 prokka 文件的第一行,删除了一些列:
A1 contig_10 16 192 ID=PROKKA_00004;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00004;product=hypothetical protein
A1 contig_100 147 353 ID=PROKKA_00036;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00036;product=hypothetical protein
A1 contig_1000 60 434 ID=PROKKA_00892;inference=ab initio prediction:Prodigal:2.6,protein motif:Pfam:PF05893.8;locus_tag=PROKKA_00892;product=Acyl-CoA reductase (LuxC)
A1 contig_10000 132 434 ID=PROKKA_11822;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_11822;product=hypothetical protein
A1 contig_100003 368 784 ID=PROKKA_96005;gene=fusA_29;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A5VR09;locus_tag=PROKKA_96005;product=Elongation factor G
A1 contig_100026 38 355 ID=PROKKA_96016;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_96016;product=hypothetical protein
A1 contig_100027 38 493 ID=PROKKA_96018;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_96018;product=hypothetical protein
A1 contig_100028 121 1131 ID=PROKKA_96019;eC_number=3.1.-.-;gene=rnjA_3;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q45493;locus_tag=PROKKA_96019;product=Ribonuclease J 1
A1 contig_10003 1028 3307 ID=PROKKA_11824;eC_number=1.1.1.40;gene=maeB_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P76558;locus_tag=PROKKA_11824;product=NADP-dependent malic enzyme
期望的输出是:
V1 V2 V3 V4 eC_number gene ID inference locus_tag note product
1 A1 contig_10 16 192 <NA> <NA> PROKKA_00004 ab initio prediction:Prodigal:2.6 PROKKA_00004 <NA> hypothetical protein
2 A1 contig_100 147 353 <NA> <NA> PROKKA_00036 ab initio prediction:Prodigal:2.6 PROKKA_00036 <NA> hypothetical protein
3 A1 contig_1000 60 434 <NA> <NA> PROKKA_00892 ab initio prediction:Prodigal:2.6,protein motif:Pfam:PF05893.8 PROKKA_00892 <NA> Acyl-CoA reductase (LuxC)
4 A1 contig_10000 132 434 <NA> <NA> PROKKA_11822 ab initio prediction:Prodigal:2.6 PROKKA_11822 <NA> hypothetical protein
5 A1 contig_100003 368 784 <NA> fusA_29 PROKKA_96005 ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A5VR09 PROKKA_96005 <NA> Elongation factor G
使用tidyverse
和splitstackshape
的选项可以实现。首先使用 say read.table
(带参数 sep="\t"
)读取文件数据。然后使用 splitstackshape::splitstackshape
将列 V5
拆分为不同的列。现在数据已准备好更改为长格式并进行处理。
library(tidyverse)
library(splitstackshape)
# If first 4 columns of "textdata" is separated by "multiple spaces" than read it as
df <- read.table(text = gsub("\s{2,}","\t",textdata), stringsAsFactors = FALSE, sep = "\t")
# If first 4 columns of "textdata" is separated by "tab" than read it as
df <- read.table(text = textdata, stringsAsFactors = FALSE, sep = "\t")
# Now, process data (Based on feedback from `@crazysantaclaus`)
df %>% cSplit("V5", sep=";") %>%
gather(Key, value, -c(V1,V2,V3,V4)) %>%
separate(value, c("Col","Value"), sep="=") %>%
select(-Key) %>%
filter(!(is.na(Col) & is.na(Value))) %>%
spread(Col, Value)
结果:
# V1 V2 V3 V4 col1 col2 col3 col4 col5 col6
#1 A1 something_101 789 910 STRING_2 string_integer string with whitespace and:colon STRING string with whitespace and special chars <NA>
#2 A1 something_100 123 456 STRING_1 <NA> string with whitespace and:colon STRING string with whitespace and special chars string
数据:
textdata <- "A1 something_100 123 456 col1=STRING_1;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars;col6=string
A1 something_101 789 910 col1=STRING_2;col2=string_integer;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars"
数据#2:
第二组数据。前 4 列不是由 \t
分隔,而是由多个 spaces
分隔
textdata <- "A1 something_100 123 456 col1=STRING_1;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars;col6=string
A1 something_101 789 910 col1=STRING_2;col2=string_integer;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars"
我有一个包含制表符分隔和分号分隔数据的文件(.gff 格式的 prokka 注释文件)。不幸的是,分号分隔的部分在条目数上不一致。
幸运的是,分号后的前导部分(例如 ID=
或 gene=
)是一致的。我想准备此数据作为 R(或 R 内)的输入,而没有不同的列号或空字段。这些是 prokka 文件的第一行,删除了一些列:
A1 contig_10 16 192 ID=PROKKA_00004;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00004;product=hypothetical protein
A1 contig_100 147 353 ID=PROKKA_00036;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00036;product=hypothetical protein
A1 contig_1000 60 434 ID=PROKKA_00892;inference=ab initio prediction:Prodigal:2.6,protein motif:Pfam:PF05893.8;locus_tag=PROKKA_00892;product=Acyl-CoA reductase (LuxC)
A1 contig_10000 132 434 ID=PROKKA_11822;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_11822;product=hypothetical protein
A1 contig_100003 368 784 ID=PROKKA_96005;gene=fusA_29;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A5VR09;locus_tag=PROKKA_96005;product=Elongation factor G
A1 contig_100026 38 355 ID=PROKKA_96016;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_96016;product=hypothetical protein
A1 contig_100027 38 493 ID=PROKKA_96018;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_96018;product=hypothetical protein
A1 contig_100028 121 1131 ID=PROKKA_96019;eC_number=3.1.-.-;gene=rnjA_3;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q45493;locus_tag=PROKKA_96019;product=Ribonuclease J 1
A1 contig_10003 1028 3307 ID=PROKKA_11824;eC_number=1.1.1.40;gene=maeB_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P76558;locus_tag=PROKKA_11824;product=NADP-dependent malic enzyme
期望的输出是:
V1 V2 V3 V4 eC_number gene ID inference locus_tag note product
1 A1 contig_10 16 192 <NA> <NA> PROKKA_00004 ab initio prediction:Prodigal:2.6 PROKKA_00004 <NA> hypothetical protein
2 A1 contig_100 147 353 <NA> <NA> PROKKA_00036 ab initio prediction:Prodigal:2.6 PROKKA_00036 <NA> hypothetical protein
3 A1 contig_1000 60 434 <NA> <NA> PROKKA_00892 ab initio prediction:Prodigal:2.6,protein motif:Pfam:PF05893.8 PROKKA_00892 <NA> Acyl-CoA reductase (LuxC)
4 A1 contig_10000 132 434 <NA> <NA> PROKKA_11822 ab initio prediction:Prodigal:2.6 PROKKA_11822 <NA> hypothetical protein
5 A1 contig_100003 368 784 <NA> fusA_29 PROKKA_96005 ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A5VR09 PROKKA_96005 <NA> Elongation factor G
使用tidyverse
和splitstackshape
的选项可以实现。首先使用 say read.table
(带参数 sep="\t"
)读取文件数据。然后使用 splitstackshape::splitstackshape
将列 V5
拆分为不同的列。现在数据已准备好更改为长格式并进行处理。
library(tidyverse)
library(splitstackshape)
# If first 4 columns of "textdata" is separated by "multiple spaces" than read it as
df <- read.table(text = gsub("\s{2,}","\t",textdata), stringsAsFactors = FALSE, sep = "\t")
# If first 4 columns of "textdata" is separated by "tab" than read it as
df <- read.table(text = textdata, stringsAsFactors = FALSE, sep = "\t")
# Now, process data (Based on feedback from `@crazysantaclaus`)
df %>% cSplit("V5", sep=";") %>%
gather(Key, value, -c(V1,V2,V3,V4)) %>%
separate(value, c("Col","Value"), sep="=") %>%
select(-Key) %>%
filter(!(is.na(Col) & is.na(Value))) %>%
spread(Col, Value)
结果:
# V1 V2 V3 V4 col1 col2 col3 col4 col5 col6
#1 A1 something_101 789 910 STRING_2 string_integer string with whitespace and:colon STRING string with whitespace and special chars <NA>
#2 A1 something_100 123 456 STRING_1 <NA> string with whitespace and:colon STRING string with whitespace and special chars string
数据:
textdata <- "A1 something_100 123 456 col1=STRING_1;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars;col6=string
A1 something_101 789 910 col1=STRING_2;col2=string_integer;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars"
数据#2:
第二组数据。前 4 列不是由 \t
分隔,而是由多个 spaces
textdata <- "A1 something_100 123 456 col1=STRING_1;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars;col6=string
A1 something_101 789 910 col1=STRING_2;col2=string_integer;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars"