从 Gencode 解析 GTF 文件
Parse GTF File from Gencode
我使用 data.table
包编写了一个脚本来解析 GENCODE gtf 文件的最后一列。对于那些不知道的人,该列包含一些键值项,每行用分号分隔。我正在处理的特定文件包含约 250 万行。我索引了前 100 行,然后是前 1000 行,只是为了测试脚本,输出正是我需要的。然而,尽管使用了 set
函数,运行-time 并没有我预期的那么快。前 100 行是即时的,但前 1000 行大约需要一两分钟。这是脚本。
#LOAD DATA.TABLE LIBRARY
require(data.table)
#READ GTF ANNOTATION FILE
info <- fread("gencodeAnnotation.gtf")
colnames(info)[9] <- "AdditionalInfo"
info <- info[1:1000]
#CREATE LIST OF 'KEYS' TO PARSE OUT
pars <- as.character(list("gene_id", "gene_type", "gene_status", "gene_name", " level ", "transcript_name", "transcript_id", "transcript_type", "transcript_support_level", "havana_gene"))
#NESTED FOR LOOP TO PARSE KEY-VALUE PAIR
for (i in 1:length(pars)) {
for (j in 1:nrow(info)) {
infoRow <- info[,tstrsplit(AdditionalInfo, ';', fixed = T)][j]
headerCheck <- like(infoRow, pars[i])
if (any(headerCheck) == TRUE) {
keyVal <- length(tstrsplit(infoRow[[which(headerCheck == T)]], " ", fixed = T))
set(info, i = j, j = toupper(pars[i]), value = tstrsplit(infoRow[[which(headerCheck == T)]], " ", fixed = T)[[keyVal]])
} else {
set(info, i = j, j = toupper(pars[i]), value = NA)
}
}
}
正如我之前所说,在前 100、1000 行上测试时输出是完美的。根据代码,它必须遍历所有行乘以要添加的列数,或 pars
中的项目。我的问题是,我的脚本中缺少什么或者我可以进行哪些编辑以减少 运行 时间?这是正在使用的 gtf 文件的 link:http://www.gencodegenes.org/releases/current.html。它是第一个 link 标记为 "Comprehensive gene annotation"。提前致谢。
每行的样例:
gene_id ENSG00000223972.5; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; level 2; havana_gene OTTHUMG00000000961.2; remap_status full_contig; remap_num_mappings 1; remap_target_status overlap;
MRE:
> dput(info[1:5,])
structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1"),
V2 = c("HAVANA", "HAVANA", "HAVANA", "HAVANA", "HAVANA"),
V3 = c("gene", "transcript", "exon", "exon", "exon"), V4 = c(11869L,
11869L, 11869L, 12613L, 13221L), V5 = c(14409L, 14409L, 12227L,
12721L, 14409L), V6 = c(".", ".", ".", ".", "."), V7 = c("+",
"+", "+", "+", "+"), V8 = c(".", ".", ".", ".", "."), AdditionalInfo = c("gene_id \"ENSG00000223972.5\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; level 2; havana_gene \"OTTHUMG00000000961.2\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 1; exon_id \"ENSE00002234944.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 2; exon_id \"ENSE00003582793.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 3; exon_id \"ENSE00002312635.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7",
"V8", "AdditionalInfo"), class = c("data.table", "data.frame"
), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x16489e8>)
使用 lapply
之类的矢量化操作,而不是 for
循环。
keys <- lapply(info$AdditionalInfo, function(x)
unlist(lapply(unlist(strsplit(x, "; ")),
function(y) unlist(strsplit(y, " "))[1])) )
values <- lapply(info$AdditionalInfo, function(x)
unlist(lapply(unlist(strsplit(x, "; ")),
function(y) gsub("\"", "", unlist(strsplit(y, " "))[2]))) )
您可以弄清楚如何处理结果 keys
和 values
。
> keys[[1]]
[1] "gene_id" "gene_type" "gene_status" "gene_name" "level"
[6] "havana_gene"
> values[[1]]
[1] "ENSG00000223972.5" "transcribed_unprocessed_pseudogene"
[3] "KNOWN" "DDX11L1"
[5] "2" "OTTHUMG00000000961.2;"
没关系,所有生物信息学家都必须从某个地方开始。 :)
基于fanli的MRE回答:
rbindlist(info[, .(list(eval(parse(text=
paste0('list(',
sub(',$', '',
gsub('([^ ]+) ([^;]+); *', '\1=\2,', AdditionalInfo)),
')')), .SD)))
, by = AdditionalInfo]$V1, fill = T)
# gene_id gene_type gene_status gene_name level havana_gene transcript_id transcript_type
#1: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 NA NA
#2: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#3: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#4: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#5: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
# transcript_status transcript_name tag transcript_support_level havana_transcript exon_number exon_id
#1: NA NA NA NA NA NA NA
#2: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 NA NA
#3: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 1 ENSE00002234944.1
#4: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 2 ENSE00003582793.1
#5: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 3 ENSE00002312635.1
上面基本上用=
替换空格,用逗号替换;
,在该字符串前面打一个list
并将其解析为R表达式,将其转换为实际列表。剩下的只是将它按摩成正确的形状。
我发现 bioconductor 包 rtracklayer 中的 readGFF
函数在这里非常合适。
require(rtracklayer)
system.time(gtf <- readGFF("~/Downloads/gencode.v24.annotation.gtf", version=2L))
# user system elapsed
# 34.558 1.541 36.737
dim(gtf)
# [1] 2572840 26
您也可以select只添加您喜欢的标签。
system.time(gtf_tags <- readGFF("~/Downloads/gencode.v24.annotation.gtf", version=2L,
tags = c("gene_id", "transcript_id")))
# user system elapsed
# 16.830 0.733 17.780
dim(gtf_tags)
# [1] 2572840 10
如果你懂 Julia 语言,你可以试试 OpenGene(https://github.com/OpenGene/OpenGene.jl),它有内置的 gencode 解析。
using OpenGene, OpenGene.Reference
# load the gencode dataset, it will download a file from gencode website if it not downloaded before
# once it's loaded, it will be cached so future loads will be fast
index = gencode_load("GRCh37")
# locate which gene chr:pos is in
gencode_locate(index, "chr5", 149526621)
# it will return
# 1-element Array{Any,1}:
# Dict{ASCIIString,Any}("gene"=>"PDGFRB","number"=>1,"transcript"=>"ENST00000261799.4","type"=>"intron")
genes = gencode_genes(index, "TP53")
# return an array with only one record
genes[1].name, genes[1].chr, genes[1].start_pos, genes[1].end_pos
# ("TP53","chr17",7565097,7590856)
我使用 data.table
包编写了一个脚本来解析 GENCODE gtf 文件的最后一列。对于那些不知道的人,该列包含一些键值项,每行用分号分隔。我正在处理的特定文件包含约 250 万行。我索引了前 100 行,然后是前 1000 行,只是为了测试脚本,输出正是我需要的。然而,尽管使用了 set
函数,运行-time 并没有我预期的那么快。前 100 行是即时的,但前 1000 行大约需要一两分钟。这是脚本。
#LOAD DATA.TABLE LIBRARY
require(data.table)
#READ GTF ANNOTATION FILE
info <- fread("gencodeAnnotation.gtf")
colnames(info)[9] <- "AdditionalInfo"
info <- info[1:1000]
#CREATE LIST OF 'KEYS' TO PARSE OUT
pars <- as.character(list("gene_id", "gene_type", "gene_status", "gene_name", " level ", "transcript_name", "transcript_id", "transcript_type", "transcript_support_level", "havana_gene"))
#NESTED FOR LOOP TO PARSE KEY-VALUE PAIR
for (i in 1:length(pars)) {
for (j in 1:nrow(info)) {
infoRow <- info[,tstrsplit(AdditionalInfo, ';', fixed = T)][j]
headerCheck <- like(infoRow, pars[i])
if (any(headerCheck) == TRUE) {
keyVal <- length(tstrsplit(infoRow[[which(headerCheck == T)]], " ", fixed = T))
set(info, i = j, j = toupper(pars[i]), value = tstrsplit(infoRow[[which(headerCheck == T)]], " ", fixed = T)[[keyVal]])
} else {
set(info, i = j, j = toupper(pars[i]), value = NA)
}
}
}
正如我之前所说,在前 100、1000 行上测试时输出是完美的。根据代码,它必须遍历所有行乘以要添加的列数,或 pars
中的项目。我的问题是,我的脚本中缺少什么或者我可以进行哪些编辑以减少 运行 时间?这是正在使用的 gtf 文件的 link:http://www.gencodegenes.org/releases/current.html。它是第一个 link 标记为 "Comprehensive gene annotation"。提前致谢。
每行的样例:
gene_id ENSG00000223972.5; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; level 2; havana_gene OTTHUMG00000000961.2; remap_status full_contig; remap_num_mappings 1; remap_target_status overlap;
MRE:
> dput(info[1:5,])
structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1"),
V2 = c("HAVANA", "HAVANA", "HAVANA", "HAVANA", "HAVANA"),
V3 = c("gene", "transcript", "exon", "exon", "exon"), V4 = c(11869L,
11869L, 11869L, 12613L, 13221L), V5 = c(14409L, 14409L, 12227L,
12721L, 14409L), V6 = c(".", ".", ".", ".", "."), V7 = c("+",
"+", "+", "+", "+"), V8 = c(".", ".", ".", ".", "."), AdditionalInfo = c("gene_id \"ENSG00000223972.5\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; level 2; havana_gene \"OTTHUMG00000000961.2\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 1; exon_id \"ENSE00002234944.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 2; exon_id \"ENSE00003582793.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 3; exon_id \"ENSE00002312635.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7",
"V8", "AdditionalInfo"), class = c("data.table", "data.frame"
), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x16489e8>)
使用 lapply
之类的矢量化操作,而不是 for
循环。
keys <- lapply(info$AdditionalInfo, function(x)
unlist(lapply(unlist(strsplit(x, "; ")),
function(y) unlist(strsplit(y, " "))[1])) )
values <- lapply(info$AdditionalInfo, function(x)
unlist(lapply(unlist(strsplit(x, "; ")),
function(y) gsub("\"", "", unlist(strsplit(y, " "))[2]))) )
您可以弄清楚如何处理结果 keys
和 values
。
> keys[[1]]
[1] "gene_id" "gene_type" "gene_status" "gene_name" "level"
[6] "havana_gene"
> values[[1]]
[1] "ENSG00000223972.5" "transcribed_unprocessed_pseudogene"
[3] "KNOWN" "DDX11L1"
[5] "2" "OTTHUMG00000000961.2;"
没关系,所有生物信息学家都必须从某个地方开始。 :)
基于fanli的MRE回答:
rbindlist(info[, .(list(eval(parse(text=
paste0('list(',
sub(',$', '',
gsub('([^ ]+) ([^;]+); *', '\1=\2,', AdditionalInfo)),
')')), .SD)))
, by = AdditionalInfo]$V1, fill = T)
# gene_id gene_type gene_status gene_name level havana_gene transcript_id transcript_type
#1: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 NA NA
#2: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#3: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#4: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#5: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
# transcript_status transcript_name tag transcript_support_level havana_transcript exon_number exon_id
#1: NA NA NA NA NA NA NA
#2: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 NA NA
#3: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 1 ENSE00002234944.1
#4: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 2 ENSE00003582793.1
#5: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 3 ENSE00002312635.1
上面基本上用=
替换空格,用逗号替换;
,在该字符串前面打一个list
并将其解析为R表达式,将其转换为实际列表。剩下的只是将它按摩成正确的形状。
我发现 bioconductor 包 rtracklayer 中的 readGFF
函数在这里非常合适。
require(rtracklayer)
system.time(gtf <- readGFF("~/Downloads/gencode.v24.annotation.gtf", version=2L))
# user system elapsed
# 34.558 1.541 36.737
dim(gtf)
# [1] 2572840 26
您也可以select只添加您喜欢的标签。
system.time(gtf_tags <- readGFF("~/Downloads/gencode.v24.annotation.gtf", version=2L,
tags = c("gene_id", "transcript_id")))
# user system elapsed
# 16.830 0.733 17.780
dim(gtf_tags)
# [1] 2572840 10
如果你懂 Julia 语言,你可以试试 OpenGene(https://github.com/OpenGene/OpenGene.jl),它有内置的 gencode 解析。
using OpenGene, OpenGene.Reference
# load the gencode dataset, it will download a file from gencode website if it not downloaded before
# once it's loaded, it will be cached so future loads will be fast
index = gencode_load("GRCh37")
# locate which gene chr:pos is in
gencode_locate(index, "chr5", 149526621)
# it will return
# 1-element Array{Any,1}:
# Dict{ASCIIString,Any}("gene"=>"PDGFRB","number"=>1,"transcript"=>"ENST00000261799.4","type"=>"intron")
genes = gencode_genes(index, "TP53")
# return an array with only one record
genes[1].name, genes[1].chr, genes[1].start_pos, genes[1].end_pos
# ("TP53","chr17",7565097,7590856)