将 hmmer --tblout 输出转换为 pandas 数据帧
Convert a hmmer --tblout output to a pandas dataframe
- 有没有办法将 hmmer 输出转换为 pandas 数据帧?
- 我也不确定如何通过以下方式将 hmmer tblout table 加载到 python
生物模块。
我相信您可以使用 SeqIO.parse 或 SeqIO.search.The 格式来调用 hmmer 格式,table 出现制表符分隔,但它似乎是一个 collection 随机空格这意味着如果我删除 headers 和 # 只留下 table 信息,那么使用制表符分隔符拆分 table 的方法并不简单。
A small example of a hmmer --tblout file is below:
# --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name accession query name accession E-value score bias E-value score bias exp reg clu ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ ----- --- --- --- --- --- --- --- --- ---------------------
3300000568@Draft_10015026@Draft_1001502652 - Bacteria_NODE_1_length_628658_cov_8.291329_24 - 7.1e-07 29.3 0.0 1.9e-05 24.6 0.0 2.0 1 1 1 2 2 2 2 -
7000000546@SRS019910_WUGC_scaffold_3948@SRS019910_WUGC_scaffold_3948_gene_2890 - Bacteria_NODE_1_length_628658_cov_8.291329_53 - 1.6e-07 31.7 0.0 0.00051 20.3 0.0 2.2 2 0 0 2 2 2 2 -
#
# Program: hmmscan
# Version: 3.1b2 (February 2015)
# Pipeline mode: SCAN
# Query file: ../Exponential_High_Complexity_Simulation.faa
# Target file: final_list.hmm
# Option settings: hmmscan --tblout Exponential_Earth.txt -E 1e-5 --cpu 8 final_list.hmm ../Exponential_High_Complexity_Simulation.faa
# Current dir: /Strong/home/glickmanc/Programs/EarthVirome
# Date: Mon Feb 24 10:47:51 2020
# [ok]
我会根据您感兴趣的属性构建一个字典,并根据该字典制作一个 DataFrame。假设您对 attributes of the hits:
感兴趣
from collections import defaultdict
import pandas as pd
from Bio import SearchIO
filename = 'test.hmmer'
attribs = ['accession', 'bias', 'bitscore', 'description', 'cluster_num', 'domain_exp_num', 'domain_included_num', 'domain_obs_num', 'domain_reported_num', 'env_num', 'evalue', 'id', 'overlap_num', 'region_num']
hits = defaultdict(list)
with open(filename) as handle:
for queryresult in SearchIO.parse(handle, 'hmmer3-tab'):
#print(queryresult.id)
#print(queryresult.accession)
#print(queryresult.description)
for hit in queryresult.hits:
for attrib in attribs:
hits[attrib].append(getattr(hit, attrib))
pd.DataFrame.from_dict(hits)
- 有没有办法将 hmmer 输出转换为 pandas 数据帧?
- 我也不确定如何通过以下方式将 hmmer tblout table 加载到 python 生物模块。
我相信您可以使用 SeqIO.parse 或 SeqIO.search.The 格式来调用 hmmer 格式,table 出现制表符分隔,但它似乎是一个 collection 随机空格这意味着如果我删除 headers 和 # 只留下 table 信息,那么使用制表符分隔符拆分 table 的方法并不简单。
A small example of a hmmer --tblout file is below:
# --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name accession query name accession E-value score bias E-value score bias exp reg clu ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ ----- --- --- --- --- --- --- --- --- ---------------------
3300000568@Draft_10015026@Draft_1001502652 - Bacteria_NODE_1_length_628658_cov_8.291329_24 - 7.1e-07 29.3 0.0 1.9e-05 24.6 0.0 2.0 1 1 1 2 2 2 2 -
7000000546@SRS019910_WUGC_scaffold_3948@SRS019910_WUGC_scaffold_3948_gene_2890 - Bacteria_NODE_1_length_628658_cov_8.291329_53 - 1.6e-07 31.7 0.0 0.00051 20.3 0.0 2.2 2 0 0 2 2 2 2 -
#
# Program: hmmscan
# Version: 3.1b2 (February 2015)
# Pipeline mode: SCAN
# Query file: ../Exponential_High_Complexity_Simulation.faa
# Target file: final_list.hmm
# Option settings: hmmscan --tblout Exponential_Earth.txt -E 1e-5 --cpu 8 final_list.hmm ../Exponential_High_Complexity_Simulation.faa
# Current dir: /Strong/home/glickmanc/Programs/EarthVirome
# Date: Mon Feb 24 10:47:51 2020
# [ok]
我会根据您感兴趣的属性构建一个字典,并根据该字典制作一个 DataFrame。假设您对 attributes of the hits:
感兴趣from collections import defaultdict
import pandas as pd
from Bio import SearchIO
filename = 'test.hmmer'
attribs = ['accession', 'bias', 'bitscore', 'description', 'cluster_num', 'domain_exp_num', 'domain_included_num', 'domain_obs_num', 'domain_reported_num', 'env_num', 'evalue', 'id', 'overlap_num', 'region_num']
hits = defaultdict(list)
with open(filename) as handle:
for queryresult in SearchIO.parse(handle, 'hmmer3-tab'):
#print(queryresult.id)
#print(queryresult.accession)
#print(queryresult.description)
for hit in queryresult.hits:
for attrib in attribs:
hits[attrib].append(getattr(hit, attrib))
pd.DataFrame.from_dict(hits)