过滤 tsv 文件以根据列值获取前 3 个出现次数
Filtering tsv file to get top 3 occurrences based on a column value
我需要过滤来自 .tsv 文件的 Blast 结果。
过滤器的参数为:
- 仅保留 E 值 < 10E-20,忽略其他值
- 对于每个重叠群,保存前 3 个 Blast 结果。每个重叠群不一定有 3 个,很多重叠群有超过 3 个。
e 个值在第三列。
文件以此格式保存为 .tsv
contig-001 [Enterobacteria phage G4 sensu lato] 9.01988e-168 5418 GCATAC
contig-001 [Enterobacteria phage ID18 sensu lato] 9.97265e-167 5418 GCATACGAAAAGACAGAATCTC
contig-002 [Enterobacteria phage ID2 Moscow/ID/2001] 1.10261e-165 5418 GCATACGAAAAGAC
contig-002 [Enterobacteria phage phiX174 sensu lato] 3.31985e-162 5418 GACTGATCGCAGT
contig-002 [Enterobacteria phage ID2 Moscow/ID/2001] 7.92015e-156 5418 GCATACGAAAAGAC
contig-002 [Enterobacteria phage ID18 sensu lato] 2.38469e-152 5418 GCATACGAAAAGAC
contig-003 [Enterobacteria phage ID2 Moscow/ID/2001] 1.08293e-112 5418 GCATACGAAAAGAC
contig-003 [Sweetpotato badnavirus A] 0.000593081 6592 CATCGTAGCTGAT
contig-003 [Dahlia mosaic virus] 0.000593081 6592 CAAGAAGATAGAGAGTCCCACA
假设您要保存的结果是核苷酸序列(最后一列),这应该可行:
import csv
from collections import defaultdict
threshold = 10E-20
data = defaultdict(dict)
with open('path/to/file') as infile:
for contig, _ignore, e, _id, nuc in csv.reader(infile, delimiter='\t'):
contig = int(contig.split('-')[1])
e = float(e)
if e < threshold: continue
data[contig][e] = nuc
if len(data[contig]) > 3: data[contig].pop(min(data[contig]))
for contig,d in data.items():
for e in sorted(d):
print(contig, e, d[e])
我需要过滤来自 .tsv 文件的 Blast 结果。 过滤器的参数为:
- 仅保留 E 值 < 10E-20,忽略其他值
- 对于每个重叠群,保存前 3 个 Blast 结果。每个重叠群不一定有 3 个,很多重叠群有超过 3 个。
e 个值在第三列。
文件以此格式保存为 .tsv
contig-001 [Enterobacteria phage G4 sensu lato] 9.01988e-168 5418 GCATAC
contig-001 [Enterobacteria phage ID18 sensu lato] 9.97265e-167 5418 GCATACGAAAAGACAGAATCTC
contig-002 [Enterobacteria phage ID2 Moscow/ID/2001] 1.10261e-165 5418 GCATACGAAAAGAC
contig-002 [Enterobacteria phage phiX174 sensu lato] 3.31985e-162 5418 GACTGATCGCAGT
contig-002 [Enterobacteria phage ID2 Moscow/ID/2001] 7.92015e-156 5418 GCATACGAAAAGAC
contig-002 [Enterobacteria phage ID18 sensu lato] 2.38469e-152 5418 GCATACGAAAAGAC
contig-003 [Enterobacteria phage ID2 Moscow/ID/2001] 1.08293e-112 5418 GCATACGAAAAGAC
contig-003 [Sweetpotato badnavirus A] 0.000593081 6592 CATCGTAGCTGAT
contig-003 [Dahlia mosaic virus] 0.000593081 6592 CAAGAAGATAGAGAGTCCCACA
假设您要保存的结果是核苷酸序列(最后一列),这应该可行:
import csv
from collections import defaultdict
threshold = 10E-20
data = defaultdict(dict)
with open('path/to/file') as infile:
for contig, _ignore, e, _id, nuc in csv.reader(infile, delimiter='\t'):
contig = int(contig.split('-')[1])
e = float(e)
if e < threshold: continue
data[contig][e] = nuc
if len(data[contig]) > 3: data[contig].pop(min(data[contig]))
for contig,d in data.items():
for e in sorted(d):
print(contig, e, d[e])