过滤 tsv 文件以根据列值获取前 3 个出现次数

Filtering tsv file to get top 3 occurrences based on a column value

我需要过滤来自 .tsv 文件的 Blast 结果。 过滤器的参数为:

  1. 仅保留 E 值 < 10E-20,忽略其他值
  2. 对于每个重叠群,保存前 3 个 Blast 结果。每个重叠群不一定有 3 个,很多重叠群有超过 3 个。

e 个值在第三列。

文件以此格式保存为 .tsv

contig-001      [Enterobacteria phage G4 sensu lato]          9.01988e-168    5418    GCATAC
contig-001      [Enterobacteria phage ID18 sensu lato]        9.97265e-167    5418    GCATACGAAAAGACAGAATCTC
contig-002      [Enterobacteria phage ID2 Moscow/ID/2001]     1.10261e-165    5418    GCATACGAAAAGAC
contig-002      [Enterobacteria phage phiX174 sensu lato]     3.31985e-162    5418 GACTGATCGCAGT
contig-002      [Enterobacteria phage ID2 Moscow/ID/2001]     7.92015e-156    5418    GCATACGAAAAGAC
contig-002      [Enterobacteria phage ID18 sensu lato]        2.38469e-152    5418    GCATACGAAAAGAC
contig-003      [Enterobacteria phage ID2 Moscow/ID/2001]     1.08293e-112    5418    GCATACGAAAAGAC
contig-003      [Sweetpotato badnavirus A]                    0.000593081     6592 CATCGTAGCTGAT
contig-003      [Dahlia mosaic virus]                         0.000593081     6592    CAAGAAGATAGAGAGTCCCACA

假设您要保存的结果是核苷酸序列(最后一列),这应该可行:

import csv
from collections import defaultdict

threshold = 10E-20

data = defaultdict(dict)
with open('path/to/file') as infile:
    for contig, _ignore, e, _id, nuc in csv.reader(infile, delimiter='\t'):
        contig = int(contig.split('-')[1])
        e = float(e)
        if e < threshold: continue
        data[contig][e] = nuc
        if len(data[contig]) > 3: data[contig].pop(min(data[contig]))

for contig,d in data.items():
    for e in sorted(d):
        print(contig, e, d[e])