如何从文件中过滤掉重复的fasta序列

How to filter out duplicated fasta sequences from a file

我有这个代码:

import sys
import argparse
import operator

def main (argv):
    parser = argparse.ArgumentParser()
    parser.add_argument('infile', help='file to process')
    parser.add_argument('outfile', help='file to produce')
    args = parser.parse_args()


    with open(args.infile, "r") as f:
        with open(args.outfile,"w+") as of:
            seen=set()
            for line in f:
                line_lower = line.lower()
                if line_lower not in seen:
                    of.write(line_lower)
                else:
                    pass


if __name__ == "__main__":
    main(sys.argv)`

infile 示例:

M03972:51:000000000-BJVL8:1:1103:20083:5527 CATGTTCGGCTTGGCCTACTTCTCTATGCAGGGAGCGTGGGCGAGAGTCGTTGTCATCCTTCTGCTGGCCGCCGGGGTGGACGCGCGCACCCATACTGTTGGGGGTTCTGCCGCGCAGACCACCGGGCGCCTCACCAGCTTATTTGACATGGGCCCCAGGCAGAAAATCCAGCTCGTTAACACCAATGGCAGCTGGCACATCAACCGCACCGCCCTGAACTGCAATGACTCCTTGCACACCGGCTTTATCG

有时会有重复的序列。我想删除它们,但我的代码似乎不起作用。它基本上只是复制文件,但不会抛出任何错误。 有谁知道为什么?

谢谢

您忘记添加 seen 独有的行。这是代码的固定部分:

seen=set()
for line in f:
    line_lower = line.lower()
    if line_lower not in seen:
        of.write(line_lower)
    else:
        seen.add(line_lower)