打印匹配项和匹配项后的行

Print match and line after match

我有这个包含 82 对 ID 的文件:

EmuJ_000063620.1    EgrG_000063620.1    253 253
EmuJ_000065200.1    EgrG_000065200.1    128 128
EmuJ_000081200.1    EgrG_000081200.1    1213    1213
EmuJ_000096200.1    EgrG_000096200.1    295 298
EmuJ_000114700.1    EgrG_000114700.1    153 153
EmuJ_000133800.1    EgrG_000133800.1    153 153
EmuJ_000139900.1    EgrG_000144400.1    2937    2937
EmuJ_000164600.1    EgrG_000164600.1    167 167

我还有另外两个文件,其中 EmuJ_* ID 和 EgrG_* ID 的序列如下:

EgrG_sequences.fasta:

>EgrG_000632500.1
MKKKSHRKSPEGNHSLTKAANKDTAKCNEERGRNIGQSNEEENATRSEKDREGDEDRNLREYVISIAQKYYPHLVSCMRQDDDNQASADARGADGANDEEHCPKHCPRLNAQKYYLYSATCNHHCEDSQASCDEEGDGKRLLKQCLLWLTERYYPSLAARIRQCNDDQASSNAHGADETDDGDRRLKQALLLFAKKLYPCVTTCIRHCVADHTSHDARGVDEEVDGEQLLKQCLHSSAQKFYPRLAACVCHCDADHASTETCGALGVGNAERCPQQCPCLCAQQYYVQSATCVHHCDNEQSSPETRGVKEDVDVEQLLKQCLLMFAEKFHPTLAAGIRSCADDESSHVASVEGEDDADKQRLKQYLLLFAQKYYPHLIAYIQKRDDDQSSSSVRDSGEEANEEEERLKQCLLLFAQKLYPRLVAYTGRCDSNQSTSDGCSVDGEEAEKHYLKQSLLLLAQKYYPSLAAYLRQFDDNQSSSDVRSVDEEEAEKRHLKQGLLFFAEKYYPSLATYIRRCDDDQSSSDARVVDEVDDEDRRLKQGLLLLAQKYYPPLANYIRHSQSSFNVCGADEKEDEEHCLNQLPRLCAQEAYIRSSSCSHHCDDDQASNDTLVVDKEEEEKYRLKQGLLLLAQKFYPPLATCIHQCDDQSSHDTRGVDEEEAEEQLLKKCLLMFAEKFYPSLAATIHHSVYDQASFDMRDVDTENDETHCLSLSAENYSTASTTCIHHSDGDQSTSDACGVEEGDVEEQRLKRGLLLLAQKYYPSLAAYICQCDDYQPSSDVCGVGEEDTGEERLKQCLLLFAKKFYPSLASRNSQCGDNLILNDEVVGETVINSDTDTDEVTPVEKSTAVCDEVDEVPFKYVGSPTPLSDVDVDSLEKVIPPNDLTAHSSFQNSLDHSVEGGYPDRAFYIGRHTVESADSTAPLSKSSSTKLYFSNTDEFPTEEEVSSPIAPLSIQRRIRIYLEDLENVRKVSLIPLCKTDKFGNPQEEIIIDSNLDDDTDESKLSSVDVEFTMEQADATPLDLEAQDEDLKNCVAIILKHIWSELMECIRREGLSDVYELSLGDRRIEVPQDDVCLVR*
>EgrG_000006700.1
MTDTKGPDESYFEKEAFSSLPQPVDSPSASATDTDRIPVVAVSLPVSSGSIDVNCNCSCYLIICETKLIIDYQMTRKW*

等等。 EmuJ_sequences.fasta 也一样 我需要获取每一对的序列,然后一个接一个地写,保持这样的顺序:

>EmuJ_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYENQCRLDTEVKRLCSNIQTFNRQVDMWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EgrG_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYDNQCRLDTEVKRLCSNIQTFNCQVDLWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EmuJ_000065200.1
MLCLITPFPSVVPVCVRTCVCMCPCPLLLILYTWSAYLVPFSLPLCLYAHFHIRFLPPFSSLSIPRFLTHSLFLPSYPPLTMLRMKKSLAPCPAERR*
>EgrG_000065200.1
MLCLVTSFPSAVPVCMRTCVCMCSCPLLLILYTWSAYLVPFSLPLCLYTHLHIRFLPPFPSLAIPRFLTHPLFLPTSLYVADKKEPSAMPRRASLRQMLLIVLLQELH*
>EmuJ_000081200.1
MNSLRIFAVVITCLMVVGFSYSIHPTFPSYQSVVWHSSANTGYECRDGICGYRCSNPWCHGFGSILHPQMGVQEMWGSAAHGRHAHSRAMTEFLAKASPEDVTMLIESTPNIDEVITSLDGEAVTILINKLPNLRRVMEELKPQTKMHIVSKLCGKVGSAMEWTEARRNDGSGMWNEYGSGWEGIDAIQDLEAEVIMRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPTPATSVSSMDPLVLRSIILNMPNLNDILMQVDPVYLQSALVHVPGFGAYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQRMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPRAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPTRFANVLLHVKPDYVRYIIEKHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPRVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAATASATEEPTVTVDYANLLRSKIPLIDNVIKMSDPEKVAILRDNLLDVSRILVNLDPTMLRNINSIIFNATKMLNELSVFLVEYPLEYLHKEGKSGVAVNKSEQVGTTGENGVSSIAVEKLQMVLLKIPLFDQFLKWIDQKKLHELLNKIPTLLEVIATANQETLDKINSLLHDAIATMNTAKKLIVTGICRKLAEEGKLRLPRVCPSAST*
>EgrG_000081200.1
MNLLRIFAVVITCLIVVGFGYPTHPTFPSYQTAVWHSSANTGYRCRAGICGYRCSSPWCHGFESALYPQMAMQEMWGSGAHGRHAHSRTMTEFLMKASPEDLTMLIESTPNIDEVITSLDSEAIIILINKLPNLRRVMEKLKPQTKMHIVSKLCDKVGNAMEWAGARRNDGSGMWNEYGSVWEGIDAIQDLEAEMITRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPMPATLVSNMDPHVLRSIILNMPNLDDILMQVDPVHLQSALMYVPGFGTYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQWMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPKAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPARFANVLLHVKPDYVRYIIENHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPNVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAPTASVTEASMVTVDYAHLLRSKIPLIDNVIKMSDPAKVAILRDNLLDVGTTDENGVSSITVEKLQMVLLKIPLFDQFLNWIDSKKLHALLQKIPTLLEVIATANQEALDKINLLLHDAIATMNTAKKLIVTSICRKLAEEGKLRLPRVCPSTST*

以此类推

我在 bash 中编写了一个脚本来执行此操作,它按我想要的方式工作,非常简单。现在我正在尝试在 Python(我正在学习)中做同样的事情,但我很难以 pythonic 方式做同样的事情。 我已经试过了,但我只有第一对,然后就停止了:

rbh=open('rbh_res_eg-not-sec.txt', 'r')
ems=open('em_seq.fasta', 'r')
egs=open('eg_seq.fasta', 'r')

for l in rbh:
    emid=l.split('\t')[0]
    egid=l.split('\t')[1]
    # ids=emid+'\n'+egid 
    # print ids # just to check if split worked
    for lm in ems:      
        if emid in lm:
            print lm.strip()
            print next(ems).strip()
    for lg in egs:
        if egid in lg:
            print lg.strip()
            print next(egs).strip()

我尝试了一些变体,但我只有 ID,没有序列。 那么,我如何在序列文件中找到 ID,打印它和它后面的行(序列引用 ID 的行)? 如果我解释清楚了,请告诉我。

迭代文件会移动文件指针,直到它到达文件末尾(最后一行),因此在外循环的第一次迭代之后,emsegs 文件累死了。

快速而肮脏的解决方法是在外循环结束时将 emsegs 指针重置为零,即:

for line in rbh:
    # no need to split twice
    parts = line.split("\t") 
    emid, egid = parts[0].strip(), parts[1].strip()

    for lm in ems:      
        if emid in lm:
            print lm.strip()
            print next(ems).strip()
    ems.seek(0) # reset the file pointer

    for lg in egs:
        if egid in lg:
            print lg.strip()
            print next(egs).strip()
    egs.seek(0) # reset the file pointer

请注意,在已经迭代 iterator 的同时调用 next(iterator) 将消耗迭代器的一项,如下所示:

>>> it = iter(range(20))
>>> for x in it:
...     print x, next(it)
... 
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
16 17
18 19

如您所见,我们不会在此处对范围的每个元素进行迭代...鉴于您的文件格式,这应该不是一个大问题,但我想我仍然会警告您。

现在您的算法效率很低 - 对于 rbh 文件的每一行,它将一次又一次地扫描整个 emsegs 文件。

_NB : 以下假设每个 emid / egid 将在 fasta 最多 出现一次 files._

如果您的 emsegs 文件不是太大并且您有足够的可用内存,您可以将它们加载到一对字典中并进行简单的字典查找(这是 O( 1) 并且可能是 Python)

中最优化的操作之一
# warning: totally untested code

def fastamap(path):
    d = dict()
    with open(path) as f:
        for num, line in enumerate(f, 1):
            line = line.strip()
            # skip empty lines.
            if not line:
                continue

            # sanity check: we should only see 
            # lines starting with ">", the "value"
            # lines being consumed by the `next(f)` call
            if not line.startswith(">"):
                raise ValueError(
                    "in file %s: line %s doesn't start with '>'" % (
                      path, num
                    ))

            # ok, proceed
            d[line.lstrip(">")] = next(f).strip()

    return d

ems = fastamap('em_seq.fasta') 
egs = fastamap('eg_seq.fasta')
with open('rbh_res_eg-not-sec.txt') as rhb:
    for line in rhb:
        parts = line.split("\t") 
        emid, egid = parts[0].strip(), parts[1].strip()

        if emid in ems:
            print emid
            print ems[emid]

        if egid in egs:
            print egid
            print egs[egid]

如果由于内存问题而无法正常运行,那么不幸的是你被顺序扫描困住了(除非你想使用一些数据库系统,但这可能有点矫枉过正),但是 - 总是假设 emid/egid 每个只在 fasta 文件中出现一次 - 一旦找到目标,您至少可以退出内部循环:

for l in rbh:
    # no need to split twice, you can just unpack
    emid, egid = l.split('\t')

    for lm in ems:      
        if emid in lm:
            print lm.strip()
            print next(ems).strip()
            break # no need to go further

    ems.seek(0) # reset the file pointer

    # etc...