Python 以字符串作为分隔符进行拆分

Python splitting with string as delimiter

我有一个看起来像这样的文件:

AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCTNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNAAAACGTGTGCATGAACAAAAAA
CGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCAC
TAAATCAACGGACATGTGTTGC

我需要将它分成 "non-N" 个序列,所以两个单独的文件如下:

AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCT

AAAACGTGTGCATGAACAAAAAACGTAGCAGATCGTGACTGGC
TATTGTATTGTGTCAATTTCGCTTCGTCACTAAATCAACGGACA
TGTGTTGC

我目前拥有的是:

UMfile = open ("C:\Users\Manuel\Desktop\sequence.txt","r")
contignumber = 1
contigfile = open ("contig "+str(contignumber), "w")

DNA = UMfile.read()
DNAstring = str(DNA)

for s in DNAstring:
    DNAstring.split("NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN",1)
contigfile.write(DNAstring)

contigfile.close()
contignumber = contignumber+1
contigfile = open ("contig "+str(contignumber), "w")

问题是我意识到 "Ns" 之间有一个换行符,这就是为什么它没有拆分我的文件,但我显示的 "file" 只是一部分大得多。所以有时 "Ns" 看起来像这样 "NNNNNN\n" 有时像 "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\n",但我需要拆分的序列之间总有 1000 Ns 的计数。

所以我的问题是:如何告诉 python 每 1000xNs 拆分并写入不同的文件,知道每行中会有不同数量的 Ns?

非常感谢大家,我真的没有信息学背景,我的 python 技能充其量只是基础。

只需在 'N' 上拆分您的字符串,然后删除所有空字符串或仅包含换行符的字符串。像这样:

#!/usr/bin/env python

DNAstring = '''AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACT
TCTCAATGGGCAGTACATATCATCTCTNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNAAAACGTGTGCATGAACAAAAAA
CGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCAC
TAAATCAACGGACATGTGTTGC'''

sequences = [u for u in DNAstring.split('N') if u and u != '\n']

for i, seq in enumerate(sequences):
    print i
    print seq.replace('\n', '') + '\n'

输出

0
AAACAACAGGGTACAAAGAGTCACGCTTATCCTGTTGATACTTCTCAATGGGCAGTACATATCATCTCT

1
AAAACGTGTGCATGAACAAAAAACGTAGCAGATCGTGACTGGCTATTGTATTGTGTCAATTTCGCTTCGTCACTAAATCAACGGACATGTGTTGC

上面的代码片段还使用 .replace('\n', '').

删除了序列中的换行符

以下是一些您可能会觉得有用的程序。

首先,行缓冲区class。您使用文件名和线宽对其进行初始化。然后你可以给它随机长度的字符串,它会自动将它们逐行保存到文本文件中,所有行(可能除了最后一行)都具有给定的长度。您可以在其他程序中使用此 class 使您的输出看起来整洁。

将此文件另存为 linebuffer.py 到您的 Python 路径中的某处;最简单的方法是将它保存在您保存 Python 程序的任何位置,并在您 运行 程序时将其设为当前目录。

linebuffer.py

#! /usr/bin/env python

''' Text output buffer 

    Write fixed width lines to a text file

    Written by PM 2Ring 2015.03.23
'''

class LineBuffer(object):
    ''' Text output buffer

        Write fixed width lines to file fname
    '''
    def __init__(self, fname, width):
        self.fh = open(fname, 'wt')
        self.width = width
        self.buff = []
        self.bufflen = 0

    def write(self, data):
        ''' Write a string to the buffer '''
        self.buff.append(data)
        self.bufflen += len(data)
        if self.bufflen >= self.width:
            self._save()

    def _save(self):
        ''' Write the buffer to the file '''
        buff = ''.join(self.buff)

        #Split buff into lines
        lines = []
        while len(buff) >= self.width:
            lines.append(buff[:self.width])
            buff = buff[self.width:]

        #Add an empty line so we get a trailing newline
        lines.append('')
        self.fh.write('\n'.join(lines))  

        self.buff = [buff]
        self.bufflen = len(buff)

    def close(self):
        ''' Flush the buffer & close the file '''
        if self.bufflen > 0:
            self.fh.write(''.join(self.buff) + '\n')
        self.fh.close()


def testLB():
    alpha = 'abcdefghijklmnopqrstuvwxyz'
    fname = 'linebuffer_test.txt'
    lb = LineBuffer(fname, 27)
    for _ in xrange(30):
        lb.write(alpha)
    lb.write(' bye.')
    lb.close()


if __name__ == '__main__':
    testLB()

这是一个程序,可以生成您在问题中描述的形式的随机 DNA 序列。它使用 linebuffer.py 来处理输出。我写这个是为了正确测试我的 DNA 序列分离器。

Random_DNA0.py

#! /usr/bin/env python

''' Make random DNA sequences

    Sequences consist of random subsequences of the letters 'ACGT'
    as well as short sequences of 'N', of random length up to 200.
    Exactly 1000 'N's separate sequence blocks. 
    All sequences may contain newlines chars 

    Takes approx 3 seconds per megabyte generated and saved 
    on a 2GHz CPU single core machine.

    Written by PM 2Ring 2015.03.23
'''

import sys
import random
from linebuffer import LineBuffer

#Set seed to None to seed randomizer from system time
random.seed(37)

#Output line width
linewidth = 120

#Subsequence base length ranges
minsub, maxsub = 15, 300

#Subsequences per sequence ranges
minseq, maxseq = 5, 50

#random 'N' sequence ranges
minn, maxn = 5, 200

#Probability that a random 'N' sequence occurs after a subsequence
randn = 0.2

#Sequence separator
nsepblock = 'N' * 1000

def main():
    #Get number of sequences from the command line
    numsequences = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    outname = 'DNA_sequence.txt'

    lb = LineBuffer(outname, linewidth)
    for i in xrange(numsequences):
        #Write the 1000*'N' separator between sequences
        if i > 0:
            lb.write(nsepblock)

        for j in xrange(random.randint(minseq, maxseq)):
            #Possibly make a short run of 'N's in the sequence
            if j > 0 and random.random() < randn:
                lb.write(''.join('N' * random.randint(minn, maxn)))

            #Create a single subsequence
            r = xrange(random.randint(minsub, maxsub))
            lb.write(''.join([random.choice('ACGT') for _ in r]))
    lb.close()


if __name__ == '__main__':
    main()

最后,我们有一个程序可以拆分您的随机 DNA 序列。它再次使用 linebuffer.py 来处理输出。

DNA_Splitter0.py

#! /usr/bin/env python

''' Split DNA sequences and save to separate files

    Sequences consist of random subsequences of the letters 'ACGT'
    as well as short sequences of 'N', of random length up to 200.
    Exactly 1000 'N's separate sequence blocks. 
    All sequences may contain newlines chars 

    Written by PM 2Ring 2015.03.23
'''

import sys
from linebuffer import LineBuffer

#Output line width
linewidth = 120

#Sequence separator
nsepblock = 'N' * 1000

def main():
    iname = 'DNA_sequence.txt'
    outbase = 'contig'

    with open(iname, 'rt') as f:
        data = f.read()

    #Remove all newlines
    data = data.replace('\n', '')

    sequences = data.split(nsepblock)

    #Save each sequence to a series of files
    for i, seq in enumerate(sequences, 1):
        outname = '%s%05d' % (outbase, i)
        print outname

        #Write sequence data, with line breaks
        lb = LineBuffer(outname, linewidth)
        lb.write(seq)
        lb.close()


if __name__ == '__main__':
    main()

您可以简单地将每个 N 和 \n 替换为 space,然后拆分。

result = DNAstring.replace("\n", " ").replace("N", " ").split()

这会给你一个字符串列表,'ACGT' 序列也将被拆分为每一行。

如果这不是您的目标并且您想保留 'ACGT' 中的 \n 而不是沿着它分割,您可以执行以下操作:

result = DNAstring.replace("N\n", " ").replace("N", " ").split()

这只会删除位于 N 序列中间的 \n。

要恰好在 1000 Ns 之后拆分字符串:

# 1/ Get rid of line breaks in the N sequence
result = DNAstring.replace("N\n", "N")
# 2/ split every 1000 Ns
result = result.split(1000*"N")

假设您可以一次读取整个文件

s=DNAstring.replace("\n","")         # first remove the nasty linebreaks
l=[x for x in s.split("N") if x]     # split and drop empty lines

for x in l:                          # print in chunks
    while x:
        print x[:10]
        x=x[10:]
    print                            # extra linebreak between chunks