Python:Rosalind 共识和简介

Python: Rosalind Consensus and Profile

我正在尝试解决 Rosalind 的 "Consensus and Profile" 挑战。 挑战说明如下:

给定:最多 10 个等长(最多 1 kbp)的 FASTA 格式的 DNA 串的集合。

Return:集合的共识字符串和配置文件矩阵。 (如果存在多个可能的共识字符串,那么您可以return其中任何一个。)

我的代码如下(大部分是从本网站的另一个用户那里得到的)。我唯一的问题是一些 DNA 链被分解成多个单独的行,因此它们作为单独的字符串附加到 "allstrings" 列表中。我想弄清楚如何将不包含“>”的每个连续行写为 单个 字符串。

import numpy as np

seq = []
allstrings = []
temp_seq = []
matrix = []
C = []
G = []
T = []
A = []
P = []
consensus = []
position = 1

file = open("C:/Users/knigh/Documents/rosalind_cons (3).txt", "r")
conout = open("C:/Users/knigh/Documents/consensus.txt", "w")

# Right now, this is reading and writing each as an individual line. Thus, it
#  is splitting each sequence into multiple small sequences. You need to figure
#  out how to read this in FASTA format to prevent this from occurring
desc = file.readlines()

for line in desc:
    allstrings.append(line)

for string in range(1, len(allstrings)):
    if ">" not in allstrings[string]:
        temp_seq.append(allstrings[string])
    else:
        seq.insert(position, temp_seq[0])
        temp_seq = []
        position += 1

# This last insertion into the sequence must be performed after the loop to empty
#  out the last remaining string from temp_seq
seq.insert(position, temp_seq[0])

for base in seq:
    matrix.append([pos for pos in base])

M = np.array(matrix).reshape(len(seq), len(seq[0]))

for base in range(len(seq[0])):
    A_count = 0
    C_count = 0
    G_count = 0
    T_count = 0
    for pos in M[:, base]:
        if pos == "A":
            A_count += 1
        elif pos == "C":
            C_count += 1
        elif pos == "G":
            G_count += 1
        elif pos == "T":
            T_count += 1
    A.append(A_count)
    C.append(C_count)
    G.append(G_count)
    T.append(T_count)

profile_matrix = {"A": A, "C": C, "G": G, "T": T}

P.append(A)
P.append(C)
P.append(G)
P.append(T)

profile = np.array(P).reshape(4, len(A))

for pos in range(len(A)):
    if max(profile[:, pos]) == profile[0, pos]:
        consensus.append("A")
    elif max(profile[:, pos]) == profile[1, pos]:
        consensus.append("C")
    elif max(profile[:, pos]) == profile[2, pos]:
        consensus.append("G")
    elif max(profile[:, pos]) == profile[3, pos]:
        consensus.append("T")

conout.write("".join(consensus) + "\n")

for k, v in profile_matrix.items():
    conout.write(k + ": " + " ".join(str(x) for x in v) + "\n")

conout.close()

有几种方法可以将 FASTA 文件迭代为记录。您可以使用预构建的库或编写自己的库。

一个广泛使用的用于处理序列数据的库是 biopython。此代码片段将创建一个字符串列表。

from Bio import SeqIO


file = "path/to/your/file.fa"
sequences = []

with open(file, "r") as file_handle:
    for record in SeqIO.parse(file_handle, "fasta"):
        sequences.append(record.seq)

或者,您可以编写自己的 FASTA 解析器。这样的事情应该有效:

def read_fasta(fh):
    # Iterate to get first FASTA header        
    for line in fh:
        if line.startswith(">"):
            name = line[1:].strip()
            break

    # This list will hold the sequence lines
    fa_lines = []

    # Now iterate to find the get multiline fasta
    for line in fh:
        if line.startswith(">"):
            # When in this block we have reached 
            #  the next FASTA record

            # yield the previous record's name and
            #  sequence as tuple that we can unpack
            yield name, "".join(fa_lines)

            # Reset the sequence lines and save the
            #  name of the next record
            fa_lines = []
            name = line[1:].strip()

            # skip to next line
            continue

        fa_lines.append(line.strip())

    yield name, "".join(fa_lines)

您可以像这样使用这个函数:

file = "path/to/your/file.fa"
sequences = []

with open(file, "r") as file_handle:
    for name, seq in read_fasta(file_handle):
        sequences.append(seq)