不使用 BioPython 读取 FASTA 中的核苷酸
Read nucleotides in FASTA without using BioPython
我需要获得与以下代码相同的输出,但不使用 BioPython。我卡住了...有人可以帮助我吗?谢谢!!!
from Bio import SeqIO
records = SeqIO.parse("data/assembledSeqs.fa", "fasta")
for i, seq_record in enumerate(records):
print("Sequence %d:" % i)
print("Number of A's: %d" % seq_record.seq.count("A"))
print("Number of C's: %d" % seq_record.seq.count("C"))
print("Number of G's: %d" % seq_record.seq.count("G"))
print("Number of T's: %d" % seq_record.seq.count("T"))
print()
FASTA 文件如下所示:
>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79
TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC
AGGACAGGCCGCTAAAGTG
>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80
CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG
GCCTGGTAACACGTGCCAGC
>chr1_8969882_-:chr1_568670_-:a1;113 total_counts: 7600 Seed: 225 K: 20 length: 86
CACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGTCTAAAC
CAAACCACTTTCACCGCCACACGACC
>chr1_8969882_-:chr1_568670_-:a2;69 total_counts: 6987 Seed: 197 K: 20 length: 120
TGAACCTACGACTACACCGACTACGGCGGACTAATCTTCAACTCCTACATACTTCCCCCA
TTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAATCGAGTAGTACTCCCG
我试过了,但根本行不通
def count_bases (fasta_file_name):
with open(fasta_file_name) as file_content:
for seqs in file_content:
if seqs.startswith('>'):
for i, seq in enumerate('>'):
print("Sequence %d:" % i)
else:
print("Number of A's: %d" % seqs.count("A"))
print("Number of C's: %d" % seqs.count("C"))
print("Number of G's: %d" % seqs.count("G"))
print("Number of T's: %d" % seqs.count("T"))
print()
return bases
result = count_bases('data/assembledSeqs.fa')
这行得通。首先我们要将你的fasta文件解析为headers和序列,然后将序列列表列表展平为字符串列表,然后计算字符串中每个核苷酸的数量,然后打印:
import sys
fasta = sys.argv[1]
def fastaParser(infile):
seqs = []
headers = []
with open(infile) as f:
sequence = ""
header = None
for line in f:
if line.startswith('>'):
headers.append(line[1:-1])
if header:
seqs.append([sequence])
sequence = ""
header = line[1:]
else:
sequence += line.rstrip()
seqs.append([sequence])
return headers, seqs
headers, seqs = fastaParser(fasta)
flat_seqs = [item for sublist in seqs for item in sublist]
def countNucs(instring):
# will count upper and lower case sequences, if do not want lower case remove .upper()
g = instring.upper().count('G')
c = instring.upper().count('C')
a = instring.upper().count('A')
t = instring.upper().count('T')
return 'G = {}, C = {}, A = {}, T = {}'.format(g, c, a, t)
for header, seq in zip(headers, flat_seqs):
print(header, countNucs(seq))
在 command-line:
上使用你的示例 fasta 到 运行
[me]$ python3.5 fasta.py infasta.fa
输出:
chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79 G = 24, C = 17, A = 14, T = 24
chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80 G = 16, C = 30, A = 17, T = 17
chr1_8969882_-:chr1_568670_-:a1;113 total_counts: 7600 Seed: 225 K: 20 length: 86 G = 12, C = 31, A = 27, T = 16
chr1_8969882_-:chr1_568670_-:a2;69 total_counts: 6987 Seed: 197 K: 20 length: 120 G = 20, C = 41, A = 31, T = 28
我会使用 Counter 字典,如下所示:
from collections import Counter
with open('temp.fasta', 'r') as f:
seq_name, d = '', ''
for line in f:
if line.startswith('>'):
print(seq_name, d)
seq_name = line.rstrip('\n')
d = Counter()
else:
for i in line.rstrip('\n'):
d[i] += 1
我把你给出的例子保存为 temp.fasta
并且在 运行 上面显示的代码之后,输出看起来如下:
>chr1_8969882_-:chr1_568670_-:a2;69 total_counts: 6987 Seed: 197 K: 20 length: 120 Counter({'C': 41, 'A': 31, 'T': 28, 'G': 20})
>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79 Counter({'T': 24, 'G': 24, 'C': 17, 'A': 14})
>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80 Counter({'C': 30, 'T': 17, 'A': 17, 'G': 16})
>chr1_8969882_-:chr1_568670_-:a1;113 total_counts: 7600 Seed: 225 K: 20 length: 86 Counter({'C': 31, 'A': 27, 'T': 16, 'G': 12})
格式由您决定。
我觉得@Kayvee 关于 collections 库中 Counter
的建议是正确的,但该建议附带的示例代码在很多方面都有不足,即不会 运行 也不会很好地使用 Counter
。以下是我将如何解决这个问题:
from collections import Counter
def display(index, counter):
print("Sequence {}:".format(index))
for nucleotide in 'ACGT':
print("Number of {}'s: {}".format(nucleotide, counter[nucleotide]))
print()
with open('assembledSeqs.fa') as file:
index = 0
counter = Counter()
for line in file:
string = line.rstrip('\n')
if string.startswith('>'):
if index > 0:
display(index, counter)
index += 1
counter.clear()
else:
counter.update(string)
if counter: # pop out the last one in the now closed file
display(index, counter)
此输出应与您从 BioPython 获得的输出相匹配:
> python3 test.py
Sequence 1:
Number of A's: 14
Number of C's: 17
Number of G's: 24
Number of T's: 24
Sequence 2:
Number of A's: 17
Number of C's: 30
Number of G's: 16
Number of T's: 17
Sequence 3:
Number of A's: 27
Number of C's: 31
Number of G's: 12
Number of T's: 16
Sequence 4:
Number of A's: 31
Number of C's: 41
Number of G's: 20
Number of T's: 28
>
我需要获得与以下代码相同的输出,但不使用 BioPython。我卡住了...有人可以帮助我吗?谢谢!!!
from Bio import SeqIO
records = SeqIO.parse("data/assembledSeqs.fa", "fasta")
for i, seq_record in enumerate(records):
print("Sequence %d:" % i)
print("Number of A's: %d" % seq_record.seq.count("A"))
print("Number of C's: %d" % seq_record.seq.count("C"))
print("Number of G's: %d" % seq_record.seq.count("G"))
print("Number of T's: %d" % seq_record.seq.count("T"))
print()
FASTA 文件如下所示:
>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79
TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC
AGGACAGGCCGCTAAAGTG
>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80
CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG
GCCTGGTAACACGTGCCAGC
>chr1_8969882_-:chr1_568670_-:a1;113 total_counts: 7600 Seed: 225 K: 20 length: 86
CACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGTCTAAAC
CAAACCACTTTCACCGCCACACGACC
>chr1_8969882_-:chr1_568670_-:a2;69 total_counts: 6987 Seed: 197 K: 20 length: 120
TGAACCTACGACTACACCGACTACGGCGGACTAATCTTCAACTCCTACATACTTCCCCCA
TTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAATCGAGTAGTACTCCCG
我试过了,但根本行不通
def count_bases (fasta_file_name):
with open(fasta_file_name) as file_content:
for seqs in file_content:
if seqs.startswith('>'):
for i, seq in enumerate('>'):
print("Sequence %d:" % i)
else:
print("Number of A's: %d" % seqs.count("A"))
print("Number of C's: %d" % seqs.count("C"))
print("Number of G's: %d" % seqs.count("G"))
print("Number of T's: %d" % seqs.count("T"))
print()
return bases
result = count_bases('data/assembledSeqs.fa')
这行得通。首先我们要将你的fasta文件解析为headers和序列,然后将序列列表列表展平为字符串列表,然后计算字符串中每个核苷酸的数量,然后打印:
import sys
fasta = sys.argv[1]
def fastaParser(infile):
seqs = []
headers = []
with open(infile) as f:
sequence = ""
header = None
for line in f:
if line.startswith('>'):
headers.append(line[1:-1])
if header:
seqs.append([sequence])
sequence = ""
header = line[1:]
else:
sequence += line.rstrip()
seqs.append([sequence])
return headers, seqs
headers, seqs = fastaParser(fasta)
flat_seqs = [item for sublist in seqs for item in sublist]
def countNucs(instring):
# will count upper and lower case sequences, if do not want lower case remove .upper()
g = instring.upper().count('G')
c = instring.upper().count('C')
a = instring.upper().count('A')
t = instring.upper().count('T')
return 'G = {}, C = {}, A = {}, T = {}'.format(g, c, a, t)
for header, seq in zip(headers, flat_seqs):
print(header, countNucs(seq))
在 command-line:
上使用你的示例 fasta 到 运行[me]$ python3.5 fasta.py infasta.fa
输出:
chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79 G = 24, C = 17, A = 14, T = 24
chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80 G = 16, C = 30, A = 17, T = 17
chr1_8969882_-:chr1_568670_-:a1;113 total_counts: 7600 Seed: 225 K: 20 length: 86 G = 12, C = 31, A = 27, T = 16
chr1_8969882_-:chr1_568670_-:a2;69 total_counts: 6987 Seed: 197 K: 20 length: 120 G = 20, C = 41, A = 31, T = 28
我会使用 Counter 字典,如下所示:
from collections import Counter
with open('temp.fasta', 'r') as f:
seq_name, d = '', ''
for line in f:
if line.startswith('>'):
print(seq_name, d)
seq_name = line.rstrip('\n')
d = Counter()
else:
for i in line.rstrip('\n'):
d[i] += 1
我把你给出的例子保存为 temp.fasta
并且在 运行 上面显示的代码之后,输出看起来如下:
>chr1_8969882_-:chr1_568670_-:a2;69 total_counts: 6987 Seed: 197 K: 20 length: 120 Counter({'C': 41, 'A': 31, 'T': 28, 'G': 20})
>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79 Counter({'T': 24, 'G': 24, 'C': 17, 'A': 14})
>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80 Counter({'C': 30, 'T': 17, 'A': 17, 'G': 16})
>chr1_8969882_-:chr1_568670_-:a1;113 total_counts: 7600 Seed: 225 K: 20 length: 86 Counter({'C': 31, 'A': 27, 'T': 16, 'G': 12})
格式由您决定。
我觉得@Kayvee 关于 collections 库中 Counter
的建议是正确的,但该建议附带的示例代码在很多方面都有不足,即不会 运行 也不会很好地使用 Counter
。以下是我将如何解决这个问题:
from collections import Counter
def display(index, counter):
print("Sequence {}:".format(index))
for nucleotide in 'ACGT':
print("Number of {}'s: {}".format(nucleotide, counter[nucleotide]))
print()
with open('assembledSeqs.fa') as file:
index = 0
counter = Counter()
for line in file:
string = line.rstrip('\n')
if string.startswith('>'):
if index > 0:
display(index, counter)
index += 1
counter.clear()
else:
counter.update(string)
if counter: # pop out the last one in the now closed file
display(index, counter)
此输出应与您从 BioPython 获得的输出相匹配:
> python3 test.py
Sequence 1:
Number of A's: 14
Number of C's: 17
Number of G's: 24
Number of T's: 24
Sequence 2:
Number of A's: 17
Number of C's: 30
Number of G's: 16
Number of T's: 17
Sequence 3:
Number of A's: 27
Number of C's: 31
Number of G's: 12
Number of T's: 16
Sequence 4:
Number of A's: 31
Number of C's: 41
Number of G's: 20
Number of T's: 28
>