按 python 计算 fasta 中的 20 聚体数
count 20-mers number in a fasta by python
读取长度为 120 nt 的常规 fasta 文件:'single_mapped.fa'
一个 CSV 文件包含 10000 个 20 聚体和每个 20 聚体的计数:'20frequent_20mers.txt',如下所示:
AAAAAGTATAGGAGATAGAA 35
AAAAATAGGAGGACTATTCA 26
AAAAATAGGAGGACTATTTA 24
AAAAATAGGAGGCCTATTCA 62
我想遍历single_mapped.fa,计算每个reads在20frequent_20mers.txt中所有20-mers的累计计数,即read:
AAAAAAGTATAGGAGATAGAA AAAAATAGGAGGACTATTCA,我想要 61 (35+26)
我的代码:
file2 = open('20frequent_20mers.txt','r')
kmer_list = csv.reader(file2, delimiter='\t')
for seq_record in SeqIO.parse("single_mapped.fa", "fasta"):
print(seq_record.id)
score_fre = 0
sequence_string = str(seq_record.seq)
for i in range(0,101):
seq = sequence_string[i:i+20]
for row in kmer_list:
if row[0] == seq:
score_fre = score_fre + int(row[1])
print(score_fre)
当我分别 运行 时,每个循环都运行良好,但没有像上面那样运行,谁能告诉我错误来自哪里?或者是否有更聪明、更有效的方法来做到这一点?提前致谢!
使用现有代码,您需要从头开始重新读取每个序列和 i
值的 kmer 文件。这将非常缓慢,应该避免。由于您没有将文件指针移回开头,因此它只会工作一次。
可以通过在 for row in kmer_list:
行之前添加来移动文件指针:
file2.seek(0)
更好的方法是首先将所有 kmer 条目连同相应的计数加载到字典中。这样可以快速查找它们:
import csv
kmers = {}
with open('20frequent_20mers.txt') as f_kmers:
for kmer, count in csv.reader(f_kmers, delimiter='\t'):
kmers[kmer] = int(count)
for seq_record in SeqIO.parse("single_mapped.fa", "fasta"):
print(seq_record.id)
score_fre = 0
sequence_string = str(seq_record.seq)
for i in range(0, 101):
seq = sequence_string[i:i+20]
score_fre += kmers.get(seq, 0)
print(score_fre)
如果在字典中找不到seq
,则返回默认值0
。
使用@MartinEvans 字典的替代实现(不一定更好或更快)但使用 re.findall()
生成 kmers 进行测试并使用 map
和 sum
而不是(显式)内部循环:
from Bio import SeqIO
from re import findall
from itertools import repeat
kmers = {}
with open('20frequent_20mers.txt') as f_kmers:
for line in f_kmers:
kmer, count = line.strip().split('\t')
kmers[kmer] = int(count)
for seq_record in SeqIO.parse("single_mapped.fa", "fasta"):
print(seq_record.id)
# use forward lookahead to make findall() find overlapping results;
score_fre = sum(map(kmers.get, findall(r'(?=([ACTG]{20}))', str(seq_record.seq)), repeat(0)))
print(score_fre)
读取长度为 120 nt 的常规 fasta 文件:'single_mapped.fa'
一个 CSV 文件包含 10000 个 20 聚体和每个 20 聚体的计数:'20frequent_20mers.txt',如下所示:
AAAAAGTATAGGAGATAGAA 35
AAAAATAGGAGGACTATTCA 26
AAAAATAGGAGGACTATTTA 24
AAAAATAGGAGGCCTATTCA 62
我想遍历single_mapped.fa,计算每个reads在20frequent_20mers.txt中所有20-mers的累计计数,即read:
AAAAAAGTATAGGAGATAGAA AAAAATAGGAGGACTATTCA,我想要 61 (35+26)
我的代码:
file2 = open('20frequent_20mers.txt','r')
kmer_list = csv.reader(file2, delimiter='\t')
for seq_record in SeqIO.parse("single_mapped.fa", "fasta"):
print(seq_record.id)
score_fre = 0
sequence_string = str(seq_record.seq)
for i in range(0,101):
seq = sequence_string[i:i+20]
for row in kmer_list:
if row[0] == seq:
score_fre = score_fre + int(row[1])
print(score_fre)
当我分别 运行 时,每个循环都运行良好,但没有像上面那样运行,谁能告诉我错误来自哪里?或者是否有更聪明、更有效的方法来做到这一点?提前致谢!
使用现有代码,您需要从头开始重新读取每个序列和 i
值的 kmer 文件。这将非常缓慢,应该避免。由于您没有将文件指针移回开头,因此它只会工作一次。
可以通过在 for row in kmer_list:
行之前添加来移动文件指针:
file2.seek(0)
更好的方法是首先将所有 kmer 条目连同相应的计数加载到字典中。这样可以快速查找它们:
import csv
kmers = {}
with open('20frequent_20mers.txt') as f_kmers:
for kmer, count in csv.reader(f_kmers, delimiter='\t'):
kmers[kmer] = int(count)
for seq_record in SeqIO.parse("single_mapped.fa", "fasta"):
print(seq_record.id)
score_fre = 0
sequence_string = str(seq_record.seq)
for i in range(0, 101):
seq = sequence_string[i:i+20]
score_fre += kmers.get(seq, 0)
print(score_fre)
如果在字典中找不到seq
,则返回默认值0
。
使用@MartinEvans 字典的替代实现(不一定更好或更快)但使用 re.findall()
生成 kmers 进行测试并使用 map
和 sum
而不是(显式)内部循环:
from Bio import SeqIO
from re import findall
from itertools import repeat
kmers = {}
with open('20frequent_20mers.txt') as f_kmers:
for line in f_kmers:
kmer, count = line.strip().split('\t')
kmers[kmer] = int(count)
for seq_record in SeqIO.parse("single_mapped.fa", "fasta"):
print(seq_record.id)
# use forward lookahead to make findall() find overlapping results;
score_fre = sum(map(kmers.get, findall(r'(?=([ACTG]{20}))', str(seq_record.seq)), repeat(0)))
print(score_fre)