在另一个文件的 k mers 中搜索一个文件的 k mers 并计算 Python 中的出现次数
Search kmers of one file in kmers of an other file and count occurences in Python
得到这个函数,它在 python 中的四个碱基上生成所有可能的 kmers:
def generate_kmers(k):
bases = ['A', 'C', 'T', 'G'] # in task (a) we only should wirte a function that generates k-mers of the four Bases
kmer = [''.join(p) for p in itertools.product(bases, repeat=length_kmer)]
# itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
# all string combinations together over a length of k-mers
return kmer
现在我想要的是查看 fastq 文件的序列列表(例如 ['GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'、'CCTGTAGAGTCATAAAGACCTCTTGGGTCCATCCTAGAAATTTTTCAGCTGAGAATAACGGGTCTGTTTCAGTTATTGCTTCTACTATNNNNNNNNNNNNNNNNNNNNNNNNNNN'])并计算函数 [generate_kmer 在我的序列列表中并将其保存在字典中。 (例如 {AAAA: 2, AAAC: 1...})
首先,我尝试修改 generate_kmer,以便它提供序列文件的所有 k-mer,并迭代 kmerSequences 和 kmerBases,但这不起作用。
有人知道我该怎么做吗?
您可以尝试使用 count
:
import itertools
def generate_kmers(k):
bases = ['A', 'C', 'T', 'G'] # in task (a) we only should wirte a function that generates k-mers of the four Bases
kmer = [''.join(p) for p in itertools.product(bases, repeat=k)]
# itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
# all string combinations together over a length of k-mers
return kmer
seqs=['GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN', 'CCTGTAGAGTCATAAAGACCTCTTGGGTCCATCCTAGAAATTTTTCAGCTGAGAATAACGGGTCTGTTTCAGTTATTGCTTCTACTATNNNNNNNNNNNNNNNNNNNNNNNNNNN']
k=4
mers4= generate_kmers(k)
dcts=[{kmer:seq.count(kmer) for kmer in mers4}for seq in seqs]
print(dcts)
编辑:
import itertools
import re
def generate_kmers(k):
bases = ['A', 'C', 'T', 'G'] # in task (a) we only should wirte a function that generates k-mers of the four Bases
kmer = [''.join(p) for p in itertools.product(bases, repeat=k)]
# itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
# all string combinations together over a length of k-mers
return kmer
k=4
mers4= generate_kmers(k)
#given sequence
s='GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'
#function that returns the dictionary with ocurrences
def dct_count(seq):
return {mer:len(re.findall(mer, s)) for mer in mers4}
dc=dct_count(s)
print(dc)
得到这个函数,它在 python 中的四个碱基上生成所有可能的 kmers:
def generate_kmers(k):
bases = ['A', 'C', 'T', 'G'] # in task (a) we only should wirte a function that generates k-mers of the four Bases
kmer = [''.join(p) for p in itertools.product(bases, repeat=length_kmer)]
# itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
# all string combinations together over a length of k-mers
return kmer
现在我想要的是查看 fastq 文件的序列列表(例如 ['GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'、'CCTGTAGAGTCATAAAGACCTCTTGGGTCCATCCTAGAAATTTTTCAGCTGAGAATAACGGGTCTGTTTCAGTTATTGCTTCTACTATNNNNNNNNNNNNNNNNNNNNNNNNNNN'])并计算函数 [generate_kmer 在我的序列列表中并将其保存在字典中。 (例如 {AAAA: 2, AAAC: 1...}) 首先,我尝试修改 generate_kmer,以便它提供序列文件的所有 k-mer,并迭代 kmerSequences 和 kmerBases,但这不起作用。
有人知道我该怎么做吗?
您可以尝试使用 count
:
import itertools
def generate_kmers(k):
bases = ['A', 'C', 'T', 'G'] # in task (a) we only should wirte a function that generates k-mers of the four Bases
kmer = [''.join(p) for p in itertools.product(bases, repeat=k)]
# itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
# all string combinations together over a length of k-mers
return kmer
seqs=['GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN', 'CCTGTAGAGTCATAAAGACCTCTTGGGTCCATCCTAGAAATTTTTCAGCTGAGAATAACGGGTCTGTTTCAGTTATTGCTTCTACTATNNNNNNNNNNNNNNNNNNNNNNNNNNN']
k=4
mers4= generate_kmers(k)
dcts=[{kmer:seq.count(kmer) for kmer in mers4}for seq in seqs]
print(dcts)
编辑:
import itertools
import re
def generate_kmers(k):
bases = ['A', 'C', 'T', 'G'] # in task (a) we only should wirte a function that generates k-mers of the four Bases
kmer = [''.join(p) for p in itertools.product(bases, repeat=k)]
# itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
# all string combinations together over a length of k-mers
return kmer
k=4
mers4= generate_kmers(k)
#given sequence
s='GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'
#function that returns the dictionary with ocurrences
def dct_count(seq):
return {mer:len(re.findall(mer, s)) for mer in mers4}
dc=dct_count(s)
print(dc)