在另一个文件的 k mers 中搜索一个文件的 k mers 并计算 Python 中的出现次数

Search kmers of one file in kmers of an other file and count occurences in Python

得到这个函数,它在 python 中的四个碱基上生成所有可能的 kmers:

def generate_kmers(k):

    bases = ['A', 'C', 'T', 'G']  # in task (a) we only should wirte a function that generates k-mers of the four Bases
    kmer = [''.join(p) for p in itertools.product(bases, repeat=length_kmer)]
    # itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
    # all string combinations together over a length of k-mers
    return kmer

现在我想要的是查看 fastq 文件的序列列表(例如 ['GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'、'CCTGTAGAGTCATAAAGACCTCTTGGGTCCATCCTAGAAATTTTTCAGCTGAGAATAACGGGTCTGTTTCAGTTATTGCTTCTACTATNNNNNNNNNNNNNNNNNNNNNNNNNNN'])并计算函数 [generate_kmer 在我的序列列表中并将其保存在字典中。 (例如 {AAAA: 2, AAAC: 1...}) 首先,我尝试修改 generate_kmer,以便它提供序列文件的所有 k-mer,并迭代 kmerSequences 和 kmerBases,但这不起作用。

有人知道我该怎么做吗?

您可以尝试使用 count:

import itertools

def generate_kmers(k):

    bases = ['A', 'C', 'T', 'G']  # in task (a) we only should wirte a function that generates k-mers of the four Bases
    kmer = [''.join(p) for p in itertools.product(bases, repeat=k)]
    # itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
    # all string combinations together over a length of k-mers
    return kmer

seqs=['GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN', 'CCTGTAGAGTCATAAAGACCTCTTGGGTCCATCCTAGAAATTTTTCAGCTGAGAATAACGGGTCTGTTTCAGTTATTGCTTCTACTATNNNNNNNNNNNNNNNNNNNNNNNNNNN']
k=4
mers4= generate_kmers(k)
dcts=[{kmer:seq.count(kmer) for kmer in mers4}for seq in seqs]
print(dcts)

编辑:

import itertools
import re
def generate_kmers(k):

    bases = ['A', 'C', 'T', 'G']  # in task (a) we only should wirte a function that generates k-mers of the four Bases
    kmer = [''.join(p) for p in itertools.product(bases, repeat=k)]
    # itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
    # all string combinations together over a length of k-mers
    return kmer

k=4
mers4= generate_kmers(k)

#given sequence
s='GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'

#function that returns the dictionary with ocurrences
def dct_count(seq):
    return {mer:len(re.findall(mer, s)) for mer in mers4}

dc=dct_count(s)
print(dc)