如何使用 Python 随机提取 FASTA 序列？

Question

我有以下序列，它们是带有序列头及其核苷酸的 fasta 格式。我怎样才能随机提取序列。例如，我想从总序列中随机 select 2 个序列。有工具提供这样做是根据百分比而不是序列数来提取。谁能帮帮我？

A.fasta

>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG
>chr1:984333-984353
CTGGAATTCCGGGCGCTGGAG
>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT

预期输出

>chr1:1154147-1154167
GAGATCGTCCGGGACCTGGGT
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG

Answer 1

如果您使用的是 fasta 文件，请使用 BioPython, to get n sequences use random.sample:

from Bio import SeqIO
from random import sample
with open("foo.fasta") as f:
    seqs = SeqIO.parse(f,"fasta")
    print(sample(list(seqs), 2))

输出：

[SeqRecord(seq=Seq('GAGATCGTCCGGGACCTGGGT', SingleLetterAlphabet()), id='chr1:1154147-1154167', name='chr1:1154147-1154167', description='chr1:1154147-1154167', dbxrefs=[]), SeqRecord(seq=Seq('GTCCGCTTGCGGGACCTGGGG', SingleLetterAlphabet()), id='chr1:983001-983021', name='chr1:983001-983021', description='chr1:983001-983021', dbxrefs=[])]

如有必要，您可以提取字符串：

 print([(seq.name,str(seq.seq)) for seq in  sample(list(seqs),2)])
 [('chr1:1310706-1310726', 'GACGGTTTCCGGTTAGTGGAA'), ('chr1:983001-983021', 'GTCCGCTTGCGGGACCTGGGG')]

如果行总是成对出现并且您跳过了顶部的元数据，您可以压缩：

from random import sample

with open("foo.fasta") as f:
    print(sample(list(zip(f, f)), 2))

这将为您提供元组中的成对行：

[('>chr1:983001-983021\n', 'GTCCGCTTGCGGGACCTGGGG\n'), ('>chr1:984333-984353\n', 'CTGGAATTCCGGGCGCTGGAG\n')]

要准备好要写入的行：

from Bio import SeqIO
from random import sample
with open("foo.fasta") as f:
    seqs = SeqIO.parse(f, "fasta")
    samps = ((seq.name, seq.seq) for seq in  sample(list(seqs),2))
    for samp in samps:
        print(">{}\n{}".format(*samp))

输出：

>chr1:1310706-1310726
GACGGTTTCCGGTTAGTGGAA
>chr1:983001-983021
GTCCGCTTGCGGGACCTGGGG

Answer 2

对Fasta了解不多，但是Python有Fasta模块（需要先安装）

>>> from pyfasta import Fasta

>>> f = Fasta('tests/test1.fasta')
>>> sorted(f.keys())
['chr1', 'chr2', 'chr3']

然后您可以使用 Python 的随机模块中的样本功能，并随机选择任意数量...

from random import sample
sample(f, how_many_you_want)

Answer 3

鉴于您显示的文件格式，并且假设文件不是太大，您不需要任何外部模块（例如 biopython）来执行此操作：

import random

with open('A.fasta') as f:
    data = f.read().splitlines()
    for i in random.sample(range(0, len(data), 2), 2):
        print data[i]
        print data[i+1]

示例输出：

>chr1:984333-984353
CTGGAATTCCGGGCGCTGGAG
>chr1:901959-901979
GAGGGCTTTCTGGAGAAGGAG

这只是选择 2 个随机序列 headers（来自 A.fasta 的那些行在 data 中具有偶数索引）和它后面的行。

如果您的文件很大，那么外部模块可能会进行优化以处理更大的数据集。

Answer 4

取决于您是否安装了 unix sort 或 shuf。如果是这样，那很容易 Select random 3000 lines from a file with awk codes

创建 headers

grep '>' A.fasta >FILE

然后 select 来自该文件的随机 2 行

CONTIGS=sort -R FILE | head -n2|tr "\n" " "

或

CONTIGS=shuf -n2 FILE|tr "\n" " "

然后，使用samtools提取

samtools faidx A.fasta $CONTIGS

Answer 5

import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", totseq

randseq = int(sys.argv[1])                          #Number of random sequences desired
print "Number of random sequences desired = ", randseq

if randseq > totseq:
  print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
  exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseq = []
outlist = []
print "Randomly chosen output sequences:"

for i in range(randseq):
  choose = random.randint(1,totseq-1)               #Choose a random sequence record number
  for j in range(len(outrandseq)):                  #Test to see if the random sequence record number has already been chosen
    if choose == outrandseq[j]:
      choose = random.randint(1,totseq-1)           #Choose a new random sequence record number if the current one has already been chosen
  outrandseq.append(choose)
  print choose
  outseq = seq[choose]
  outlist.append(outseq)                            #Append seq record to output list

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()

Answer 6

import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# I use this from small numbers of sequences (input file up to 10000 sequences) and it works fine. 
# For very large sequence sets it may be too slow -- I just have not tried.

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", totseq

numrandseq = int(sys.argv[1])                       #Number of random sequences desired
print "Number of random sequences desired = ", numrandseq

if numrandseq > totseq:
  print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
  exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseqset = []
i = 1
for i in range(numrandseq):                         #Create a list of random sequence record numbers for output
  choice = random.randint(1,totseq)
  outrandseqset.append(choice)

i = 1
j = 1
duplicate = 1
while duplicate:                                    #Make sure no sequences are duplicated in the list
    duplicate = 0
    for i in range(numrandseq):
      for j in range(i+1, numrandseq):
        if outrandseqset[i] == outrandseqset[j]:
            outrandseqset[j] = random.randint(1,totseq)
            duplicate = 1


i = 1
print "Randomly chosen output sequences:"
for i in range(numrandseq):
  print outrandseqset[i]

outlist = []
i = 1
for i in range(numrandseq):                         #Create the list of seq records to be written to the output file
  seqnum = outrandseqset[i]
  outseq = seq[seqnum]
  outlist.append(outseq)

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()

如何使用 Python 随机提取 FASTA 序列？

How to randomly extract FASTA sequences using Python?

python

extract

bioinformatics

extraction

fasta