根据biopython的序列从fasta文件中提取基因
gene extraction from a fasta file according to sequence with biopython
我有一个正在解析的 fasta 文件。该文件由多个序列组成,这些序列属于来自不同细菌菌株的同一基因。我想做的和脚本所做的是检查序列是否与参考序列相同或不同。有了这些信息,我想生成一个新文件,但我每个文件只有一个序列。
def checking_sequences():
gene_records=list(SeqIO.parse('/files/gene_A.fasta', 'fasta'))
ref_id=gene_records[-1].id
ref_seq=gene_records[-1].seq
#print gene_records[-1].description
output_handle=open('//files/' + 'corrected_gene_1', 'a')
print len(gene_records)
count=0
dif_count=0
reference_list=[]
for gene_record in gene_records:
#count+=1
if len(gene_record.seq) == len(ref_seq):
#print len(gene_records.seq)
#print len(ref_seq)
print 'Found all lengths are equal'
else:
print 'Found %s sequence with different lengths' % (gene_records.description)
###checking sequence equality
if gene_record.seq==ref_seq:
count+=1
gene_record.id=gene_record.id +'_0'
reference_list.append(gene_record)
ref_count=reference_list.count(gene_record.seq)
print 'There are %i sequences are equal to the reference sequence' %(count)
else:
dif_count+=1
reference_list.append(gene_record.seq)
seq_count=reference_list.count(gene_record.seq)
gene_record.id=gene_record.id +'_'+ str(dif_count)
print 'Found %i different that ref_seq' % (seq_count)
print 'xxxxxxxxxxx'
#print seq_count
#print len(reference_list)
SeqIO.write(gene_record, output_handle, 'fasta')
output_handle.close()
checking_sequences()
澄清一下:
original file desire output
>gene_1 strainIDx >gen1_strainIDx
seqA seqA
>gene_1 strainIDy >gene_1 strainIDy
seqB seqB
>gene_1 strainIDz
seqA
我不介意 ID,只是我希望每个 ID 都有一个序列。我曾尝试使用“break”,但没有得到我想要的输出。帮助将不胜感激
Comment: I have never heard of hash so I didn't know what it does or doesn't.
参考:
Built-in Functions hash(object)
Return the hash value as integers.
They are used to quickly compare dictionary keys during a dictionary lookup.
SO QA Built in python hash() function
Question: I want to produce a new file BUT only one sequence of each.
使用查找 Table,例如:
lookup = Set()
with open('/files/' + 'corrected_gene_1', "w") as handle:
for record in SeqIO.parse('/files/gene_A.fasta', "fasta"):
seq_hash = hash(str(record.seq))
if not seq_hash in lookup:
# Not in lookup, save
lookup.add(seq_hash)
SeqIO.write(records, handle, "fasta")
else:
# already saved - Skip
pass
我有一个正在解析的 fasta 文件。该文件由多个序列组成,这些序列属于来自不同细菌菌株的同一基因。我想做的和脚本所做的是检查序列是否与参考序列相同或不同。有了这些信息,我想生成一个新文件,但我每个文件只有一个序列。
def checking_sequences():
gene_records=list(SeqIO.parse('/files/gene_A.fasta', 'fasta'))
ref_id=gene_records[-1].id
ref_seq=gene_records[-1].seq
#print gene_records[-1].description
output_handle=open('//files/' + 'corrected_gene_1', 'a')
print len(gene_records)
count=0
dif_count=0
reference_list=[]
for gene_record in gene_records:
#count+=1
if len(gene_record.seq) == len(ref_seq):
#print len(gene_records.seq)
#print len(ref_seq)
print 'Found all lengths are equal'
else:
print 'Found %s sequence with different lengths' % (gene_records.description)
###checking sequence equality
if gene_record.seq==ref_seq:
count+=1
gene_record.id=gene_record.id +'_0'
reference_list.append(gene_record)
ref_count=reference_list.count(gene_record.seq)
print 'There are %i sequences are equal to the reference sequence' %(count)
else:
dif_count+=1
reference_list.append(gene_record.seq)
seq_count=reference_list.count(gene_record.seq)
gene_record.id=gene_record.id +'_'+ str(dif_count)
print 'Found %i different that ref_seq' % (seq_count)
print 'xxxxxxxxxxx'
#print seq_count
#print len(reference_list)
SeqIO.write(gene_record, output_handle, 'fasta')
output_handle.close()
checking_sequences() 澄清一下:
original file desire output
>gene_1 strainIDx >gen1_strainIDx
seqA seqA
>gene_1 strainIDy >gene_1 strainIDy
seqB seqB
>gene_1 strainIDz
seqA
我不介意 ID,只是我希望每个 ID 都有一个序列。我曾尝试使用“break”,但没有得到我想要的输出。帮助将不胜感激
Comment: I have never heard of hash so I didn't know what it does or doesn't.
参考:
Built-in Functions hash(object)
Return the hash value as integers.
They are used to quickly compare dictionary keys during a dictionary lookup.SO QA Built in python hash() function
Question: I want to produce a new file BUT only one sequence of each.
使用查找 Table,例如:
lookup = Set()
with open('/files/' + 'corrected_gene_1', "w") as handle:
for record in SeqIO.parse('/files/gene_A.fasta', "fasta"):
seq_hash = hash(str(record.seq))
if not seq_hash in lookup:
# Not in lookup, save
lookup.add(seq_hash)
SeqIO.write(records, handle, "fasta")
else:
# already saved - Skip
pass