如何使用 Python 根据给定数据过滤掉序列?
How to filter out sequences based on a given data using Python?
我会根据给定的文件过滤掉不需要的序列 A.fasta。原始文件包含所有序列,而 fasta 文件实际上是一个以序列 ID 开头的文件,后跟由 A、T、C、G 表示的核苷酸。有人可以帮我吗?
A.fasta
>chr12:15747942-15747949
TGACATCA
>chr2:130918058-130918065
TGACCTCA
Original.fasta
>chr3:99679938-99679945
TGACGTAA
>chr9:135822160-135822167
TGACCTCA
>chr12:15747942-15747949
TGACATCA
>chr2:130918058-130918065
TGACCTCA
>chr2:38430457-38430464
TGACCTCA
>chr1:112381724-112381731
TGACATCA
C.fasta
的预期输出
>chr3:99679938-99679945
TGACGTAA
>chr9:135822160-135822167
TGACCTCA
>chr2:38430457-38430464
TGACCTCA
>chr1:112381724-112381731
TGACATCA
代码
import sys
import warnings
from Bio import SeqIO
from Bio import BiopythonDeprecationWarning
warnings.simplefilter('ignore',BiopythonDeprecationWarning)
fasta_file = sys.argv[1] # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file
remove = set()
with open(remove_file) as f:
for line in f:
line = line.strip()
if line != "":
remove.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
nuc = seq.seq.tostring()
if nuc not in remove and len(nuc) > 0:
SeqIO.write([seq], f, "fasta")
上面的代码会过滤掉重复序列,但如果重复序列确实出现在输出中,我想保留它
看看BioPython。这是一个使用它的解决方案:
from Bio import SeqIO
input_file = 'a.fasta'
merge_file = 'original.fasta'
output_file = 'results.fasta'
exclude = set()
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
exclude.add(fasta.id)
fasta_sequences = SeqIO.parse(open(merge_file),'fasta')
with open(output_file, 'w') as output_handle:
for fasta in fasta_sequences:
if fasta.id not in exclude:
SeqIO.write([fasta], output_handle, "fasta")
我会根据给定的文件过滤掉不需要的序列 A.fasta。原始文件包含所有序列,而 fasta 文件实际上是一个以序列 ID 开头的文件,后跟由 A、T、C、G 表示的核苷酸。有人可以帮我吗?
A.fasta
>chr12:15747942-15747949
TGACATCA
>chr2:130918058-130918065
TGACCTCA
Original.fasta
>chr3:99679938-99679945
TGACGTAA
>chr9:135822160-135822167
TGACCTCA
>chr12:15747942-15747949
TGACATCA
>chr2:130918058-130918065
TGACCTCA
>chr2:38430457-38430464
TGACCTCA
>chr1:112381724-112381731
TGACATCA
C.fasta
的预期输出>chr3:99679938-99679945
TGACGTAA
>chr9:135822160-135822167
TGACCTCA
>chr2:38430457-38430464
TGACCTCA
>chr1:112381724-112381731
TGACATCA
代码
import sys
import warnings
from Bio import SeqIO
from Bio import BiopythonDeprecationWarning
warnings.simplefilter('ignore',BiopythonDeprecationWarning)
fasta_file = sys.argv[1] # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file
remove = set()
with open(remove_file) as f:
for line in f:
line = line.strip()
if line != "":
remove.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
nuc = seq.seq.tostring()
if nuc not in remove and len(nuc) > 0:
SeqIO.write([seq], f, "fasta")
上面的代码会过滤掉重复序列,但如果重复序列确实出现在输出中,我想保留它
看看BioPython。这是一个使用它的解决方案:
from Bio import SeqIO
input_file = 'a.fasta'
merge_file = 'original.fasta'
output_file = 'results.fasta'
exclude = set()
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
exclude.add(fasta.id)
fasta_sequences = SeqIO.parse(open(merge_file),'fasta')
with open(output_file, 'w') as output_handle:
for fasta in fasta_sequences:
if fasta.id not in exclude:
SeqIO.write([fasta], output_handle, "fasta")