使用字符列表在输出中重复写入 - python
Repetitive writing in output using list of characters - python
在一个文件中,我有一些字符要被替换。
字母 = ["B", "Z", "J", "U", "O"]
for record in SeqIO.parse(inFile, "fasta"):
for letter in letters:
if letters in str(record.seq):
print record.id
record.seq = str(record.seq).replace(letter, "X")
outFile.write(">%s\n%s\n" % (record.description, record.seq))
else:
outFile.write(">%s\n%s\n" % (record.description, record.seq))
#pass
问题是输出看起来像这样,用字母写出尽可能多的字符输出:
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
你打错了。
if letters in str(record.seq):
而不是
if letter in str(record.seq)
因此,您的检查总是失败,并为每个字母打印 else
部分。
我认为您想要做的是用 'X'
.
替换模棱两可的 IUPAC amino acid codes(加上您不知何故获得的一些额外的字母?)
最好使用 str.translate()
(在 Python 3 中)一次完成所有替换。此外,由于您正在使用 Biopython 读取文件,因此您也可以使用 Biopython 轻松编写输出文件。
from Bio import SeqIO
from Bio.Seq import Seq
letters = ["B", "Z", "J", "U", "O"]
trans_tab = str.maketrans(''.join(letters), 'X'*len(letters))
def yield_seqs(in_file):
for record in SeqIO.parse(in_file, 'fasta'):
record.seq = Seq(str(record.seq).translate(trans_tab))
yield record
SeqIO.write(yield_seqs('input.fasta'), 'output.fasta', 'fasta')
示例:
$ cat input.fasta
>1
MBZJ
$ python3 myscript.py
$ cat output.fasta
>1
MXXX
在一个文件中,我有一些字符要被替换。
字母 = ["B", "Z", "J", "U", "O"]
for record in SeqIO.parse(inFile, "fasta"):
for letter in letters:
if letters in str(record.seq):
print record.id
record.seq = str(record.seq).replace(letter, "X")
outFile.write(">%s\n%s\n" % (record.description, record.seq))
else:
outFile.write(">%s\n%s\n" % (record.description, record.seq))
#pass
问题是输出看起来像这样,用字母写出尽可能多的字符输出:
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
你打错了。
if letters in str(record.seq):
而不是
if letter in str(record.seq)
因此,您的检查总是失败,并为每个字母打印 else
部分。
我认为您想要做的是用 'X'
.
最好使用 str.translate()
(在 Python 3 中)一次完成所有替换。此外,由于您正在使用 Biopython 读取文件,因此您也可以使用 Biopython 轻松编写输出文件。
from Bio import SeqIO
from Bio.Seq import Seq
letters = ["B", "Z", "J", "U", "O"]
trans_tab = str.maketrans(''.join(letters), 'X'*len(letters))
def yield_seqs(in_file):
for record in SeqIO.parse(in_file, 'fasta'):
record.seq = Seq(str(record.seq).translate(trans_tab))
yield record
SeqIO.write(yield_seqs('input.fasta'), 'output.fasta', 'fasta')
示例:
$ cat input.fasta
>1
MBZJ
$ python3 myscript.py
$ cat output.fasta
>1
MXXX