使用数据框更改 fasta 文件中的序列名称
change seq name in a fasta file with a dataframe
我有问题,我解释一下。
我有一个 fasta 文件:
>seqA
AAAAATTTGG
>seqB
ATTGGGCCG
>seqC
ATTGGCC
>seqD
ATTGGACAG
和一个数据框:
seq name New name seq
seqB BOBO
seqC JOHN
我想在 fasta 文件中更改我的 ID seq 名称,如果我的数据框中有相同的 seq 名称并将其更改为新名称 seq,它会给出:
新斋戒文件:
>seqA
AAAAATTTGG
>BOBO
ATTGGGCCG
>JOHN
ATTGGCC
>seqD
ATTGGACAG
非常感谢
编辑:
我使用了这个脚本:
blast=pd.read_table("matches_Busco_0035_0042.m8",header=None)
blast.columns = ["qseqid", "Busco_ID", "pident", "length", "mismatch", "gapopen","qstart", "qend", "sstart", "send", "evalue", "bitscore"]
repl = blast[blast.pident > 95]
print(repl)
#substituion dataframe
newfile = []
count = 0
for rec in SeqIO.parse("concatenate_0035_0042_aa2.fa", "fasta"):
#get corresponding value for record ID from dataframe
x = repl.loc[repl.seq == rec.id, "Busco_ID"]
#change record, if not empty
if x.any():
rec.name = rec.description = rec.id = x.iloc[0]
count += 1
#append record to list
newfile.append(rec)
#write list into new fasta file
SeqIO.write(newfile, "changedtest.faa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))
我收到以下错误:
Traceback (most recent call last):
File "Get_busco_blast.py", line 74, in <module>
x = repl.loc[repl.seq == rec.id, "Busco_ID"]
File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'seq'
如果您安装了 Biopython
,那么您可以使用 SeqIO
到 read/write fasta 文件:
from Bio import SeqIO
#substituion dataframe
repl = pd.DataFrame(np.asarray([["seqB_3652_i36", "Bob"], ["seqC_123_6XXX1", "Patrick"]]), columns = ["seq", "newseq"])
newfile = []
count = 0
for rec in SeqIO.parse("test.faa", "fasta"):
#get corresponding value for record ID from dataframe
#repl["seq"] and "newseq" are the pandas column with the old and new sequence names, respectively
x = repl.loc[repl["seq"] == rec.id, "newseq"]
#change record, if not empty
if x.any():
#append old identifier number to the new id name
rec.name = rec.description = rec.id = x.iloc[0] + rec.id[rec.id.index("_"):]
count += 1
#append record to list
newfile.append(rec)
#write list into new fasta file
SeqIO.write(newfile, "changedtest.faa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))
请注意,此脚本不会检查替换 table 中的多个条目。如果记录 ID 不在数据框中,它只采用第一个元素或不更改任何内容。
使用 BioPython.
之类的东西更容易做到这一点
首先创建字典
names = Series(df['seq name'].values,index=df['New seq name']).to_dict()
现在迭代
from Bio import SeqIO
outs = []
for record in SeqIO.parse("orig.fasta", "fasta"):
record.id = names.get(record.id, default=record.id)
outs.append(record)
SeqIO.write(open("new.fasta", "w"), outs, "fasta")
我有问题,我解释一下。
我有一个 fasta 文件:
>seqA
AAAAATTTGG
>seqB
ATTGGGCCG
>seqC
ATTGGCC
>seqD
ATTGGACAG
和一个数据框:
seq name New name seq
seqB BOBO
seqC JOHN
我想在 fasta 文件中更改我的 ID seq 名称,如果我的数据框中有相同的 seq 名称并将其更改为新名称 seq,它会给出:
新斋戒文件:
>seqA
AAAAATTTGG
>BOBO
ATTGGGCCG
>JOHN
ATTGGCC
>seqD
ATTGGACAG
非常感谢
编辑: 我使用了这个脚本:
blast=pd.read_table("matches_Busco_0035_0042.m8",header=None)
blast.columns = ["qseqid", "Busco_ID", "pident", "length", "mismatch", "gapopen","qstart", "qend", "sstart", "send", "evalue", "bitscore"]
repl = blast[blast.pident > 95]
print(repl)
#substituion dataframe
newfile = []
count = 0
for rec in SeqIO.parse("concatenate_0035_0042_aa2.fa", "fasta"):
#get corresponding value for record ID from dataframe
x = repl.loc[repl.seq == rec.id, "Busco_ID"]
#change record, if not empty
if x.any():
rec.name = rec.description = rec.id = x.iloc[0]
count += 1
#append record to list
newfile.append(rec)
#write list into new fasta file
SeqIO.write(newfile, "changedtest.faa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))
我收到以下错误:
Traceback (most recent call last):
File "Get_busco_blast.py", line 74, in <module>
x = repl.loc[repl.seq == rec.id, "Busco_ID"]
File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'seq'
如果您安装了 Biopython
,那么您可以使用 SeqIO
到 read/write fasta 文件:
from Bio import SeqIO
#substituion dataframe
repl = pd.DataFrame(np.asarray([["seqB_3652_i36", "Bob"], ["seqC_123_6XXX1", "Patrick"]]), columns = ["seq", "newseq"])
newfile = []
count = 0
for rec in SeqIO.parse("test.faa", "fasta"):
#get corresponding value for record ID from dataframe
#repl["seq"] and "newseq" are the pandas column with the old and new sequence names, respectively
x = repl.loc[repl["seq"] == rec.id, "newseq"]
#change record, if not empty
if x.any():
#append old identifier number to the new id name
rec.name = rec.description = rec.id = x.iloc[0] + rec.id[rec.id.index("_"):]
count += 1
#append record to list
newfile.append(rec)
#write list into new fasta file
SeqIO.write(newfile, "changedtest.faa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))
请注意,此脚本不会检查替换 table 中的多个条目。如果记录 ID 不在数据框中,它只采用第一个元素或不更改任何内容。
使用 BioPython.
之类的东西更容易做到这一点首先创建字典
names = Series(df['seq name'].values,index=df['New seq name']).to_dict()
现在迭代
from Bio import SeqIO
outs = []
for record in SeqIO.parse("orig.fasta", "fasta"):
record.id = names.get(record.id, default=record.id)
outs.append(record)
SeqIO.write(open("new.fasta", "w"), outs, "fasta")