根据数据帧切换 fasta seq
Switch fasta seq depending on a dataframe
我实际上有 2 个名为:
的 fasta 文件
result1_aa.fasta
result2_aa.fasta
在这些文件中,我有这样的序列:
文件result1_aa.fasta
:
>gene1_B
ATTGGACCA
>gene2_A
ATTAGGAC
>gene90_B
ATTAGCCACA
>gene65_B
ATTGAG
文件result2_aa.fasta
:
>gene78_A
ATTGGACCA
>gene45_B
ATTAGGAC
>gene93_B
ATTAGCCACA
>gene54_A
ATTGACA
我有一个这样的数据框:
geneA geneB
gene78_A gene1_B
gene2_A gene45_B
gene90_A gene93_B
gene54_A gene65_B
它们实际上是有序的(参见 _number
)
我想要的是获得 2 个新的 fasta 文件,其顺序与上面的数据框相同,这里是:
文件result1_aa_new.fasta:
>gene78_A
ATTGGACCA
>gene2_A
ATTAGGAC
>gene90_A
ATTAGCCACA
>gene54_A
ATTGACA
文件result2_new_aa.fasta:
>gene1_B
ATTGGACCA
>gene45_B
ATTAGGAC
>gene93_B
ATTAGCCACA
>gene65_B
ATTGAG
我尝试了一些解决方案,但我无法将顺序作为数据框保存在我的 fasta 文件中...
用阿美的解决方法:
from Bio import SeqIO
import sys
from Bio.SeqRecord import SeqRecord
import pandas as pd
seq_0042_aa=open("seq_0042_aa.fasta","w")
seq_0042_dna=open("seq_0042_dna.fasta","w")
seq_0035_aa=open("seq_0035_aa.fasta","w")
seq_0035_dna=open("seq_0035_dna.fasta","w")
dN_dS_sorted=pd.read_table("dn_ds.out_sorted",sep='\t')
seq1_id=dN_dS_sorted["seq1_id"] #first row
seq2_id=dN_dS_sorted["seq2_id"] #second row
from Bio import SeqIO
results1 = list(SeqIO.parse("result1_aa.fasta", "fasta"))
results1 = pd.DataFrame({'f_id': [r.id for r in results1], 'f_seq': results1})
results1 = pd.merge(dN_dS_sorted, results1, left_on="seq1_id", right_on='f_id', how='left').dropna()
results1 = list(results1.f_seq.values)
with open("out.fasta", "w") as output_handle:
SeqIO.write(results1, output_handle, "fasta")
results2 = list(SeqIO.parse("result2_aa.fasta", "fasta"))
results2 = pd.DataFrame({'f_id': [r.id for r in results2], 'f_seq': results2})
results2 = pd.merge(dN_dS_sorted, results2, left_on="seq2_id", right_on='f_id', how='left').dropna()
results2 = list(results2.f_seq.values)
with open("out2.fasta", "w") as output_handle:
SeqIO.write(results2, output_handle, "fasta")
这是我的数据框的头部:
seq1_id seq2_id dN dS
g66097.t1_0035_0035 g13600.t1_0042_0042 0.10455938989199982 0.3122332927029104
g45594.t1_0035_0035 g1464.t1_0042_0042 0.5208761055250978 5.430485421797574
g50055.t1_0035_0035 g34744.t1_0042_0035 0.08040473491714645 0.4233916132491867
g34020.t1_0035_0035 g12096.t1_0042_0042 0.4385191689737516 26.834927363887587
g28436.t1_0035_0042 g35222.t1_0042_0035 0.055299811368483165 0.1181241496387666
然后,在输出中我应该得到:
输出1:
>g66097.t1_0035_0035
ATTGGAGATA
>g45594.t1_0035_0035
TAGGAGGAGA
>g34020.t1_0035_0035
ATGGGAT
>g28436.t1_0035_0042
ATTGGAGA
和输出2:
>g13600.t1_0042_0042
ATGGGAGAGA
>g1464.t1_0042_0042
ATGGAGGAGA
>g12096.t1_0042_0042
ATGGAGGAA
>g35222.t1_0042_0035
ATGGAGAG
但我实际上得到:
输出 1:
>g28436.t1_0035_0042
ATGAGAGAGA
>g1005.t1_0035_0035
ATAGGAGATA
>g28456.t1_0035_0035
ATGGAGATA
>g30148.t1_0035_0042
ATGGAGA
和输出2:
>g35222.t1_0042_0035
ATAGGAGA
>g11524.t1_0042_0042
ATAGGAGA
>g31669.t1_0042_0035
ATGAGAGA
>g37790.t1_0042_0035
ATGAGGAGA
这是 fastafile1 的头部:
>g13600.t1_0042_0042
AGATAGAGA
>g1464.t1_0042_0042
AGATTAGA
>g34744.t1_0042_0035
ATAGAGGA
>g12096.t1_0042_0042
AGATATGA
这里是 fastafile2 的头部:
>g66097.t1_0035_0035
AGATTAGAGA
>g45594.t1_0035_0035
AGTATAGAGA
>g50055.t1_0035_0035
ATAGGAGAGA
>g34020.t1_0035_0035
ATAGGAGAG
让我们做第一个文件。再次使用 BioPython、
from Bio import SeqIO
results1 = list(SeqIO.parse("result1_aa.fasta", "fasta"))
results1 = pd.DataFrame({'f_id': [r.id for r in results1], 'f_seq': results1})
现在合并它们:
results1 = pd.merge(df, results1, left_on='results_on', right_on='id', how='left').dropna()
(假设列名是 results_on
- 你没有指定它。
现在获取排序的记录:
results1 = list(results1.f_seq.values)
写出来:
with open("out.fasta", "w") as output_handle:
SeqIO.write(results1, output_handle, "fasta")
我实际上有 2 个名为:
的 fasta 文件result1_aa.fasta
result2_aa.fasta
在这些文件中,我有这样的序列:
文件result1_aa.fasta
:
>gene1_B
ATTGGACCA
>gene2_A
ATTAGGAC
>gene90_B
ATTAGCCACA
>gene65_B
ATTGAG
文件result2_aa.fasta
:
>gene78_A
ATTGGACCA
>gene45_B
ATTAGGAC
>gene93_B
ATTAGCCACA
>gene54_A
ATTGACA
我有一个这样的数据框:
geneA geneB
gene78_A gene1_B
gene2_A gene45_B
gene90_A gene93_B
gene54_A gene65_B
它们实际上是有序的(参见 _number
)
我想要的是获得 2 个新的 fasta 文件,其顺序与上面的数据框相同,这里是:
文件result1_aa_new.fasta:
>gene78_A
ATTGGACCA
>gene2_A
ATTAGGAC
>gene90_A
ATTAGCCACA
>gene54_A
ATTGACA
文件result2_new_aa.fasta:
>gene1_B
ATTGGACCA
>gene45_B
ATTAGGAC
>gene93_B
ATTAGCCACA
>gene65_B
ATTGAG
我尝试了一些解决方案,但我无法将顺序作为数据框保存在我的 fasta 文件中...
用阿美的解决方法:
from Bio import SeqIO
import sys
from Bio.SeqRecord import SeqRecord
import pandas as pd
seq_0042_aa=open("seq_0042_aa.fasta","w")
seq_0042_dna=open("seq_0042_dna.fasta","w")
seq_0035_aa=open("seq_0035_aa.fasta","w")
seq_0035_dna=open("seq_0035_dna.fasta","w")
dN_dS_sorted=pd.read_table("dn_ds.out_sorted",sep='\t')
seq1_id=dN_dS_sorted["seq1_id"] #first row
seq2_id=dN_dS_sorted["seq2_id"] #second row
from Bio import SeqIO
results1 = list(SeqIO.parse("result1_aa.fasta", "fasta"))
results1 = pd.DataFrame({'f_id': [r.id for r in results1], 'f_seq': results1})
results1 = pd.merge(dN_dS_sorted, results1, left_on="seq1_id", right_on='f_id', how='left').dropna()
results1 = list(results1.f_seq.values)
with open("out.fasta", "w") as output_handle:
SeqIO.write(results1, output_handle, "fasta")
results2 = list(SeqIO.parse("result2_aa.fasta", "fasta"))
results2 = pd.DataFrame({'f_id': [r.id for r in results2], 'f_seq': results2})
results2 = pd.merge(dN_dS_sorted, results2, left_on="seq2_id", right_on='f_id', how='left').dropna()
results2 = list(results2.f_seq.values)
with open("out2.fasta", "w") as output_handle:
SeqIO.write(results2, output_handle, "fasta")
这是我的数据框的头部:
seq1_id seq2_id dN dS
g66097.t1_0035_0035 g13600.t1_0042_0042 0.10455938989199982 0.3122332927029104
g45594.t1_0035_0035 g1464.t1_0042_0042 0.5208761055250978 5.430485421797574
g50055.t1_0035_0035 g34744.t1_0042_0035 0.08040473491714645 0.4233916132491867
g34020.t1_0035_0035 g12096.t1_0042_0042 0.4385191689737516 26.834927363887587
g28436.t1_0035_0042 g35222.t1_0042_0035 0.055299811368483165 0.1181241496387666
然后,在输出中我应该得到:
输出1:
>g66097.t1_0035_0035
ATTGGAGATA
>g45594.t1_0035_0035
TAGGAGGAGA
>g34020.t1_0035_0035
ATGGGAT
>g28436.t1_0035_0042
ATTGGAGA
和输出2:
>g13600.t1_0042_0042
ATGGGAGAGA
>g1464.t1_0042_0042
ATGGAGGAGA
>g12096.t1_0042_0042
ATGGAGGAA
>g35222.t1_0042_0035
ATGGAGAG
但我实际上得到: 输出 1:
>g28436.t1_0035_0042
ATGAGAGAGA
>g1005.t1_0035_0035
ATAGGAGATA
>g28456.t1_0035_0035
ATGGAGATA
>g30148.t1_0035_0042
ATGGAGA
和输出2:
>g35222.t1_0042_0035
ATAGGAGA
>g11524.t1_0042_0042
ATAGGAGA
>g31669.t1_0042_0035
ATGAGAGA
>g37790.t1_0042_0035
ATGAGGAGA
这是 fastafile1 的头部:
>g13600.t1_0042_0042
AGATAGAGA
>g1464.t1_0042_0042
AGATTAGA
>g34744.t1_0042_0035
ATAGAGGA
>g12096.t1_0042_0042
AGATATGA
这里是 fastafile2 的头部:
>g66097.t1_0035_0035
AGATTAGAGA
>g45594.t1_0035_0035
AGTATAGAGA
>g50055.t1_0035_0035
ATAGGAGAGA
>g34020.t1_0035_0035
ATAGGAGAG
让我们做第一个文件。再次使用 BioPython、
from Bio import SeqIO
results1 = list(SeqIO.parse("result1_aa.fasta", "fasta"))
results1 = pd.DataFrame({'f_id': [r.id for r in results1], 'f_seq': results1})
现在合并它们:
results1 = pd.merge(df, results1, left_on='results_on', right_on='id', how='left').dropna()
(假设列名是 results_on
- 你没有指定它。
现在获取排序的记录:
results1 = list(results1.f_seq.values)
写出来:
with open("out.fasta", "w") as output_handle:
SeqIO.write(results1, output_handle, "fasta")