如何在每次循环迭代时不覆盖文件的情况下写入文件

How to write files without overwriting them with each loop iteration

以下代码是分析FASTA序列(.faa文件)的氨基酸组成

from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis
import fastaparser
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

filename = input("Please enter the full path of the amino acid sequence file!: ")
pH_input = input("At which pH should the analysis be conducted? ")
flexibility_ynu = input("Do you wish a flexibility analysis?\n   (1) Yes\n   (0) No\n")

if pH_input == "":
    pH= 7.4
elif pH_input != "":
    pH = pH_input

f = open(filename + "_analysis.txt","w+")
for record in SeqIO.parse(filename, 'fasta'):
    X = ProteinAnalysis(str(record.seq))
    print("ANALYSIS OF", record, "\n ----------- \n -----------", file=f)
    #
    pd_count_amino_acids = pd.DataFrame(X.count_amino_acids(), index=[1])
    print("number of amino acids: \n",pd_count_amino_acids , file=f)
    plt_acc = pd_count_amino_acids.plot.bar()
    plt.savefig(filename + "_count_amino_acids_plot.pdf")
    #
    pd_get_amino_acids_percent = pd.DataFrame(X.get_amino_acids_percent(), index=[1])
    print("\n percentage of amino acids: \n", pd_get_amino_acids_percent, file=f)
    plt_acp = pd_get_amino_acids_percent.plot.bar()
    plt.savefig(filename + "amino_acids_percent_plot.pdf")
    #
    print("\n molecular weight: {:.2f}".format(X.molecular_weight()), file=f)
    print("\n aromaticity: {:.2f}".format(X.aromaticity()), file=f)
    print("\n instability index: {:.2f}".format(X.instability_index()), file=f)
    if flexibility_ynu == "1":
        print("\n flexibility: ", X.flexibility(), file=f)
    print("\n IEP: ", X.isoelectric_point(), file=f)
    print("therefore its charge at pH = ",pH," is {:.2f}".format(X.charge_at_pH(pH)), file=f)
    print("secondary structure fraction: (Helix, Turn, Sheet): ", X.secondary_structure_fraction(), "\n\n\n", file=f)
f.close()
print("done")

我现在想要氨基酸的绝对数量和相对数量的条形图,但要为每个 FASTA ID 创建一个独特的图。

例如,NC_011544 有 5 个 ID,所以我想生成 10 个独特的图(每个 ID 2 个,一个用于绝对数,一个用于相对数)。

有什么办法吗?

NC_011544.faa

>gi|212671454|ref|YP_002308464.1| replicase [Hosta virus X]
MARLREVFSSFTEPNLKTIVQQETYKLAKAELKTIQTYNPYAQTKDAADLLEDLGINTNPHAVTAHTHAA
AKSIENDLYGITSHYLPKTPITFLFMKRGKLQFFKRGPQHNDLFFYTTHEPKDVIRYQSEDQTADMFRVP
TSTGFIGDTLHFLSLKYLHRLFLKNPNLNTLYATMVLPPEAMYRMASIYPEIYQIQYQEDGFLYIPGGHG
GAAYFHTYDTLTWLRVGQFQAKEFTAHLPKVGDKGANHLFIIQRADLKTPKYRTFVPRRKWVTLPNIFLP
STQANHLFIIQRADLKTPKYRTFVPRRKWVTSNIFLPKHTNARKPILKQTMMQLFLYEKSVKEITFRDVF
AKIRQLIQTKDLEQFDPDELVRLANYVMHTSKLLEKDPYELIEGQGKLQDLVNPIKTWVSEKWQNWFGWK
DYTRLIRALKWVDVDLVLRVMNTRSTPTGIQTSELLPDEAGPPKSKKKRGGKKIPSPEPSRNCRSKSKRT
RGNRAQREKEPHRRKLRWQKENFQRVTVQVHQAPKGDPSPLARFSQSLKELPRRSQPRRLSKFQDFLMSS
TQTRFQIPSSLNRRAGHWRPKQQGTPPTTQEAGTEGPPTTQPGKPTASSPRAAPQPTANAETMEKGSQAS
SATTRGRDPVTDRTREQAPTNLTPEEEALPWKHWLKQLKAVGFKGNETQMDGDGTSISPIEQIKSCPGKP
KSVSKEILETLRSGHAPNFWKPDASRARAYTSDIKNRRTGAAVHMAPQAWKETMDFIAENAERTLHILRH
PWRRRFREEQMSSRDAHKFHFLFDETLVVCPTNELRRDWIDKLPLSEPGSVLTFERALMNPAKGTVIFDD
YTKLPAGFIEAYSICQPNVELVILTGDAKQASHHESNDNAMIAGLDPAAFEFSKFCRYYLNATHRNPRNL
ANALGIYSEKPGNLKVTFTNHLLPEMHILVPSLLKKATLEELGHKCSTYAGCQGVTLSKVQIYLDSNTTL
CSNEVLYTALSRAVEQINFVNSGPFNGPFWAKLEATPYLKTFLRLTREEKINEITPEEPKPKEPEPPKTH
FPVETSAHLYSSITEEMPEKHAREIYNKTHGHTNCVQTDEPLVQMFAHQQAKDEALFWETIEARLRITTS
EANVQELNEKRDIGDLLFHAYHKAMGLPKDPIPFENDLWETCAQEVQQTYLSKPINLIKNGEKRQGPDFD
KNAIMLFLKSQWVKKMEKLGAPTIKPGQTIASFHQITVMLYGTMARYMRRIRDRFCPKHILINCEKTPTQ
ISDFVKAQWDFSDFAYANDFTAFDQSQDGAMLQFEIIKAKFHNIPEDIILGYMDIKTNAKIFLGTLAIMR
LTGEGPTFDANTECNIAYTHLRFNVPENVAQVYAGDDSALSKVCPEKDSFKQFADRLTLKSKPQVFPQTQ
GAWAEFCGLLITPRGIIKDPVKLHASWVLATKLGTLQQIKCVNSYGEDLKLSYDLGDHLQELLSESQCRT
HQVTVRELVKFAGKVEKHQAEIRSVANGNIRQLPFFY
>gi|212671455|ref|YP_002308465.1| 26 kDa protein [Hosta virus X]
MATFASFLSSTRPDFERTNTPLTKPLVIHAVAGAGKTTLLRDFLRANPLTNAQTLGTPDCPTLDGAYIRP
FSGPVANLVNILDEYTAHRHGSWDVLIADPLQHYERAKLPHYICKRSHRLCPATARLLRKLGLDIHSYRE
DESEISFSDIFSGQLEGTVLPLTPLCKDLLERHSCPFKCPSEFIGEQDDIITVVSEIPLSKHPDKTALYR
ALTRHTRRLNVLAPPPYPTP
>gi|212671456|ref|YP_002308466.1| 13 kDa protein [Hosta virus X]
MSSPHRLTPPPNYTPVLLAVVIGVGLAVVTNQLTRSTLPHVGDNIHSLPHGGNYKDGTKSVIYRGPAPFQ
RSHSTAPPFNAVLLLTFAIWFLSCRTRRAAIGIHVCHTCSQTREQQ
>gi|212671457|ref|YP_002308467.1| 8 kDa protein [Hosta virus X]
MQSFCSHLRSGSFPVVLGALLLAFTCATLVLRLGNNNSNNCLIYVDGARAFLEGNCAGISAEVVAALRPH
SHAG
>gi|212671458|ref|YP_002308468.1| coat protein [Hosta virus X]
MASDAPTPPAAPSPVTFTAPTQEQLTSLALPIISTRLPSPDVLNQISVKWQELGVPTASISSTAIALCMA
CYHSGSSGSTLIPGLAPGTTVNYTSLAAAVKSLATLREFARYFAPIIWNYAIEHKIPPANWAAMGYKENT
KYAAFDTFDSILNPAALQPTGGLIRQPTEEELLAHQANSALHIFDSLRNDFASTDGRVTRGHITSNVNSL
NYLPAPEGSS
  • 代码已经为每条记录创建了一个图,但是,所有内容都使用了相同的文件名,因此每个新图和 analysis.txt 都会覆盖之前的图。
  • record 中提取 id 作为 record_id,并使用它在每个循环中创建唯一的文件名。
  • 下面的代码现在将为每个 id
  • 生成一个 percent_plot.pdfcount_plot.pdf 和一个 analysis.txt
  • 使用with open,会自动关闭文件f
  • .T用于在绘图时转置数据框,将氨基酸名称放在轴上,从而使绘图更易于阅读。
import pandas
import matplotlib.pyplot as plt
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis

# set pandas display option
pd.set_option('display.max_columns', None)

# input selections were removed for testing
filename = 'NC_011544.faa'
flexibility_ynu =  '1'
pH = 7.4

for record in SeqIO.parse(filename, 'fasta'):
    # get each record id to be used for unique file names
    record_id = record.id.split('|')[1]
    print(record)
    # open the file and use the record id as part of the name, so there will be a unique file for each id
    with open(f"{filename}_{record_id}_analysis.txt","w+") as f:

        X = ProteinAnalysis(str(record.seq))
        print("ANALYSIS OF", record, "\n ----------- \n -----------", file=f)
        #
        pd_count_amino_acids = pd.DataFrame(X.count_amino_acids(), index=[1])
        display(pd_count_amino_acids)  # display is for jupyter notebook, otherwise use print
        print("number of amino acids: \n",pd_count_amino_acids , file=f)
        plt_acc = pd_count_amino_acids.T.plot.bar(legend=False, figsize=(7, 5))
        plt.title(f'Number of Amino Acids for {record_id} in {filename}')
        plt.xticks(rotation=0)
        # add the record id as part of the file name
        plt.savefig(f'{filename}_{record_id}_count_amino_acids_plot.pdf')
        plt.show()
        #
        pd_get_amino_acids_percent = pd.DataFrame(X.get_amino_acids_percent(), index=[1])
        display(pd_get_amino_acids_percent.round(2))  # display is for jupyter notebook, otherwise use print
        print("\n percentage of amino acids: \n", pd_get_amino_acids_percent, file=f)
        plt_acp = pd_get_amino_acids_percent.T.plot.bar(legend=False, figsize=(7, 5))
        plt.title(f'Percentage of Amino Acids for {record_id} in {filename}')
        plt.xticks(rotation=0)
        # add the record id as part of the file name
        plt.savefig(f'{filename}_{record_id}_amino_acids_percent_plot.pdf')
        plt.show()
        #
        print("\n molecular weight: {:.2f}".format(X.molecular_weight()), file=f)
        print("\n aromaticity: {:.2f}".format(X.aromaticity()), file=f)
        print("\n instability index: {:.2f}".format(X.instability_index()), file=f)

        if flexibility_ynu == "1":
            print("\n flexibility: ", X.flexibility(), file=f)

        print("\n IEP: ", X.isoelectric_point(), file=f)
        print("therefore its charge at pH = ",pH," is {:.2f}".format(X.charge_at_pH(pH)), file=f)
        print("secondary structure fraction: (Helix, Turn, Sheet): ", X.secondary_structure_fraction(), "\n\n\n", file=f)
        print('\n')

第一个ID

第二个ID