从 fasta 文件中解析 header 中的特定字符串
Parsing specific string from header from fasta file
我想从 fasta header 文件中获取有机体名称,我感兴趣的是从描述中提取什么时候 OS=(有机体名称) .
法斯塔 HEADER
>sp|Q8T8B9|ACMSD_CAEEL 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Caenorhabditis elegans GN=acsd-1 PE=2 SV=1
MPICEFSATSKSRKIDVHAHVLPKNIPDFQEKFGYPGFVRLDHKEDGTTHMVKDGKLFRV
VEPNCFDTETRIADMNRANVNVQCLSTVPVMFSYWAKPADTEIVARFVNDDLLAECQKFP
GKEHIVLGTDYPFPLGEL
EVGRVVEEYKPFSAKDREDLLWKNAVKMLDIDENLLFNKDF
>sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2
MNSLLRLSHLAGPAHYRALHSSSSIWSKVAISKFEPKSYLPYEKLSQTVKIVKDRLKRPL
TLSEKILYGHLDQPKTQDIERGVSYLRLRPDRVAMQDATAQMAMLQFISSGLPKTAVPST
IHCDHLIEAQKGGAQDLARAKDLNKEVFNFLATAGSKYGVGFWKPGSGIIHQIILENYAF
FastaHeader获取代码
from Bio import SeqIO
import re
import pandas as pd
input_file = "ANIMAL.fasta"
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
fasta_id, sequence = fasta.id, str(fasta.seq)
print(fasta.description)
当前输出:
>sp|Q8T8B9|ACMSD_CAEEL 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Caenorhabditis elegans GN=acsd-1 PE=2 SV=1
>sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2
期望的输出:
Caenorhabditis elegans
Caenorhabditis elegans
您可以使用正则表达式搜索您的信息:
import re
example = "sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2"
start = re.search("OS", example).start()
result = example[start+3:].split("GN")[0].strip()
print(result)
>> Caenorhabditis elegans
此代码查找“OS=”之后的文本,直到“GN”并删除末尾的空格
我想从 fasta header 文件中获取有机体名称,我感兴趣的是从描述中提取什么时候 OS=(有机体名称) .
法斯塔 HEADER>sp|Q8T8B9|ACMSD_CAEEL 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Caenorhabditis elegans GN=acsd-1 PE=2 SV=1
MPICEFSATSKSRKIDVHAHVLPKNIPDFQEKFGYPGFVRLDHKEDGTTHMVKDGKLFRV
VEPNCFDTETRIADMNRANVNVQCLSTVPVMFSYWAKPADTEIVARFVNDDLLAECQKFP
GKEHIVLGTDYPFPLGEL
EVGRVVEEYKPFSAKDREDLLWKNAVKMLDIDENLLFNKDF
>sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2
MNSLLRLSHLAGPAHYRALHSSSSIWSKVAISKFEPKSYLPYEKLSQTVKIVKDRLKRPL
TLSEKILYGHLDQPKTQDIERGVSYLRLRPDRVAMQDATAQMAMLQFISSGLPKTAVPST
IHCDHLIEAQKGGAQDLARAKDLNKEVFNFLATAGSKYGVGFWKPGSGIIHQIILENYAF
FastaHeader获取代码
from Bio import SeqIO
import re
import pandas as pd
input_file = "ANIMAL.fasta"
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
fasta_id, sequence = fasta.id, str(fasta.seq)
print(fasta.description)
当前输出:
>sp|Q8T8B9|ACMSD_CAEEL 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Caenorhabditis elegans GN=acsd-1 PE=2 SV=1
>sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2
期望的输出:
Caenorhabditis elegans
Caenorhabditis elegans
您可以使用正则表达式搜索您的信息:
import re
example = "sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2"
start = re.search("OS", example).start()
result = example[start+3:].split("GN")[0].strip()
print(result)
>> Caenorhabditis elegans
此代码查找“OS=”之后的文本,直到“GN”并删除末尾的空格