使用 Biopython 从 python 中的 .fasta 基因中提取基因起始位置
Extracting gene starting location from .fasta gene in python using Biopython
我有一个包含多个基因的 .fasta 文件。它们都有类似的描述,例如:
>lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]
我正在尝试提取所有这些基因的基因起始位置(即上例中的“1”)。我尝试了以下代码,但它似乎不起作用。
from Bio import SeqIO
genes = fasta_file.fasta
records = SeqIO.parse(open(genes), 'fasta')
record = next(records)
parts = record.description.split("..")
print(parts[0])
如有任何帮助或资源,我们将不胜感激!
这对我有用。希望对您有所帮助。
import re
from Bio import SeqIO
genes = "fasta_file.fasta"
records = SeqIO.parse(genes, 'fasta')
# fasta_file.fasta file has this line only.
>lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]
您可以使用 SeqIO.parse(filename, "fasta)
获取记录。
要检查这一点,
for record in SeqIO.parse(genes, 'fasta'):
print(record)
给出如下。 record.description
有字符串信息。
ID: lcl|NZ_LN831034.1_cds_WP_002987659.1_1 Name:
lcl|NZ_LN831034.1_cds_WP_002987659.1_1 Description:
lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA]
[locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773]
[protein=chromosomal replication initiator protein DnaA]
[protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS] Number of
features: 0 Seq('', SingleLetterAlphabet())
使用正则表达式获取 "location=" 之后的数字。
ma = re.search("location=(\d+)\.\.\d+", record.description)
ma.groups()[0] # 1
我有一个包含多个基因的 .fasta 文件。它们都有类似的描述,例如:
>lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]
我正在尝试提取所有这些基因的基因起始位置(即上例中的“1”)。我尝试了以下代码,但它似乎不起作用。
from Bio import SeqIO
genes = fasta_file.fasta
records = SeqIO.parse(open(genes), 'fasta')
record = next(records)
parts = record.description.split("..")
print(parts[0])
如有任何帮助或资源,我们将不胜感激!
这对我有用。希望对您有所帮助。
import re
from Bio import SeqIO
genes = "fasta_file.fasta"
records = SeqIO.parse(genes, 'fasta')
# fasta_file.fasta file has this line only.
>lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS]
您可以使用 SeqIO.parse(filename, "fasta)
获取记录。
要检查这一点,
for record in SeqIO.parse(genes, 'fasta'):
print(record)
给出如下。 record.description
有字符串信息。
ID: lcl|NZ_LN831034.1_cds_WP_002987659.1_1 Name: lcl|NZ_LN831034.1_cds_WP_002987659.1_1 Description: lcl|NZ_LN831034.1_cds_WP_002987659.1_1 [gene=dnaA] [locus_tag=B6D67_RS00005] [db_xref=GeneID:46805773] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_002987659.1] [location=1..1356] [gbkey=CDS] Number of features: 0 Seq('', SingleLetterAlphabet())
使用正则表达式获取 "location=" 之后的数字。
ma = re.search("location=(\d+)\.\.\d+", record.description)
ma.groups()[0] # 1