如何检索包含 header 中任何字符串列表的所有 fasta 记录
How to retrieve all fasta records that contain any of a list of strings in the header
我想要一个脚本,它可以从包含 header.
中任何字符串列表的文件中提取所有 fasta 记录
所以如果我在一个文件中有这样一个列表:
2.A.1.13.5
3.A.1.208.23
还有这样的 fasta 文件:
>gnl|TC-DB|O60645|1.F.2.1.2 Exocyst complex component 3 OS=Homo sapiens GN=EXOC3 PE=1 SV=2
MQCEDSTSFFTMKETDREAVATAVQRVAGMLQRPDQLDKVEQYRRREARKKASVEARLKA
>gnl|TC-DB|O60669|2.A.1.13.5 Monocarboxylate transporter 2 - Homo sapiens (Human).
MPPMPSAPPVHPPPDGGWGWIVVGAAFISIGFSYAFPKAVTVFFKEIQQIFHTTYSEIAW
>gnl|TC-DB|O60683|3.A.20.1.1 Peroxisome biogenesis factor 10 OS=Homo sapiens GN=PEX10 PE=1 SV=1
MAPAAASPPEVIRAAQKDEYYRGGLRSAAGGALHSLAGARKWLEWRKEVELLSDVAYFGL
>gnl|TC-DB|O60684|1.I.1.1.3 Importin subunit alpha-7 OS=Homo sapiens GN=KPNA6 PE=1 SV=1
METMASPGKDNYRMKSYKNNALNPEEMRRRREEEGIQLRKQKREQQLFKRRNVELINEEA
>gnl|TC-DB|O60706|3.A.1.208.23 ATP-binding cassette sub-family C member 9 OS=Homo sapiens GN=ABCC9 PE=1 SV=2
MSLSFCGNNISSYNINDGVLQNSCFVDALNLVPHVFLLFITFPILFIGWGSQSSKVQIHH
>gnl|TC-DB|O60721|3.A.1.208.23 Sodium/potassium/calcium exchanger 1 OS=Homo sapiens GN=SLC24A1 PE=1 SV=1
MGKLIRMGPQERWLLRTKRLHWSRLLFLLGMLIIGSTYQHLRRPRGLSSLWAAVSSHQPI
>gnl|TC-DB|O60779|2.A.1.13.5 Thiamine transporter 1 (THTR-1) (ThTr1) (Thiamine carrier 1) (TC1) - Homo sapiens (Human).
MDVPGPVSRRAAAAAATVLLRTARVRRECWFLPTALLCAYGFFASLRPSEPFLTPYLLGP
然后脚本将打印第 2、5、6 和 7 条记录,如下所示:
>gnl|TC-DB|O60669|2.A.1.13.5 Monocarboxylate transporter 2 - Homo sapiens (Human).
MPPMPSAPPVHPPPDGGWGWIVVGAAFISIGFSYAFPKAVTVFFKEIQQIFHTTYSEIAW
>gnl|TC-DB|O60706|3.A.1.208.23 ATP-binding cassette sub-family C member 9 OS=Homo sapiens GN=ABCC9 PE=1 SV=2
MSLSFCGNNISSYNINDGVLQNSCFVDALNLVPHVFLLFITFPILFIGWGSQSSKVQIHH
>gnl|TC-DB|O60721|3.A.1.208.23 Sodium/potassium/calcium exchanger 1 OS=Homo sapiens GN=SLC24A1 PE=1 SV=1
MGKLIRMGPQERWLLRTKRLHWSRLLFLLGMLIIGSTYQHLRRPRGLSSLWAAVSSHQPI
>gnl|TC-DB|O60779|2.A.1.13.5 Thiamine transporter 1 (THTR-1) (ThTr1) (Thiamine carrier 1) (TC1) - Homo sapiens (Human).
MDVPGPVSRRAAAAAATVLLRTARVRRECWFLPTALLCAYGFFASLRPSEPFLTPYLLGP
这是我正在采用的方法,但我可能离得很远,因为我不太确定我在做什么。有人告诉我 BioPython 很适合处理 fasta 格式文件,但我仍在努力掌握它。
from Bio import SeqIO
import sys
headers = []
search_list = open(sys.argv[1])
for line in search_list:
headers.append(line.rstrip())
search_list.close()
print search_list
for seq_record in SeqIO.parse(sys.argv[2], "fasta"):
#print seq_record
for a in headers:
head = str(a)
if head in seq_record:
print seq_record
提前为任何帮助干杯!
import re
list_items = '|'.join(list(open('list.txt').read().split('\n')))
with open('fasta.txt') as f:
for line in f.read().split('\n'):
if re.search(list_items, line ):
print line
输出:
gnl|TC-DB|O60669|2.A.1.13.5 Monocarboxylate transporter 2 - Homo sapiens (Human).
gnl|TC-DB|O60706|3.A.1.208.23 ATP-binding cassette sub-family C member 9 OS=Homo sapiens GN=ABCC9 PE=1 SV=2
gnl|TC-DB|O60721|3.A.1.208.23 Sodium/potassium/calcium exchanger 1 OS=Homo sapiens GN=SLC24A1 PE=1 SV=1
gnl|TC-DB|O60779|2.A.1.13.5 Thiamine transporter 1 (THTR-1) (ThTr1) (Thiamine carrier 1) (TC1) - Homo sapiens (Human).
您的代码几乎包含所有内容。您只需要检查 header 是否在 description
中,然后打印 SeqRecord 元素即可。
from Bio import SeqIO
import sys
headers = list()
with open(sys.argv[1], 'r') as search_list:
for line in search_list:
headers.append(line.rstrip())
for seq_record in SeqIO.parse(sys.argv[2], "fasta"):
for head in headers:
if head in seq_record.description:
print(seq_record.format('fasta'))
break
小提示:我在第一个找到的元素后添加了一个 break
语句,如果找到两个 header 它将打印两次。
我想要一个脚本,它可以从包含 header.
中任何字符串列表的文件中提取所有 fasta 记录所以如果我在一个文件中有这样一个列表:
2.A.1.13.5
3.A.1.208.23
还有这样的 fasta 文件:
>gnl|TC-DB|O60645|1.F.2.1.2 Exocyst complex component 3 OS=Homo sapiens GN=EXOC3 PE=1 SV=2
MQCEDSTSFFTMKETDREAVATAVQRVAGMLQRPDQLDKVEQYRRREARKKASVEARLKA
>gnl|TC-DB|O60669|2.A.1.13.5 Monocarboxylate transporter 2 - Homo sapiens (Human).
MPPMPSAPPVHPPPDGGWGWIVVGAAFISIGFSYAFPKAVTVFFKEIQQIFHTTYSEIAW
>gnl|TC-DB|O60683|3.A.20.1.1 Peroxisome biogenesis factor 10 OS=Homo sapiens GN=PEX10 PE=1 SV=1
MAPAAASPPEVIRAAQKDEYYRGGLRSAAGGALHSLAGARKWLEWRKEVELLSDVAYFGL
>gnl|TC-DB|O60684|1.I.1.1.3 Importin subunit alpha-7 OS=Homo sapiens GN=KPNA6 PE=1 SV=1
METMASPGKDNYRMKSYKNNALNPEEMRRRREEEGIQLRKQKREQQLFKRRNVELINEEA
>gnl|TC-DB|O60706|3.A.1.208.23 ATP-binding cassette sub-family C member 9 OS=Homo sapiens GN=ABCC9 PE=1 SV=2
MSLSFCGNNISSYNINDGVLQNSCFVDALNLVPHVFLLFITFPILFIGWGSQSSKVQIHH
>gnl|TC-DB|O60721|3.A.1.208.23 Sodium/potassium/calcium exchanger 1 OS=Homo sapiens GN=SLC24A1 PE=1 SV=1
MGKLIRMGPQERWLLRTKRLHWSRLLFLLGMLIIGSTYQHLRRPRGLSSLWAAVSSHQPI
>gnl|TC-DB|O60779|2.A.1.13.5 Thiamine transporter 1 (THTR-1) (ThTr1) (Thiamine carrier 1) (TC1) - Homo sapiens (Human).
MDVPGPVSRRAAAAAATVLLRTARVRRECWFLPTALLCAYGFFASLRPSEPFLTPYLLGP
然后脚本将打印第 2、5、6 和 7 条记录,如下所示:
>gnl|TC-DB|O60669|2.A.1.13.5 Monocarboxylate transporter 2 - Homo sapiens (Human).
MPPMPSAPPVHPPPDGGWGWIVVGAAFISIGFSYAFPKAVTVFFKEIQQIFHTTYSEIAW
>gnl|TC-DB|O60706|3.A.1.208.23 ATP-binding cassette sub-family C member 9 OS=Homo sapiens GN=ABCC9 PE=1 SV=2
MSLSFCGNNISSYNINDGVLQNSCFVDALNLVPHVFLLFITFPILFIGWGSQSSKVQIHH
>gnl|TC-DB|O60721|3.A.1.208.23 Sodium/potassium/calcium exchanger 1 OS=Homo sapiens GN=SLC24A1 PE=1 SV=1
MGKLIRMGPQERWLLRTKRLHWSRLLFLLGMLIIGSTYQHLRRPRGLSSLWAAVSSHQPI
>gnl|TC-DB|O60779|2.A.1.13.5 Thiamine transporter 1 (THTR-1) (ThTr1) (Thiamine carrier 1) (TC1) - Homo sapiens (Human).
MDVPGPVSRRAAAAAATVLLRTARVRRECWFLPTALLCAYGFFASLRPSEPFLTPYLLGP
这是我正在采用的方法,但我可能离得很远,因为我不太确定我在做什么。有人告诉我 BioPython 很适合处理 fasta 格式文件,但我仍在努力掌握它。
from Bio import SeqIO
import sys
headers = []
search_list = open(sys.argv[1])
for line in search_list:
headers.append(line.rstrip())
search_list.close()
print search_list
for seq_record in SeqIO.parse(sys.argv[2], "fasta"):
#print seq_record
for a in headers:
head = str(a)
if head in seq_record:
print seq_record
提前为任何帮助干杯!
import re
list_items = '|'.join(list(open('list.txt').read().split('\n')))
with open('fasta.txt') as f:
for line in f.read().split('\n'):
if re.search(list_items, line ):
print line
输出:
gnl|TC-DB|O60669|2.A.1.13.5 Monocarboxylate transporter 2 - Homo sapiens (Human). gnl|TC-DB|O60706|3.A.1.208.23 ATP-binding cassette sub-family C member 9 OS=Homo sapiens GN=ABCC9 PE=1 SV=2 gnl|TC-DB|O60721|3.A.1.208.23 Sodium/potassium/calcium exchanger 1 OS=Homo sapiens GN=SLC24A1 PE=1 SV=1 gnl|TC-DB|O60779|2.A.1.13.5 Thiamine transporter 1 (THTR-1) (ThTr1) (Thiamine carrier 1) (TC1) - Homo sapiens (Human).
您的代码几乎包含所有内容。您只需要检查 header 是否在 description
中,然后打印 SeqRecord 元素即可。
from Bio import SeqIO
import sys
headers = list()
with open(sys.argv[1], 'r') as search_list:
for line in search_list:
headers.append(line.rstrip())
for seq_record in SeqIO.parse(sys.argv[2], "fasta"):
for head in headers:
if head in seq_record.description:
print(seq_record.format('fasta'))
break
小提示:我在第一个找到的元素后添加了一个 break
语句,如果找到两个 header 它将打印两次。