从 Url 地址读取 Fasta 文件
Reading a Fasta file from Url address
我正在使用 Python 3.4.
我写了一些代码来从互联网站点读取Fasta文件,但它没有用。
http://www.uniprot.org/uniprot/B5ZC00.fasta
(我可以将其作为文本文件下载和阅读,但我打算从给定站点读取多个 Fasta 文件。)
(1) 第一次尝试
# read FASTA file
def read_fasta(filename_as_string):
"""
open text file with FASTA format
read it and convert it into string list
convert the list to dictionary
>>> read_fasta('sample.txt')
{'Rosalind_0000':'GTAT....ATGA', ... }
"""
f = open(filename_as_string,'r')
content = [line.strip() for line in f]
f.close()
new_content = []
for line in content:
if '>Rosalind' in line:
new_content.append(line.strip('>'))
new_content.append('')
else:
new_content[-1] += line
dict = {}
for i in range(len(new_content)-1):
if i % 2 == 0:
dict[new_content[i]] = new_content[i+1]
return dict
此代码可以读取我台式计算机中的任何 Fasta 文件,但无法从 Internet 站点读取相同的 Fasta 文件。
>>> from urllib.request import urlopen
>>> html = urlopen("http://www.uniprot.org/uniprot/B5ZC00.fasta")
>>> print (read_fasta(html))
TypeError: invalid file: <http.client.HTTPResponse object at 0x02A62EF0>
(2) 第二次尝试
>>> from urllib.request import urlopen
>>> html = urlopen("http://www.uniprot.org/uniprot/B5ZC00.fasta")
>>> lines = [x.strip() for x in html.readlines()]
>>> print (lines)
[b'>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 (strain ATCC 33699 / Western) GN=glyQS PE=3 SV=1', b'MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQ', b'KDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSS', b'NEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVN', b'FKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKY', b'LNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYD', b'LSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILM', b'DLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIY', b'CLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK']
我以为我可以修改我的代码来读取在线 Fasta 文件作为字符串列表,但很快我意识到这并不容易。
>>> print (type(lines[0]))
<class 'bytes'>
我无法删除列表每个元素头部的脏 'b' 字符。
>>> print (lines[0])
b'>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase ...
>>> print (lines[0][1:])
b'sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase ...
(3) 个问题
如何删除脏的 'b' 字符?
有没有更好的方法从给定的 Url?
读取 Fasta 文件
在一些帮助下,我想我可以修改并完善我的代码。
谢谢。
来晚了,有用的话我来回答
在 python 2
import urllib2
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urllib2.urlopen(url)
fasta = response.read()
print fasta
在 python 3
from urllib.request import urlopen
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urlopen(url)
fasta = response.read().decode("utf-8", "ignore")
print(fasta)
你得到:
>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 (strain ATCC 33699 / Western) GN=glyQS PE=3 SV=1
MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQ
KDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSS
NEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVN
FKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKY
LNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYD
LSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILM
DLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIY
CLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK
奖金
最好使用 biopython(python 2 的示例)
from Bio import SeqIO
import urllib2
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urllib2.urlopen(url)
fasta_iterator = SeqIO.parse(response, "fasta")
for seq in fasta_iterator:
print seq.format("fasta")
如果您只对一级氨基酸序列感兴趣(想忽略 header),请尝试以下操作:
link = str(sys.argv[1]) #fasta file URL provided as command line argument
FASTA = urllib.urlopen(link).readlines()[1:] # as list without header (">...")
FASTA = "".join(FASTA).replace("\n","") # as a string free of new line markers
print FASTA
晚会有点晚了,但尝试 Jose 的 Biopython 答案在 Python 中不再有效 3. 这是一个替代方案:
from Bio import SeqIO
import requests
from io import StringIO
link = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
data = requests.get(link).text
fasta_iterator = SeqIO.parse(StringIO(data), "fasta")
# Pretty print the fasta info
for seq in fasta_iterator:
print(seq.format("fasta"))
我正在使用 Python 3.4.
我写了一些代码来从互联网站点读取Fasta文件,但它没有用。
http://www.uniprot.org/uniprot/B5ZC00.fasta
(我可以将其作为文本文件下载和阅读,但我打算从给定站点读取多个 Fasta 文件。)
(1) 第一次尝试
# read FASTA file
def read_fasta(filename_as_string):
"""
open text file with FASTA format
read it and convert it into string list
convert the list to dictionary
>>> read_fasta('sample.txt')
{'Rosalind_0000':'GTAT....ATGA', ... }
"""
f = open(filename_as_string,'r')
content = [line.strip() for line in f]
f.close()
new_content = []
for line in content:
if '>Rosalind' in line:
new_content.append(line.strip('>'))
new_content.append('')
else:
new_content[-1] += line
dict = {}
for i in range(len(new_content)-1):
if i % 2 == 0:
dict[new_content[i]] = new_content[i+1]
return dict
此代码可以读取我台式计算机中的任何 Fasta 文件,但无法从 Internet 站点读取相同的 Fasta 文件。
>>> from urllib.request import urlopen
>>> html = urlopen("http://www.uniprot.org/uniprot/B5ZC00.fasta")
>>> print (read_fasta(html))
TypeError: invalid file: <http.client.HTTPResponse object at 0x02A62EF0>
(2) 第二次尝试
>>> from urllib.request import urlopen
>>> html = urlopen("http://www.uniprot.org/uniprot/B5ZC00.fasta")
>>> lines = [x.strip() for x in html.readlines()]
>>> print (lines)
[b'>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 (strain ATCC 33699 / Western) GN=glyQS PE=3 SV=1', b'MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQ', b'KDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSS', b'NEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVN', b'FKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKY', b'LNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYD', b'LSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILM', b'DLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIY', b'CLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK']
我以为我可以修改我的代码来读取在线 Fasta 文件作为字符串列表,但很快我意识到这并不容易。
>>> print (type(lines[0]))
<class 'bytes'>
我无法删除列表每个元素头部的脏 'b' 字符。
>>> print (lines[0])
b'>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase ...
>>> print (lines[0][1:])
b'sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase ...
(3) 个问题
如何删除脏的 'b' 字符?
有没有更好的方法从给定的 Url?
在一些帮助下,我想我可以修改并完善我的代码。 谢谢。
来晚了,有用的话我来回答
在 python 2
import urllib2
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urllib2.urlopen(url)
fasta = response.read()
print fasta
在 python 3
from urllib.request import urlopen
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urlopen(url)
fasta = response.read().decode("utf-8", "ignore")
print(fasta)
你得到:
>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 (strain ATCC 33699 / Western) GN=glyQS PE=3 SV=1 MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQ KDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSS NEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVN FKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKY LNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYD LSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILM DLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIY CLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK
奖金
最好使用 biopython(python 2 的示例)
from Bio import SeqIO
import urllib2
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urllib2.urlopen(url)
fasta_iterator = SeqIO.parse(response, "fasta")
for seq in fasta_iterator:
print seq.format("fasta")
如果您只对一级氨基酸序列感兴趣(想忽略 header),请尝试以下操作:
link = str(sys.argv[1]) #fasta file URL provided as command line argument
FASTA = urllib.urlopen(link).readlines()[1:] # as list without header (">...")
FASTA = "".join(FASTA).replace("\n","") # as a string free of new line markers
print FASTA
晚会有点晚了,但尝试 Jose 的 Biopython 答案在 Python 中不再有效 3. 这是一个替代方案:
from Bio import SeqIO
import requests
from io import StringIO
link = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
data = requests.get(link).text
fasta_iterator = SeqIO.parse(StringIO(data), "fasta")
# Pretty print the fasta info
for seq in fasta_iterator:
print(seq.format("fasta"))