Python:从 URL 的列表中提取 webtext 上的字符串
Python: extract string on webtext from a list of URL
我有一个 Web 文本列表 URL,我需要从中提取信息,然后将这些信息存储在列表中。
我需要提取的字符串始终以 (P: OR C: OR F:) 开头,并始终以“;”结尾。
我很难一起完成这项工作,我们将不胜感激。
URL之一的网络文本示例:
DR Proteomes; UP000005640; Chromosome 3.
DR Bgee; C9J872; -.
DR ExpressionAtlas; C9J872; baseline and differential.
DR GO; GO:0005634; C:nucleus; IBA:GO_Central.
DR GO; GO:0005667; C:transcription factor complex; IEA:InterPro.
DR GO; GO:0003677; F:DNA binding; IEA:UniProtKB-KW.
DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IBA:GO_Central.
DR GO; GO:0003712; F:transcription cofactor activity; IEA:InterPro.
DR GO; GO:0000278; P:mitotic cell cycle; IEA:InterPro.
这里是在 C:
之后搜索的预期结果
['nucleus', 'transcription factor complex']
但它还需要经过不同的URL并追加到同一个列表
到目前为止我尝试过但没有成功的示例:
import urllib2
import sys
import re
IDlist = ['C9JVZ1', 'C9JLN0', 'C9J872']
URLlist = ["http://www.uniprot.org/uniprot/"+x+".txt" for x in IDlist]
function_list = []
for item in URLlist:
textfile = urllib2.urlopen(item)
myfile = textfile.read()
for line in myfile:
function = re.search('P:(.+?);', line).group(1)
function_list.append(function)
这是包含您的词典的更新文件。请注意,我将循环控制更改为文件 ID 上的键:该 ID 用作字典键。
import urllib2
import re
IDlist = ['C9JVZ1', 'C9JLN0', 'C9J872']
function_dict = {}
# Cycle through the data files, keyed by ID
for id in IDlist:
# Start a new list of functions for this file.
# Open the file and read line by line.
function_list = []
textfile = urllib2.urlopen("http://www.uniprot.org/uniprot/"+id+".txt")
myfile = textfile.readlines()
for line in myfile:
# When you find a function tag, extract the function and add it to the list.
found = re.search(' [PCF]:(.+?);', line)
if found:
function = found.group(1)
function_list.append(function)
# At end of file, insert the list into the dictionary.
function_dict[id] = function_list
print function_dict
我从你的数据中得到的输出是
{'C9JVZ1': [], 'C9J872': ['nucleus', 'transcription factor complex', 'DNA binding', 'sequence-specific DNA binding RNA polymerase II transcription factor activity', 'transcription cofactor activity', 'mitotic cell cycle', 'regulation of transcription from RNA polymerase II promoter', 'transcription, DNA-templated'], 'C9JLN0': ['cytosol']}
我有一个 Web 文本列表 URL,我需要从中提取信息,然后将这些信息存储在列表中。 我需要提取的字符串始终以 (P: OR C: OR F:) 开头,并始终以“;”结尾。 我很难一起完成这项工作,我们将不胜感激。
URL之一的网络文本示例:
DR Proteomes; UP000005640; Chromosome 3.
DR Bgee; C9J872; -.
DR ExpressionAtlas; C9J872; baseline and differential.
DR GO; GO:0005634; C:nucleus; IBA:GO_Central.
DR GO; GO:0005667; C:transcription factor complex; IEA:InterPro.
DR GO; GO:0003677; F:DNA binding; IEA:UniProtKB-KW.
DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IBA:GO_Central.
DR GO; GO:0003712; F:transcription cofactor activity; IEA:InterPro.
DR GO; GO:0000278; P:mitotic cell cycle; IEA:InterPro.
这里是在 C:
之后搜索的预期结果['nucleus', 'transcription factor complex']
但它还需要经过不同的URL并追加到同一个列表
到目前为止我尝试过但没有成功的示例:
import urllib2
import sys
import re
IDlist = ['C9JVZ1', 'C9JLN0', 'C9J872']
URLlist = ["http://www.uniprot.org/uniprot/"+x+".txt" for x in IDlist]
function_list = []
for item in URLlist:
textfile = urllib2.urlopen(item)
myfile = textfile.read()
for line in myfile:
function = re.search('P:(.+?);', line).group(1)
function_list.append(function)
这是包含您的词典的更新文件。请注意,我将循环控制更改为文件 ID 上的键:该 ID 用作字典键。
import urllib2
import re
IDlist = ['C9JVZ1', 'C9JLN0', 'C9J872']
function_dict = {}
# Cycle through the data files, keyed by ID
for id in IDlist:
# Start a new list of functions for this file.
# Open the file and read line by line.
function_list = []
textfile = urllib2.urlopen("http://www.uniprot.org/uniprot/"+id+".txt")
myfile = textfile.readlines()
for line in myfile:
# When you find a function tag, extract the function and add it to the list.
found = re.search(' [PCF]:(.+?);', line)
if found:
function = found.group(1)
function_list.append(function)
# At end of file, insert the list into the dictionary.
function_dict[id] = function_list
print function_dict
我从你的数据中得到的输出是
{'C9JVZ1': [], 'C9J872': ['nucleus', 'transcription factor complex', 'DNA binding', 'sequence-specific DNA binding RNA polymerase II transcription factor activity', 'transcription cofactor activity', 'mitotic cell cycle', 'regulation of transcription from RNA polymerase II promoter', 'transcription, DNA-templated'], 'C9JLN0': ['cytosol']}