打开连接并获得响应需要太多时间
Opening the connection and getting response takes too much time
我写了一个 python 脚本来使用 SPARQL 查询 this endpoint 以获得一些关于基因的信息。脚本是这样工作的:
Get genes
Foreach gene:
Get proteins
Foreach proteins
Get the protein function
.....
Get Taxons
....
但脚本执行时间过长。我使用 pyinstrument 进行了分析,得到了以下结果:
39.481 <module> extracting_genes.py:10
`- 39.282 _main extracting_genes.py:750
|- 21.629 create_prot_func_info_dico extracting_genes.py:613
| `- 21.609 get_prot_func_info extracting_genes.py:216
| `- 21.596 query build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:780
| `- 21.596 _query build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:750
| `- 21.588 urlopen urllib2.py:131
| `- 21.588 open urllib2.py:411
| `- 21.588 _open urllib2.py:439
| `- 21.588 _call_chain urllib2.py:399
| `- 21.588 http_open urllib2.py:1229
| `- 21.588 do_open urllib2.py:1154
| |- 11.207 request httplib.py:1040
| | `- 11.207 _send_request httplib.py:1067
| | `- 11.205 endheaders httplib.py:1025
| | `- 11.205 _send_output httplib.py:867
| | `- 11.205 send httplib.py:840
| | `- 11.205 connect httplib.py:818
| | `- 11.205 create_connection socket.py:541
| | `- 9.552 meth socket.py:227
| `- 10.379 getresponse httplib.py:1084
| `- 10.379 begin httplib.py:431
| `- 10.379 _read_status httplib.py:392
| `- 10.379 readline socket.py:410
|- 6.045 create_gene_info_dico extracting_genes.py:323
| `- 6.040 ...
|- 3.957 create_prots_info_dico extracting_genes.py:381
| `- 3.928 ...
|- 3.414 create_taxons_info_dico extracting_genes.py:668
| `- 3.414 ...
|- 3.005 create_prot_parti_info_dico extracting_genes.py:558
| `- 2.999 ...
`- 0.894 create_prot_loc_info_dico extracting_genes.py:504
`- 0.893 ...
基本上我多次执行多个查询(+60000)所以我的理解是 opening the connection
和 getting response
多次完成,这会减慢执行速度。
有没有人知道如何解决这个问题?
正如@Stanislav 提到的,SPARQLWrapper Doesn't support persistent connections but I found a way to keep the connection alive, using setUseKeepAlive()
function defined in SPARQLWrapper/Wrapper.py 使用的 urllib2
。
我必须先安装 keepalive
软件包:
pip install keepalive
它减少了将近 40% 的执行时间。
def get_all_genes_uri(endpoint, the_offset):
sparql = SPARQLWrapper(endpoint)
sparql.setUseKeepAlive() # <--- Added this line
sparql.setQuery("""
#My_query
""")
....
得到如下结果:
24.673 <module> extracting_genes.py:10
`- 24.473 _main extracting_genes.py:750
|- 12.314 create_prot_func_info_dico extracting_genes.py:613
| `- 12.068 get_prot_func_info extracting_genes.py:216
| |- 11.428 query build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:780
| | `- 11.426 _query build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:750
| | `- 11.353 urlopen urllib2.py:131
| | `- 11.353 open urllib2.py:411
| | `- 11.339 _open urllib2.py:439
| | `- 11.338 _call_chain urllib2.py:399
| | `- 11.338 http_open keepalive/keepalive.py:343
| | `- 11.338 do_open keepalive/keepalive.py:213
| | `- 11.329 _reuse_connection keepalive/keepalive.py:264
| | `- 11.280 getresponse httplib.py:1084
| | `- 11.262 begin httplib.py:431
| | `- 11.207 _read_status httplib.py:392
| | `- 11.204 readline socket.py:410
| `- 0.304 __init__ build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:261
| `- 0.292 resetQuery build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:301
| `- 0.288 setQuery build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:516
|- 4.894 create_gene_info_dico extracting_genes.py:323
| `- 4.880 ...
|- 2.631 create_prots_info_dico extracting_genes.py:381
| `- 2.595 ...
|- 1.933 create_taxons_info_dico extracting_genes.py:668
| `- 1.923 ...
|- 1.804 create_prot_parti_info_dico extracting_genes.py:558
| `- 1.780 ...
`- 0.514 create_prot_loc_info_dico extracting_genes.py:504
`- 0.510 ...
老实说,执行时间还是没有我想要的那么快,我看看有没有别的办法。
我写了一个 python 脚本来使用 SPARQL 查询 this endpoint 以获得一些关于基因的信息。脚本是这样工作的:
Get genes
Foreach gene:
Get proteins
Foreach proteins
Get the protein function
.....
Get Taxons
....
但脚本执行时间过长。我使用 pyinstrument 进行了分析,得到了以下结果:
39.481 <module> extracting_genes.py:10
`- 39.282 _main extracting_genes.py:750
|- 21.629 create_prot_func_info_dico extracting_genes.py:613
| `- 21.609 get_prot_func_info extracting_genes.py:216
| `- 21.596 query build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:780
| `- 21.596 _query build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:750
| `- 21.588 urlopen urllib2.py:131
| `- 21.588 open urllib2.py:411
| `- 21.588 _open urllib2.py:439
| `- 21.588 _call_chain urllib2.py:399
| `- 21.588 http_open urllib2.py:1229
| `- 21.588 do_open urllib2.py:1154
| |- 11.207 request httplib.py:1040
| | `- 11.207 _send_request httplib.py:1067
| | `- 11.205 endheaders httplib.py:1025
| | `- 11.205 _send_output httplib.py:867
| | `- 11.205 send httplib.py:840
| | `- 11.205 connect httplib.py:818
| | `- 11.205 create_connection socket.py:541
| | `- 9.552 meth socket.py:227
| `- 10.379 getresponse httplib.py:1084
| `- 10.379 begin httplib.py:431
| `- 10.379 _read_status httplib.py:392
| `- 10.379 readline socket.py:410
|- 6.045 create_gene_info_dico extracting_genes.py:323
| `- 6.040 ...
|- 3.957 create_prots_info_dico extracting_genes.py:381
| `- 3.928 ...
|- 3.414 create_taxons_info_dico extracting_genes.py:668
| `- 3.414 ...
|- 3.005 create_prot_parti_info_dico extracting_genes.py:558
| `- 2.999 ...
`- 0.894 create_prot_loc_info_dico extracting_genes.py:504
`- 0.893 ...
基本上我多次执行多个查询(+60000)所以我的理解是 opening the connection
和 getting response
多次完成,这会减慢执行速度。
有没有人知道如何解决这个问题?
正如@Stanislav 提到的,SPARQLWrapper Doesn't support persistent connections but I found a way to keep the connection alive, using setUseKeepAlive()
function defined in SPARQLWrapper/Wrapper.py 使用的 urllib2
。
我必须先安装 keepalive
软件包:
pip install keepalive
它减少了将近 40% 的执行时间。
def get_all_genes_uri(endpoint, the_offset):
sparql = SPARQLWrapper(endpoint)
sparql.setUseKeepAlive() # <--- Added this line
sparql.setQuery("""
#My_query
""")
....
得到如下结果:
24.673 <module> extracting_genes.py:10
`- 24.473 _main extracting_genes.py:750
|- 12.314 create_prot_func_info_dico extracting_genes.py:613
| `- 12.068 get_prot_func_info extracting_genes.py:216
| |- 11.428 query build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:780
| | `- 11.426 _query build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:750
| | `- 11.353 urlopen urllib2.py:131
| | `- 11.353 open urllib2.py:411
| | `- 11.339 _open urllib2.py:439
| | `- 11.338 _call_chain urllib2.py:399
| | `- 11.338 http_open keepalive/keepalive.py:343
| | `- 11.338 do_open keepalive/keepalive.py:213
| | `- 11.329 _reuse_connection keepalive/keepalive.py:264
| | `- 11.280 getresponse httplib.py:1084
| | `- 11.262 begin httplib.py:431
| | `- 11.207 _read_status httplib.py:392
| | `- 11.204 readline socket.py:410
| `- 0.304 __init__ build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:261
| `- 0.292 resetQuery build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:301
| `- 0.288 setQuery build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:516
|- 4.894 create_gene_info_dico extracting_genes.py:323
| `- 4.880 ...
|- 2.631 create_prots_info_dico extracting_genes.py:381
| `- 2.595 ...
|- 1.933 create_taxons_info_dico extracting_genes.py:668
| `- 1.923 ...
|- 1.804 create_prot_parti_info_dico extracting_genes.py:558
| `- 1.780 ...
`- 0.514 create_prot_loc_info_dico extracting_genes.py:504
`- 0.510 ...
老实说,执行时间还是没有我想要的那么快,我看看有没有别的办法。