如何读取urllib3下载的.net文件?

How to read .net file downloaded by urllib3?

我正在使用 urllib3 从 github 下载文件 airports.net 并使用 networkx.read_pajek 将其作为图形对象读取,如下所示:

import urllib3
import networkx as nx


http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)
G = nx.read_pajek(f.data(), encoding = 'UTF-8')
print(G)

然后出现错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-7728c1228755> in <module>
     13 url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
     14 f = http.request('GET', url)
---> 15 G = nx.read_pajek(f.data(), encoding = 'UTF-8')
     16 print(G)
     17 

TypeError: 'bytes' object is not callable

能否请您详细说明如何操作?

更新: 如果我将 f.data() 更改为 f.data,则会出现新错误

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-e96ad6eb1bfb> in <module>()
      6 url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
      7 f = http.request('GET', url)
----> 8 G = nx.read_pajek(f.data, encoding = 'UTF-8')
      9 print(G)

<decorator-gen-781> in read_pajek(path, encoding)

4 frames
/usr/local/lib/python3.6/dist-packages/networkx/readwrite/pajek.py in <genexpr>(.0)
    159     for format information.
    160     """
--> 161     lines = (line.decode(encoding) for line in path)
    162     return parse_pajek(lines)
    163 

AttributeError: 'int' object has no attribute 'decode'

正如可以从错误消息中推断的那样,也可以在 the docs 中读到,HTTPResponse.databytes 类型的 属性 而不是方法。所以你需要 f.data 而不是 f.data() 来检索值。

更新

关于 AttributeError:可以在 network docs 中验证,函数 read_pajek 期望它的第一个参数是包含数据的文件路径,而不是实际数据.因此,您可以将字节转储到一个文件中,然后将该文件的路径作为参数传递。有几种选择:

  1. 只需使用硬编码文件名。这可以说是最简单的,不需要额外的导入。
import urllib3
import networkx as nx

FILE_NAME = "/tmp/test.net"

http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)

with open(FILE_NAME, "w") as fh:
    fh.write(f.data.decode())

G = nx.read_pajek(FILE_NAME, encoding='UTF-8')
print(f"G='{G}', G.size={G.size()}")
  1. 使用tempfile标准库模块为您管理文件(即给它一个随机名称,然后在不再使用后将其删除)。
import tempfile

import urllib3
import networkx as nx

http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)

with tempfile.NamedTemporaryFile() as fh:
    fh.write(f.data)
    G = nx.read_pajek(fh.name, encoding='UTF-8')

print(f"G='{G}', G.size={G.size()}")
  1. 使用io.BytesIOio.StringIO(“内存文件”)。这将创建一个存储在内存 (RAM) 中但具有 API 的对象,就像存储在磁盘上的常规文件一样。访问存储在 RAM 中的东西要快得多(快得多!),因此出于性能原因这很有用。当然,你不能总是使用它,因为你只有那么多 RAM,但在你的特定情况下,你已经在内存中有了数据,所以将它转储到磁盘会浪费大量时间,只是为了让 networkx 将其读回记忆中。尽管在您的特定情况下您可能不会注意到差异,因为您似乎只下载了 1 个不太大的文件一次,但也许它将来会派上用场。
import io

import urllib3
import networkx as nx

http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)

data = io.BytesIO(f.data)

G = nx.read_pajek(data, encoding = 'UTF-8')
print(f"G='{G}', G.size={G.size()}")