如何读取urllib3下载的.net文件？

Question

我正在使用 urllib3 从 github 下载文件 airports.net 并使用 networkx.read_pajek 将其作为图形对象读取，如下所示：

import urllib3
import networkx as nx


http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)
G = nx.read_pajek(f.data(), encoding = 'UTF-8')
print(G)

然后出现错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-7728c1228755> in <module>
     13 url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
     14 f = http.request('GET', url)
---> 15 G = nx.read_pajek(f.data(), encoding = 'UTF-8')
     16 print(G)
     17 

TypeError: 'bytes' object is not callable

能否请您详细说明如何操作？

更新： 如果我将 f.data() 更改为 f.data，则会出现新错误

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-e96ad6eb1bfb> in <module>()
      6 url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
      7 f = http.request('GET', url)
----> 8 G = nx.read_pajek(f.data, encoding = 'UTF-8')
      9 print(G)

<decorator-gen-781> in read_pajek(path, encoding)

4 frames
/usr/local/lib/python3.6/dist-packages/networkx/readwrite/pajek.py in <genexpr>(.0)
    159     for format information.
    160     """
--> 161     lines = (line.decode(encoding) for line in path)
    162     return parse_pajek(lines)
    163 

AttributeError: 'int' object has no attribute 'decode'

Answer 1

正如可以从错误消息中推断的那样，也可以在 the docs 中读到，HTTPResponse.data 是 bytes 类型的属性而不是方法。所以你需要 f.data 而不是 f.data() 来检索值。

更新

关于 AttributeError：可以在 network docs 中验证，函数 read_pajek 期望它的第一个参数是包含数据的文件路径，而不是实际数据.因此，您可以将字节转储到一个文件中，然后将该文件的路径作为参数传递。有几种选择：

只需使用硬编码文件名。这可以说是最简单的，不需要额外的导入。

import urllib3
import networkx as nx

FILE_NAME = "/tmp/test.net"

http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)

with open(FILE_NAME, "w") as fh:
    fh.write(f.data.decode())

G = nx.read_pajek(FILE_NAME, encoding='UTF-8')
print(f"G='{G}', G.size={G.size()}")

使用tempfile标准库模块为您管理文件（即给它一个随机名称，然后在不再使用后将其删除）。

import tempfile

import urllib3
import networkx as nx

http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)

with tempfile.NamedTemporaryFile() as fh:
    fh.write(f.data)
    G = nx.read_pajek(fh.name, encoding='UTF-8')

print(f"G='{G}', G.size={G.size()}")

使用io.BytesIO 或io.StringIO（“内存文件”）。这将创建一个存储在内存 (RAM) 中但具有 API 的对象，就像存储在磁盘上的常规文件一样。访问存储在 RAM 中的东西要快得多（快得多！），因此出于性能原因这很有用。当然，你不能总是使用它，因为你只有那么多 RAM，但在你的特定情况下，你已经在内存中有了数据，所以将它转储到磁盘会浪费大量时间，只是为了让 networkx 将其读回记忆中。尽管在您的特定情况下您可能不会注意到差异，因为您似乎只下载了 1 个不太大的文件一次，但也许它将来会派上用场。

import io

import urllib3
import networkx as nx

http = urllib3.PoolManager()
url = 'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/airports.net'
f = http.request('GET', url)

data = io.BytesIO(f.data)

G = nx.read_pajek(data, encoding = 'UTF-8')
print(f"G='{G}', G.size={G.size()}")

如何读取urllib3下载的.net文件？

How to read .net file downloaded by urllib3?

python

networkx

urllib3

更新