将 Python 中的字节从 Numpy 数组复制到字符串或字节数组

Question

我正在 while 循环中从 UDP 套接字读取数据。我需要最有效的方法

1) 阅读数据 (*)（有点解决了，但欢迎评论）

2) 定期将（操纵的）数据转储到文件中 (**)（问题）

我预计 numpy 的 "tostring" 方法会出现瓶颈。让我们考虑以下一段（不完整的）代码：

import socket
import numpy

nbuf=4096
buf=numpy.zeros(nbuf,dtype=numpy.uint8) # i.e., an array of bytes
f=open('dump.data','w')

datasocket=socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
# ETC.. (code missing here) .. the datasocket is, of course, non-blocking

while True:
  gotsome=True
  try:
    N=datasocket.recv_into(buf) # no memory-allocation here .. (*)
  except(socket.error):
    # do nothing ..
    gotsome=False

  if (gotsome):
    # the bytes in "buf" will be manipulated in various ways ..
    # the following write is done frequently (not necessarily in each pass of the while loop):
    f.write(buf[:N].tostring())  # (**) The question: what is the most efficient way to do this?

f.close()

现在，在 (**)，据我了解：

1) buf[:N] 为一个新的数组对象分配内存，长度为N+1，对吧？（也许不是）

.. 之后：

2) buf[:N].tostring() 为新字符串分配内存，将buf中的字节复制到这个字符串中

这似乎有很多内存分配和交换。在同一个循环中，将来，我将读取多个套接字并写入多个文件。

有没有办法只告诉f.write直接访问"buf"的内存地址0到N字节写到磁盘上？

也就是说，本着缓冲区接口的精神来做这件事并避免那两个额外的内存分配？

P. S.f.write(buf[:N].tostring()) 等价于buf[:N].tofile(f)

Answer 1

基本上，听起来像是要使用数组的tofile方法或者直接使用ndarray.data缓冲对象。

对于您的确切用例，使用数组的 data 缓冲区是最有效的，但是对于一般用途，您需要注意很多注意事项。我会详细说明一下。

但是，首先让我回答您的几个问题并提供一些说明：

buf[:N] allocates memory for a new array object, having the length N+1, right?

这取决于你所说的 "new array object" 是什么意思。无论涉及的数组大小如何，分配的额外内存很少。

它确实为新的数组对象（几个字节）分配内存，但它不会为数组的数据分配额外的内存。相反，它创建一个 "view" 共享原始数组的数据缓冲区。您对 y = buf[:N] 所做的任何更改也会影响 buf。

buf[:N].tostring() allocates memory for a new string, and the bytes from buf are copied into this string

是的，没错。

附带说明一下，您实际上可以采用相反的方式（字符串到数组）而无需分配任何额外的内存：

somestring = 'This could be a big string'
arr = np.frombuffer(buffer(somestring), dtype=np.uint8)

但是，因为 python 字符串是不可变的，所以 arr 将是只读的。

Is there a way to just tell f.write to access directly the memory address of "buf" from 0 to N bytes and write them onto the disk?

是的！

基本上，您需要：

f.write(buf[:N].data)

这非常有效，适用于任何类似文件的对象。在这种情况下，这几乎肯定是您想要的。但是，有几个注意事项！

首先，请注意 N 将在数组中的项目中，而不是直接以字节为单位。它们在您的示例代码中是等效的（由于 dtype=np.int8，或任何其他 8 位数据类型）。

如果你确实想写一些字节，你可以这样做

f.write(buf.data[:N])

...但是切片 arr.data 缓冲区将分配一个新字符串，因此它在功能上类似于 buf[:N].tostring()。无论如何，请注意，对于大多数数据类型，f.write(buf[:N].tostring()) 与 f.write(buf.data[:N]) 不同，但两者都会分配一个新字符串。

接下来，numpy数组可以共享数据缓冲区。在您的示例中，您无需担心这一点，但一般来说，使用 somearr.data 可能会因此而导致意外。

举个例子：

x = np.arange(10, dtype=np.uint8)
y = x[::2]

现在，y 与 x 共享相同的内存缓冲区，但它在内存中不连续（查看 x.flags 与 y.flags）。相反，它引用 x 内存缓冲区中的所有其他 项（将 x.strides 与 y.strides 进行比较）。

如果我们尝试访问 y.data，我们会得到一个错误，告诉我们这不是内存中的连续数组，我们无法为其获取单段缓冲区：

In [5]: y.data
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-54-364eeabf8187> in <module>()
----> 1 y.data

AttributeError: cannot get single-segment buffer for discontiguous array

这是 numpy 数组具有 tofile 方法的很大一部分原因（它也早于 python 的 buffer，但那是另一回事了）。

tofile会将数组中的数据写入文件，而不分配额外的内存。然而，因为它是在 C 级实现的，所以它只适用于真正的 file 对象，而不适用于类文件对象（例如套接字、StringIO 等）。

例如：

buf[:N].tofile(f)

但是，这是在 C 级实现的，仅适用于实际文件对象，不适用于套接字、StringIO 和其他类似文件的对象。

但是，这确实允许您使用任意数组索引。

buf[someslice].tofile(f)

将创建一个新视图（相同的内存缓冲区），并将其高效地写入磁盘。在您的具体情况下，它比切片 arr.data 缓冲区并将其直接写入磁盘要慢一些。如果您更喜欢使用数组索引（而不是字节数），那么 ndarray.tofile 方法将比 f.write(arr.tostring()).

更有效

将 Python 中的字节从 Numpy 数组复制到字符串或字节数组

Copying bytes in Python from Numpy array into string or bytearray

python

arrays

buffer

numpy