转换为共享字符串数组的 Numpy 字符串矩阵会导致类型不匹配

Question

我正在试验 multiprocessing in Python, however, I am having trouble with creating some shared memory。以下面的例子说明了我的问题：

参照 the following (slightly different as he uses a matrix full of floats, but same principle), I want to convert a numpy matrix of strings into a shared memory space 供流程使用。我有以下内容：

from ctypes import c_wchar_p
import numpy as np
from multiprocessing.sharedctypes import Array

input_array = np.array([['Red', 'Green', 'Blue', 'Yellow'],
                        ['Purple', 'Orange', 'Cyan', 'Pink']]).T

shared_memory = Array(c_wchar_p, input_array.size, lock=False) # Equivalent to just using a RawArray
np_wrapper = np.frombuffer(shared_memory, dtype='<U1').reshape(input_array.shape)
np.copyto(np_wrapper, input_array)
print(np_wrapper)

然而，np_wrapper只有每个字符串的第一个字符：

[['R' 'P']
 ['G' 'O']
 ['B' 'C']
 ['Y' 'P']]

我已尝试解决的问题：

我尝试将 frombuffer 函数的 dtype 从 <U1 更改为 <U6，即 input_array 的 dtype .但是，它抛出以下异常：

ValueError: buffer size must be a multiple of element size

我尝试将 int64 的 dtype 与 frombuffer 函数一起使用，因为我的 shared_memory 数组是 c_wchar_p（即字符串指针），我在 64 位 Windows 10 系统上。但是，它抛出以下异常：

ValueError: cannot reshape array of size 4 into shape (4,2)

我非常困惑为什么我在这里输入错误。 有人知道如何解决这个问题吗？

Answer 1

这可能有助于理解这个字符串数组包含的内容：

In [643]: input_array = np.array([['Red', 'Green', 'Blue', 'Yellow'],
     ...:                         ['Purple', 'Orange', 'Cyan', 'Pink']]).T
     ...: 
     ...:                         
In [644]: input_array.size
Out[644]: 8
In [645]: input_array.itemsize
Out[645]: 24
In [646]: input_array.nbytes
Out[646]: 192

因为是转置，形状和步长与输入数组不同，但字符串是原顺序。

In [647]: input_array.__array_interface__
Out[647]: 
{'data': (139792902236880, False),
 'strides': (24, 96),
 'descr': [('', '<U6')],
 'typestr': '<U6',
 'shape': (4, 2),
 'version': 3}

我的猜测是 Array 应该定义为 nbytes 而不是 size。

Answer 2

前言

在我详细说明我的解决方案之前，我想在我的回答前加上一些有用的信息。 python 中的函数 memoryview() 被证明对于获取全貌非常有用。例如，运行将 input_array 的 dtype 指定为 dtype='S6'（要检查的 b/c 更少字节）后的以下内容：

print(bytes(memoryview(input_array)))

然后得到以下结果：

b'Red\x00\x00\x00PurpleGreen\x00OrangeBlue\x00\x00Cyan\x00\x00YellowPink\x00\x00'

我们可以从下面的输出中看出，input_array中的每个条目的长度都是 6 个字节，并且分布在一个连续的内存块中。这告诉我们，我们的 Numpy 数组不仅仅是 8 个指向内存中字符串的指针。

回到 dtype 未指定时，@hpaulj 还提供了更有用的见解。读取 dtype documentation 后，我们的数组具有类型 <U6，其转换如下：

<  -- Little-Endian (b/c I am on an Intel-based system)
U  -- Unicode String (Remember with 4 bytes per Unicode String)
6  -- 24 bytes per entry in the array.

解决方案

TLDR；这是解决方案：

from ctypes import c_char
import numpy as np
from multiprocessing.sharedctypes import Array

input_array = np.array([['Red', 'Green', 'Blue', 'Yellow'],
                        ['Purple', 'Orange', 'Cyan', 'Pink']]).T

shared_memory = Array(c_char, input_array.size * input_array.itemsize, lock=False)
np_wrapper = np.frombuffer(shared_memory, dtype=input_array.dtype).reshape(input_array.shape)
np.copyto(np_wrapper, input_array)

print(shared_memory[:])
print(np_wrapper)

解法说明：

初始代码的第一个错误方面是初始 shared_memory 数组的输入信息。我们的 Numpy 数组不是指针数组，而是 8 个紧挨在一起的字符串（最长的元素指定了一些填充）。因此，使用 c_wchar_p 类型（即字符串指针）是不正确的。我选择 c_char 而不是 c_wchar 因为 c_char 保证是一个字节，而 c_wchar 不是 (see documentation for further details).

接下来，需要指定整个共享内存的大小。因为我选择 c_char 作为我的类型，所以我将指定字节数。长度由以下给出：

There are 8 elements (input_array.size) with each element contain 24 bytes (input_array.itemsize). Therefore, there are 8 * 24 = 192 bytes total in our shared memory.

最后，在Numpy中使用frombuffer函数时，一定要指定正确的 dtype因为Numpy会这样划分和解释传入的任意字节。只需简单地使用 input_array 的相同 dtype 即可完成翻译。

最后，一旦 copyto 开始，shared_memory 将成功配置！

转换为共享字符串数组的 Numpy 字符串矩阵会导致类型不匹配

Numpy matrix of strings converted to shared array of strings creates type mismatch

python

ctypes

numpy

shared-memory

multiprocessing

前言

解决方案

TLDR；这是解决方案：

解法说明：