C-Numpy：如何从现有数据创建固定宽度的字符串 ndarray

Question

我正在用 C++ 和 Boost Python 编写一个 Python 扩展模块。我想 return numpy 数组从模块到 Python。它适用于像 double 这样的数字数据类型，但有时我需要从现有数据创建一个 string 数组。

对于数字数组，我使用了 PyArray_SimpleNewFromData 效果很好，但是由于字符串的长度不是固定的，我使用了 PyArray_New ，我可以在其中传递项目大小，在我的例子中是 4。这是最小值示例：

bool initNumpy()
{
    Py_Initialize();
    import_array();
    return true;
}

class Foo {
    public:            
        Foo() {
            initNumpy();
            data.reserve(10);
            data = {"Rx", "Rx", "Rx", "RxTx", "Tx", "Tx", "Tx", "RxTx", "Rx", "Tx"};                
        }

        PyObject* getArray() {
            npy_intp dims[] = { data.size() };            
            return (PyObject*)PyArray_New(&PyArray_Type, 1, dims, NPY_STRING, NULL, &data[0], 4, NPY_ARRAY_OWNDATA, NULL);
        }
    private:
        std::vector<std::string> data;             
};

我希望 getArray() 的输出等于 numpy.array(["Rx", "Rx" ...], dtype="S4") 的输出，即：

array([b'Rx', b'Rx', b'Rx', b'RxTx', b'Tx', b'Tx', b'Tx', b'RxTx', b'Rx',
       b'Tx'], dtype='|S4')

但看起来像这样：

array([b'Rx', b'', b'\xcc\xb3b\xd9', b'\xfe\x07', b'\x02', b'', b'\x0f',
       b'', b'Rx\x00\x03', b''], dtype='|S4')

我尝试使用 npy_intp const* strides 参数，因为我认为问题出在底层数据的内存块上。不幸的是它没有达到我想要的结果。

我正在使用 Microsoft Build Tools 2017、Boost python、distutils 和 Python 3.7 来构建扩展。

Answer 1

当使用 PyArray_New 时，传递的数据必须具有一个内存布局，这是 numpy 数组所期望的。 np.float64 这样的简单数据类型就是这种情况，但 std::vector<std::string> 和 dtype='|S4'.

不是这种情况

首先，PyArray_New 期望 |S4 的内存布局是什么？

举个例子

array([b'Rx', b'RxTx', b'T'], dtype='|S4')

预期的内存布局为：

| R| x|[=11=]|[=11=]| R| x| T| x| T|[=11=]|[=11=]|[=11=]|
|           |           |           |
|- 1. word -|- 2. word -|- 3. word -|

有以下值得注意的细节：

内存是连续且直接的。
每个元素都是 4 字节长，保存的字符串没有 NUL 终止符（参见 2.word），实际上不需要此信息。
如果单词长度小于4个字符，则剩余字符必须设置为[=19=]，即NUL字符。如果有人想存储带有尾随 [=19=] 的字符串，那就不走运了——但这是另一回事了。

A std::vector<std::string> 有一个完全不同的内存布局 - 因为 std::string 的内存布局不是通过 C++ 标准规定的，它可以在不同的实现中改变。

上述观察的结果是，如果字符串以 std::vector<std::string> 形式给出，则无法绕过复制数据。该函数包括三个步骤：

分配内存
将字符串复制到新位置
从上面构造的内存中创建 numpy 数组。

下面是 C++11 的示例实现，其中错误处理留作 reader:

的练习

PyObject* create_np_array(const std::vector<std::string> vals, size_t itemsize){

    //1. step allocate memory
    size_t mem_size = vals.size()*itemsize;
    void * mem = PyDataMem_NEW(mem_size);
    //ToDo: check mem!=nullptr
    //ToDo: make code exception safe

    //2. step initialize memory/copy data:
    size_t cur_index=0;
    for(const auto& val : vals){
        for(size_t i=0;i<itemsize;i++){
            char ch = i<val.size() ?  
                      val[i] : 
                      0; //fill with NUL if string too short
            reinterpret_cast<char*>(mem)[cur_index] = ch;
            cur_index++;
        }
    }

    //3. create numpy array
    npy_intp dim = static_cast<npy_intp>(vals.size());         
    return (PyObject*)PyArray_New(&PyArray_Type, 1, &dim, NPY_STRING, NULL, mem, 4, NPY_ARRAY_OWNDATA, NULL);

最后一件重要的事情：应该使用 PyDataMem_NEW 来分配数据而不是 malloc，如果它应该由生成的 numpy 数组（NPY_ARRAY_OWNDATA - 标志）拥有。这有两个优点：内存跟踪工作正常，我们不会（误）使用实现细节。有关传递数据所有权的其他方法，请参阅此 SO-post。

C-Numpy：如何从现有数据创建固定宽度的字符串 ndarray

C-Numpy: How to create fixed-width ndarray of strings from existing data

c++

python

numpy

python-c-api