NumPy "record array" 或 "structured array" 或 "recarray"

Question

NumPy "structured array"、"record array" 和 "recarray" 之间有什么区别（如果有的话）？

NumPy docs 暗示前两个相同：如果相同，该对象的首选术语是哪个？

同一文档说（在页面底部）：您可以找到更多关于 recarrays 和结构化数组的信息（包括两者之间的区别）here。对这种差异有简单的解释吗？

Answer 1

Records/recarrays 在

中实施

https://github.com/numpy/numpy/blob/master/numpy/core/records.py

此文件中的一些相关引述

Record Arrays Record arrays expose the fields of structured arrays as properties. The recarray is almost identical to a standard array (which supports named fields already) The biggest difference is that it can use attribute-lookup to find the fields and it is constructed using a record.

recarray 是 ndarray 的子类（与 matrix 和 masked arrays 的子类相同）。但请注意，它的构造函数与 np.array 不同。它更像是 np.empty(size, dtype).

class recarray(ndarray):
    """Construct an ndarray that allows field access using attributes.
    This constructor can be compared to ``empty``: it creates a new record
       array but does not fill it with data.

实现唯一字段作为属性行为的关键函数是__getattribute__（__getitem__实现索引）：

def __getattribute__(self, attr):
    # See if ndarray has this attr, and return it if so. (note that this
    # means a field with the same name as an ndarray attr cannot be
    # accessed by attribute).
    try:
        return object.__getattribute__(self, attr)
    except AttributeError:  # attr must be a fieldname
        pass

    # look for a field with this name
    fielddict = ndarray.__getattribute__(self, 'dtype').fields
    try:
        res = fielddict[attr][:2]
    except (TypeError, KeyError):
        raise AttributeError("recarray has no attribute %s" % attr)
    obj = self.getfield(*res)

    # At this point obj will always be a recarray, since (see
    # PyArray_GetField) the type of obj is inherited. Next, if obj.dtype is
    # non-structured, convert it to an ndarray. If obj is structured leave
    # it as a recarray, but make sure to convert to the same dtype.type (eg
    # to preserve numpy.record type if present), since nested structured
    # fields do not inherit type.
    if obj.dtype.fields:
        return obj.view(dtype=(self.dtype.type, obj.dtype.fields))
    else:
        return obj.view(ndarray)

它首先会尝试获取常规属性 - 诸如 .shape、.strides、.data 以及所有方法（.sum、.reshape，等等）。如果失败，它会在 dtype 字段名称中查找名称。所以它实际上只是一个具有一些重新定义的访问方法的结构化数组。

据我所知，record array 和 recarray 是相同的。

另一个文件显示了一些历史

https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py

Collection of utilities to manipulate structured arrays. Most of these functions were initially implemented by John Hunter for matplotlib. They have been rewritten and extended for convenience.

此文件中的许多函数以：

结尾

    if asrecarray:
        output = output.view(recarray)

您可以 return 数组作为 recarray 视图这一事实显示了 'thin' 这一层的情况。

numpy历史悠久，合并了几个独立的项目。我的印象是 recarray 是一个较旧的想法，而结构化数组是构建在广义 dtype 上的当前实现。与任何新开发相比，recarrays 似乎是为了方便和向后兼容而保留的。但我必须研究 github 文件历史记录，以及任何最近的 issues/pull 请求才能确定。

Answer 2

简而言之，答案是你通常应该使用结构化数组而不是 recarrays，因为结构化数组更快，recarrays 的唯一优势是允许你写 arr.x 而不是 arr['x']，这可能是一个方便的快捷方式，但如果您的列名与 numpy methods/attributes.

冲突，也容易出错

请参阅@jakevdp 书中的 excerpt 以获得更详细的解释。特别是，他指出，简单地访问结构化数组的列比访问 recarray 的列快大约 20 到 30 倍。但是，他的示例使用了一个只有 4 行的非常小的数据框，并且不执行任何标准操作。

对于较大数据帧的简单操作，尽管结构化数组仍然更快，但差异可能会小得多。例如，这是一个结构化的记录数组，每个数组有 10,000 行（从 @jpp answer here 借来的数据框创建数组的代码）。

n = 10_000
df = pd.DataFrame({ 'x':np.random.randn(n) })
df['y'] = df.x.astype(int)

rec_array = df.to_records(index=False)

s = df.dtypes
struct_array = np.array([tuple(x) for x in df.values], dtype=list(zip(s.index, s)))

如果我们执行标准操作，例如将列乘以 2，则结构化数组的速度大约快 50%：

%timeit struct_array['x'] * 2
9.18 µs ± 88.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit rec_array.x * 2
14.2 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

NumPy "record array" 或 "structured array" 或 "recarray"

NumPy "record array" or "structured array" or "recarray"

python

numpy

data-structures