如何将 ndarray/multi-dimensional 数组转换为 parquet 文件?

How can I convert a ndarray/multi-dimensional array to a parquet file?

我有一个 数组,我想将其保存到 parquet 文件以传递到我正在构建的 ML 模型。 我的数组有 159573 个数组,每个数组有 1395 个数组。

这是我的数据示例:

[[0.         0.         0.         ... 0.24093714 0.75547471 0.74532781]
 [0.         0.         0.         ... 0.24093714 0.75547471 0.74532781]
 [0.         0.         0.         ... 0.24093714 0.75547471 0.74532781]
 ...
 [0.         0.         0.         ... 0.89473684 0.29282009 0.29277004]
 [0.         0.         0.         ... 0.89473684 0.29282009 0.29277004]
 [0.         0.         0.         ... 0.89473684 0.29282009 0.29277004]]

我尝试使用此代码进行转换:

import pyarrow as pa
pa_table = pa.table({"data": Main_x})
pa.parquet.write_table(pa_table, "full_data.parquet")

我得到这个堆栈跟踪:

5 frames
/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.table()

/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.array()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: only handle 1-dimensional arrays

我想知道是否有办法将多维数组保存为 parquet?

Parquet/Arrow 不适合保存此类数据。 它更擅长处理具有定义良好的架构和特定列名称和类型的表格数据。 特别是 numpy conversion API 仅支持一维数据。

话虽如此,您可以轻松地将二维 numpy 数组转换为镶木地板,但您需要先对其进行按摩。

您最好的选择是将其另存为 table,n 列,每列 m 为 double。

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

matrix = np.random.rand(10, 100)
arrays = [
    pa.array(col)  # Create one arrow array per column
    for col in matrix
]

table = pa.Table.from_arrays(
    arrays,
    names=[str(i) for i in range(len(arrays))] # give names to each columns
)
# Save it:
pq.write_table(table, 'table.pq')

# Read it back as numpy:
table_from_parquet = pq.read_table('table.pq')
matrix_from_parquet = table_from_parquet.to_pandas().T.to_numpy()

中间 table 有 10 列和 100 行:

|         0 |          1 |          2 |         3 |          4 |          5 |          6 |         7 |         8 |          9 |
|----------:|-----------:|-----------:|----------:|-----------:|-----------:|-----------:|----------:|----------:|-----------:|
| 0.45774   | 0.92753    | 0.252345   | 0.982261  | 0.503732   | 0.543526   | 0.22827    | 0.347948  | 0.654259  | 0.152693   |
| 0.287813  | 0.793067   | 0.972282   | 0.739047  | 0.0689906  | 0.102235   | 0.110273   | 0.166839  | 0.907481  | 0.427729   |
| 0.523928  | 0.511737   | 0.473887   | 0.771607  | 0.707633   | 0.276726   | 0.943073   | 0.788174  | 0.305119  | 0.511876   |
| 0.67563   | 0.947449   | 0.895125   | 0.246979  | 0.703503   | 0.256418   | 0.93113    | 0.116715  | 0.330746  | 0.566704   |
| 0.471526  | 0.45332    | 0.546384   | 0.822873  | 0.333542   | 0.518933   | 0.229525   | 0.381977  | 0.893204  | 0.932781   |
...