如何将 ndarray/multi-dimensional 数组转换为 parquet 文件?
How can I convert a ndarray/multi-dimensional array to a parquet file?
我有一个 数组,我想将其保存到 parquet 文件以传递到我正在构建的 ML 模型。
我的数组有 159573 个数组,每个数组有 1395 个数组。
这是我的数据示例:
[[0. 0. 0. ... 0.24093714 0.75547471 0.74532781]
[0. 0. 0. ... 0.24093714 0.75547471 0.74532781]
[0. 0. 0. ... 0.24093714 0.75547471 0.74532781]
...
[0. 0. 0. ... 0.89473684 0.29282009 0.29277004]
[0. 0. 0. ... 0.89473684 0.29282009 0.29277004]
[0. 0. 0. ... 0.89473684 0.29282009 0.29277004]]
我尝试使用此代码进行转换:
import pyarrow as pa
pa_table = pa.table({"data": Main_x})
pa.parquet.write_table(pa_table, "full_data.parquet")
我得到这个堆栈跟踪:
5 frames
/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.table()
/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()
/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.asarray()
/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.array()
/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: only handle 1-dimensional arrays
我想知道是否有办法将多维数组保存为 parquet?
Parquet/Arrow 不适合保存此类数据。
它更擅长处理具有定义良好的架构和特定列名称和类型的表格数据。
特别是 numpy conversion API 仅支持一维数据。
话虽如此,您可以轻松地将二维 numpy 数组转换为镶木地板,但您需要先对其进行按摩。
您最好的选择是将其另存为 table,n 列,每列 m 为 double。
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
matrix = np.random.rand(10, 100)
arrays = [
pa.array(col) # Create one arrow array per column
for col in matrix
]
table = pa.Table.from_arrays(
arrays,
names=[str(i) for i in range(len(arrays))] # give names to each columns
)
# Save it:
pq.write_table(table, 'table.pq')
# Read it back as numpy:
table_from_parquet = pq.read_table('table.pq')
matrix_from_parquet = table_from_parquet.to_pandas().T.to_numpy()
中间 table
有 10 列和 100 行:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|----------:|-----------:|-----------:|----------:|-----------:|-----------:|-----------:|----------:|----------:|-----------:|
| 0.45774 | 0.92753 | 0.252345 | 0.982261 | 0.503732 | 0.543526 | 0.22827 | 0.347948 | 0.654259 | 0.152693 |
| 0.287813 | 0.793067 | 0.972282 | 0.739047 | 0.0689906 | 0.102235 | 0.110273 | 0.166839 | 0.907481 | 0.427729 |
| 0.523928 | 0.511737 | 0.473887 | 0.771607 | 0.707633 | 0.276726 | 0.943073 | 0.788174 | 0.305119 | 0.511876 |
| 0.67563 | 0.947449 | 0.895125 | 0.246979 | 0.703503 | 0.256418 | 0.93113 | 0.116715 | 0.330746 | 0.566704 |
| 0.471526 | 0.45332 | 0.546384 | 0.822873 | 0.333542 | 0.518933 | 0.229525 | 0.381977 | 0.893204 | 0.932781 |
...
我有一个
这是我的数据示例:
[[0. 0. 0. ... 0.24093714 0.75547471 0.74532781]
[0. 0. 0. ... 0.24093714 0.75547471 0.74532781]
[0. 0. 0. ... 0.24093714 0.75547471 0.74532781]
...
[0. 0. 0. ... 0.89473684 0.29282009 0.29277004]
[0. 0. 0. ... 0.89473684 0.29282009 0.29277004]
[0. 0. 0. ... 0.89473684 0.29282009 0.29277004]]
我尝试使用此代码进行转换:
import pyarrow as pa
pa_table = pa.table({"data": Main_x})
pa.parquet.write_table(pa_table, "full_data.parquet")
我得到这个堆栈跟踪:
5 frames
/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.table()
/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()
/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.asarray()
/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.array()
/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: only handle 1-dimensional arrays
我想知道是否有办法将多维数组保存为 parquet?
Parquet/Arrow 不适合保存此类数据。 它更擅长处理具有定义良好的架构和特定列名称和类型的表格数据。 特别是 numpy conversion API 仅支持一维数据。
话虽如此,您可以轻松地将二维 numpy 数组转换为镶木地板,但您需要先对其进行按摩。
您最好的选择是将其另存为 table,n 列,每列 m 为 double。
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
matrix = np.random.rand(10, 100)
arrays = [
pa.array(col) # Create one arrow array per column
for col in matrix
]
table = pa.Table.from_arrays(
arrays,
names=[str(i) for i in range(len(arrays))] # give names to each columns
)
# Save it:
pq.write_table(table, 'table.pq')
# Read it back as numpy:
table_from_parquet = pq.read_table('table.pq')
matrix_from_parquet = table_from_parquet.to_pandas().T.to_numpy()
中间 table
有 10 列和 100 行:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|----------:|-----------:|-----------:|----------:|-----------:|-----------:|-----------:|----------:|----------:|-----------:|
| 0.45774 | 0.92753 | 0.252345 | 0.982261 | 0.503732 | 0.543526 | 0.22827 | 0.347948 | 0.654259 | 0.152693 |
| 0.287813 | 0.793067 | 0.972282 | 0.739047 | 0.0689906 | 0.102235 | 0.110273 | 0.166839 | 0.907481 | 0.427729 |
| 0.523928 | 0.511737 | 0.473887 | 0.771607 | 0.707633 | 0.276726 | 0.943073 | 0.788174 | 0.305119 | 0.511876 |
| 0.67563 | 0.947449 | 0.895125 | 0.246979 | 0.703503 | 0.256418 | 0.93113 | 0.116715 | 0.330746 | 0.566704 |
| 0.471526 | 0.45332 | 0.546384 | 0.822873 | 0.333542 | 0.518933 | 0.229525 | 0.381977 | 0.893204 | 0.932781 |
...