是否有更惯用的方法根据列的内容从 PyArrow table select 行?
Is there a more idiomatic way to select rows from a PyArrow table based on contents of a column?
我有一个大型 PyArrow table,其中有一列名为 index
,我想用它来对 table 进行分区; index
的每个单独值代表 table.
中的不同数量
是否有一种惯用的方法可以根据列的内容从 PyArrow table 中 select 行?
这是一个例子table:
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
# Example table for data schema
irow = np.arange(2**20)
dt = 17
df0 = pd.DataFrame({'timestamp': np.array((irow//2)*dt, dtype=np.int64),
'index': np.array(irow%2, dtype=np.int16),
'value': np.array(irow*0, dtype=np.int32)},
columns=['timestamp','index','value'])
ii = df0['index'] == 0
df0.loc[ii,'value'] = irow[ii]//2
ii = df0['index'] == 1
df0.loc[ii,'value'] = (np.sin(df0.loc[ii,'timestamp']*0.01)*10000).astype(np.int32)
table0 = pa.Table.from_pandas(df0)
print(df0)
# prints the following:
timestamp index value
0 0 0 0
1 0 1 0
2 17 0 1
3 17 1 1691
4 34 0 2
... ... ... ...
1048571 8912845 1 9945
1048572 8912862 0 524286
1048573 8912862 1 9978
1048574 8912879 0 524287
1048575 8912879 1 9723
[1048576 rows x 3 columns]
在 Pandas 中 selection:
很容易做到这一点
print(df0[df0['index']==1])
# prints the following
timestamp index value
1 0 1 0
3 17 1 1691
5 34 1 3334
7 51 1 4881
9 68 1 6287
... ... ... ...
1048567 8912811 1 9028
1048569 8912828 1 9625
1048571 8912845 1 9945
1048573 8912862 1 9978
1048575 8912879 1 9723
[524288 rows x 3 columns]
但是对于 PyArrow,我必须在 PyArrow 和 numpy 或 pandas:
之间做一些调整
value_index = table0.column('index').to_numpy()
# get values of the index column, convert to numpy format
row_indices = np.nonzero(value_index==1)[0]
# find matches and get their indices
selected_table = table0.take(pa.array(row_indices))
# use take() with those indices
v = selected_table.column('value')
print(v.to_numpy())
# which prints
[ 0 1691 3334 ... 9945 9978 9723]
有没有更直接的方法?
执行布尔过滤操作不需要转换为 numpy。为此,您可以使用 pyarrow.compute
模块中的 equal
和 filter
函数:
import pyarrow.compute as pc
value_index = table0.column('index')
row_mask = pc.equal(value_index, pa.scalar(1, value_index.type))
selected_table = table0.filter(row_mask)
我有一个大型 PyArrow table,其中有一列名为 index
,我想用它来对 table 进行分区; index
的每个单独值代表 table.
是否有一种惯用的方法可以根据列的内容从 PyArrow table 中 select 行?
这是一个例子table:
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
# Example table for data schema
irow = np.arange(2**20)
dt = 17
df0 = pd.DataFrame({'timestamp': np.array((irow//2)*dt, dtype=np.int64),
'index': np.array(irow%2, dtype=np.int16),
'value': np.array(irow*0, dtype=np.int32)},
columns=['timestamp','index','value'])
ii = df0['index'] == 0
df0.loc[ii,'value'] = irow[ii]//2
ii = df0['index'] == 1
df0.loc[ii,'value'] = (np.sin(df0.loc[ii,'timestamp']*0.01)*10000).astype(np.int32)
table0 = pa.Table.from_pandas(df0)
print(df0)
# prints the following:
timestamp index value
0 0 0 0
1 0 1 0
2 17 0 1
3 17 1 1691
4 34 0 2
... ... ... ...
1048571 8912845 1 9945
1048572 8912862 0 524286
1048573 8912862 1 9978
1048574 8912879 0 524287
1048575 8912879 1 9723
[1048576 rows x 3 columns]
在 Pandas 中 selection:
很容易做到这一点print(df0[df0['index']==1])
# prints the following
timestamp index value
1 0 1 0
3 17 1 1691
5 34 1 3334
7 51 1 4881
9 68 1 6287
... ... ... ...
1048567 8912811 1 9028
1048569 8912828 1 9625
1048571 8912845 1 9945
1048573 8912862 1 9978
1048575 8912879 1 9723
[524288 rows x 3 columns]
但是对于 PyArrow,我必须在 PyArrow 和 numpy 或 pandas:
之间做一些调整value_index = table0.column('index').to_numpy()
# get values of the index column, convert to numpy format
row_indices = np.nonzero(value_index==1)[0]
# find matches and get their indices
selected_table = table0.take(pa.array(row_indices))
# use take() with those indices
v = selected_table.column('value')
print(v.to_numpy())
# which prints
[ 0 1691 3334 ... 9945 9978 9723]
有没有更直接的方法?
执行布尔过滤操作不需要转换为 numpy。为此,您可以使用 pyarrow.compute
模块中的 equal
和 filter
函数:
import pyarrow.compute as pc
value_index = table0.column('index')
row_mask = pc.equal(value_index, pa.scalar(1, value_index.type))
selected_table = table0.filter(row_mask)