在 pyarrow a dictionnay_encoded 列中转换整个 table/dataframe
converting a whole table/dataframe in pyarrow a dictionnay_encoded columns
我正在从 apache arrow (pyarrow) 加载镶木地板文件,到目前为止,我必须转移到 pandas,进行分类转换,然后将其作为箭头 table 发回(稍后将其保存为羽毛文件类型)
代码看起来像:
df = pq.read_table(inputFile)
# convert to pandas
df2 = df.to_pandas()
# get all cols that needs to be transformed and cast
list_str_obj_cols = df2.columns[df2.dtypes == "object"].tolist()
for str_obj_col in list_str_obj_cols:
df2[str_obj_col] = df2[str_obj_col].astype("category")
print(df2.dtypes)
#get back from pandas to arrow
table = pa.Table.from_pandas(df2)
# write the file in fs
ft.write_feather(table, outputFile, compression='lz4')
有没有办法直接用 pyarrow 做这个?会更快吗?
提前致谢
在 pyarrow 中,“分类”被称为“字典编码”。所以我认为您的问题是是否可以对现有 table 中的列进行字典编码。您可以使用 pyarrow.compute.dictionary_encode
函数来执行此操作。将它们放在一起:
import pyarrow as pa
import pyarrow.compute as pc
def dict_encode_all_str_columns(table):
new_arrays = []
for index, field in enumerate(table.schema):
if field.type == pa.string():
new_arr = pc.dictionary_encode(table.column(index))
new_arrays.append(new_arr)
else:
new_arrays.append(table.column(index))
return pa.Table.from_arrays(new_arrays, names=table.column_names)
table = pa.Table.from_pydict({'int': [1, 2, 3, 4], 'str': ['x', 'y', 'x', 'y']})
print(table)
print(dict_encode_all_str_columns(table))
我正在从 apache arrow (pyarrow) 加载镶木地板文件,到目前为止,我必须转移到 pandas,进行分类转换,然后将其作为箭头 table 发回(稍后将其保存为羽毛文件类型)
代码看起来像:
df = pq.read_table(inputFile)
# convert to pandas
df2 = df.to_pandas()
# get all cols that needs to be transformed and cast
list_str_obj_cols = df2.columns[df2.dtypes == "object"].tolist()
for str_obj_col in list_str_obj_cols:
df2[str_obj_col] = df2[str_obj_col].astype("category")
print(df2.dtypes)
#get back from pandas to arrow
table = pa.Table.from_pandas(df2)
# write the file in fs
ft.write_feather(table, outputFile, compression='lz4')
有没有办法直接用 pyarrow 做这个?会更快吗? 提前致谢
在 pyarrow 中,“分类”被称为“字典编码”。所以我认为您的问题是是否可以对现有 table 中的列进行字典编码。您可以使用 pyarrow.compute.dictionary_encode
函数来执行此操作。将它们放在一起:
import pyarrow as pa
import pyarrow.compute as pc
def dict_encode_all_str_columns(table):
new_arrays = []
for index, field in enumerate(table.schema):
if field.type == pa.string():
new_arr = pc.dictionary_encode(table.column(index))
new_arrays.append(new_arr)
else:
new_arrays.append(table.column(index))
return pa.Table.from_arrays(new_arrays, names=table.column_names)
table = pa.Table.from_pydict({'int': [1, 2, 3, 4], 'str': ['x', 'y', 'x', 'y']})
print(table)
print(dict_encode_all_str_columns(table))